[ https://issues.apache.org/jira/browse/ARROW-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872636#comment-16872636 ]
Francois Saint-Jacques commented on ARROW-5427: ----------------------------------------------- I think this is related to the dask failure ARROW-5730 . Once I apply this patch, it reduces the failures from 9 to 1. > [Python] RangeIndex serialization change implications > ----------------------------------------------------- > > Key: ARROW-5427 > URL: https://issues.apache.org/jira/browse/ARROW-5427 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 0.13.0 > Reporter: Joris Van den Bossche > Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In 0.13, the conversion of a pandas DataFrame's RangeIndex changed: it is no > longer serialized as an actual column in the arrow table, but only saved as > metadata (in the pandas metadata) (ARROW-1639). > This change lead to a couple of issues: > - It can sometimes be unpredictable in pandas when you have a RangeIndex and > when not. Which means that the resulting schema in arrow can be somewhat > unexpected. See ARROW-5104: empty DataFrame has RangeIndex or not depending > on how it was created > - The metadata is not always enough (or not updated) to reconstruct it when > the table has been modified / subsetted. > For example, ARROW-5138: retrieving a single row group from parquet file > doesn't restore index properly (since the RangeIndex metadata was for the > full table, not this subset) > And another one, ARROW-5139: empty column selection no longer restores > index. > I think we should decide if we either want to try to fix those (or give an > option to avoid those issues), or either close those as "won't fix". > One idea I had that could potentially alleviate some of those issues: > - Make it possible for the user to still force actual serialization of the > index, always, even if it is a RangeIndex. > - To not introduce a new option, we could reuse the {{preserve_index}} > keyword: change the default to None (which means the current behaviour), and > change {{True}} to mean "always serialize" (although this is not fully > backwards compatible with 0.13.0 for those users who explicitly specified the > keyword). > I am not sure this is worth the added complexity (although I personally like > providing the option where the index is simply always serialized as columns, > without surprises). But ideally we decide on it for 0.14, to either fix or > close the mentioned issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)