[GitHub] [arrow] jorisvandenbossche commented on issue #14229: OSError: List index overflow.

GitBox Thu, 29 Sep 2022 02:57:33 -0700


jorisvandenbossche commented on issue #14229:
URL: https://github.com/apache/arrow/issues/14229#issuecomment-1262049136


   > I believe you saying that the number of columns shouldn't matter but 
somehow it is strange that I do not get an overflow error if I only load a 
single column subset of the dataframe ...
   
   Loading a single column subset works for some columns, or does it work for 
all columns? (you could maybe test that with a loop) There might be one of the 
columns that is specifically triggering the error.
   
   > But I could share the resulting dataframe that triggers the error when 
trying to read, although its 14GB large ...
   
   That's indeed a bit large. Could you first try to see if you still get the 
error when reducing the file size a bit? (for example, if you only save half of 
the columns in the file, do you still have the issue on read? Or if you take 
only 50 or 75% of the number of rows, do you still have the issue?)
   
   For creating a script to generate the data instead, one possible approach 
that _might_ reproduce the issue can be to get a tiny sample of the data, save 
that, and then see if a script that generates a large file from that sample 
still reproduces it. For example, like:
   
   ```python
   subset = pd.read_parquet("data_subset.parquet")
   df = pd.concat([subset]*1000, ignore_index=True) # use number here needed to 
get the size of data to reproduce the issue)
   df.to_parquet("data.parquet")
   # does this still error?
   pd.read_parquet(data.parquet")
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #14229: OSError: List index overflow.

Reply via email to