jorisvandenbossche commented on issue #14229:
URL: https://github.com/apache/arrow/issues/14229#issuecomment-1262049136
> I believe you saying that the number of columns shouldn't matter but
somehow it is strange that I do not get an overflow error if I only load a
single column subset of the dataframe ...
Loading a single column subset works for some columns, or does it work for
all columns? (you could maybe test that with a loop) There might be one of the
columns that is specifically triggering the error.
> But I could share the resulting dataframe that triggers the error when
trying to read, although its 14GB large ...
That's indeed a bit large. Could you first try to see if you still get the
error when reducing the file size a bit? (for example, if you only save half of
the columns in the file, do you still have the issue on read? Or if you take
only 50 or 75% of the number of rows, do you still have the issue?)
For creating a script to generate the data instead, one possible approach
that _might_ reproduce the issue can be to get a tiny sample of the data, save
that, and then see if a script that generates a large file from that sample
still reproduces it. For example, like:
```python
subset = pd.read_parquet("data_subset.parquet")
df = pd.concat([subset]*1000, ignore_index=True) # use number here needed to
get the size of data to reproduce the issue)
df.to_parquet("data.parquet")
# does this still error?
pd.read_parquet(data.parquet")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]