adrienchaton opened a new issue, #14229:
URL: https://github.com/apache/arrow/issues/14229

   Hello,
   
   I am storing pandas dataframe as .parquet with pd.to_parquet and then try to 
load them back with pd.read_parquet.
   I am experiencing some error for which I do not find solution and would 
kindly ask for help to solve this ...
   
   Here is the trace:
   
   `  File 
"/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py",
 line 493, in read_parquet
       return impl.read(
     File 
"/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py",
 line 240, in read
       result = self.api.parquet.read_table(
     File 
"/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py",
 line 2827, in read_table
       return dataset.read(columns=columns, use_threads=use_threads,
     File 
"/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py",
 line 2473, in read
       table = self._dataset.to_table(
     File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 2577, in 
pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
   OSError: List index overflow.`
   
   If I store a small dataframe, I do not face this error.
   If I store a larger dataframe with e.g. 295.912.999 rows then I get this 
error.
   
   However before saving it, I print the index range and it is bound in 0 
295912998.
   Whether I save the .parquet with index=True or False gives the same error 
but I do not understand why there is an overflow on the bounded index ...
   
   Any hints are much appreciated, thanks !
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to