danepitkin commented on issue #38260:
URL: https://github.com/apache/arrow/issues/38260#issuecomment-1761794858

   I am able to reproduce it locally. The behavior changes only when updating 
python/pandas. I'm not sure what the root cause is though. 
   
   ```
   $ python arrow-38260.py 20000
   python version:  3.9.18
   pyarrow version: 13.0.0
   pandas version:  1.5.3
   numpy version:   1.26.0
   Conversion from pandas to pyarrow took 1.114816792 seconds for 20000 columns
   $ python arrow-38260.py 40000
   python version:  3.9.18
   pyarrow version: 13.0.0
   pandas version:  1.5.3
   numpy version:   1.26.0
   Conversion from pandas to pyarrow took 2.4374076250000005 seconds for 40000 
columns
   
   $ python arrow-38260.py 20000
   python version:  3.12.0
   pyarrow version: 13.0.0
   pandas version:  2.1.1
   numpy version:   1.26.0
   Conversion from pandas to pyarrow took 5.036314583034255 seconds for 20000 
columns
   $ python arrow-38260.py 40000
   python version:  3.12.0
   pyarrow version: 13.0.0
   pandas version:  2.1.1
   numpy version:   1.26.0
   Conversion from pandas to pyarrow took 19.435286541993264 seconds for 40000 
columns
   ```
   
   
   Here's the modified version of the above script I used:
   ```
   import argparse
   import platform
   import timeit
   
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   
   parser = argparse.ArgumentParser()
   parser.add_argument("num_cols", type=int)
   args = parser.parse_args()
   
   num_cols = args.num_cols
   num_dates = 8800
   dates = pd.date_range(start='19900101', freq='b', periods=num_dates)
   data = np.random.randint(low=0, high=10, size=(num_dates, num_cols))
   df = pd.DataFrame(data, index=dates)
   
   tic = timeit.default_timer()
   pa.Table.from_pandas(df, preserve_index=True)
   total_time = timeit.default_timer() - tic
   print(f'python version:  {platform.python_version()}')
   print(f'pyarrow version: {pa.__version__}')
   print(f'pandas version:  {pd.__version__}')
   print(f'numpy version:   {np.__version__}')
   print(f'Conversion from pandas to pyarrow took {total_time} seconds for 
{num_cols} columns')
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to