danepitkin commented on issue #38260:
URL: https://github.com/apache/arrow/issues/38260#issuecomment-1761794858
I am able to reproduce it locally. The behavior changes only when updating
python/pandas. I'm not sure what the root cause is though.
```
$ python arrow-38260.py 20000
python version: 3.9.18
pyarrow version: 13.0.0
pandas version: 1.5.3
numpy version: 1.26.0
Conversion from pandas to pyarrow took 1.114816792 seconds for 20000 columns
$ python arrow-38260.py 40000
python version: 3.9.18
pyarrow version: 13.0.0
pandas version: 1.5.3
numpy version: 1.26.0
Conversion from pandas to pyarrow took 2.4374076250000005 seconds for 40000
columns
$ python arrow-38260.py 20000
python version: 3.12.0
pyarrow version: 13.0.0
pandas version: 2.1.1
numpy version: 1.26.0
Conversion from pandas to pyarrow took 5.036314583034255 seconds for 20000
columns
$ python arrow-38260.py 40000
python version: 3.12.0
pyarrow version: 13.0.0
pandas version: 2.1.1
numpy version: 1.26.0
Conversion from pandas to pyarrow took 19.435286541993264 seconds for 40000
columns
```
Here's the modified version of the above script I used:
```
import argparse
import platform
import timeit
import numpy as np
import pandas as pd
import pyarrow as pa
parser = argparse.ArgumentParser()
parser.add_argument("num_cols", type=int)
args = parser.parse_args()
num_cols = args.num_cols
num_dates = 8800
dates = pd.date_range(start='19900101', freq='b', periods=num_dates)
data = np.random.randint(low=0, high=10, size=(num_dates, num_cols))
df = pd.DataFrame(data, index=dates)
tic = timeit.default_timer()
pa.Table.from_pandas(df, preserve_index=True)
total_time = timeit.default_timer() - tic
print(f'python version: {platform.python_version()}')
print(f'pyarrow version: {pa.__version__}')
print(f'pandas version: {pd.__version__}')
print(f'numpy version: {np.__version__}')
print(f'Conversion from pandas to pyarrow took {total_time} seconds for
{num_cols} columns')
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]