[PR] GH-40720: [Python] Simplify and improve perf of creation of the column names in Table.to_pandas [arrow]

via GitHub Thu, 21 Mar 2024 10:40:58 -0700


jorisvandenbossche opened a new pull request, #40721:
URL: https://github.com/apache/arrow/pull/40721


   ### Rationale for this change
   
   The `pandas_compat.py` has over the years grown quite complex and a lot of 
pandas compatibility code, which probably can be simplified nowadays because of 
not supporting old pandas and Python versions anymore.
   
   One part of the code where this is the case is in the reconstruction of the 
`.columns` Index object of the resulting DataFrame. Right now that always goes 
through a MultiIndex (even for simple column names), which has quite some 
overhead of the simple case. And it also has some old Python/pandas compat code 
that could be removed.
   
   ### What changes are included in this PR?
   
   The simplification to not go through a MultiIndex for the simple cases gives 
a nice speed-up as well:
   
   ```python
   In [1]: table = pa.table({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3], 'c': [3, 4, 
5]})
   
   In [2]: %timeit table.to_pandas()
   251 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)   # 
<-- main
   68.1 µs ± 894 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  # 
<-- PR
   ```
   
   ### Are these changes tested?
   
   We should have extensive existing tests for this
   
   ### Are there any user-facing changes?
   
   That should not be the case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] GH-40720: [Python] Simplify and improve perf of creation of the column names in Table.to_pandas [arrow]

Reply via email to