Optimising pandas relational ops with pyarrow

Ivan Petrov Fri, 01 Jan 2021 09:24:19 -0800

Hi!
I plan to:
-  join
- group by
- filter
data using pyarrow (new to it). The idea is to get better performance and
memory utilisation ( apache arrow columnar compression) compared to pandas.
Seems like pyarrow has no support for joining two Tables / Dataset by key
so I have to fallback to pandas.
I don’t really follow how pyarrow <-> pandas integration works. Will pandas
rely on apache arrow data structure? I’m fine with using only these flat
types for columns to avoid "corner cases"
- string
- int
- long
- decimal


I have a feeling that pandas will copy all data from apache arrow and
double the size (according to the doc). Did I get it right?
What is the right way to join, groupBy and filter several "Tables" /
"Datasets" utilizing pyarrow (underlying apache arrow) power?

Thank you!

Optimising pandas relational ops with pyarrow

Reply via email to