Thanks Wes. Makes sense that compute expressions will be the direction to go.
I finally found the use of expressions when searching for "Projection": https://arrow.apache.org/docs/python/dataset.html And the result is that running the following with and without output_file represents a diff of about 180ms, compared to about 2 seconds before, so this is definitely 10x faster! (The source parquet file has 33MM records.) The key learning here is that I need to read *and* project simultaneously, not read *then* project. projection = { 'username': ds.field('username'), 'output_file': pc.if_else( pc.is_null(ds.field('username')),pc.scalar(9), pc.if_else(ds.field('username') < 'Bil', pc.scalar(0), pc.if_else(ds.field('username') < 'Cou', pc.scalar(1), pc.if_else(ds.field('username') < 'Eve', pc.scalar(2), pc.if_else(ds.field('username') < 'IZh', pc.scalar(3), pc.if_else(ds.field('username') < 'Kib', pc.scalar(4), pc.if_else(ds.field('username') < 'Mat', pc.scalar(5), pc.if_else(ds.field('username') < 'Pat', pc.scalar(6), pc.if_else(ds.field('username') < 'Sco', pc.scalar(7), pc.if_else(ds.field('username') < 'Tok', pc.scalar(8), pc.scalar(9) )))))))))) } t = pq.read_table('/path/to/file', columns=projection) Best, Cedric On Wed, Jun 1, 2022 at 9:04 AM Wes McKinney <[email protected]> wrote: > > we may need to consider c++ implementations to get full benefit. > > The way that you are executing your expression is not the way that we > intend users to write performant code. > > The intended way is to build an expression and then execute the > expression — the current expression evaluator is not very efficient > (it does not reuse temporary allocations yet), but that is where the > optimization work should happen rather than in custom C++ code. > > I'm struggling to find documentation for this right now (expressions > aren't discussed in https://arrow.apache.org/docs/python/compute.html) > but there are many examples in the unit tests: > > > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_compute.py > > On Tue, May 31, 2022 at 11:25 PM Cedric Yau <[email protected]> wrote: > > > > I have code like below to range partition a file. It looks like each of > the pc.less, pc.cast, and pc.add allocates new arrays. So the code appears > to be spending more time performing memory allocations than it is doing the > comparisons. The performance is still pretty good (and faster than the > alternatives), but it does make me think as we start shifting more > calculations to arrow, we may need to consider c++ implementations to get > full benefit. > > > > Thanks, > > Cedric > > > > import pyarrow.parquet as pq > > import pyarrow.compute as pc > > > > t = pq.read_table('/path/to/file', columns=['username']) > > ta = t.column('username') > > > > output_file = pc.cast( pc.less(ta,'Bil'), 'int8') > > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Cou'), 'int8')) > > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Eve'), 'int8')) > > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Ish'), 'int8')) > > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Kib'), 'int8')) > > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Mat'), 'int8')) > > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Pat'), 'int8')) > > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Sco'), 'int8')) > > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Tok'), 'int8')) > > output_file = pc.coalesce(output_file,9) > > > > On Tue, May 31, 2022 at 2:25 PM Weston Pace <[email protected]> > wrote: > >> > >> I'd be more interested in some kind of buffer / array pool plus the > >> ability to specify an output buffer for a kernel function. I think it > >> would achieve the same goal (avoiding allocation) with more > >> flexibility (e.g. you wouldn't have to overwrite your input buffer). > >> > >> At the moment though I wonder if this is a concern. Jemalloc should > >> do some level of memory reuse. Is there a specific performance issue > >> you are encountering? > >> > >> On Tue, May 31, 2022 at 11:45 AM Wes McKinney <[email protected]> > wrote: > >> > > >> > *In principle*, it would be possible to provide mutable output buffers > >> > for a kernel's execution, so that input and output buffers could be > >> > the same (essentially exposing the lower-level kernel execution > >> > interface that underlies arrow::compute::CallFunction). But this would > >> > be a fair amount of development work to achieve. If there are others > >> > interested in exploring an implementation, we could create a Jira > >> > issue. > >> > > >> > On Sun, May 29, 2022 at 3:04 PM Micah Kornfield < > [email protected]> wrote: > >> > > > >> > > I think even in cython this might be difficult as Array data > structures are generally considered immutable, so this is inherently > unsafe, and requires doing with care. > >> > > > >> > > On Sun, May 29, 2022 at 11:21 AM Cedric Yau <[email protected]> > wrote: > >> > >> > >> > >> Suppose I have an array with 1MM integers and I add 1 to them with > pyarrow.compute.add. It looks like a new array is assigned. > >> > >> > >> > >> Is there a way to do this inplace? It looks like a new array is > allocated. Would cython be required at this point? > >> > >> > >> > >> ``` > >> > >> import pyarrow as pa > >> > >> import pyarrow.compute as pc > >> > >> > >> > >> a = pa.array(range(1000000)) > >> > >> print(id(a)) > >> > >> a = pc.add(a,1) > >> > >> print(id(a)) > >> > >> > >> > >> # output > >> > >> # 139634974909024 > >> > >> # 139633492705920 > >> > >> ``` > >> > >> > >> > >> Thanks, > >> > >> Cedric > > > > > > >
