Thanks Wes.  Makes sense that compute expressions will be the direction to
go.

I finally found the use of expressions when searching for "Projection":
https://arrow.apache.org/docs/python/dataset.html

And the result is that running the following with and without output_file
represents a diff of about 180ms, compared to about 2 seconds before, so
this is definitely 10x faster!  (The source parquet file has 33MM records.)

The key learning here is that I need to read *and* project simultaneously,
not read *then* project.

projection = {
 'username': ds.field('username'),
 'output_file':
    pc.if_else( pc.is_null(ds.field('username')),pc.scalar(9),
                pc.if_else(ds.field('username') < 'Bil', pc.scalar(0),
                pc.if_else(ds.field('username') < 'Cou', pc.scalar(1),
                pc.if_else(ds.field('username') < 'Eve', pc.scalar(2),
                pc.if_else(ds.field('username') < 'IZh', pc.scalar(3),
                pc.if_else(ds.field('username') < 'Kib', pc.scalar(4),
                pc.if_else(ds.field('username') < 'Mat', pc.scalar(5),
                pc.if_else(ds.field('username') < 'Pat', pc.scalar(6),
                pc.if_else(ds.field('username') < 'Sco', pc.scalar(7),
                pc.if_else(ds.field('username') < 'Tok', pc.scalar(8),
                           pc.scalar(9)


                          ))))))))))
}
t = pq.read_table('/path/to/file', columns=projection)

Best,
Cedric

On Wed, Jun 1, 2022 at 9:04 AM Wes McKinney <[email protected]> wrote:

> > we may need to consider c++ implementations to get full benefit.
>
> The way that you are executing your expression is not the way that we
> intend users to write performant code.
>
> The intended way is to build an expression and then execute the
> expression — the current expression evaluator is not very efficient
> (it does not reuse temporary allocations yet), but that is where the
> optimization work should happen rather than in custom C++ code.
>
> I'm struggling to find documentation for this right now (expressions
> aren't discussed in https://arrow.apache.org/docs/python/compute.html)
> but there are many examples in the unit tests:
>
>
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_compute.py
>
> On Tue, May 31, 2022 at 11:25 PM Cedric Yau <[email protected]> wrote:
> >
> > I have code like below to range partition a file.  It looks like each of
> the pc.less, pc.cast, and pc.add allocates new arrays.  So the code appears
> to be spending more time performing memory allocations than it is doing the
> comparisons.  The performance is still pretty good (and faster than the
> alternatives), but it does make me think as we start shifting more
> calculations to arrow, we may need to consider c++ implementations to get
> full benefit.
> >
> > Thanks,
> > Cedric
> >
> > import pyarrow.parquet as pq
> > import pyarrow.compute as pc
> >
> > t = pq.read_table('/path/to/file', columns=['username'])
> > ta = t.column('username')
> >
> > output_file = pc.cast( pc.less(ta,'Bil'), 'int8')
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Cou'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Eve'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Ish'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Kib'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Mat'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Pat'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Sco'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Tok'), 'int8'))
> > output_file = pc.coalesce(output_file,9)
> >
> > On Tue, May 31, 2022 at 2:25 PM Weston Pace <[email protected]>
> wrote:
> >>
> >> I'd be more interested in some kind of buffer / array pool plus the
> >> ability to specify an output buffer for a kernel function.  I think it
> >> would achieve the same goal (avoiding allocation) with more
> >> flexibility (e.g. you wouldn't have to overwrite your input buffer).
> >>
> >> At the moment though I wonder if this is a concern.  Jemalloc should
> >> do some level of memory reuse.  Is there a specific performance issue
> >> you are encountering?
> >>
> >> On Tue, May 31, 2022 at 11:45 AM Wes McKinney <[email protected]>
> wrote:
> >> >
> >> > *In principle*, it would be possible to provide mutable output buffers
> >> > for a kernel's execution, so that input and output buffers could be
> >> > the same (essentially exposing the lower-level kernel execution
> >> > interface that underlies arrow::compute::CallFunction). But this would
> >> > be a fair amount of development work to achieve. If there are others
> >> > interested in exploring an implementation, we could create a Jira
> >> > issue.
> >> >
> >> > On Sun, May 29, 2022 at 3:04 PM Micah Kornfield <
> [email protected]> wrote:
> >> > >
> >> > > I think even in cython this might be difficult as Array data
> structures are generally considered immutable, so this is inherently
> unsafe, and requires doing with care.
> >> > >
> >> > > On Sun, May 29, 2022 at 11:21 AM Cedric Yau <[email protected]>
> wrote:
> >> > >>
> >> > >> Suppose I have an array with 1MM integers and I add 1 to them with
> pyarrow.compute.add.  It looks like a new array is assigned.
> >> > >>
> >> > >> Is there a way to do this inplace?  It looks like a new array is
> allocated.  Would cython be required at this point?
> >> > >>
> >> > >> ```
> >> > >> import pyarrow as pa
> >> > >> import pyarrow.compute as pc
> >> > >>
> >> > >> a = pa.array(range(1000000))
> >> > >> print(id(a))
> >> > >> a = pc.add(a,1)
> >> > >> print(id(a))
> >> > >>
> >> > >> # output
> >> > >> # 139634974909024
> >> > >> # 139633492705920
> >> > >> ```
> >> > >>
> >> > >> Thanks,
> >> > >> Cedric
> >
> >
> >
>

Reply via email to