I think we would want to implement a scalar "list_isin" function as a core C++ function, so the type signature looks like this:
(Array<List<T>>, Scalar<T>) -> Array<Boolean> I couldn't find an issue like this with a quick Jira search so I created https://issues.apache.org/jira/browse/ARROW-12849 On Fri, May 21, 2021 at 8:06 AM Elad Rosenheim <[email protected]> wrote: > > Hi! > > One of the gaps I currently have in Funnel Rocket > (https://github.com/DynamicYieldProjects/funnel-rocket) is supporting nested > columns, as in: given a Parquet file with a column of type List(int64), be > able to find rows where the list holds a specific int element. > > Right now, the need is fortunately limited to lists of primitives (mostly > int) and maps of string->string, rather than any arbitrary complexity. > > Currently, I load Parquet files via pyarrow, then call to_pandas() and run > multiple filters on the DataFrame. > > After reading Uwe's blog post > (https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html) > and looking at the Fletcher project (https://github.com/xhochy/fletcher), > seems the "proper" way to do it would be: > > * Write an ExtensionDType/ExtensionArray can wrap an arrow ChunkedArray made > of ListArrays. Not even sure what the operator should be for lookup in a list > - should I treat a list_series==123 as "for each list in this series, look > for the element 123 in it?". > > * Potentially use a @jitclass for more performant lookup, as Uwe has > outlined. > > * For now, for any abstract method I'm not sure what to do with - start with > raising an exception, then run some unit tests based on my project's needs, > and see that they pass :-/ > > * When calling Table.to_pandas(), supply a type mapper argument to map the > specific supported types to the appropriate extension class. > > * If it seems to work, figure out if I've missed something important in the > concrete classes :-/ > > Am I getting this right, more or less? > > Thanks a lot, > Elad
