Hi!

Why don't you use arrow Table join directly ?

https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.join

Though you need to be careful with join order as speed may be differ
depending on order of the joined tables.

BR,

Jacek


czw., 15 wrz 2022 o 06:15 Weston Pace <[email protected]> napisał(a):

> Within Arrow-C++ that is the only way I am aware of.  You might be able to
> use DuckDb.  It should be able to scan parquet files.
>
> Is this the same program that you shared before?  Were you able to figure
> out threading?  Can you create a JIRA with some sample input files and a
> reproducible example?
>
> On Wed, Sep 14, 2022 at 5:14 PM 1057445597 <[email protected]> wrote:
>
>> Acero performs poorly, and coredump occurs frequently!
>>
>> In the scenario I'm working on, I'll read one Parquet file and then
>> several other Parquet files. These files will have the same column name
>> (UUID). I need to join (by UUID), project (remove UUID), and filter (some
>> custom filtering) the results of the two reads. I found that Acero could
>> only be used to do join, but when I tested it, Acero performance was very
>> poor and very unstable, coredump often happened. Is there another way? Or
>> just another way to do a join!
>>
>>
>> ------------------------------
>> 1057445597
>> [email protected]
>>
>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=>
>>
>>
>

Reply via email to