Thanks for the detailed reproducer. I've added a few notes on the JIRA that I hope will help.
On Tue, Sep 20, 2022, 5:10 AM 1057445597 <[email protected]> wrote: > I re-uploaded a copy of the code that can be compiled and run in > join_test.zip, including cmakelists.txt, the test data files and the Python > code that generated the test files. There is also Python code to view the > data files. You will need to compile Arrow 9.0 yourself. > > ------------------------------ > 1057445597 > [email protected] > > <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=> > > > > ------------------ 原始邮件 ------------------ > *发件人:* "user" <[email protected]>; > *发送时间:* 2022年9月15日(星期四) 晚上10:27 > *收件人:* "user"<[email protected]>; > *主题:* 回复: [c++][compute]Is there any other way to use Join besides Acero? > > this jira > > https://issues.apache.org/jira/browse/ARROW-17740 > ------------------------------ > 1057445597 > [email protected] > > <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=> > > > > ------------------ 原始邮件 ------------------ > *发件人:* "user" <[email protected]>; > *发送时间:* 2022年9月15日(星期四) 中午12:15 > *收件人:* "user"<[email protected]>; > *主题:* Re: [c++][compute]Is there any other way to use Join besides Acero? > > Within Arrow-C++ that is the only way I am aware of. You might be able to > use DuckDb. It should be able to scan parquet files. > > Is this the same program that you shared before? Were you able to figure > out threading? Can you create a JIRA with some sample input files and a > reproducible example? > > On Wed, Sep 14, 2022 at 5:14 PM 1057445597 <[email protected]> wrote: > >> Acero performs poorly, and coredump occurs frequently! >> >> In the scenario I'm working on, I'll read one Parquet file and then >> several other Parquet files. These files will have the same column name >> (UUID). I need to join (by UUID), project (remove UUID), and filter (some >> custom filtering) the results of the two reads. I found that Acero could >> only be used to do join, but when I tested it, Acero performance was very >> poor and very unstable, coredump often happened. Is there another way? Or >> just another way to do a join! >> >> >> ------------------------------ >> 1057445597 >> [email protected] >> >> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=> >> >> >
