Hi! https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.join
However in my case I want to stay within memory and I found an ugly workaround through unifying dictionaries and then building final column with pa.DictionaryArray.from_arrays BR, Jacek wt., 16 kwi 2024 o 22:16 PASSWORD ADMINISTRATOR <ultimatepwdmas...@gmail.com> napisał(a): > Can we join on a "dataset" yet using pyarrow? What I mean is, my parquet > file, which is larger than memory, can I read it using dataset API and join > with other dataset/in memory table? If yes, I couldn't find it in > documentation, can you please guide how to do that join > > On Tue, Apr 16, 2024, 9:59 PM Ruoxi Sun <zanmato1...@gmail.com> wrote: > >> Hi Jacek, >> >> I recall an issue with similar concern [1] that I was trying to answer, >> hope that can help. >> >> Besides, if you do the join in parallel, e.g. by directly calling acero >> API in C++ and the source node is parallel, there is another level of >> uncertainty of the order of output rows, depending on the timing of each >> thread finishes. >> >> I think acero is kind of a SQL-like query engine. So, though not >> explicitly documented, it follows the order convention of SQL - no order >> guarantee unless specified using `order by`. >> >> [1] https://github.com/apache/arrow/issues/37542#issuecomment-1871692692 >> >> Thanks. >> >> *Regards,* >> *Rossi SUN* >> >> >> Weston Pace <weston.p...@gmail.com> 于2024年4月16日周二 23:34写道: >> >>> > Can someone confirm it? >>> >>> I can confirm that the current join implementation will potentially >>> reorder input. The larger the input the more likely the chance of >>> reordering. >>> >>> > I think that ordering is only guaranteed if it has been sorted. >>> >>> Close enough probably. I think there is an implicit order (the order of >>> the defined by the files in the dataset and the rows in those files, or the >>> original order when the input is in memory) that will be respected if there >>> are no joins or aggregates. >>> >>> On Tue, Apr 16, 2024 at 8:19 AM Aldrin <octalene....@pm.me> wrote: >>> >>>> I think that ordering is only guaranteed if it has been sorted. >>>> >>>> Sent from Proton Mail <https://proton.me/mail/home> for iOS >>>> >>>> >>>> On Tue, Apr 16, 2024 at 08:12, Jacek Pliszka <jacek.plis...@gmail.com >>>> <On+Tue,+Apr+16,+2024+at+08:12,+Jacek+Pliszka+%3C%3Ca+href=>> wrote: >>>> >>>> Hi! >>>> >>>> I just hit a very strange behaviour. >>>> >>>> I am joining two tables with "left outer" join. >>>> >>>> Naively I would expect that the output rows will match the order of the >>>> left table. >>>> >>>> But sometimes the order of rows is different ... >>>> >>>> Can someone confirm it? >>>> >>>> I would expect this would be mentioned in the docs. >>>> >>>> I am using 12.0.1 due to Python 3.7 dependency. >>>> >>>> Best Regards, >>>> >>>> Jacek Pliszka >>>> >>>> >>>> >>>>