Can we join on a "dataset" yet using pyarrow? What I mean is, my parquet file, which is larger than memory, can I read it using dataset API and join with other dataset/in memory table? If yes, I couldn't find it in documentation, can you please guide how to do that join
On Tue, Apr 16, 2024, 9:59 PM Ruoxi Sun <zanmato1...@gmail.com> wrote: > Hi Jacek, > > I recall an issue with similar concern [1] that I was trying to answer, > hope that can help. > > Besides, if you do the join in parallel, e.g. by directly calling acero > API in C++ and the source node is parallel, there is another level of > uncertainty of the order of output rows, depending on the timing of each > thread finishes. > > I think acero is kind of a SQL-like query engine. So, though not > explicitly documented, it follows the order convention of SQL - no order > guarantee unless specified using `order by`. > > [1] https://github.com/apache/arrow/issues/37542#issuecomment-1871692692 > > Thanks. > > *Regards,* > *Rossi SUN* > > > Weston Pace <weston.p...@gmail.com> 于2024年4月16日周二 23:34写道: > >> > Can someone confirm it? >> >> I can confirm that the current join implementation will potentially >> reorder input. The larger the input the more likely the chance of >> reordering. >> >> > I think that ordering is only guaranteed if it has been sorted. >> >> Close enough probably. I think there is an implicit order (the order of >> the defined by the files in the dataset and the rows in those files, or the >> original order when the input is in memory) that will be respected if there >> are no joins or aggregates. >> >> On Tue, Apr 16, 2024 at 8:19 AM Aldrin <octalene....@pm.me> wrote: >> >>> I think that ordering is only guaranteed if it has been sorted. >>> >>> Sent from Proton Mail <https://proton.me/mail/home> for iOS >>> >>> >>> On Tue, Apr 16, 2024 at 08:12, Jacek Pliszka <jacek.plis...@gmail.com >>> <On+Tue,+Apr+16,+2024+at+08:12,+Jacek+Pliszka+%3C%3Ca+href=>> wrote: >>> >>> Hi! >>> >>> I just hit a very strange behaviour. >>> >>> I am joining two tables with "left outer" join. >>> >>> Naively I would expect that the output rows will match the order of the >>> left table. >>> >>> But sometimes the order of rows is different ... >>> >>> Can someone confirm it? >>> >>> I would expect this would be mentioned in the docs. >>> >>> I am using 12.0.1 due to Python 3.7 dependency. >>> >>> Best Regards, >>> >>> Jacek Pliszka >>> >>> >>> >>>