Can we join on a "dataset" yet using pyarrow? What I mean is, my parquet
file, which is larger than memory, can I read it using dataset API and join
with other dataset/in memory table? If yes, I couldn't find it in
documentation, can you please guide how to do that join

On Tue, Apr 16, 2024, 9:59 PM Ruoxi Sun <zanmato1...@gmail.com> wrote:

> Hi Jacek,
>
> I recall an issue with similar concern [1] that I was trying to answer,
> hope that can help.
>
> Besides, if you do the join in parallel, e.g. by directly calling acero
> API in C++ and the source node is parallel, there is another level of
> uncertainty of the order of output rows, depending on the timing of each
> thread finishes.
>
> I think acero is kind of a SQL-like query engine. So, though not
> explicitly documented, it follows the order convention of SQL - no order
> guarantee unless specified using `order by`.
>
> [1] https://github.com/apache/arrow/issues/37542#issuecomment-1871692692
>
> Thanks.
>
> *Regards,*
> *Rossi SUN*
>
>
> Weston Pace <weston.p...@gmail.com> 于2024年4月16日周二 23:34写道:
>
>> > Can someone confirm it?
>>
>> I can confirm that the current join implementation will potentially
>> reorder input.  The larger the input the more likely the chance of
>> reordering.
>>
>> > I think that ordering is only guaranteed if it has been sorted.
>>
>> Close enough probably.  I think there is an implicit order (the order of
>> the defined by the files in the dataset and the rows in those files, or the
>> original order when the input is in memory) that will be respected if there
>> are no joins or aggregates.
>>
>> On Tue, Apr 16, 2024 at 8:19 AM Aldrin <octalene....@pm.me> wrote:
>>
>>> I think that ordering is only guaranteed if it has been sorted.
>>>
>>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>>
>>>
>>> On Tue, Apr 16, 2024 at 08:12, Jacek Pliszka <jacek.plis...@gmail.com
>>> <On+Tue,+Apr+16,+2024+at+08:12,+Jacek+Pliszka+%3C%3Ca+href=>> wrote:
>>>
>>> Hi!
>>>
>>> I just hit a very strange behaviour.
>>>
>>> I am joining two tables with "left outer" join.
>>>
>>> Naively I would expect that the output rows will match the order of the
>>> left table.
>>>
>>> But sometimes the order of rows is different ...
>>>
>>> Can someone confirm it?
>>>
>>> I would expect this would be mentioned in the docs.
>>>
>>> I am using 12.0.1 due to Python 3.7 dependency.
>>>
>>> Best Regards,
>>>
>>> Jacek Pliszka
>>>
>>>
>>>
>>>

Reply via email to