Hi!

https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.join

However in my case I want to stay within memory and I found an ugly
workaround through unifying dictionaries
and then building final column with pa.DictionaryArray.from_arrays

BR,

Jacek






wt., 16 kwi 2024 o 22:16 PASSWORD ADMINISTRATOR <ultimatepwdmas...@gmail.com>
napisał(a):

> Can we join on a "dataset" yet using pyarrow? What I mean is, my parquet
> file, which is larger than memory, can I read it using dataset API and join
> with other dataset/in memory table? If yes, I couldn't find it in
> documentation, can you please guide how to do that join
>
> On Tue, Apr 16, 2024, 9:59 PM Ruoxi Sun <zanmato1...@gmail.com> wrote:
>
>> Hi Jacek,
>>
>> I recall an issue with similar concern [1] that I was trying to answer,
>> hope that can help.
>>
>> Besides, if you do the join in parallel, e.g. by directly calling acero
>> API in C++ and the source node is parallel, there is another level of
>> uncertainty of the order of output rows, depending on the timing of each
>> thread finishes.
>>
>> I think acero is kind of a SQL-like query engine. So, though not
>> explicitly documented, it follows the order convention of SQL - no order
>> guarantee unless specified using `order by`.
>>
>> [1] https://github.com/apache/arrow/issues/37542#issuecomment-1871692692
>>
>> Thanks.
>>
>> *Regards,*
>> *Rossi SUN*
>>
>>
>> Weston Pace <weston.p...@gmail.com> 于2024年4月16日周二 23:34写道:
>>
>>> > Can someone confirm it?
>>>
>>> I can confirm that the current join implementation will potentially
>>> reorder input.  The larger the input the more likely the chance of
>>> reordering.
>>>
>>> > I think that ordering is only guaranteed if it has been sorted.
>>>
>>> Close enough probably.  I think there is an implicit order (the order of
>>> the defined by the files in the dataset and the rows in those files, or the
>>> original order when the input is in memory) that will be respected if there
>>> are no joins or aggregates.
>>>
>>> On Tue, Apr 16, 2024 at 8:19 AM Aldrin <octalene....@pm.me> wrote:
>>>
>>>> I think that ordering is only guaranteed if it has been sorted.
>>>>
>>>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>>>
>>>>
>>>> On Tue, Apr 16, 2024 at 08:12, Jacek Pliszka <jacek.plis...@gmail.com
>>>> <On+Tue,+Apr+16,+2024+at+08:12,+Jacek+Pliszka+%3C%3Ca+href=>> wrote:
>>>>
>>>> Hi!
>>>>
>>>> I just hit a very strange behaviour.
>>>>
>>>> I am joining two tables with "left outer" join.
>>>>
>>>> Naively I would expect that the output rows will match the order of the
>>>> left table.
>>>>
>>>> But sometimes the order of rows is different ...
>>>>
>>>> Can someone confirm it?
>>>>
>>>> I would expect this would be mentioned in the docs.
>>>>
>>>> I am using 12.0.1 due to Python 3.7 dependency.
>>>>
>>>> Best Regards,
>>>>
>>>> Jacek Pliszka
>>>>
>>>>
>>>>
>>>>

Reply via email to