Re: rows reshuffled on join

2024-04-16 Thread Jacek Pliszka
Hi!

https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.join

However in my case I want to stay within memory and I found an ugly
workaround through unifying dictionaries
and then building final column with pa.DictionaryArray.from_arrays

BR,

Jacek






wt., 16 kwi 2024 o 22:16 PASSWORD ADMINISTRATOR 
napisał(a):

> Can we join on a "dataset" yet using pyarrow? What I mean is, my parquet
> file, which is larger than memory, can I read it using dataset API and join
> with other dataset/in memory table? If yes, I couldn't find it in
> documentation, can you please guide how to do that join
>
> On Tue, Apr 16, 2024, 9:59 PM Ruoxi Sun  wrote:
>
>> Hi Jacek,
>>
>> I recall an issue with similar concern [1] that I was trying to answer,
>> hope that can help.
>>
>> Besides, if you do the join in parallel, e.g. by directly calling acero
>> API in C++ and the source node is parallel, there is another level of
>> uncertainty of the order of output rows, depending on the timing of each
>> thread finishes.
>>
>> I think acero is kind of a SQL-like query engine. So, though not
>> explicitly documented, it follows the order convention of SQL - no order
>> guarantee unless specified using `order by`.
>>
>> [1] https://github.com/apache/arrow/issues/37542#issuecomment-1871692692
>>
>> Thanks.
>>
>> *Regards,*
>> *Rossi SUN*
>>
>>
>> Weston Pace  于2024年4月16日周二 23:34写道:
>>
>>> > Can someone confirm it?
>>>
>>> I can confirm that the current join implementation will potentially
>>> reorder input.  The larger the input the more likely the chance of
>>> reordering.
>>>
>>> > I think that ordering is only guaranteed if it has been sorted.
>>>
>>> Close enough probably.  I think there is an implicit order (the order of
>>> the defined by the files in the dataset and the rows in those files, or the
>>> original order when the input is in memory) that will be respected if there
>>> are no joins or aggregates.
>>>
>>> On Tue, Apr 16, 2024 at 8:19 AM Aldrin  wrote:
>>>
 I think that ordering is only guaranteed if it has been sorted.

 Sent from Proton Mail  for iOS


 On Tue, Apr 16, 2024 at 08:12, Jacek Pliszka >>> > wrote:

 Hi!

 I just hit a very strange behaviour.

 I am joining two tables with "left outer" join.

 Naively I would expect that the output rows will match the order of the
 left table.

 But sometimes the order of rows is different ...

 Can someone confirm it?

 I would expect this would be mentioned in the docs.

 I am using 12.0.1 due to Python 3.7 dependency.

 Best Regards,

 Jacek Pliszka






Re: rows reshuffled on join

2024-04-16 Thread PASSWORD ADMINISTRATOR
Can we join on a "dataset" yet using pyarrow? What I mean is, my parquet
file, which is larger than memory, can I read it using dataset API and join
with other dataset/in memory table? If yes, I couldn't find it in
documentation, can you please guide how to do that join

On Tue, Apr 16, 2024, 9:59 PM Ruoxi Sun  wrote:

> Hi Jacek,
>
> I recall an issue with similar concern [1] that I was trying to answer,
> hope that can help.
>
> Besides, if you do the join in parallel, e.g. by directly calling acero
> API in C++ and the source node is parallel, there is another level of
> uncertainty of the order of output rows, depending on the timing of each
> thread finishes.
>
> I think acero is kind of a SQL-like query engine. So, though not
> explicitly documented, it follows the order convention of SQL - no order
> guarantee unless specified using `order by`.
>
> [1] https://github.com/apache/arrow/issues/37542#issuecomment-1871692692
>
> Thanks.
>
> *Regards,*
> *Rossi SUN*
>
>
> Weston Pace  于2024年4月16日周二 23:34写道:
>
>> > Can someone confirm it?
>>
>> I can confirm that the current join implementation will potentially
>> reorder input.  The larger the input the more likely the chance of
>> reordering.
>>
>> > I think that ordering is only guaranteed if it has been sorted.
>>
>> Close enough probably.  I think there is an implicit order (the order of
>> the defined by the files in the dataset and the rows in those files, or the
>> original order when the input is in memory) that will be respected if there
>> are no joins or aggregates.
>>
>> On Tue, Apr 16, 2024 at 8:19 AM Aldrin  wrote:
>>
>>> I think that ordering is only guaranteed if it has been sorted.
>>>
>>> Sent from Proton Mail  for iOS
>>>
>>>
>>> On Tue, Apr 16, 2024 at 08:12, Jacek Pliszka >> > wrote:
>>>
>>> Hi!
>>>
>>> I just hit a very strange behaviour.
>>>
>>> I am joining two tables with "left outer" join.
>>>
>>> Naively I would expect that the output rows will match the order of the
>>> left table.
>>>
>>> But sometimes the order of rows is different ...
>>>
>>> Can someone confirm it?
>>>
>>> I would expect this would be mentioned in the docs.
>>>
>>> I am using 12.0.1 due to Python 3.7 dependency.
>>>
>>> Best Regards,
>>>
>>> Jacek Pliszka
>>>
>>>
>>>
>>>


Re: rows reshuffled on join

2024-04-16 Thread Ruoxi Sun
Hi Jacek,

I recall an issue with similar concern [1] that I was trying to answer,
hope that can help.

Besides, if you do the join in parallel, e.g. by directly calling acero API
in C++ and the source node is parallel, there is another level of
uncertainty of the order of output rows, depending on the timing of each
thread finishes.

I think acero is kind of a SQL-like query engine. So, though not explicitly
documented, it follows the order convention of SQL - no order guarantee
unless specified using `order by`.

[1] https://github.com/apache/arrow/issues/37542#issuecomment-1871692692

Thanks.

*Regards,*
*Rossi SUN*


Weston Pace  于2024年4月16日周二 23:34写道:

> > Can someone confirm it?
>
> I can confirm that the current join implementation will potentially
> reorder input.  The larger the input the more likely the chance of
> reordering.
>
> > I think that ordering is only guaranteed if it has been sorted.
>
> Close enough probably.  I think there is an implicit order (the order of
> the defined by the files in the dataset and the rows in those files, or the
> original order when the input is in memory) that will be respected if there
> are no joins or aggregates.
>
> On Tue, Apr 16, 2024 at 8:19 AM Aldrin  wrote:
>
>> I think that ordering is only guaranteed if it has been sorted.
>>
>> Sent from Proton Mail  for iOS
>>
>>
>> On Tue, Apr 16, 2024 at 08:12, Jacek Pliszka > > wrote:
>>
>> Hi!
>>
>> I just hit a very strange behaviour.
>>
>> I am joining two tables with "left outer" join.
>>
>> Naively I would expect that the output rows will match the order of the
>> left table.
>>
>> But sometimes the order of rows is different ...
>>
>> Can someone confirm it?
>>
>> I would expect this would be mentioned in the docs.
>>
>> I am using 12.0.1 due to Python 3.7 dependency.
>>
>> Best Regards,
>>
>> Jacek Pliszka
>>
>>
>>
>>


Re: rows reshuffled on join

2024-04-16 Thread Weston Pace
> Can someone confirm it?

I can confirm that the current join implementation will potentially reorder
input.  The larger the input the more likely the chance of reordering.

> I think that ordering is only guaranteed if it has been sorted.

Close enough probably.  I think there is an implicit order (the order of
the defined by the files in the dataset and the rows in those files, or the
original order when the input is in memory) that will be respected if there
are no joins or aggregates.

On Tue, Apr 16, 2024 at 8:19 AM Aldrin  wrote:

> I think that ordering is only guaranteed if it has been sorted.
>
> Sent from Proton Mail  for iOS
>
>
> On Tue, Apr 16, 2024 at 08:12, Jacek Pliszka  > wrote:
>
> Hi!
>
> I just hit a very strange behaviour.
>
> I am joining two tables with "left outer" join.
>
> Naively I would expect that the output rows will match the order of the
> left table.
>
> But sometimes the order of rows is different ...
>
> Can someone confirm it?
>
> I would expect this would be mentioned in the docs.
>
> I am using 12.0.1 due to Python 3.7 dependency.
>
> Best Regards,
>
> Jacek Pliszka
>
>
>
>


Re: rows reshuffled on join

2024-04-16 Thread Aldrin
I think that ordering is only guaranteed if it has been sorted.
 Sent from Proton Mail for iOS 
On Tue, Apr 16, 2024 at 08:12, Jacek Pliszka jacek.plis...@gmail.com 
wrote:  Hi!
I just hit a very strange behaviour.
I am joining two tables with  "left outer" join.
Naively I would expect that the output rows will match the order of the left 
table.
But sometimes the order of rows is different ...
Can someone confirm it?
I would expect this would be mentioned in the docs.

I am using 12.0.1 due to Python 3.7 dependency.
Best Regards,
Jacek Pliszka


signature.asc
Description: OpenPGP digital signature