Hi!
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.join
However in my case I want to stay within memory and I found an ugly
workaround through unifying dictionaries
and then building final column with pa.DictionaryArray.from_arrays
BR,
Jacek
Can we join on a "dataset" yet using pyarrow? What I mean is, my parquet
file, which is larger than memory, can I read it using dataset API and join
with other dataset/in memory table? If yes, I couldn't find it in
documentation, can you please guide how to do that join
On Tue, Apr 16, 2024, 9:59
Hi Jacek,
I recall an issue with similar concern [1] that I was trying to answer,
hope that can help.
Besides, if you do the join in parallel, e.g. by directly calling acero API
in C++ and the source node is parallel, there is another level of
uncertainty of the order of output rows, depending
> Can someone confirm it?
I can confirm that the current join implementation will potentially reorder
input. The larger the input the more likely the chance of reordering.
> I think that ordering is only guaranteed if it has been sorted.
Close enough probably. I think there is an implicit
I think that ordering is only guaranteed if it has been sorted.
Sent from Proton Mail for iOS
On Tue, Apr 16, 2024 at 08:12, Jacek Pliszka jacek.plis...@gmail.com
wrote: Hi!
I just hit a very strange behaviour.
I am joining two tables with "left outer" join.
Naively I would expect that the
Hi!
I just hit a very strange behaviour.
I am joining two tables with "left outer" join.
Naively I would expect that the output rows will match the order of the
left table.
But sometimes the order of rows is different ...
Can someone confirm it?
I would expect this would be mentioned in the