RE: [Python] pyarrow dataset writing CSV with or without headers and quoting style

2024-04-16 Thread Lee, David (PAG)
Ok I figured out.. You have to create a pyarrow.dataset.CsvFileFormat object first and generate a csv_file_options=CsvFileFormat.make_write_options(**{include_header: True}) first.. Then pass file_options = csv_file_options in write_dataset().. The only issue I’ve seen is that when using

Re: rows reshuffled on join

2024-04-16 Thread Jacek Pliszka
Hi! https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.join However in my case I want to stay within memory and I found an ugly workaround through unifying dictionaries and then building final column with pa.DictionaryArray.from_arrays BR, Jacek

Re: rows reshuffled on join

2024-04-16 Thread PASSWORD ADMINISTRATOR
Can we join on a "dataset" yet using pyarrow? What I mean is, my parquet file, which is larger than memory, can I read it using dataset API and join with other dataset/in memory table? If yes, I couldn't find it in documentation, can you please guide how to do that join On Tue, Apr 16, 2024, 9:59

[Python] pyarrow dataset writing CSV with or without headers and quoting style

2024-04-16 Thread Lee, David (PAG)
How do you pass a csv.WriteOptions() class to pyarrow.dataset.write_dataset() ?? I tried pass in file_options = pa.csv.WriteOptions(include_header=True) and file_options = {“include_header”: True} Both attempts came back with an error: object has no attribute 'format' CSV cookbook example:

Re: rows reshuffled on join

2024-04-16 Thread Ruoxi Sun
Hi Jacek, I recall an issue with similar concern [1] that I was trying to answer, hope that can help. Besides, if you do the join in parallel, e.g. by directly calling acero API in C++ and the source node is parallel, there is another level of uncertainty of the order of output rows, depending

Re: rows reshuffled on join

2024-04-16 Thread Weston Pace
> Can someone confirm it? I can confirm that the current join implementation will potentially reorder input. The larger the input the more likely the chance of reordering. > I think that ordering is only guaranteed if it has been sorted. Close enough probably. I think there is an implicit

Re: rows reshuffled on join

2024-04-16 Thread Aldrin
I think that ordering is only guaranteed if it has been sorted. Sent from Proton Mail for iOS On Tue, Apr 16, 2024 at 08:12, Jacek Pliszka jacek.plis...@gmail.com wrote: Hi! I just hit a very strange behaviour. I am joining two tables with "left outer" join. Naively I would expect that the

rows reshuffled on join

2024-04-16 Thread Jacek Pliszka
Hi! I just hit a very strange behaviour. I am joining two tables with "left outer" join. Naively I would expect that the output rows will match the order of the left table. But sometimes the order of rows is different ... Can someone confirm it? I would expect this would be mentioned in the