kmitchener commented on issue #3629: URL: https://github.com/apache/arrow-datafusion/issues/3629#issuecomment-1259971237
In this particular case of the tpch binary, if you leave the default partitions setting at 1, the physical plan is _just_ the `CsvExec` node and it results in 1 file. (it's coming from 1 CSV file, so output partitions is just 1, resulting in a single file being written by write_parquet()) Based on this usage of Dataframe.repartition() and Andy's suggestion to use repartition() in this same way in the Slack channel, I think that round-robin splitting any input into N partitions is the desired use of repartition(). It just doesn't actually work because the physical optimizer adds RepartitionExec nodes after it to undo the work. ``` RepartitionExec: partitioning=RoundRobinBatch(20) RepartitionExec: partitioning=RoundRobinBatch(4) RepartitionExec: partitioning=RoundRobinBatch(20) ``` I was thinking to fix it by adding some logical to the Repartitioning bit of the physical optimizer so that it won't add 2 repartitions in a row. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org