kmitchener commented on issue #3629:
URL: 
https://github.com/apache/arrow-datafusion/issues/3629#issuecomment-1259971237

   In this particular case of the tpch binary, if you leave the default 
partitions setting at 1, the physical plan is _just_ the `CsvExec` node and it 
results in 1 file. (it's coming from 1 CSV file, so output partitions is just 
1, resulting in a single file being written by write_parquet())
   
   Based on this usage of Dataframe.repartition() and Andy's suggestion to use 
repartition() in this same way in the Slack channel, I think that round-robin 
splitting any input into N partitions is the desired use of repartition(). It 
just doesn't actually work because the physical optimizer adds RepartitionExec 
nodes after it to undo the work.
   
   ```
     RepartitionExec: partitioning=RoundRobinBatch(20)
       RepartitionExec: partitioning=RoundRobinBatch(4)
         RepartitionExec: partitioning=RoundRobinBatch(20)
   ```
   
   I was thinking to fix it by adding some logical to the Repartitioning bit of 
the physical optimizer so that it won't add 2 repartitions in a row. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to