[ 
https://issues.apache.org/jira/browse/ARROW-9420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157211#comment-17157211
 ] 

Jorge commented on ARROW-9420:
------------------------------

Thanks a lot for the clarification. Makes a lot of sense.

I am a bit unsure how we are exactly doing all of this, since all the 
parallelism on DataFusion is thread-based, while in a distributed environment 
we work at the inter-process level through some communication medium.

IMO for Ballista to inter-operate with DataFusion, the API should not be based 
on SQL, but a smaller unit like intra-partition operations.

Also, in that case, DataFusion should not have to worry about cross-partition 
operations as they are normally planned differently in a distributed context.

Looking forward for the Jira issue - We currently use Vec<BatchRecord> as an 
iterator, which defeats the purpose of an iterator since we need to fit it all 
in memory anyway. In theory Record batches can run concurrently up to an 
exchange, and IMO one way to do this is to replace our `Vec<BatchRecord>` by an 
iterator of asyncs that are spawn in parallel.

> [Rust][DataFusion] Add repartition/shuffle plan
> -----------------------------------------------
>
>                 Key: ARROW-9420
>                 URL: https://issues.apache.org/jira/browse/ARROW-9420
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Jorge
>            Assignee: Jorge
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Some operations (group by, join, over(window.partition_by)) greatly benefit 
> from hash partitioning.
> This is a proposal to add hash partitioning (based on a expression) to this 
> library, so that optimizers can be written to optimize the plan based on the 
> required hashing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to