Cody Allen created SPARK-24947:
----------------------------------

             Summary: aggregateAsync and foldAsync for RDD
                 Key: SPARK-24947
                 URL: https://issues.apache.org/jira/browse/SPARK-24947
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
    Affects Versions: 2.3.1
            Reporter: Cody Allen


{{AsyncRDDActions}} contains {{collectAsync}}, {{countAsync}}, 
{{foreachAsync}}, etc; but it doesn't provide general mechanisms for reducing 
datasets asynchronously. If I want to aggregate some statistics on a large 
dataset and it's going to take an hour, I shouldn't need to completely block a 
thread for the hour to wait for the result.

 

I propose the following methods be added to {{AsyncRDDActions}}:

 

{{def aggregateAsync[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): 
FutureAction[U]}}

{{}}{{def foldAsync(zeroValue: T)(op: (T, T) => T): FutureAction[T]}}

 

Locally I have a version of {{aggregateAsync}} implemented based on 
{{submitJob}} (similar to how {{countAsync}} is implemented), and a 
{{foldAsync}} implementation that simply delegates through to 
{{aggregateAsync}}. I haven't yet written unit tests for these, but I can do so 
if this is a contribution that would be accepted. Please let me know.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to