Cody Allen created SPARK-24947: ---------------------------------- Summary: aggregateAsync and foldAsync for RDD Key: SPARK-24947 URL: https://issues.apache.org/jira/browse/SPARK-24947 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.3.1 Reporter: Cody Allen
{{AsyncRDDActions}} contains {{collectAsync}}, {{countAsync}}, {{foreachAsync}}, etc; but it doesn't provide general mechanisms for reducing datasets asynchronously. If I want to aggregate some statistics on a large dataset and it's going to take an hour, I shouldn't need to completely block a thread for the hour to wait for the result. I propose the following methods be added to {{AsyncRDDActions}}: {{def aggregateAsync[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): FutureAction[U]}} {{}}{{def foldAsync(zeroValue: T)(op: (T, T) => T): FutureAction[T]}} Locally I have a version of {{aggregateAsync}} implemented based on {{submitJob}} (similar to how {{countAsync}} is implemented), and a {{foldAsync}} implementation that simply delegates through to {{aggregateAsync}}. I haven't yet written unit tests for these, but I can do so if this is a contribution that would be accepted. Please let me know. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org