[ https://issues.apache.org/jira/browse/SPARK-6535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-6535. ------------------------------ Resolution: Not a Problem I think it's fair to say that this would not require a change to Spark to implement the desired functionality, so closing it. > new RDD function that returns intermediate Future > ------------------------------------------------- > > Key: SPARK-6535 > URL: https://issues.apache.org/jira/browse/SPARK-6535 > Project: Spark > Issue Type: Wish > Components: Spark Core > Reporter: Eric Johnston > Priority: Minor > Labels: features, newbie > Original Estimate: 168h > Remaining Estimate: 168h > > I'm suggesting a possible Spark RDD method that I think could give value to a > number of people. I'd be interested in thoughts and feedback. Is this a good > or bad idea in general? Will it work well, but is too specific for Spark-Core? > def mapIO[V : ClassTag](f1 : T => Future[U], f2 : U => V, batchSize : Int) : > RDD[V] > The idea is that often times we have an RDD[T] containing metadata, for > example a file path or a unique identifier to data in an external database. > We would like to retrieve this data, process it, and provide the output as an > RDD. Right now, one way to do that is with two map calls: the first being T > => U, followed by U => V. However, this will block on all T => U IO > operations. By wrapping U in a Future, this problem is avoided. The > "batchSize" is added because we do not want to create a future for every row > in a partition -- we may get too much data back at once. The batchSize limits > the number of outstanding Futures within a partition. Ideally this number is > set to be big enough so that there is always data ready to process, but small > enough that not too much data is pulled at any one time. We could potentially > default the batchSize to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org