Re: Efficient for loops in Spark

Erik Erlandson Mon, 16 May 2016 09:39:07 -0700

Regarding the specific problem of generating random folds in a more efficient 
way, this should help:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions


It uses a sort of multiplexing formalism on RDDs:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions

I wrote a blog post to explain the idea here:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/



----- Original Message -----
> Hi there,
> 
> I'd like to write some iterative computation, i.e., computation that can be
> done via a for loop. I understand that in Spark foreach is a better choice.
> However, foreach and foreachPartition seem to be for self-contained
> computation that only involves the corresponding Row or Partition,
> respectively. But in my application each computational task does not only
> involve one partition, but also other partitions. It's just that every task
> has a specific way of using the corresponding partition and the other
> partitions. An example application will be cross-validation in machine
> learning, where each fold corresponds to a partition, e.g., the whole data
> is divided into 5 folds, then for task 1, I use fold 1 for testing and folds
> 2,3,4,5 for training; for task 2, I use fold 2 for testing and folds 1,3,4,5
> for training; etc.
> 
> In this case, if I were to use foreachPartition, it seems that I need to
> duplicate the data the number of times equal to the number of folds (or
> iterations of my for loop). More generally, I would need to still prepare a
> partition for every distributed task and that partition would need to
> include all the data needed for the task, which could be a huge waste of
> space.
> 
> Is there any other solutions? Thanks.
> 
> f.
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-for-loops-in-Spark-tp26939.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Efficient for loops in Spark

Reply via email to