The question would be whether or not Iterable would save memory.  

It's trivial for me to build a list out of my iterable.  If I understood the
code correctly, Spark takes that List and converts it to an array, so I
built an ArrayList out of the iterable in the hopes that Spark would use the
underlying array structure natively in the RDD.  If that is the case, then
no effort or memory has been wasted (at least for single node).  If (when)
that is not the case, if the RDD throws away my array, then indeed it would
be more efficient to pass Spark an iterable and let it build up whatever
internal representation that it needs.

Our bigger picture is that we have a proprietary streaming system that
streams rows of data (the "row" is a java data structure we have in the form
of Iterable<ProprietaryRow>.  The Iterable<> may invoke other upstream
Iterables which may invoke a streaming read from a database, file, etc.  So
far we have been careful to avoid collecting the entire stream in memory
unless absolutely necessary.  By experimenting with Spark and RDD, we are
taking the leap of collecting the entire dataset in (potentially
distributed) memory to see if it can help us parallelize and scale.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-RDD-from-Java-in-memory-data-tp2486p2560.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to