Hello there,

(part of my problem is docs that say "undocumented" on parallelize <https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/SparkContext.html#parallelize%28scala.collection.Seq,%20int,%20scala.reflect.ClassTag%29> leave me reading books for examples that don't always pertain )

I am trying to create an RDD length N = 10^6 by executing N operations of a Java class we have, I can have that class implement Serializable or any Function if necessary. I don't have a fixed length dataset up front, I am trying to create one. Trying to figure out whether to create a dummy array of length N to parallelize, or pass it a function that runs N times.

Not sure which approach is valid/better, I see in Spark if I am starting out with a well defined data set like words in a doc, the length/count of those words is already defined and I just parallelize some map or filter to do some operation on that data.

In my case I think it's different, trying to parallelize the creation an RDD that will contain 10^6 elements... here's a lot more info if you want ...

DESCRIPTION:

In Java 8 using Spark 1.5.1, we have a Java method doDrop() that takes a PipeLinkageData and returns a DropResult.

I am thinking I could use map() or flatMap() to call a one to many function, I was trying to do something like this in another question that never quite worked <http://stackoverflow.com/questions/33882283/build-spark-javardd-list-from-dropresult-objects>:

|JavaRDD<DropResult>simCountRDD =spark.parallelize(makeRange(1,getSimCount())).map(newFunction<Integer,DropResult>(){publicDropResultcall(Integeri){returnpld.doDrop();}});|

Thinking something like this is more the correct approach? And this has more context if desired:

|// pld is of type PipeLinkageData, it's already initialized// parallelize wants a collection passed into first paramList<PipeLinkageData>pldListofOne =newArrayList();// make an ArrayList of onepldListofOne.add(pld);inthowMany =1000000;JavaRDD<DropResult>nSizedRDD =spark.parallelize(pldListofOne).flatMap(newFlatMapFunction<PipeLinkageData,DropResult>(){publicIterable<DropResult>call(PipeLinkageDatapld){List<DropResult>returnRDD =newArrayList();// is Spark good at spreading a for loop like this?for(inti =0;i <howMany ;i++){returnRDD.add(pld.doDrop());}returnreturnRDD;}});|

One other concern: A JavaRDD is corrrect here? I can see needing to call FlatMapFunction but I don't need a FlatMappedRDD? And since I am never trying to flatten a group of arrays or lists to a single array or list, do I really ever need to flatten anything?





Reply via email to