Give parallelize a dummy Arraylist length N to control RDD size?

Jim Fri, 27 Nov 2015 15:19:16 -0800

 Hello there,

(part of my problem is docs that say "undocumented" on parallelize<https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/SparkContext.html#parallelize%28scala.collection.Seq,%20int,%20scala.reflect.ClassTag%29>leave me reading books for examples that don't always pertain )

I am trying to create an RDD length N = 10^6 by executing N operationsof a Java class we have, I can have that class implement Serializable orany Function if necessary. I don't have a fixed length dataset up front,I am trying to create one. Trying to figure out whether to create adummy array of length N to parallelize, or pass it a function that runsN times.

Not sure which approach is valid/better, I see in Spark if I am startingout with a well defined data set like words in a doc, the length/countof those words is already defined and I just parallelize some map orfilter to do some operation on that data.

In my case I think it's different, trying to parallelize the creation anRDD that will contain 10^6 elements... here's a lot more info if youwant ...


DESCRIPTION:

In Java 8 using Spark 1.5.1, we have a Java method doDrop() that takes aPipeLinkageData and returns a DropResult.

I am thinking I could use map() or flatMap() to call a one to manyfunction, I was trying to do something like this in another questionthat never quite worked<http://stackoverflow.com/questions/33882283/build-spark-javardd-list-from-dropresult-objects>:

|JavaRDD<DropResult>simCountRDD=spark.parallelize(makeRange(1,getSimCount())).map(newFunction<Integer,DropResult>(){publicDropResultcall(Integeri){returnpld.doDrop();}});|

Thinking something like this is more the correct approach? And this hasmore context if desired:

|// pld is of type PipeLinkageData, it's already initialized//parallelize wants a collection passed into firstparamList<PipeLinkageData>pldListofOne =newArrayList();// make anArrayList of onepldListofOne.add(pld);inthowMany=1000000;JavaRDD<DropResult>nSizedRDD=spark.parallelize(pldListofOne).flatMap(newFlatMapFunction<PipeLinkageData,DropResult>(){publicIterable<DropResult>call(PipeLinkageDatapld){List<DropResult>returnRDD=newArrayList();// is Spark good at spreading a for loop likethis?for(inti =0;i <howMany;i++){returnRDD.add(pld.doDrop());}returnreturnRDD;}});|

One other concern: A JavaRDD is corrrect here? I can see needing to callFlatMapFunction but I don't need a FlatMappedRDD? And since I am nevertrying to flatten a group of arrays or lists to a single array or list,do I really ever need to flatten anything?

Give parallelize a dummy Arraylist length N to control RDD size?

Reply via email to