Hello there,
(part of my problem is docs that say "undocumented" on parallelize
<https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/SparkContext.html#parallelize%28scala.collection.Seq,%20int,%20scala.reflect.ClassTag%29>
leave me reading books for examples that don't always pertain )
I am trying to create an RDD length N = 10^6 by executing N operations
of a Java class we have, I can have that class implement Serializable or
any Function if necessary. I don't have a fixed length dataset up front,
I am trying to create one. Trying to figure out whether to create a
dummy array of length N to parallelize, or pass it a function that runs
N times.
Not sure which approach is valid/better, I see in Spark if I am starting
out with a well defined data set like words in a doc, the length/count
of those words is already defined and I just parallelize some map or
filter to do some operation on that data.
In my case I think it's different, trying to parallelize the creation an
RDD that will contain 10^6 elements... here's a lot more info if you
want ...
DESCRIPTION:
In Java 8 using Spark 1.5.1, we have a Java method doDrop() that takes a
PipeLinkageData and returns a DropResult.
I am thinking I could use map() or flatMap() to call a one to many
function, I was trying to do something like this in another question
that never quite worked
<http://stackoverflow.com/questions/33882283/build-spark-javardd-list-from-dropresult-objects>:
|JavaRDD<DropResult>simCountRDD
=spark.parallelize(makeRange(1,getSimCount())).map(newFunction<Integer,DropResult>(){publicDropResultcall(Integeri){returnpld.doDrop();}});|
Thinking something like this is more the correct approach? And this has
more context if desired:
|// pld is of type PipeLinkageData, it's already initialized//
parallelize wants a collection passed into first
paramList<PipeLinkageData>pldListofOne =newArrayList();// make an
ArrayList of onepldListofOne.add(pld);inthowMany
=1000000;JavaRDD<DropResult>nSizedRDD
=spark.parallelize(pldListofOne).flatMap(newFlatMapFunction<PipeLinkageData,DropResult>(){publicIterable<DropResult>call(PipeLinkageDatapld){List<DropResult>returnRDD
=newArrayList();// is Spark good at spreading a for loop like
this?for(inti =0;i <howMany
;i++){returnRDD.add(pld.doDrop());}returnreturnRDD;}});|
One other concern: A JavaRDD is corrrect here? I can see needing to call
FlatMapFunction but I don't need a FlatMappedRDD? And since I am never
trying to flatten a group of arrays or lists to a single array or list,
do I really ever need to flatten anything?
- Give parallelize a dummy Arraylist length N to control RDD size? Jim
-