assume I don't care about values which may be created in a later map - in
scala I can say
val rdd = sc.parallelize(1 to 1000000000, numSlices = 1000)
but in Java JavaSparkContext can only paralellize a List - limited to
Integer,MAX_VALUE elements and required to exist in memory - the best I can
do on memory is to build my own List based on a BitSet.
Is there a JIRA asking for JavaSparkContext.parallelize to take an Iterable
or an Iterator?
I am trying to make an RDD with at least 100 million elements and if
possible several billion to test performance issues on a large application

Reply via email to