Hi,
I would like to ask if it is possible to use generator, that generates data
bigger than size of RAM across all the machines as the input for sc =
SparkContext(), sc.paralelize(generator). I would like to create RDD this way.
When I am trying to create RDD by sc.TextFile(file) where file ha
sc.parallelize() to distribute a list of data into numbers of partitions, but
generator can not be cut and serialized automatically.
If you can partition your generator, then you can try this:
sc.parallelize(range(N), N).flatMap(lambda x: generate_partiton(x))
such as you want to generate xrange
mu:
Datum: 06.10.2014 18:09
Předmět: Re: Spark and Python using generator of data bigger than RAM as input
to sc.parallelize()
sc.parallelize() to distribute a list of data into numbers of partitions, but
generator can not be cut and serialized automatically.
If you can partition your generator
uld be accessable by workers,
putting them in a DFS or NFS in cluster mode. In local mode, may be you should
use absolute path for the files.
Davies
> __
>> Od: Davies Liu
>> Komu:
>> Datum: 06.10.2014 18:09
>&
Try a Hadoop Custom InputFormat - I can give you some samples -
While I have not tried this an input split has only a length (could be
ignores if the format treats as non splittable) and a String for a location.
If the location is a URL into wikipedia the whole thing should work.
Hadoop InputFormat
from wikipedia (it can be also some non public wikipedia like site)
__
Od: Steve Lewis
Komu: Davies Liu
Datum: 06.10.2014 22:39
Předmět: Re: Spark and Python using generator of data bigger than RAM as input
to sc.parallelize()
CC: &quo
ewis
Komu:
Datum: 07.10.2014 01:25
Předmět: Re: Spark and Python using generator of data bigger than RAM as input
to sc.parallelize()
Say more about the one file you have - is the file itself large and is it
text?Here are 3 samples - I have tested the first two in Spark like this