Hi,
I would like to ask if it is possible to use generator, that generates data
bigger than size of RAM across all the machines as the input for sc =
SparkContext(), sc.paralelize(generator). I would like to create RDD this way.
When I am trying to create RDD by sc.TextFile(file) where file
sc.parallelize() to distribute a list of data into numbers of partitions, but
generator can not be cut and serialized automatically.
If you can partition your generator, then you can try this:
sc.parallelize(range(N), N).flatMap(lambda x: generate_partiton(x))
such as you want to generate
Hi,
Thank you for your advice. It really might work, but to specify my problem a
bit more, think of my data more like one generated item is one parsed wikipedia
page. I am getting this generator from the parser and I don't want to save it
to the storage, but directly apply parallelize and
On Mon, Oct 6, 2014 at 1:08 PM, jan.zi...@centrum.cz wrote:
Hi,
Thank you for your advice. It really might work, but to specify my problem a
bit more, think of my data more like one generated item is one parsed
wikipedia page. I am getting this generator from the parser and I don't want
to
Try a Hadoop Custom InputFormat - I can give you some samples -
While I have not tried this an input split has only a length (could be
ignores if the format treats as non splittable) and a String for a location.
If the location is a URL into wikipedia the whole thing should work.
Hadoop
@Davies
I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle
neck on the master node.
Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes
to store there whole data that needs to be parsed by extract_pages. I have my
data on S3 and I kind