Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
Hi, I would like to ask if it is possible to use generator, that generates data bigger than size of RAM across all the machines as the input for sc = SparkContext(), sc.paralelize(generator). I would like to create RDD this way. When I am trying to create RDD by sc.TextFile(file) where file

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Davies Liu
sc.parallelize() to distribute a list of data into numbers of partitions, but generator can not be cut and serialized automatically. If you can partition your generator, then you can try this: sc.parallelize(range(N), N).flatMap(lambda x: generate_partiton(x)) such as you want to generate

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
Hi, Thank you for your advice. It really might work, but to specify my problem a bit more, think of my data more like one generated item is one parsed wikipedia page. I am getting this generator from the parser and I don't want to save it to the storage, but directly apply parallelize and

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Davies Liu
On Mon, Oct 6, 2014 at 1:08 PM, jan.zi...@centrum.cz wrote: Hi, Thank you for your advice. It really might work, but to specify my problem a bit more, think of my data more like one generated item is one parsed wikipedia page. I am getting this generator from the parser and I don't want to

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Steve Lewis
Try a Hadoop Custom InputFormat - I can give you some samples - While I have not tried this an input split has only a length (could be ignores if the format treats as non splittable) and a String for a location. If the location is a URL into wikipedia the whole thing should work. Hadoop

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
@Davies I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle neck on the master node. Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes to store there whole data that needs to be parsed by extract_pages. I have my data on S3 and I kind