Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
Hi, I would like to ask if it is possible to use generator, that generates data bigger than size of RAM across all the machines as the input for sc = SparkContext(), sc.paralelize(generator). I would like to create RDD this way. When I am trying to create RDD by sc.TextFile(file) where file ha

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Davies Liu
sc.parallelize() to distribute a list of data into numbers of partitions, but generator can not be cut and serialized automatically. If you can partition your generator, then you can try this: sc.parallelize(range(N), N).flatMap(lambda x: generate_partiton(x)) such as you want to generate xrange

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
mu: Datum: 06.10.2014 18:09 Předmět: Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize() sc.parallelize() to distribute a list of data into numbers of partitions, but generator can not be cut and serialized automatically. If you can partition your generator

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Davies Liu
uld be accessable by workers, putting them in a DFS or NFS in cluster mode. In local mode, may be you should use absolute path for the files. Davies > __ >> Od: Davies Liu >> Komu: >> Datum: 06.10.2014 18:09 >&

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Steve Lewis
Try a Hadoop Custom InputFormat - I can give you some samples - While I have not tried this an input split has only a length (could be ignores if the format treats as non splittable) and a String for a location. If the location is a URL into wikipedia the whole thing should work. Hadoop InputFormat

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
from wikipedia (it can be also some non public wikipedia like site) __ Od: Steve Lewis Komu: Davies Liu Datum: 06.10.2014 22:39 Předmět: Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize() CC: &quo

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-07 Thread jan.zikes
ewis Komu: Datum: 07.10.2014 01:25 Předmět: Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize() Say more about the one file you have  - is the file itself large and is it text?Here are 3 samples - I have tested the first two in Spark like this