@Davies
I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle 
neck on the master node.
Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes 
to store there whole data that needs to be parsed by extract_pages. I have my 
data on S3 and I kind of hoped that after reading (sc.textFile(file_on_s3)) the 
data from S3 to RDD it will be possible to pass the RDD to extract_pages, this 
unfortunately does not work for me. If it'd work it'd be by far the best way to 
go for me.
 
@Steve
I can try Hadoop Custom InputFormat. It'd be great if you could send me some 
samples. But if I understand it correctly then I'm afraid that it won't work 
for me, because I actually don't have any url to wikipedia, I have only file, 
that is opened, parsed and returned as generator that generates parsed pagename 
and text from wikipedia (it can be also some non public wikipedia like site)
______________________________________________________________
Od: Steve Lewis <lordjoe2...@gmail.com>
Komu: Davies Liu <dav...@databricks.com>
Datum: 06.10.2014 22:39
Předmět: Re: Spark and Python using generator of data bigger than RAM as input 
to sc.parallelize()

CC: "user"
Try a Hadoop Custom InputFormat - I can give you some samples - While I have 
not tried this an input split has only a length (could be ignores if the format 
treats as non splittable) and a String for a location.If the location is a URL 
into wikipedia the whole thing should work.Hadoop InputFormats seem to be the 
best way to get large (say multi gigabyte files) into RDDs
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to