The file itself is for now just wikipedia dump, that can be downloaded from here http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. It's basically one big .xml that I need to parse in a way to have title + text on one line of the data. For this I currently use gensim.corpora.wikicorpus.extract_pages that is possible to see here https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py. This returns the generator from which I'd like to make the RDD. ______________________________________________________________
Od: Steve Lewis <lordjoe2...@gmail.com> Komu: <jan.zi...@centrum.cz> Datum: 07.10.2014 01:25 Předmět: Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()
Say more about the one file you have - is the file itself large and is it text?Here are 3 samples - I have tested the first two in Spark like this Class inputFormatClass = MGFInputFormat.class; Class keyClass = String.class; Class valueClass = String.class; JavaPairRDD<String, String> spectraAsStrings = ctx.newAPIHadoopFile( path, inputFormatClass, keyClass, valueClass, ctx.hadoopConfiguration() );I have not tested with non-local cluster or gigabyte sized files on Spark but the equivalent Hadoop code - like this but returning Hadoop Text works well at those scales On Mon, Oct 6, 2014 at 2:33 PM, <jan.zi...@centrum.cz <jan.zi...@centrum.cz>> wrote: @Davies I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle neck on the master node. Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes to store there whole data that needs to be parsed by extract_pages. I have my data on S3 and I kind of hoped that after reading (sc.textFile(file_on_s3)) the data from S3 to RDD it will be possible to pass the RDD to extract_pages, this unfortunately does not work for me. If it'd work it'd be by far the best way to go for me. @Steve I can try Hadoop Custom InputFormat. It'd be great if you could send me some samples. But if I understand it correctly then I'm afraid that it won't work for me, because I actually don't have any url to wikipedia, I have only file, that is opened, parsed and returned as generator that generates parsed pagename and text from wikipedia (it can be also some non public wikipedia like site) ______________________________________________________________
Od: Steve Lewis <lordjoe2...@gmail.com <lordjoe2...@gmail.com>> Komu: Davies Liu <dav...@databricks.com <dav...@databricks.com>> Datum: 06.10.2014 22:39 Předmět: Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize() CC: "user"
Try a Hadoop Custom InputFormat - I can give you some samples - While I have not tried this an input split has only a length (could be ignores if the format treats as non splittable) and a String for a location.If the location is a URL into wikipedia the whole thing should work.Hadoop InputFormats seem to be the best way to get large (say multi gigabyte files) into RDDs -- Steven M. Lewis PhD4221 105th Ave NEKirkland, WA 98033206-384-1340 (cell) Skype lordjoe_com
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org