Maybe sc.wholeTextFile() is what you want, you can get the whole text and parse it by yourself.
On Tue, Oct 7, 2014 at 1:06 AM, <jan.zi...@centrum.cz> wrote: > Hi, > > I have already unsucesfully asked quiet simmilar question at stackoverflow, > particularly here: > http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim. > I've also unsucessfully tryied some workaround, but unsucessfuly, workaround > problem can be found at > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html. > > Particularly what I'm trying to do, I have .xml dump of wikipedia as the > input. The .xml is quite big and it spreads across multiple lines. You can > check it out at > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. > > My goal is to parse this .xml in a same way as > gensim.corpora.wikicorpus.extract_pages do, implementation is at > https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py. > Unfortunately this method does not work, because RDD.flatMap() process the > RDD line by line as strings. > > Does anyone has some suggestion of how to possibly parse the wikipedia like > .xml loaded in RDD using Python? > > Thank you in advance for any suggestions, advices or hints. > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org