Parsing one big multiple line .xml loaded in RDD using Python

jan.zikes Tue, 07 Oct 2014 01:08:05 -0700

Hi,

I have already unsucesfully asked quiet simmilar question at stackoverflow, 
particularly here: 
http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim.
 I've also unsucessfully tryied some workaround, but unsucessfuly, workaround 
problem can be found at 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html.


Particularly what I'm trying to do, I have .xml dump of wikipedia as the input. 
The .xml is quite big and it spreads across multiple lines. You can check it 
out at 
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. 
My goal is to parse this .xml in a same way as 
gensim.corpora.wikicorpus.extract_pages do, implementation is at 
https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py. 
Unfortunately this method does not work, because RDD.flatMap() process the RDD 
line by line as strings.

Does anyone has some suggestion of how to possibly parse the wikipedia like 
.xml loaded in RDD using Python? 

Thank you in advance for any suggestions, advices or hints.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Parsing one big multiple line .xml loaded in RDD using Python

Reply via email to