Hi, I have already unsucesfully asked quiet simmilar question at stackoverflow, particularly here: http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim. I've also unsucessfully tryied some workaround, but unsucessfuly, workaround problem can be found at http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html.
Particularly what I'm trying to do, I have .xml dump of wikipedia as the input. The .xml is quite big and it spreads across multiple lines. You can check it out at http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. My goal is to parse this .xml in a same way as gensim.corpora.wikicorpus.extract_pages do, implementation is at https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py. Unfortunately this method does not work, because RDD.flatMap() process the RDD line by line as strings. Does anyone has some suggestion of how to possibly parse the wikipedia like .xml loaded in RDD using Python? Thank you in advance for any suggestions, advices or hints.
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org