Maybe sc.wholeTextFile() is what you want, you can get the whole text
and parse it by yourself.

On Tue, Oct 7, 2014 at 1:06 AM,  <jan.zi...@centrum.cz> wrote:
> Hi,
>
> I have already unsucesfully asked quiet simmilar question at stackoverflow,
> particularly here:
> http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim.
> I've also unsucessfully tryied some workaround, but unsucessfuly, workaround
> problem can be found at
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html.
>
> Particularly what I'm trying to do, I have .xml dump of wikipedia as the
> input. The .xml is quite big and it spreads across multiple lines. You can
> check it out at
> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
>
> My goal is to parse this .xml in a same way as
> gensim.corpora.wikicorpus.extract_pages do, implementation is at
> https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py.
> Unfortunately this method does not work, because RDD.flatMap() process the
> RDD line by line as strings.
>
> Does anyone has some suggestion of how to possibly parse the wikipedia like
> .xml loaded in RDD using Python?
>
> Thank you in advance for any suggestions, advices or hints.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to