Thank you, this seems to be the way to go, but unfortunately, when I'm trying 
to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting 
following Error:
 
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
Traceback (most recent call last):
  File "/root/distributed_rdd_test.py", line 27, in <module>
    result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
  File "/root/spark/python/pyspark/rdd.py", line 1126, in take
    totalParts = self._jrdd.partitions().size()
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions.
: java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
 
at java.lang.Thread.run(Thread.java:745)
 
My code is following:
 
sc = SparkContext(appName="Process wiki")
distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput') 
result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
for item in result:
        print item.getvalue()
sc.stop()
 
So my question is is it possible to read whole files from S3? Based on the 
documentation it shouold be possible, but it seems that it does not work for me.
 
______________________________________________________________
Od: Davies Liu <dav...@databricks.com>
Komu: <jan.zi...@centrum.cz>
Datum: 07.10.2014 17:38
Předmět: Re: Parsing one big multiple line .xml loaded in RDD using Python

CC: "u...@spark.incubator.apache.org"
Maybe sc.wholeTextFile() is what you want, you can get the whole text
and parse it by yourself.

On Tue, Oct 7, 2014 at 1:06 AM,  <jan.zi...@centrum.cz> wrote:
Hi,

I have already unsucesfully asked quiet simmilar question at stackoverflow,
particularly here:
http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim
 
<http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim>.
I've also unsucessfully tryied some workaround, but unsucessfuly, workaround
problem can be found at
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html
 
<http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html>.

Particularly what I'm trying to do, I have .xml dump of wikipedia as the
input. The .xml is quite big and it spreads across multiple lines. You can
check it out at
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 
<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2>.

My goal is to parse this .xml in a same way as
gensim.corpora.wikicorpus.extract_pages do, implementation is at
https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py 
<https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py>.
Unfortunately this method does not work, because RDD.flatMap() process the
RDD line by line as strings.

Does anyone has some suggestion of how to possibly parse the wikipedia like
.xml loaded in RDD using Python?

Thank you in advance for any suggestions, advices or hints.



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to