, this looks working for the
testcodes and rough use cases.
Maybe you can try this :).
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239p25272.html
Sent from the Apache Spark User List mailing list archive
--
If you reply to this email, your message will be added to the discussion
below:
http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239.html
To start a new topic under Apache Spark User List, email
ml-node+s1001560n1...@n3
to this email, your message will be added to the
discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239.html
To start a new topic under Apache Spark User List, email [hidden email]
http://user/SendEmail.jtp?type=nodenode=19477i=1
Actually, it's a real
On Tue Nov 18 2014 at 2:52:00 AM Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for
one solution.
One issue with those XML files is that they cannot be processed line by
line in parallel; plus you
(sorry about the previous spam... google inbox didn't allowed me to cancel
the miserable sent action :-/)
So what I was about to say: it's a real PAIN tin the ass to parse the
wikipedia articles in the dump due to this mulitline articles...
However, there is a way to manage that quite easily,
If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump
that all revision information also) that is stored in HDFS, is it possible
to parse it in parallel/faster using Spark? Or do we have to use something
like a PullParser or Iteratee?
My current solution is to read the single
Hi,
see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for one
solution.
One issue with those XML files is that they cannot be processed line by
line in parallel; plus you inherently need shared/global state to parse XML
or check for well-formedness, I think. (Same issue with