Re: Parsing a large XML file using Spark

2015-11-04 Thread Jin
, this looks working for the testcodes and rough use cases. Maybe you can try this :). Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239p25272.html Sent from the Apache Spark User List mailing list archive

Re: Parsing a large XML file using Spark

2014-11-21 Thread Prannoy
-- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3

Re: Parsing a large XML file using Spark

2014-11-21 Thread Paul Brown
to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239.html To start a new topic under Apache Spark User List, email [hidden email] http://user/SendEmail.jtp?type=nodenode=19477i=1

Re: Parsing a large XML file using Spark

2014-11-21 Thread andy petrella
Actually, it's a real On Tue Nov 18 2014 at 2:52:00 AM Tobias Pfeiffer t...@preferred.jp wrote: Hi, see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for one solution. One issue with those XML files is that they cannot be processed line by line in parallel; plus you

Re: Parsing a large XML file using Spark

2014-11-21 Thread andy petrella
(sorry about the previous spam... google inbox didn't allowed me to cancel the miserable sent action :-/) So what I was about to say: it's a real PAIN tin the ass to parse the wikipedia articles in the dump due to this mulitline articles... However, there is a way to manage that quite easily,

Parsing a large XML file using Spark

2014-11-18 Thread Soumya Simanta
If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump that all revision information also) that is stored in HDFS, is it possible to parse it in parallel/faster using Spark? Or do we have to use something like a PullParser or Iteratee? My current solution is to read the single

Re: Parsing a large XML file using Spark

2014-11-18 Thread Tobias Pfeiffer
Hi, see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for one solution. One issue with those XML files is that they cannot be processed line by line in parallel; plus you inherently need shared/global state to parse XML or check for well-formedness, I think. (Same issue with