Parsing a large XML file using Spark

Soumya Simanta Tue, 18 Nov 2014 16:57:01 -0800

If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump
that all revision information also) that is stored in HDFS, is it possible
to parse it in parallel/faster using Spark? Or do we have to use something
like a PullParser or Iteratee?


My current solution is to read the single XML file in the first pass -
write it to HDFS and then read the small files in parallel on the Spark
workers.

Thanks
-Soumya

Parsing a large XML file using Spark

Reply via email to