If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump
that all revision information also) that is stored in HDFS, is it possible
to parse it in parallel/faster using Spark? Or do we have to use something
like a PullParser or Iteratee?

My current solution is to read the single XML file in the first pass -
write it to HDFS and then read the small files in parallel on the Spark
workers.

Thanks
-Soumya

Reply via email to