If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump that all revision information also) that is stored in HDFS, is it possible to parse it in parallel/faster using Spark? Or do we have to use something like a PullParser or Iteratee?
My current solution is to read the single XML file in the first pass - write it to HDFS and then read the small files in parallel on the Spark workers. Thanks -Soumya