Hi, see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for one solution.
One issue with those XML files is that they cannot be processed line by line in parallel; plus you inherently need shared/global state to parse XML or check for well-formedness, I think. (Same issue with multi-line JSON, by the way.) Tobias