There was a previous discussion about this here:

http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html

How big are the XML or JSON files you're looking to deal with?

It may not be practical to deserialize the entire document at once. In that
case an obvious work-around would be to have some kind of pre-processing
step that separates XML nodes/JSON objects with newlines so that you
*can* analyze
the data with Spark in a "line-oriented format". Your preprocessor wouldn't
have to parse/deserialize the massive document; it would just have to track
open/closed tags/braces to know when to insert a newline.

Then you'd just open the line-delimited result and deserialize the
individual objects/nodes with map().

Nick


On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll <dcarr...@cloudera.com>wrote:

> Has anyone got a working example of a Spark application that analyzes data
> in a non-line-oriented format, such as XML or JSON?  I'd like to do this
> without re-inventing the wheel...anyone care to share?  Thanks!
>
> Diana
>

Reply via email to