I don't actually have any data. I'm writing a course that teaches students how to do this sort of thing and am interested in looking at a variety of real life examples of people doing things like that. I'd love to see some working code implementing the "obvious work-around" you mention...do you have any to share? It's an approach that makes a lot of sense, and as I said, I'd love to not have to re-invent the wheel if someone else has already written that code. Thanks!
Diana On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > There was a previous discussion about this here: > > > http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html > > How big are the XML or JSON files you're looking to deal with? > > It may not be practical to deserialize the entire document at once. In > that case an obvious work-around would be to have some kind of > pre-processing step that separates XML nodes/JSON objects with newlines so > that you *can* analyze the data with Spark in a "line-oriented format". > Your preprocessor wouldn't have to parse/deserialize the massive document; > it would just have to track open/closed tags/braces to know when to insert > a newline. > > Then you'd just open the line-delimited result and deserialize the > individual objects/nodes with map(). > > Nick > > > On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll <dcarr...@cloudera.com>wrote: > >> Has anyone got a working example of a Spark application that analyzes >> data in a non-line-oriented format, such as XML or JSON? I'd like to do >> this without re-inventing the wheel...anyone care to share? Thanks! >> >> Diana >> > >