I don't actually have any data.  I'm writing a course that teaches students
how to do this sort of thing and am interested in looking at a variety of
real life examples of people doing things like that.  I'd love to see some
working code implementing the "obvious work-around" you mention...do you
have any to share?  It's an approach that makes a lot of sense, and as I
said, I'd love to not have to re-invent the wheel if someone else has
already written that code.  Thanks!

Diana


On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> There was a previous discussion about this here:
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html
>
> How big are the XML or JSON files you're looking to deal with?
>
> It may not be practical to deserialize the entire document at once. In
> that case an obvious work-around would be to have some kind of
> pre-processing step that separates XML nodes/JSON objects with newlines so
> that you *can* analyze the data with Spark in a "line-oriented format".
> Your preprocessor wouldn't have to parse/deserialize the massive document;
> it would just have to track open/closed tags/braces to know when to insert
> a newline.
>
> Then you'd just open the line-delimited result and deserialize the
> individual objects/nodes with map().
>
> Nick
>
>
> On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll <dcarr...@cloudera.com>wrote:
>
>> Has anyone got a working example of a Spark application that analyzes
>> data in a non-line-oriented format, such as XML or JSON?  I'd like to do
>> this without re-inventing the wheel...anyone care to share?  Thanks!
>>
>> Diana
>>
>
>

Reply via email to