Re: Using Spark to analyze complex JSON

Nicholas Chammas Fri, 23 May 2014 15:48:06 -0700

Michael,

What an excellent example! Thank you for posting such a detailed
explanation and sample code. So I see what you’re doing and it looks like
it works very well as long as your source data has a well-known and fixed
structure.


I’m looking for a pattern that can be used to expose JSON data for
exploration via SQL, so that means I can’t make any assumptions about the
structure of the data. It looks like it would be tough to extend this
approach for that purpose. You’d have to create a class with dynamically
defined attributes. Right? I don’t know much about Scala or Java. Is that
even possible?

But going back to your presented pattern, I have a question. Say your data
does have a fixed structure, but some of the JSON values are lists. How
would you map that to a SchemaRDD? (I didn’t notice any list values in the
CandyCrush example.) Take the likes field from my original example:

{
   "name": "Nick",
   "location": {
      "x": 241.6,
      "y": -22.5
   },
   "likes": ["ice cream", "dogs", "Vanilla Ice"]
}

Would you register likes as a separate table and link it to the parent
table somehow?

Nick



On Thu, May 22, 2014 at 2:56 AM, Michael Cutler <mich...@tumra.com> wrote:

> Hi Nick,
>
> Here is an illustrated example which extracts certain fields from Facebook
> messages, each one is a JSON object and they are serialised into files with
> one complete JSON object per line. Example of one such message:
> CandyCrush.json <https://gist.github.com/cotdp/131a1c9fc620ab7898c4>
>
> You need to define a case class which has all the fields you'll be able to
> query later in SQL, e.g.
>
> case class CandyCrushInteraction(id: String, user: String, level: Int,
> gender: String, language: String)
>
> The basic objective is to use Spark to convert the file from RDD[String] -
> - parse JSON - - > RDD[JValue] - - extract fields - - > RDD[
> CandyCrushInteraction]
>
>     // Produces a RDD[String]
>     val lines = sc.textFile("facebook-2014-05-19.json")
>
>         // Process the messages
>
>     val interactions = lines.map(line => {
>
>       // Parse the JSON, returns RDD[JValue]
>       parse(line)
>
>     }).filter(json => {
>       // Filter out only 'Candy Crush Saga' Facebook App activity
>
>       (json \ "facebook" \ "application").extract[String] == "Candy Crush 
> Saga"
>
>     }).map(json => {
>       // Extract fields we want, we use compact() because they may not exist
>
>       val id = compact(json \ "facebook" \ "id")
>
>       val user = compact(json \ "facebook" \ "author" \ "hash")
>
>       val gender = compact(json \ "demographic" \ "gender")
>
>       val language = compact(json \ "language" \ "tag")
>
>       // Extract the 'level' using a RegEx or default to zero
>
>       var level = 0;
>       pattern.findAllIn( compact(json \ "interaction" \ "title") 
> ).matchData.foreach(m => {
>
>         level = m.group(1).toInt
>
>       })
>       // Return an RDD[CandyCrushInteraction]
>       ( CandyCrushInteraction(id, user, level, gender, language) )
>
>     })
>
>
> Now you can register the RDD[CandyCrushInteraction] as a table and query
> it in SQL.
>
>     interactions.registerAsTable("candy_crush_interaction")
>
>         // Game level by Gender
>
>     sql("SELECT gender, COUNT(level), MAX(level), MIN(level), AVG(level) FROM 
> candy_crush_interaction WHERE level > 0 GROUP BY 
> gender").collect().foreach(println)
>
>     /* Returns:
>         ["male",14727,590,1,104.71705031574659]
>         ["female",15422,590,1,114.17202697445208]
>         ["mostly_male",2824,590,1,97.08852691218131]
>         ["mostly_female",1934,590,1,99.0517063081696]
>
>         ["unisex",2674,590,1,113.42071802543006]
>         [,11023,590,1,93.45677220357435]
>      */
>
>
> Full working example: 
> CandyCrushSQL.scala<https://gist.github.com/cotdp/b5b8155bb85e254d2a3c>
>
> MC
>
>
> *Michael Cutler*
> Founder, CTO
>
>
> * Mobile: +44 789 990 7847 Email:   mich...@tumra.com <mich...@tumra.com>
> Web:     tumra.com
> <http://tumra.com/?utm_source=signature&utm_medium=email> *
> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
> *Registered in England & Wales, 07916412. VAT No. 130595328*
>
>
> This email and any files transmitted with it are confidential and may also
> be privileged. It is intended only for the person to whom it is addressed.
> If you have received this email in error, please inform the sender 
> immediately.
> If you are not the intended recipient you must not use, disclose, copy,
> print, distribute or rely on this email.
>
>
> On 22 May 2014 04:43, Nicholas Chammas <nicholas.cham...@gmail.com> wrote:
>
>> That's a good idea. So you're saying create a SchemaRDD by applying a
>> function that deserializes the JSON and transforms it into a relational
>> structure, right?
>>
>> The end goal for my team would be to expose some JDBC endpoint for
>> analysts to query from, so once Shark is updated to use Spark SQL that
>> would become possible without having to resort to using Hive at all.
>>
>>
>> On Wed, May 21, 2014 at 11:11 PM, Tobias Pfeiffer <t...@preferred.jp>wrote:
>>
>>> Hi,
>>>
>>> as far as I understand, if you create an RDD with a relational
>>> structure from your JSON, you should be able to do much of that
>>> already today. For example, take lift-json's deserializer and do
>>> something like
>>>
>>>   val json_table: RDD[MyCaseClass] = json_data.flatMap(json =>
>>> json.extractOpt[MyCaseClass])
>>>
>>> then I guess you can use Spark SQL on that. (Something like your
>>> likes[2] query won't work, though, I guess.)
>>>
>>> Regards
>>> Tobias
>>>
>>>
>>> On Thu, May 22, 2014 at 5:32 AM, Nicholas Chammas
>>> <nicholas.cham...@gmail.com> wrote:
>>> > Looking forward to that update!
>>> >
>>> > Given a table of JSON objects like this one:
>>> >
>>> > {
>>> >    "name": "Nick",
>>> >    "location": {
>>> >       "x": 241.6,
>>> >       "y": -22.5
>>> >    },
>>> >    "likes": ["ice cream", "dogs", "Vanilla Ice"]
>>> > }
>>> >
>>> > It would be SUPER COOL if we could query that table in a way that is as
>>> > natural as follows:
>>> >
>>> > SELECT DISTINCT name
>>> > FROM json_table;
>>> >
>>> > SELECT MAX(location.x)
>>> > FROM json_table;
>>> >
>>> > SELECT likes[2] -- Ice Ice Baby
>>> > FROM json_table
>>> > WHERE name = "Nick";
>>> >
>>> > Of course, this is just a hand-wavy suggestion of how I’d like to be
>>> able to
>>> > query JSON (particularly that last example) using SQL. I’m interested
>>> in
>>> > seeing what y’all come up with.
>>> >
>>> > A large part of what my team does is make it easy for analysts to
>>> explore
>>> > and query JSON data using SQL. We have a fairly complex home-grown
>>> process
>>> > to do that and are looking to replace it with something more out of
>>> the box.
>>> > So if you’d like more input on how users might use this feature, I’d
>>> be glad
>>> > to chime in.
>>> >
>>> > Nick
>>> >
>>> >
>>> >
>>> > On Wed, May 21, 2014 at 11:21 AM, Michael Armbrust <
>>> mich...@databricks.com>
>>> > wrote:
>>> >>
>>> >> You can already extract fields from json data using Hive UDFs.  We
>>> have an
>>> >> intern working on on better native support this summer.  We will be
>>> sure to
>>> >> post updates once there is a working prototype.
>>> >>
>>> >> Michael
>>> >>
>>> >>
>>> >> On Tue, May 20, 2014 at 6:46 PM, Nick Chammas <
>>> nicholas.cham...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> The Apache Drill home page has an interesting heading: "Liberate
>>> Nested
>>> >>> Data".
>>> >>>
>>> >>> Is there any current or planned functionality in Spark SQL or Shark
>>> to
>>> >>> enable SQL-like querying of complex JSON?
>>> >>>
>>> >>> Nick
>>> >>>
>>> >>>
>>> >>> ________________________________
>>> >>> View this message in context: Using Spark to analyze complex JSON
>>> >>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Using Spark to analyze complex JSON

Reply via email to