Re: Using Spark to analyze complex JSON

Michael Cutler Wed, 21 May 2014 23:58:25 -0700

Hi Nick,

Here is an illustrated example which extracts certain fields from Facebook
messages, each one is a JSON object and they are serialised into files with
one complete JSON object per line. Example of one such message:
CandyCrush.json <https://gist.github.com/cotdp/131a1c9fc620ab7898c4>


You need to define a case class which has all the fields you'll be able to
query later in SQL, e.g.

case class CandyCrushInteraction(id: String, user: String, level: Int,
gender: String, language: String)

The basic objective is to use Spark to convert the file from RDD[String] -
- parse JSON - - > RDD[JValue] - - extract fields - - > RDD[
CandyCrushInteraction]

    // Produces a RDD[String]
    val lines = sc.textFile("facebook-2014-05-19.json")
        // Process the messages
    val interactions = lines.map(line => {
      // Parse the JSON, returns RDD[JValue]
      parse(line)
    }).filter(json => {
      // Filter out only 'Candy Crush Saga' Facebook App activity
      (json \ "facebook" \ "application").extract[String] == "Candy Crush Saga"
    }).map(json => {
      // Extract fields we want, we use compact() because they may not exist
      val id = compact(json \ "facebook" \ "id")
      val user = compact(json \ "facebook" \ "author" \ "hash")
      val gender = compact(json \ "demographic" \ "gender")
      val language = compact(json \ "language" \ "tag")
      // Extract the 'level' using a RegEx or default to zero
      var level = 0;
      pattern.findAllIn( compact(json \ "interaction" \ "title")
).matchData.foreach(m => {
        level = m.group(1).toInt
      })
      // Return an RDD[CandyCrushInteraction]
      ( CandyCrushInteraction(id, user, level, gender, language) )
    })


Now you can register the RDD[CandyCrushInteraction] as a table and query it
in SQL.

    interactions.registerAsTable("candy_crush_interaction")
        // Game level by Gender
    sql("SELECT gender, COUNT(level), MAX(level), MIN(level),
AVG(level) FROM candy_crush_interaction WHERE level > 0 GROUP BY
gender").collect().foreach(println)
    /* Returns:
        ["male",14727,590,1,104.71705031574659]
        ["female",15422,590,1,114.17202697445208]
        ["mostly_male",2824,590,1,97.08852691218131]
        ["mostly_female",1934,590,1,99.0517063081696]
        ["unisex",2674,590,1,113.42071802543006]
        [,11023,590,1,93.45677220357435]
     */


Full working example:
CandyCrushSQL.scala<https://gist.github.com/cotdp/b5b8155bb85e254d2a3c>

MC


*Michael Cutler*
Founder, CTO


*Mobile: +44 789 990 7847Email:   mich...@tumra.com <mich...@tumra.com>Web:
    tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>*
*Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
*Registered in England & Wales, 07916412. VAT No. 130595328*


This email and any files transmitted with it are confidential and may also
be privileged. It is intended only for the person to whom it is addressed.
If you have received this email in error, please inform the sender immediately.
If you are not the intended recipient you must not use, disclose, copy,
print, distribute or rely on this email.


On 22 May 2014 04:43, Nicholas Chammas <nicholas.cham...@gmail.com> wrote:

> That's a good idea. So you're saying create a SchemaRDD by applying a
> function that deserializes the JSON and transforms it into a relational
> structure, right?
>
> The end goal for my team would be to expose some JDBC endpoint for
> analysts to query from, so once Shark is updated to use Spark SQL that
> would become possible without having to resort to using Hive at all.
>
>
> On Wed, May 21, 2014 at 11:11 PM, Tobias Pfeiffer <t...@preferred.jp>wrote:
>
>> Hi,
>>
>> as far as I understand, if you create an RDD with a relational
>> structure from your JSON, you should be able to do much of that
>> already today. For example, take lift-json's deserializer and do
>> something like
>>
>>   val json_table: RDD[MyCaseClass] = json_data.flatMap(json =>
>> json.extractOpt[MyCaseClass])
>>
>> then I guess you can use Spark SQL on that. (Something like your
>> likes[2] query won't work, though, I guess.)
>>
>> Regards
>> Tobias
>>
>>
>> On Thu, May 22, 2014 at 5:32 AM, Nicholas Chammas
>> <nicholas.cham...@gmail.com> wrote:
>> > Looking forward to that update!
>> >
>> > Given a table of JSON objects like this one:
>> >
>> > {
>> >    "name": "Nick",
>> >    "location": {
>> >       "x": 241.6,
>> >       "y": -22.5
>> >    },
>> >    "likes": ["ice cream", "dogs", "Vanilla Ice"]
>> > }
>> >
>> > It would be SUPER COOL if we could query that table in a way that is as
>> > natural as follows:
>> >
>> > SELECT DISTINCT name
>> > FROM json_table;
>> >
>> > SELECT MAX(location.x)
>> > FROM json_table;
>> >
>> > SELECT likes[2] -- Ice Ice Baby
>> > FROM json_table
>> > WHERE name = "Nick";
>> >
>> > Of course, this is just a hand-wavy suggestion of how I’d like to be
>> able to
>> > query JSON (particularly that last example) using SQL. I’m interested in
>> > seeing what y’all come up with.
>> >
>> > A large part of what my team does is make it easy for analysts to
>> explore
>> > and query JSON data using SQL. We have a fairly complex home-grown
>> process
>> > to do that and are looking to replace it with something more out of the
>> box.
>> > So if you’d like more input on how users might use this feature, I’d be
>> glad
>> > to chime in.
>> >
>> > Nick
>> >
>> >
>> >
>> > On Wed, May 21, 2014 at 11:21 AM, Michael Armbrust <
>> mich...@databricks.com>
>> > wrote:
>> >>
>> >> You can already extract fields from json data using Hive UDFs.  We
>> have an
>> >> intern working on on better native support this summer.  We will be
>> sure to
>> >> post updates once there is a working prototype.
>> >>
>> >> Michael
>> >>
>> >>
>> >> On Tue, May 20, 2014 at 6:46 PM, Nick Chammas <
>> nicholas.cham...@gmail.com>
>> >> wrote:
>> >>>
>> >>> The Apache Drill home page has an interesting heading: "Liberate
>> Nested
>> >>> Data".
>> >>>
>> >>> Is there any current or planned functionality in Spark SQL or Shark to
>> >>> enable SQL-like querying of complex JSON?
>> >>>
>> >>> Nick
>> >>>
>> >>>
>> >>> ________________________________
>> >>> View this message in context: Using Spark to analyze complex JSON
>> >>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>
>> >>
>> >
>>
>
>

Re: Using Spark to analyze complex JSON

Reply via email to