Re: Using Spark to analyze complex JSON

Michael Cutler Thu, 22 May 2014 10:12:55 -0700

I am not 100% sure of the functionality in Catalyst, probably the easiest
way to see what it supports is to look at
SqlParser.scala<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala>in
GIT.  Straight away I can see "
LIKE", "RLIKE" and "REGEXP" so clearly some of the basics are in there.


As the saying goes ... *"Use the source, Luke!
<http://blog.codinghorror.com/learn-to-read-the-source-luke/>"*   :o)
ᐧ


*Michael Cutler*
Founder, CTO


*Mobile: +44 789 990 7847Email:   mich...@tumra.com <mich...@tumra.com>Web:
    tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>*
*Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
*Registered in England & Wales, 07916412. VAT No. 130595328*


This email and any files transmitted with it are confidential and may also
be privileged. It is intended only for the person to whom it is addressed.
If you have received this email in error, please inform the sender immediately.
If you are not the intended recipient you must not use, disclose, copy,
print, distribute or rely on this email.


On 22 May 2014 09:06, Flavio Pompermaier <pomperma...@okkam.it> wrote:

> Is there a way to query fields by similarity (like Lucene or using a
> similarity metric) to be able to query something like WHERE language LIKE
> "it~0.5" ?
>
> Best,
> Flavio
>
>
> On Thu, May 22, 2014 at 8:56 AM, Michael Cutler <mich...@tumra.com> wrote:
>
>> Hi Nick,
>>
>> Here is an illustrated example which extracts certain fields from
>> Facebook messages, each one is a JSON object and they are serialised into
>> files with one complete JSON object per line. Example of one such message:
>> CandyCrush.json <https://gist.github.com/cotdp/131a1c9fc620ab7898c4>
>>
>> You need to define a case class which has all the fields you'll be able
>> to query later in SQL, e.g.
>>
>> case class CandyCrushInteraction(id: String, user: String, level: Int,
>> gender: String, language: String)
>>
>> The basic objective is to use Spark to convert the file from RDD[String] -
>> - parse JSON - - > RDD[JValue] - - extract fields - - > RDD[
>> CandyCrushInteraction]
>>
>>     // Produces a RDD[String]
>>     val lines = sc.textFile("facebook-2014-05-19.json")
>>
>>
>>
>>         // Process the messages
>>
>>     val interactions = lines.map(line => {
>>
>>
>>
>>       // Parse the JSON, returns RDD[JValue]
>>       parse(line)
>>
>>
>>
>>     }).filter(json => {
>>       // Filter out only 'Candy Crush Saga' Facebook App activity
>>
>>       (json \ "facebook" \ "application").extract[String] == "Candy Crush 
>> Saga"
>>
>>
>>
>>     }).map(json => {
>>       // Extract fields we want, we use compact() because they may not exist
>>
>>       val id = compact(json \ "facebook" \ "id")
>>
>>
>>
>>       val user = compact(json \ "facebook" \ "author" \ "hash")
>>
>>
>>
>>       val gender = compact(json \ "demographic" \ "gender")
>>
>>
>>
>>       val language = compact(json \ "language" \ "tag")
>>
>>
>>
>>       // Extract the 'level' using a RegEx or default to zero
>>
>>       var level = 0;
>>       pattern.findAllIn( compact(json \ "interaction" \ "title") 
>> ).matchData.foreach(m => {
>>
>>
>>
>>         level = m.group(1).toInt
>>
>>
>>
>>       })
>>       // Return an RDD[CandyCrushInteraction]
>>       ( CandyCrushInteraction(id, user, level, gender, language) )
>>
>>
>>
>>     })
>>
>>
>> Now you can register the RDD[CandyCrushInteraction] as a table and query
>> it in SQL.
>>
>>     interactions.registerAsTable("candy_crush_interaction")
>>
>>
>>
>>         // Game level by Gender
>>
>>     sql("SELECT gender, COUNT(level), MAX(level), MIN(level), AVG(level) 
>> FROM candy_crush_interaction WHERE level > 0 GROUP BY 
>> gender").collect().foreach(println)
>>
>>
>>
>>     /* Returns:
>>         ["male",14727,590,1,104.71705031574659]
>>         ["female",15422,590,1,114.17202697445208]
>>         ["mostly_male",2824,590,1,97.08852691218131]
>>         ["mostly_female",1934,590,1,99.0517063081696]
>>
>>
>>
>>         ["unisex",2674,590,1,113.42071802543006]
>>         [,11023,590,1,93.45677220357435]
>>      */
>>
>>
>> Full working example: 
>> CandyCrushSQL.scala<https://gist.github.com/cotdp/b5b8155bb85e254d2a3c>
>>
>> MC
>>
>>
>> *Michael Cutler*
>> Founder, CTO
>>
>>
>> * Mobile: +44 789 990 7847 Email:   mich...@tumra.com <mich...@tumra.com>
>> Web:     tumra.com
>> <http://tumra.com/?utm_source=signature&utm_medium=email> *
>> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
>> *Registered in England & Wales, 07916412. VAT No. 130595328*
>>
>>
>> This email and any files transmitted with it are confidential and may
>> also be privileged. It is intended only for the person to whom it is
>> addressed. If you have received this email in error, please inform the
>> sender immediately. If you are not the intended recipient you must not
>> use, disclose, copy, print, distribute or rely on this email.
>>
>>
>> On 22 May 2014 04:43, Nicholas Chammas <nicholas.cham...@gmail.com>wrote:
>>
>>> That's a good idea. So you're saying create a SchemaRDD by applying a
>>> function that deserializes the JSON and transforms it into a relational
>>> structure, right?
>>>
>>> The end goal for my team would be to expose some JDBC endpoint for
>>> analysts to query from, so once Shark is updated to use Spark SQL that
>>> would become possible without having to resort to using Hive at all.
>>>
>>>
>>> On Wed, May 21, 2014 at 11:11 PM, Tobias Pfeiffer <t...@preferred.jp>wrote:
>>>
>>>> Hi,
>>>>
>>>> as far as I understand, if you create an RDD with a relational
>>>> structure from your JSON, you should be able to do much of that
>>>> already today. For example, take lift-json's deserializer and do
>>>> something like
>>>>
>>>>   val json_table: RDD[MyCaseClass] = json_data.flatMap(json =>
>>>> json.extractOpt[MyCaseClass])
>>>>
>>>> then I guess you can use Spark SQL on that. (Something like your
>>>> likes[2] query won't work, though, I guess.)
>>>>
>>>> Regards
>>>> Tobias
>>>>
>>>>
>>>> On Thu, May 22, 2014 at 5:32 AM, Nicholas Chammas
>>>> <nicholas.cham...@gmail.com> wrote:
>>>> > Looking forward to that update!
>>>> >
>>>> > Given a table of JSON objects like this one:
>>>> >
>>>> > {
>>>> >    "name": "Nick",
>>>> >    "location": {
>>>> >       "x": 241.6,
>>>> >       "y": -22.5
>>>> >    },
>>>> >    "likes": ["ice cream", "dogs", "Vanilla Ice"]
>>>> > }
>>>> >
>>>> > It would be SUPER COOL if we could query that table in a way that is
>>>> as
>>>> > natural as follows:
>>>> >
>>>> > SELECT DISTINCT name
>>>> > FROM json_table;
>>>> >
>>>> > SELECT MAX(location.x)
>>>> > FROM json_table;
>>>> >
>>>> > SELECT likes[2] -- Ice Ice Baby
>>>> > FROM json_table
>>>> > WHERE name = "Nick";
>>>> >
>>>> > Of course, this is just a hand-wavy suggestion of how I’d like to be
>>>> able to
>>>> > query JSON (particularly that last example) using SQL. I’m interested
>>>> in
>>>> > seeing what y’all come up with.
>>>> >
>>>> > A large part of what my team does is make it easy for analysts to
>>>> explore
>>>> > and query JSON data using SQL. We have a fairly complex home-grown
>>>> process
>>>> > to do that and are looking to replace it with something more out of
>>>> the box.
>>>> > So if you’d like more input on how users might use this feature, I’d
>>>> be glad
>>>> > to chime in.
>>>> >
>>>> > Nick
>>>> >
>>>> >
>>>> >
>>>> > On Wed, May 21, 2014 at 11:21 AM, Michael Armbrust <
>>>> mich...@databricks.com>
>>>> > wrote:
>>>> >>
>>>> >> You can already extract fields from json data using Hive UDFs.  We
>>>> have an
>>>> >> intern working on on better native support this summer.  We will be
>>>> sure to
>>>> >> post updates once there is a working prototype.
>>>> >>
>>>> >> Michael
>>>> >>
>>>> >>
>>>> >> On Tue, May 20, 2014 at 6:46 PM, Nick Chammas <
>>>> nicholas.cham...@gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> The Apache Drill home page has an interesting heading: "Liberate
>>>> Nested
>>>> >>> Data".
>>>> >>>
>>>> >>> Is there any current or planned functionality in Spark SQL or Shark
>>>> to
>>>> >>> enable SQL-like querying of complex JSON?
>>>> >>>
>>>> >>> Nick
>>>> >>>
>>>> >>>
>>>> >>> ________________________________
>>>> >>> View this message in context: Using Spark to analyze complex JSON
>>>> >>> Sent from the Apache Spark User List mailing list archive at
>>>> Nabble.com.
>>>> >>
>>>> >>
>>>> >
>>>>
>>>

Re: Using Spark to analyze complex JSON

Reply via email to