Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

Reynold Xin Thu, 17 Nov 2016 12:45:13 -0800

Adding a new data type is an enormous undertaking and very invasive. I
don't think it is worth it in this case given there are clear, simple
workarounds.



On Thu, Nov 17, 2016 at 12:24 PM, kant kodali <kanth...@gmail.com> wrote:

> Can we have a JSONType for Spark SQL?
>
> On Wed, Nov 16, 2016 at 8:41 PM, Nathan Lande <nathanla...@gmail.com>
> wrote:
>
>> If you are dealing with a bunch of different schemas in 1 field, figuring
>> out a strategy to deal with that will depend on your data and does not
>> really have anything to do with spark since mapping your JSON payloads to
>> tractable data structures will depend on business logic.
>>
>> The strategy of pulling out a blob into its on rdd and feeding it into
>> the JSON loader should work for any data source once you have your data
>> strategy figured out.
>>
>> On Wed, Nov 16, 2016 at 4:39 PM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> 1. I have a Cassandra Table where one of the columns is blob. And this
>>> blob contains a JSON encoded String however not all the blob's across the
>>> Cassandra table for that column are same (some blobs have difference json's
>>> than others) so In that case what is the best way to approach it? Do we
>>> need to put /group all the JSON Blobs that have same structure (same keys)
>>> into each individual data frame? For example, say if I have 5 json blobs
>>> that have same structure and another 3 JSON blobs that belongs to some
>>> other structure In this case do I need to create two data frames? (Attached
>>> is a screen shot of 2 rows of how my json looks like)
>>> 2. In my case, toJSON on RDD doesn't seem to help a lot. Attached a
>>> screen shot. Looks like I got the same data frame as my original one.
>>>
>>> Thanks much for these examples.
>>>
>>>
>>>
>>> On Wed, Nov 16, 2016 at 2:54 PM, Nathan Lande <nathanla...@gmail.com>
>>> wrote:
>>>
>>>> I'm looking forward to 2.1 but, in the meantime, you can pull out the
>>>> specific column into an RDD of JSON objects, pass this RDD into the
>>>> read.json() and then join the results back onto your initial DF.
>>>>
>>>> Here is an example of what we do to unpack headers from Avro log data:
>>>>
>>>> def jsonLoad(path):
>>>>     #
>>>>     #load in the df
>>>>     raw = (sqlContext.read
>>>>             .format('com.databricks.spark.avro')
>>>>             .load(path)
>>>>         )
>>>>     #
>>>>     #define json blob, add primary key elements (hi and lo)
>>>>     #
>>>>     JSONBlob = concat(
>>>>         lit('{'),
>>>>         concat(lit('"lo":'), col('header.eventId.lo').cast('string'),
>>>> lit(',')),
>>>>         concat(lit('"hi":'), col('header.eventId.hi').cast('string'),
>>>> lit(',')),
>>>>         concat(lit('"response":'), decode('requestResponse.response',
>>>> 'UTF-8')),
>>>>         lit('}')
>>>>     )
>>>>     #
>>>>     #extract the JSON blob as a string
>>>>     rawJSONString = raw.select(JSONBlob).rdd.map(lambda x: str(x[0]))
>>>>     #
>>>>     #transform the JSON string into a DF struct object
>>>>     structuredJSON = sqlContext.read.json(rawJSONString)
>>>>     #
>>>>     #join the structured JSON back onto the initial DF using the hi and
>>>> lo join keys
>>>>     final = (raw.join(structuredJSON,
>>>>                 ((raw['header.eventId.lo'] == structuredJSON['lo']) &
>>>> (raw['header.eventId.hi'] == structuredJSON['hi'])),
>>>>                 'left_outer')
>>>>             .drop('hi')
>>>>             .drop('lo')
>>>>         )
>>>>     #
>>>>     #win
>>>>     return final
>>>>
>>>> On Wed, Nov 16, 2016 at 10:50 AM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon <gurwls...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Maybe it sounds like you are looking for from_json/to_json functions
>>>>>> after en/decoding properly.
>>>>>>
>>>>>
>>>>> Which are new built-in functions that will be released with Spark 2.1.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

Reply via email to