Re: Flatten JSON to multiple columns in Spark

Michael Armbrust Tue, 18 Jul 2017 11:04:38 -0700

Here is an overview of how to work with complex JSON in Spark:
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
(works in streaming and batch)


On Tue, Jul 18, 2017 at 10:29 AM, Riccardo Ferrari <ferra...@gmail.com>
wrote:

> What's against:
>
> df.rdd.map(...)
>
> or
>
> dataset.foreach()
>
> https://spark.apache.org/docs/2.0.1/api/scala/index.html#
> org.apache.spark.sql.Dataset@foreach(f:T=>Unit):Unit
>
> Best,
>
> On Tue, Jul 18, 2017 at 6:46 PM, lucas.g...@gmail.com <
> lucas.g...@gmail.com> wrote:
>
>> I've been wondering about this for awhile.
>>
>> We wanted to do something similar for generically saving thousands of
>> individual homogenous events into well formed parquet.
>>
>> Ultimately I couldn't find something I wanted to own and pushed back on
>> the requirements.
>>
>> It seems the canonical answer is that you need to 'own' the schema of the
>> json and parse it out manually and into your dataframe.  There's nothing
>> challenging about it.  Just verbose code.  If you're 'info' is a consistent
>> schema then you'll be fine.  For us it was 12 wildly diverging schemas and
>> I didn't want to own the transforms.
>>
>> I also recommend persisting anything that isn't part of your schema in an
>> 'extras field'  So when you parse out your json, if you've got anything
>> leftover drop it in there for later analysis.
>>
>> I can provide some sample code but I think it's pretty straightforward /
>> you can google it.
>>
>> What you can't seem to do efficiently is dynamically generate a dataframe
>> from random JSON.
>>
>>
>> On 18 July 2017 at 01:57, Chetan Khatri <chetan.opensou...@gmail.com>
>> wrote:
>>
>>> Implicit tried - didn't worked!
>>>
>>> from_json - didnt support spark 2.0.1 any alternate solution would be
>>> welcome please
>>>
>>>
>>> On Tue, Jul 18, 2017 at 12:18 PM, Georg Heiler <
>>> georg.kf.hei...@gmail.com> wrote:
>>>
>>>> You need to have spark implicits in scope
>>>> Richard Xin <richardxin...@yahoo.com.invalid> schrieb am Di. 18. Juli
>>>> 2017 um 08:45:
>>>>
>>>>> I believe you could use JOLT (bazaarvoice/jolt
>>>>> <https://github.com/bazaarvoice/jolt>) to flatten it to a json string
>>>>> and then to dataframe or dataset.
>>>>>
>>>>> bazaarvoice/jolt
>>>>>
>>>>> jolt - JSON to JSON transformation library written in Java.
>>>>> <https://github.com/bazaarvoice/jolt>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Monday, July 17, 2017, 11:18:24 PM PDT, Chetan Khatri <
>>>>> chetan.opensou...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Explode is not working in this scenario with error - string cannot be
>>>>> used in explore either array or map in spark
>>>>> On Tue, Jul 18, 2017 at 11:39 AM, 刘虓 <ipf...@gmail.com> wrote:
>>>>>
>>>>> Hi,
>>>>> have you tried to use explode?
>>>>>
>>>>> Chetan Khatri <chetan.opensou...@gmail.com> 于2017年7月18日 周二下午2:06写道：
>>>>>
>>>>> Hello Spark Dev's,
>>>>>
>>>>> Can you please guide me, how to flatten JSON to multiple columns in
>>>>> Spark.
>>>>>
>>>>> *Example:*
>>>>>
>>>>> Sr No Title ISBN Info
>>>>> 1 Calculus Theory 1234567890
>>>>>
>>>>> [{"cert":[{
>>>>> "authSbmtr":"009415da-c8cd- 418d-869e-0a19601d79fa",
>>>>> 009415da-c8cd-418d-869e- 0a19601d79fa
>>>>> "certUUID":"03ea5a1a-5530- 4fa3-8871-9d1ebac627c4",
>>>>>
>>>>> "effDt":"2016-05-06T15:04:56. 279Z",
>>>>>
>>>>>
>>>>> "fileFmt":"rjrCsv","status":" live"}],
>>>>>
>>>>> "expdCnt":"15",
>>>>> "mfgAcctNum":"531093",
>>>>>
>>>>> "oUUID":"23d07397-4fbe-4897- 8a18-b79c9f64726c",
>>>>>
>>>>>
>>>>> "pgmRole":["RETAILER"],
>>>>> "pgmUUID":"1cb5dd63-817a-45bc- a15c-5660e4accd63",
>>>>> "regUUID":"cc1bd898-657d-40dc- af5d-4bf1569a1cc4",
>>>>> "rtlrsSbmtd":["009415da-c8cd- 418d-869e-0a19601d79fa"]}]
>>>>>
>>>>> I want to get single row with 11 columns.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>
>>
>

Re: Flatten JSON to multiple columns in Spark

Reply via email to