Re: confusing about Spark SQL json format

Femi Anthony Thu, 31 Mar 2016 10:02:47 -0700

I encountered a similar problem reading multi-line JSON files into Spark a
while back, and here's an article I wrote about how to solve it:


http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/

You may find it useful.

Femi

On Thu, Mar 31, 2016 at 12:32 PM, <ross.cramb...@thomsonreuters.com> wrote:

> You are correct that it does not take the standard JSON file format. From
> the Spark Docs:
> "Note that the file that is offered as *a json file* is not a typical
> JSON file. Each line must contain a separate, self-contained valid JSON
> object. As a consequence, a regular multi-line JSON file will most often
> fail.”
>
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>
> On Mar 31, 2016, at 5:30 AM, charles li <charles.up...@gmail.com> wrote:
>
> hi, UMESH, have you tried to load that json file on your machine? I did
> try it before, and here is the screenshot:
>
> <屏幕快照 2016-03-31 下午5.27.30.png>
> <屏幕快照 2016-03-31 下午5.27.39.png>
> 
> 
>
>
>
>
> On Thu, Mar 31, 2016 at 5:19 PM, UMESH CHAUDHARY <umesh9...@gmail.com>
> wrote:
>
>> Hi Charles,
>> The definition of object from www.json.org
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.json.org&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=goeVxSn01bVFiVJp7KJ9Yaz8FjuPpCfcS65BtTLr1d4&e=>
>> :
>>
>> An *object* is an unordered set of name/value pairs. An object begins
>> with { (left brace) and ends with } (right brace). Each name is followed
>> by : (colon) and the name/value pairs are separated by , (comma).
>>
>> Its a pretty much OOPS paradigm , isn't it?
>>
>> Regards,
>> Umesh
>>
>> On Thu, Mar 31, 2016 at 2:34 PM, charles li <charles.up...@gmail.com>
>> wrote:
>>
>>> hi, UMESH, I think you've misunderstood the json definition.
>>>
>>> there is only one object in a json file:
>>>
>>>
>>> for the file, people.json, as bellow:
>>>
>>>
>>> --------------------------------------------------------------------------------------------
>>>
>>> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
>>> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>>
>>>
>>> -----------------------------------------------------------------------------------------------
>>>
>>> it does have two valid format:
>>>
>>> 1.
>>>
>>>
>>> --------------------------------------------------------------------------------------------
>>>
>>> [ {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}},
>>> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>> ]
>>>
>>>
>>> -----------------------------------------------------------------------------------------------
>>>
>>> 2.
>>>
>>>
>>> --------------------------------------------------------------------------------------------
>>>
>>> {"name": ["Yin", "Michael"],
>>> "address":[ {"city":"Columbus","state":"Ohio"},
>>> {"city":null, "state":"California"} ]
>>> }
>>>
>>> -----------------------------------------------------------------------------------------------
>>>
>>>
>>>
>>> On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY <umesh9...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> Look at below image which is from json.org
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__json.org&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=R1os0JBEfw1hBGFnNmMyqIHc17wYCdE2yyJVjANbY88&e=>
>>>> :
>>>>
>>>> <image.png>
>>>>
>>>> The above image describes the object formulation of below JSON:
>>>>
>>>> Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
>>>> Object=> {"name":"Michael", "address":{"city":null,
>>>> "state":"California"}}
>>>>
>>>>
>>>> Note that "address" is also an object.
>>>>
>>>>
>>>>
>>>> On Thu, Mar 31, 2016 at 1:53 PM, charles li <charles.up...@gmail.com>
>>>> wrote:
>>>>
>>>>> as this post  says, that in spark, we can load a json file in this way
>>>>> bellow:
>>>>>
>>>>> *post* :
>>>>> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__databricks.com_blog_2015_02_02_an-2Dintroduction-2Dto-2Djson-2Dsupport-2Din-2Dspark-2Dsql.html&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=zsbEQhumiJod3T8z6Ev_pLMmhJQp5gYOpYbvVl8iPto&e=>
>>>>>
>>>>>
>>>>>
>>>>> -----------------------------------------------------------------------------------------------
>>>>> sqlContext.jsonFile(file_path)
>>>>> or
>>>>> sqlContext.read.json(file_path)
>>>>>
>>>>> -----------------------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> and the *json file format* looks like bellow, say *people.json*
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------------------------{"name":"Yin",
>>>>> "address":{"city":"Columbus","state":"Ohio"}}
>>>>> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>>>>
>>>>> -----------------------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> and here comes my *problems*:
>>>>>
>>>>> Is that the *standard json format*? according to http://www.json.org/
>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.json.org_&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=dqmXt1Kv3AFEJPSn-Bpp6LCBkR-pbTHlLYAYbZ_sMDQ&e=>
>>>>> , I don't think so. it's just a *collection of records* [ a dict ],
>>>>> not a valid json format. as the json official doc, the standard json 
>>>>> format
>>>>> of people.json should be :
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------------------------{"name":
>>>>> ["Yin", "Michael"],
>>>>> "address":[ {"city":"Columbus","state":"Ohio"},
>>>>> {"city":null, "state":"California"} ]
>>>>> }
>>>>>
>>>>> -----------------------------------------------------------------------------------------------
>>>>>
>>>>> So, why we define the json format as a collection of records in spark,
>>>>> I mean, it will lead to some unconvenient, for if we had a large standard
>>>>> json file, we need to firstly format it to make it correctly readable in
>>>>> spark, which will low-efficiency, time-consuming, un-compatible and
>>>>> space-consuming.
>>>>>
>>>>>
>>>>> great thanks,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *--------------------------------------*
>>>>> a spark lover, a quant, a developer and a good man.
>>>>>
>>>>> http://github.com/litaotao
>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_litaotao&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=wka5JBoaoNVjZiTllSeNJUZzD8BxrB9RhxNXmruSxyQ&e=>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *--------------------------------------*
>>> a spark lover, a quant, a developer and a good man.
>>>
>>> http://github.com/litaotao
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_litaotao&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=wka5JBoaoNVjZiTllSeNJUZzD8BxrB9RhxNXmruSxyQ&e=>
>>>
>>
>>
>
>
> --
> *--------------------------------------*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_litaotao&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=wka5JBoaoNVjZiTllSeNJUZzD8BxrB9RhxNXmruSxyQ&e=>
>
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Re: confusing about Spark SQL json format

Reply via email to