Re: confusing about Spark SQL json format

2016-03-31 Thread Femi Anthony
I encountered a similar problem reading multi-line JSON files into Spark a
while back, and here's an article I wrote about how to solve it:

http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/

You may find it useful.

Femi

On Thu, Mar 31, 2016 at 12:32 PM,  wrote:

> You are correct that it does not take the standard JSON file format. From
> the Spark Docs:
> "Note that the file that is offered as *a json file* is not a typical
> JSON file. Each line must contain a separate, self-contained valid JSON
> object. As a consequence, a regular multi-line JSON file will most often
> fail.”
>
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>
> On Mar 31, 2016, at 5:30 AM, charles li  wrote:
>
> hi, UMESH, have you tried to load that json file on your machine? I did
> try it before, and here is the screenshot:
>
> <屏幕快照 2016-03-31 下午5.27.30.png>
> <屏幕快照 2016-03-31 下午5.27.39.png>
> ​
> ​
>
>
>
>
> On Thu, Mar 31, 2016 at 5:19 PM, UMESH CHAUDHARY 
> wrote:
>
>> Hi Charles,
>> The definition of object from www.json.org
>> 
>> :
>>
>> An *object* is an unordered set of name/value pairs. An object begins
>> with { (left brace) and ends with } (right brace). Each name is followed
>> by : (colon) and the name/value pairs are separated by , (comma).
>>
>> Its a pretty much OOPS paradigm , isn't it?
>>
>> Regards,
>> Umesh
>>
>> On Thu, Mar 31, 2016 at 2:34 PM, charles li 
>> wrote:
>>
>>> hi, UMESH, I think you've misunderstood the json definition.
>>>
>>> there is only one object in a json file:
>>>
>>>
>>> for the file, people.json, as bellow:
>>>
>>>
>>> 
>>>
>>> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
>>> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>>
>>>
>>> ---
>>>
>>> it does have two valid format:
>>>
>>> 1.
>>>
>>>
>>> 
>>>
>>> [ {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}},
>>> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>> ]
>>>
>>>
>>> ---
>>>
>>> 2.
>>>
>>>
>>> 
>>>
>>> {"name": ["Yin", "Michael"],
>>> "address":[ {"city":"Columbus","state":"Ohio"},
>>> {"city":null, "state":"California"} ]
>>> }
>>>
>>> ---
>>>
>>>
>>>
>>> On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY 
>>> wrote:
>>>
 Hi,
 Look at below image which is from json.org
 
 :

 

 The above image describes the object formulation of below JSON:

 Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
 Object=> {"name":"Michael", "address":{"city":null,
 "state":"California"}}


 Note that "address" is also an object.



 On Thu, Mar 31, 2016 at 1:53 PM, charles li 
 wrote:

> as this post  says, that in spark, we can load a json file in this way
> bellow:
>
> *post* :
> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
> 
>
>
>
> ---
> sqlContext.jsonFile(file_path)
> or
> sqlContext.read.json(file_path)
>
> ---
>
>
> and the *json file format* looks like bellow, say *people.json*
>
>
> {"name":"Yin",
> 

Re: confusing about Spark SQL json format

2016-03-31 Thread Ross.Cramblit
You are correct that it does not take the standard JSON file format. From the 
Spark Docs:
"Note that the file that is offered as a json file is not a typical JSON file. 
Each line must contain a separate, self-contained valid JSON object. As a 
consequence, a regular multi-line JSON file will most often fail.”

http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

On Mar 31, 2016, at 5:30 AM, charles li 
> wrote:

hi, UMESH, have you tried to load that json file on your machine? I did try it 
before, and here is the screenshot:

<屏幕快照 2016-03-31 下午5.27.30.png>
<屏幕快照 2016-03-31 下午5.27.39.png>
​
​




On Thu, Mar 31, 2016 at 5:19 PM, UMESH CHAUDHARY 
> wrote:
Hi Charles,
The definition of object from 
www.json.org:

An object is an unordered set of name/value pairs. An object begins with { 
(left brace) and ends with } (right brace). Each name is followed by : (colon) 
and the name/value pairs are separated by , (comma).

Its a pretty much OOPS paradigm , isn't it?

Regards,
Umesh

On Thu, Mar 31, 2016 at 2:34 PM, charles li 
> wrote:
hi, UMESH, I think you've misunderstood the json definition.

there is only one object in a json file:


for the file, people.json, as bellow:



{"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}

---

it does have two valid format:

1.



[ {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}},
{"name":"Michael", "address":{"city":null, "state":"California"}}
]

---

2.



{"name": ["Yin", "Michael"],
"address":[ {"city":"Columbus","state":"Ohio"},
{"city":null, "state":"California"} ]
}
---



On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY 
> wrote:
Hi,
Look at below image which is from 
json.org
 :



The above image describes the object formulation of below JSON:

Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
Object=> {"name":"Michael", "address":{"city":null, "state":"California"}}


Note that "address" is also an object.



On Thu, Mar 31, 2016 at 1:53 PM, charles li 
> wrote:
as this post  says, that in spark, we can load a json file in this way bellow:

post : 
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html


---
sqlContext.jsonFile(file_path)
or
sqlContext.read.json(file_path)
---


and the json file format looks like bellow, say people.json

{"name":"Yin",
 "address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}
---


and here comes my problems:

Is that the standard json format? according to 
http://www.json.org/
 , I don't think so. it's just a collection of records [ a dict ], not a valid 
json format. as the json 

Re: confusing about Spark SQL json format

2016-03-31 Thread UMESH CHAUDHARY
Hi Charles,
The definition of object from www.json.org:

An *object* is an unordered set of name/value pairs. An object begins with {
 (left brace) and ends with } (right brace). Each name is followed by :
(colon) and the name/value pairs are separated by , (comma).

Its a pretty much OOPS paradigm , isn't it?

Regards,
Umesh

On Thu, Mar 31, 2016 at 2:34 PM, charles li  wrote:

> hi, UMESH, I think you've misunderstood the json definition.
>
> there is only one object in a json file:
>
>
> for the file, people.json, as bellow:
>
>
> 
>
> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
> {"name":"Michael", "address":{"city":null, "state":"California"}}
>
>
> ---
>
> it does have two valid format:
>
> 1.
>
>
> 
>
> [ {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}},
> {"name":"Michael", "address":{"city":null, "state":"California"}}
> ]
>
>
> ---
>
> 2.
>
>
> 
>
> {"name": ["Yin", "Michael"],
> "address":[ {"city":"Columbus","state":"Ohio"},
> {"city":null, "state":"California"} ]
> }
>
> ---
>
>
>
> On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY 
> wrote:
>
>> Hi,
>> Look at below image which is from json.org :
>>
>> [image: Inline image 1]
>>
>> The above image describes the object formulation of below JSON:
>>
>> Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
>> Object=> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>
>>
>> Note that "address" is also an object.
>>
>>
>>
>> On Thu, Mar 31, 2016 at 1:53 PM, charles li 
>> wrote:
>>
>>> as this post  says, that in spark, we can load a json file in this way
>>> bellow:
>>>
>>> *post* :
>>> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
>>>
>>>
>>>
>>> ---
>>> sqlContext.jsonFile(file_path)
>>> or
>>> sqlContext.read.json(file_path)
>>>
>>> ---
>>>
>>>
>>> and the *json file format* looks like bellow, say *people.json*
>>>
>>>
>>> {"name":"Yin",
>>> "address":{"city":"Columbus","state":"Ohio"}}
>>> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>>
>>> ---
>>>
>>>
>>> and here comes my *problems*:
>>>
>>> Is that the *standard json format*? according to http://www.json.org/ ,
>>> I don't think so. it's just a *collection of records* [ a dict ], not a
>>> valid json format. as the json official doc, the standard json format of
>>> people.json should be :
>>>
>>>
>>> {"name":
>>> ["Yin", "Michael"],
>>> "address":[ {"city":"Columbus","state":"Ohio"},
>>> {"city":null, "state":"California"} ]
>>> }
>>>
>>> ---
>>>
>>> So, why we define the json format as a collection of records in spark, I
>>> mean, it will lead to some unconvenient, for if we had a large standard
>>> json file, we need to firstly format it to make it correctly readable in
>>> spark, which will low-efficiency, time-consuming, un-compatible and
>>> space-consuming.
>>>
>>>
>>> great thanks,
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> *--*
>>> a spark lover, a quant, a developer and a good man.
>>>
>>> http://github.com/litaotao
>>>
>>
>>
>
>
> --
> *--*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>


Re: confusing about Spark SQL json format

2016-03-31 Thread charles li
hi, UMESH, I think you've misunderstood the json definition.

there is only one object in a json file:


for the file, people.json, as bellow:



{"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}

---

it does have two valid format:

1.



[ {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}},
{"name":"Michael", "address":{"city":null, "state":"California"}}
]

---

2.



{"name": ["Yin", "Michael"],
"address":[ {"city":"Columbus","state":"Ohio"},
{"city":null, "state":"California"} ]
}
---



On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY 
wrote:

> Hi,
> Look at below image which is from json.org :
>
> [image: Inline image 1]
>
> The above image describes the object formulation of below JSON:
>
> Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
> Object=> {"name":"Michael", "address":{"city":null, "state":"California"}}
>
>
> Note that "address" is also an object.
>
>
>
> On Thu, Mar 31, 2016 at 1:53 PM, charles li 
> wrote:
>
>> as this post  says, that in spark, we can load a json file in this way
>> bellow:
>>
>> *post* :
>> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
>>
>>
>>
>> ---
>> sqlContext.jsonFile(file_path)
>> or
>> sqlContext.read.json(file_path)
>>
>> ---
>>
>>
>> and the *json file format* looks like bellow, say *people.json*
>>
>>
>> {"name":"Yin",
>> "address":{"city":"Columbus","state":"Ohio"}}
>> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>
>> ---
>>
>>
>> and here comes my *problems*:
>>
>> Is that the *standard json format*? according to http://www.json.org/ ,
>> I don't think so. it's just a *collection of records* [ a dict ], not a
>> valid json format. as the json official doc, the standard json format of
>> people.json should be :
>>
>>
>> {"name":
>> ["Yin", "Michael"],
>> "address":[ {"city":"Columbus","state":"Ohio"},
>> {"city":null, "state":"California"} ]
>> }
>>
>> ---
>>
>> So, why we define the json format as a collection of records in spark, I
>> mean, it will lead to some unconvenient, for if we had a large standard
>> json file, we need to firstly format it to make it correctly readable in
>> spark, which will low-efficiency, time-consuming, un-compatible and
>> space-consuming.
>>
>>
>> great thanks,
>>
>>
>>
>>
>>
>>
>> --
>> *--*
>> a spark lover, a quant, a developer and a good man.
>>
>> http://github.com/litaotao
>>
>
>


-- 
*--*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao


Re: confusing about Spark SQL json format

2016-03-31 Thread Hechem El Jed
Hello,

Actually I have been through the same problem as you when I was
implementing a decision tree algorithm with Spark parsing the output to a
comprehensible json format.

So as you said; the correct json format is :
[{
"name": "Yin",
"address": {
"city": "Columbus",
"state": "Ohio"
}
}, {
"name": "Michael",
"address": {
"city": null,
"state": "California"
}
}]

However, I had to consider it as a list such as data[0] to get :

{
"name": "Yin",
"address": {
"city": "Columbus",
"state": "Ohio"
}
}

and then use it for my visualizations.
Spark still a bit tricky when dealing with input/output formats, so I guess
the solution for now, is to create your own parser.


Cheers,

*Hechem El Jed*
Software Engineer & Business Analyst
MY +601131094294
TN +216 24 937 021
[image: View my profile on LinkedIn]


Our environment is fragile, please do not print this email unless necessary.

On Thu, Mar 31, 2016 at 4:23 PM, charles li  wrote:

> as this post  says, that in spark, we can load a json file in this way
> bellow:
>
> *post* :
> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
>
>
>
> ---
> sqlContext.jsonFile(file_path)
> or
> sqlContext.read.json(file_path)
>
> ---
>
>
> and the *json file format* looks like bellow, say *people.json*
>
>
> {"name":"Yin",
> "address":{"city":"Columbus","state":"Ohio"}}
> {"name":"Michael", "address":{"city":null, "state":"California"}}
>
> ---
>
>
> and here comes my *problems*:
>
> Is that the *standard json format*? according to http://www.json.org/ , I
> don't think so. it's just a *collection of records* [ a dict ], not a
> valid json format. as the json official doc, the standard json format of
> people.json should be :
>
>
> {"name":
> ["Yin", "Michael"],
> "address":[ {"city":"Columbus","state":"Ohio"},
> {"city":null, "state":"California"} ]
> }
>
> ---
>
> So, why we define the json format as a collection of records in spark, I
> mean, it will lead to some unconvenient, for if we had a large standard
> json file, we need to firstly format it to make it correctly readable in
> spark, which will low-efficiency, time-consuming, un-compatible and
> space-consuming.
>
>
> great thanks,
>
>
>
>
>
>
> --
> *--*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>


Re: confusing about Spark SQL json format

2016-03-31 Thread UMESH CHAUDHARY
Hi,
Look at below image which is from json.org :

[image: Inline image 1]

The above image describes the object formulation of below JSON:

Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
Object=> {"name":"Michael", "address":{"city":null, "state":"California"}}


Note that "address" is also an object.



On Thu, Mar 31, 2016 at 1:53 PM, charles li  wrote:

> as this post  says, that in spark, we can load a json file in this way
> bellow:
>
> *post* :
> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
>
>
>
> ---
> sqlContext.jsonFile(file_path)
> or
> sqlContext.read.json(file_path)
>
> ---
>
>
> and the *json file format* looks like bellow, say *people.json*
>
>
> {"name":"Yin",
> "address":{"city":"Columbus","state":"Ohio"}}
> {"name":"Michael", "address":{"city":null, "state":"California"}}
>
> ---
>
>
> and here comes my *problems*:
>
> Is that the *standard json format*? according to http://www.json.org/ , I
> don't think so. it's just a *collection of records* [ a dict ], not a
> valid json format. as the json official doc, the standard json format of
> people.json should be :
>
>
> {"name":
> ["Yin", "Michael"],
> "address":[ {"city":"Columbus","state":"Ohio"},
> {"city":null, "state":"California"} ]
> }
>
> ---
>
> So, why we define the json format as a collection of records in spark, I
> mean, it will lead to some unconvenient, for if we had a large standard
> json file, we need to firstly format it to make it correctly readable in
> spark, which will low-efficiency, time-consuming, un-compatible and
> space-consuming.
>
>
> great thanks,
>
>
>
>
>
>
> --
> *--*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>


confusing about Spark SQL json format

2016-03-31 Thread charles li
as this post  says, that in spark, we can load a json file in this way
bellow:

*post* :
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html


---
sqlContext.jsonFile(file_path)
or
sqlContext.read.json(file_path)
---


and the *json file format* looks like bellow, say *people.json*

{"name":"Yin",
"address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}
---


and here comes my *problems*:

Is that the *standard json format*? according to http://www.json.org/ , I
don't think so. it's just a *collection of records* [ a dict ], not a valid
json format. as the json official doc, the standard json format of
people.json should be :

{"name":
["Yin", "Michael"],
"address":[ {"city":"Columbus","state":"Ohio"},
{"city":null, "state":"California"} ]
}
---

So, why we define the json format as a collection of records in spark, I
mean, it will lead to some unconvenient, for if we had a large standard
json file, we need to firstly format it to make it correctly readable in
spark, which will low-efficiency, time-consuming, un-compatible and
space-consuming.


great thanks,






-- 
*--*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao