I encountered a similar problem reading multi-line JSON files into Spark a while back, and here's an article I wrote about how to solve it:
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ You may find it useful. Femi On Thu, Mar 31, 2016 at 12:32 PM, <ross.cramb...@thomsonreuters.com> wrote: > You are correct that it does not take the standard JSON file format. From > the Spark Docs: > "Note that the file that is offered as *a json file* is not a typical > JSON file. Each line must contain a separate, self-contained valid JSON > object. As a consequence, a regular multi-line JSON file will most often > fail.” > > > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > > On Mar 31, 2016, at 5:30 AM, charles li <charles.up...@gmail.com> wrote: > > hi, UMESH, have you tried to load that json file on your machine? I did > try it before, and here is the screenshot: > > <屏幕快照 2016-03-31 下午5.27.30.png> > <屏幕快照 2016-03-31 下午5.27.39.png> > > > > > > > On Thu, Mar 31, 2016 at 5:19 PM, UMESH CHAUDHARY <umesh9...@gmail.com> > wrote: > >> Hi Charles, >> The definition of object from www.json.org >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.json.org&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=goeVxSn01bVFiVJp7KJ9Yaz8FjuPpCfcS65BtTLr1d4&e=> >> : >> >> An *object* is an unordered set of name/value pairs. An object begins >> with { (left brace) and ends with } (right brace). Each name is followed >> by : (colon) and the name/value pairs are separated by , (comma). >> >> Its a pretty much OOPS paradigm , isn't it? >> >> Regards, >> Umesh >> >> On Thu, Mar 31, 2016 at 2:34 PM, charles li <charles.up...@gmail.com> >> wrote: >> >>> hi, UMESH, I think you've misunderstood the json definition. >>> >>> there is only one object in a json file: >>> >>> >>> for the file, people.json, as bellow: >>> >>> >>> -------------------------------------------------------------------------------------------- >>> >>> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}} >>> {"name":"Michael", "address":{"city":null, "state":"California"}} >>> >>> >>> ----------------------------------------------------------------------------------------------- >>> >>> it does have two valid format: >>> >>> 1. >>> >>> >>> -------------------------------------------------------------------------------------------- >>> >>> [ {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}, >>> {"name":"Michael", "address":{"city":null, "state":"California"}} >>> ] >>> >>> >>> ----------------------------------------------------------------------------------------------- >>> >>> 2. >>> >>> >>> -------------------------------------------------------------------------------------------- >>> >>> {"name": ["Yin", "Michael"], >>> "address":[ {"city":"Columbus","state":"Ohio"}, >>> {"city":null, "state":"California"} ] >>> } >>> >>> ----------------------------------------------------------------------------------------------- >>> >>> >>> >>> On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY <umesh9...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> Look at below image which is from json.org >>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__json.org&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=R1os0JBEfw1hBGFnNmMyqIHc17wYCdE2yyJVjANbY88&e=> >>>> : >>>> >>>> <image.png> >>>> >>>> The above image describes the object formulation of below JSON: >>>> >>>> Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}} >>>> Object=> {"name":"Michael", "address":{"city":null, >>>> "state":"California"}} >>>> >>>> >>>> Note that "address" is also an object. >>>> >>>> >>>> >>>> On Thu, Mar 31, 2016 at 1:53 PM, charles li <charles.up...@gmail.com> >>>> wrote: >>>> >>>>> as this post says, that in spark, we can load a json file in this way >>>>> bellow: >>>>> >>>>> *post* : >>>>> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html >>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__databricks.com_blog_2015_02_02_an-2Dintroduction-2Dto-2Djson-2Dsupport-2Din-2Dspark-2Dsql.html&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=zsbEQhumiJod3T8z6Ev_pLMmhJQp5gYOpYbvVl8iPto&e=> >>>>> >>>>> >>>>> >>>>> ----------------------------------------------------------------------------------------------- >>>>> sqlContext.jsonFile(file_path) >>>>> or >>>>> sqlContext.read.json(file_path) >>>>> >>>>> ----------------------------------------------------------------------------------------------- >>>>> >>>>> >>>>> and the *json file format* looks like bellow, say *people.json* >>>>> >>>>> >>>>> --------------------------------------------------------------------------------------------{"name":"Yin", >>>>> "address":{"city":"Columbus","state":"Ohio"}} >>>>> {"name":"Michael", "address":{"city":null, "state":"California"}} >>>>> >>>>> ----------------------------------------------------------------------------------------------- >>>>> >>>>> >>>>> and here comes my *problems*: >>>>> >>>>> Is that the *standard json format*? according to http://www.json.org/ >>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.json.org_&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=dqmXt1Kv3AFEJPSn-Bpp6LCBkR-pbTHlLYAYbZ_sMDQ&e=> >>>>> , I don't think so. it's just a *collection of records* [ a dict ], >>>>> not a valid json format. as the json official doc, the standard json >>>>> format >>>>> of people.json should be : >>>>> >>>>> >>>>> --------------------------------------------------------------------------------------------{"name": >>>>> ["Yin", "Michael"], >>>>> "address":[ {"city":"Columbus","state":"Ohio"}, >>>>> {"city":null, "state":"California"} ] >>>>> } >>>>> >>>>> ----------------------------------------------------------------------------------------------- >>>>> >>>>> So, why we define the json format as a collection of records in spark, >>>>> I mean, it will lead to some unconvenient, for if we had a large standard >>>>> json file, we need to firstly format it to make it correctly readable in >>>>> spark, which will low-efficiency, time-consuming, un-compatible and >>>>> space-consuming. >>>>> >>>>> >>>>> great thanks, >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *--------------------------------------* >>>>> a spark lover, a quant, a developer and a good man. >>>>> >>>>> http://github.com/litaotao >>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_litaotao&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=wka5JBoaoNVjZiTllSeNJUZzD8BxrB9RhxNXmruSxyQ&e=> >>>>> >>>> >>>> >>> >>> >>> -- >>> *--------------------------------------* >>> a spark lover, a quant, a developer and a good man. >>> >>> http://github.com/litaotao >>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_litaotao&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=wka5JBoaoNVjZiTllSeNJUZzD8BxrB9RhxNXmruSxyQ&e=> >>> >> >> > > > -- > *--------------------------------------* > a spark lover, a quant, a developer and a good man. > > http://github.com/litaotao > <https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_litaotao&d=CwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=p-30mQfpiGcYa4IPhDd3F0Yecif2LwGfBsScx0gXAKw&s=wka5JBoaoNVjZiTllSeNJUZzD8BxrB9RhxNXmruSxyQ&e=> > > > -- http://www.femibyte.com/twiki5/bin/view/Tech/ http://www.nextmatrix.com "Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.