Re: DataFrame --> JSON objects, instead of un-named array of fields

2016-03-29 Thread 刘虓
Hi,
Besides your solution ,yon can use df.write.format('json').save('a.json')

2016-03-29 4:11 GMT+08:00 Russell Jurney :

> To answer my own question, DataFrame.toJSON() does this, so there is no
> need to map and json.dump():
>
>
> on_time_dataframe.toJSON().saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl')
>
>
> Thanks!
>
> On Mon, Mar 28, 2016 at 12:54 PM, Russell Jurney  > wrote:
>
>> In PySpark, given a DataFrame, I am attempting to save it as JSON
>> Lines/ndjson. I run this code:
>>
>> json_lines = on_time_dataframe.map(lambda x: json.dumps(x))
>>
>> json_lines.saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl')
>>
>>
>> This results in simple arrays of fields, instead of JSON objects:
>>
>> [2015, 1, 1, 1, 4, "2015-01-01", "AA", 19805, "AA", "N787AA", 1, 12478,
>> 1247802, 31703, "JFK", "New York, NY", "NY", 36, "New York", 22, 12892,
>> 1289203, 32575, "LAX", "Los Angeles, CA", "CA", 6, "California", 91, 900,
>> 855, -5.0, 0.0, 0.0, -1, "0900-0959", 17.0, 912, 1230, 7.0, 1230, 1237,
>> 7.0, 7.0, 0.0, 0, "1200-1259", 0.0, "", 0.0, 390.0, 402.0, 378.0, 1.0,
>> 2475.0, 10, null, null, null, null, null, null, null, null, 0, null, null,
>> null, null, "", null, null, null, null, null, null, "", "", null, null,
>> null, null, null, null, "", "", null, null, null, null, null, "", "", "",
>> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
>>
>> What I actually want is JSON objects, with a field name for each field:
>>
>> {"year": "2015", "month": 1, ...}
>>
>>
>> How can I achieve this in PySpark?
>>
>> Thanks!
>> --
>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io
>


Re: DataFrame --> JSON objects, instead of un-named array of fields

2016-03-28 Thread Russell Jurney
To answer my own question, DataFrame.toJSON() does this, so there is no
need to map and json.dump():

on_time_dataframe.toJSON().saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl')


Thanks!

On Mon, Mar 28, 2016 at 12:54 PM, Russell Jurney 
wrote:

> In PySpark, given a DataFrame, I am attempting to save it as JSON
> Lines/ndjson. I run this code:
>
> json_lines = on_time_dataframe.map(lambda x: json.dumps(x))
> json_lines.saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl')
>
>
> This results in simple arrays of fields, instead of JSON objects:
>
> [2015, 1, 1, 1, 4, "2015-01-01", "AA", 19805, "AA", "N787AA", 1, 12478,
> 1247802, 31703, "JFK", "New York, NY", "NY", 36, "New York", 22, 12892,
> 1289203, 32575, "LAX", "Los Angeles, CA", "CA", 6, "California", 91, 900,
> 855, -5.0, 0.0, 0.0, -1, "0900-0959", 17.0, 912, 1230, 7.0, 1230, 1237,
> 7.0, 7.0, 0.0, 0, "1200-1259", 0.0, "", 0.0, 390.0, 402.0, 378.0, 1.0,
> 2475.0, 10, null, null, null, null, null, null, null, null, 0, null, null,
> null, null, "", null, null, null, null, null, null, "", "", null, null,
> null, null, null, null, "", "", null, null, null, null, null, "", "", "",
> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
>
> What I actually want is JSON objects, with a field name for each field:
>
> {"year": "2015", "month": 1, ...}
>
>
> How can I achieve this in PySpark?
>
> Thanks!
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io
>



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io


DataFrame --> JSON objects, instead of un-named array of fields

2016-03-28 Thread Russell Jurney
In PySpark, given a DataFrame, I am attempting to save it as JSON
Lines/ndjson. I run this code:

json_lines = on_time_dataframe.map(lambda x: json.dumps(x))
json_lines.saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl')


This results in simple arrays of fields, instead of JSON objects:

[2015, 1, 1, 1, 4, "2015-01-01", "AA", 19805, "AA", "N787AA", 1, 12478,
1247802, 31703, "JFK", "New York, NY", "NY", 36, "New York", 22, 12892,
1289203, 32575, "LAX", "Los Angeles, CA", "CA", 6, "California", 91, 900,
855, -5.0, 0.0, 0.0, -1, "0900-0959", 17.0, 912, 1230, 7.0, 1230, 1237,
7.0, 7.0, 0.0, 0, "1200-1259", 0.0, "", 0.0, 390.0, 402.0, 378.0, 1.0,
2475.0, 10, null, null, null, null, null, null, null, null, 0, null, null,
null, null, "", null, null, null, null, null, null, "", "", null, null,
null, null, null, null, "", "", null, null, null, null, null, "", "", "",
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]

What I actually want is JSON objects, with a field name for each field:

{"year": "2015", "month": 1, ...}


How can I achieve this in PySpark?

Thanks!
-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io