Re: DataFrame --> JSON objects, instead of un-named array of fields
Hi, Besides your solution ,yon can use df.write.format('json').save('a.json') 2016-03-29 4:11 GMT+08:00 Russell Jurney: > To answer my own question, DataFrame.toJSON() does this, so there is no > need to map and json.dump(): > > > on_time_dataframe.toJSON().saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl') > > > Thanks! > > On Mon, Mar 28, 2016 at 12:54 PM, Russell Jurney > wrote: > >> In PySpark, given a DataFrame, I am attempting to save it as JSON >> Lines/ndjson. I run this code: >> >> json_lines = on_time_dataframe.map(lambda x: json.dumps(x)) >> >> json_lines.saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl') >> >> >> This results in simple arrays of fields, instead of JSON objects: >> >> [2015, 1, 1, 1, 4, "2015-01-01", "AA", 19805, "AA", "N787AA", 1, 12478, >> 1247802, 31703, "JFK", "New York, NY", "NY", 36, "New York", 22, 12892, >> 1289203, 32575, "LAX", "Los Angeles, CA", "CA", 6, "California", 91, 900, >> 855, -5.0, 0.0, 0.0, -1, "0900-0959", 17.0, 912, 1230, 7.0, 1230, 1237, >> 7.0, 7.0, 0.0, 0, "1200-1259", 0.0, "", 0.0, 390.0, 402.0, 378.0, 1.0, >> 2475.0, 10, null, null, null, null, null, null, null, null, 0, null, null, >> null, null, "", null, null, null, null, null, null, "", "", null, null, >> null, null, null, null, "", "", null, null, null, null, null, "", "", "", >> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""] >> >> What I actually want is JSON objects, with a field name for each field: >> >> {"year": "2015", "month": 1, ...} >> >> >> How can I achieve this in PySpark? >> >> Thanks! >> -- >> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io >> > > > > -- > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io >
Re: DataFrame --> JSON objects, instead of un-named array of fields
To answer my own question, DataFrame.toJSON() does this, so there is no need to map and json.dump(): on_time_dataframe.toJSON().saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl') Thanks! On Mon, Mar 28, 2016 at 12:54 PM, Russell Jurneywrote: > In PySpark, given a DataFrame, I am attempting to save it as JSON > Lines/ndjson. I run this code: > > json_lines = on_time_dataframe.map(lambda x: json.dumps(x)) > json_lines.saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl') > > > This results in simple arrays of fields, instead of JSON objects: > > [2015, 1, 1, 1, 4, "2015-01-01", "AA", 19805, "AA", "N787AA", 1, 12478, > 1247802, 31703, "JFK", "New York, NY", "NY", 36, "New York", 22, 12892, > 1289203, 32575, "LAX", "Los Angeles, CA", "CA", 6, "California", 91, 900, > 855, -5.0, 0.0, 0.0, -1, "0900-0959", 17.0, 912, 1230, 7.0, 1230, 1237, > 7.0, 7.0, 0.0, 0, "1200-1259", 0.0, "", 0.0, 390.0, 402.0, 378.0, 1.0, > 2475.0, 10, null, null, null, null, null, null, null, null, 0, null, null, > null, null, "", null, null, null, null, null, null, "", "", null, null, > null, null, null, null, "", "", null, null, null, null, null, "", "", "", > "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""] > > What I actually want is JSON objects, with a field name for each field: > > {"year": "2015", "month": 1, ...} > > > How can I achieve this in PySpark? > > Thanks! > -- > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io
DataFrame --> JSON objects, instead of un-named array of fields
In PySpark, given a DataFrame, I am attempting to save it as JSON Lines/ndjson. I run this code: json_lines = on_time_dataframe.map(lambda x: json.dumps(x)) json_lines.saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl') This results in simple arrays of fields, instead of JSON objects: [2015, 1, 1, 1, 4, "2015-01-01", "AA", 19805, "AA", "N787AA", 1, 12478, 1247802, 31703, "JFK", "New York, NY", "NY", 36, "New York", 22, 12892, 1289203, 32575, "LAX", "Los Angeles, CA", "CA", 6, "California", 91, 900, 855, -5.0, 0.0, 0.0, -1, "0900-0959", 17.0, 912, 1230, 7.0, 1230, 1237, 7.0, 7.0, 0.0, 0, "1200-1259", 0.0, "", 0.0, 390.0, 402.0, 378.0, 1.0, 2475.0, 10, null, null, null, null, null, null, null, null, 0, null, null, null, null, "", null, null, null, null, null, null, "", "", null, null, null, null, null, null, "", "", null, null, null, null, null, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""] What I actually want is JSON objects, with a field name for each field: {"year": "2015", "month": 1, ...} How can I achieve this in PySpark? Thanks! -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io