Re: Quirk in how Spark DF handles JSON input records?

Michael Segel Wed, 02 Nov 2016 11:58:50 -0700

ARGH!!

Looks like a formatting issue.  Spark doesn’t like ‘pretty’ output.


So then the entire record which defines the schema has to be a single line?

Really?

On Nov 2, 2016, at 1:50 PM, Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:

This may be a silly mistake on my part…

Doing an example using Chicago’s Crime data.. (There’s a lot of it going 
around. ;-)

The goal is to read a file containing a JSON record that describes the crime 
data.csv for ingestion into a data frame, then I want to output to a Parquet 
file.
(Pretty simple right?)

I ran this both in Zeppelin and in the Spark-Shell (2.01)

// Setup of environment
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic 
example").config("spark.some.config.option", "some-value").getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

// Load the JSON from file:
val df = spark.read.json(“~/datasets/Chicago_Crimes.json")
df.show()


The output
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "metadata": {|
| "source": "CSV_...|
| "table": "Chica...|
| "compression": ...|
| },|
| "columns": [{|
| "col_name": "Id",|
| "data_type": "I...|
| }, {|
| "col_name": "Ca...|
| "data_type": "B...|
| }, {|

I checked the JSON file against a JSONLint tool (two actually)
My JSON record is valid w no errors. (see below)

So what’s happening?  What am I missing?
The goal is to create an ingestion schema for each source. From this I can 
build the schema for the Parquet file or other data target.

Thx

-Mike

My JSON record:
{
"metadata": {
"source": "CSV_FILE",
"table": "Chicago_Crime",
"compression": "SNAPPY"
},
"columns": [{
"col_name": "Id",
"data_type": "INT64"
}, {
"col_name": "Case_No.",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Date",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Block",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "IUCR",
"data_type": "INT32"
}, {
"col_name": "Primary_Type",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Description",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Location_Description",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Arrest",
"data_type": "BOOLEAN"
}, {
"col_name": "Domestic",
"data_type": "BOOLEAN"
}, {
"col_name": "Beat",
"data_type": "BYTE_ARRAYI"
}, {
"col_name": "District",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Ward",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Community",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "FBI_Code",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "X_Coordinate",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Y_Coordinate",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Year",
"data_type": "INT32"
}, {
"col_name": "Updated_On",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Latitude",
"data_type": "DOUBLE"
}, {
"col_name": "Longitude",
"data_type": "DOUBLE"
}, {
"col_name": "Location",
"data_type": "BYTE_ARRAY"


}]
}

Re: Quirk in how Spark DF handles JSON input records?

Reply via email to