ARGH!! Looks like a formatting issue. Spark doesn’t like ‘pretty’ output.
So then the entire record which defines the schema has to be a single line? Really? On Nov 2, 2016, at 1:50 PM, Michael Segel <msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote: This may be a silly mistake on my part… Doing an example using Chicago’s Crime data.. (There’s a lot of it going around. ;-) The goal is to read a file containing a JSON record that describes the crime data.csv for ingestion into a data frame, then I want to output to a Parquet file. (Pretty simple right?) I ran this both in Zeppelin and in the Spark-Shell (2.01) // Setup of environment import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate() // For implicit conversions like converting RDDs to DataFrames import spark.implicits._ // Load the JSON from file: val df = spark.read.json(“~/datasets/Chicago_Crimes.json") df.show() The output df: org.apache.spark.sql.DataFrame = [_corrupt_record: string] +--------------------+ | _corrupt_record| +--------------------+ | {| | "metadata": {| | "source": "CSV_...| | "table": "Chica...| | "compression": ...| | },| | "columns": [{| | "col_name": "Id",| | "data_type": "I...| | }, {| | "col_name": "Ca...| | "data_type": "B...| | }, {| I checked the JSON file against a JSONLint tool (two actually) My JSON record is valid w no errors. (see below) So what’s happening? What am I missing? The goal is to create an ingestion schema for each source. From this I can build the schema for the Parquet file or other data target. Thx -Mike My JSON record: { "metadata": { "source": "CSV_FILE", "table": "Chicago_Crime", "compression": "SNAPPY" }, "columns": [{ "col_name": "Id", "data_type": "INT64" }, { "col_name": "Case_No.", "data_type": "BYTE_ARRAY" }, { "col_name": "Date", "data_type": "BYTE_ARRAY" }, { "col_name": "Block", "data_type": "BYTE_ARRAY" }, { "col_name": "IUCR", "data_type": "INT32" }, { "col_name": "Primary_Type", "data_type": "BYTE_ARRAY" }, { "col_name": "Description", "data_type": "BYTE_ARRAY" }, { "col_name": "Location_Description", "data_type": "BYTE_ARRAY" }, { "col_name": "Arrest", "data_type": "BOOLEAN" }, { "col_name": "Domestic", "data_type": "BOOLEAN" }, { "col_name": "Beat", "data_type": "BYTE_ARRAYI" }, { "col_name": "District", "data_type": "BYTE_ARRAY" }, { "col_name": "Ward", "data_type": "BYTE_ARRAY" }, { "col_name": "Community", "data_type": "BYTE_ARRAY" }, { "col_name": "FBI_Code", "data_type": "BYTE_ARRAY" }, { "col_name": "X_Coordinate", "data_type": "BYTE_ARRAY" }, { "col_name": "Y_Coordinate", "data_type": "BYTE_ARRAY" }, { "col_name": "Year", "data_type": "INT32" }, { "col_name": "Updated_On", "data_type": "BYTE_ARRAY" }, { "col_name": "Latitude", "data_type": "DOUBLE" }, { "col_name": "Longitude", "data_type": "DOUBLE" }, { "col_name": "Location", "data_type": "BYTE_ARRAY" }] }