Re: Spark 2.1 - Infering schema of dataframe after reading json files not during

Alonso Isidoro Roman Fri, 02 Jun 2017 10:26:36 -0700

not sure if this can help you, but you can infer programmatically the
schema providing a json schema file,


val path: Path = new Path(schema_parquet)
val fileSystem = path.getFileSystem(sc.hadoopConfiguration)

val inputStream: FSDataInputStream = fileSystem.open(path)

val schema_json = Stream.cons(inputStream.readLine(),
Stream.continually(inputStream.readLine))

logger.debug("schema_json looks like " + schema_json.head)

val mySchemaStructType =
DataType.fromJson(schema_json.head).asInstanceOf[StructType]

logger.debug("mySchemaStructType is " + mySchemaStructType)


where schema_parquet can be something like this:

{"type" : "struct","fields" : [ {"name" : "column0","type" :
"string","nullable" : false},{"name":"column1", "type":"string",
"nullable":false},{"name":"column2", "type":"string",
"nullable":true}, {"name":"column3", "type":"string",
"nullable":false}]}



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2017-06-02 16:11 GMT+02:00 Aseem Bansal <asmbans...@gmail.com>:

> When we read files in spark it infers the schema. We have the option to
> not infer the schema. Is there a way to ask spark to infer the schema again
> just like when reading json?
>
> The reason we want to get this done is because we have a problem in our
> data files. We have a json file containing this
>
> {"a": NESTED_JSON_VALUE}
> {"a":"null"}
>
> It should have been empty json but due to a bug it became "null" instead.
> Now, when we read the file the column "a" is considered as a String.
> Instead what we want to do is ask spark to read the file considering "a" as
> a String, filter the "null" out/replace with empty json and then ask spark
> to infer schema of "a" after the fix so we can access the nested json
> properly.
>

Re: Spark 2.1 - Infering schema of dataframe after reading json files not during

Reply via email to