not sure if this can help you, but you can infer programmatically the schema providing a json schema file,
val path: Path = new Path(schema_parquet) val fileSystem = path.getFileSystem(sc.hadoopConfiguration) val inputStream: FSDataInputStream = fileSystem.open(path) val schema_json = Stream.cons(inputStream.readLine(), Stream.continually(inputStream.readLine)) logger.debug("schema_json looks like " + schema_json.head) val mySchemaStructType = DataType.fromJson(schema_json.head).asInstanceOf[StructType] logger.debug("mySchemaStructType is " + mySchemaStructType) where schema_parquet can be something like this: {"type" : "struct","fields" : [ {"name" : "column0","type" : "string","nullable" : false},{"name":"column1", "type":"string", "nullable":false},{"name":"column2", "type":"string", "nullable":true}, {"name":"column3", "type":"string", "nullable":false}]} Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> 2017-06-02 16:11 GMT+02:00 Aseem Bansal <asmbans...@gmail.com>: > When we read files in spark it infers the schema. We have the option to > not infer the schema. Is there a way to ask spark to infer the schema again > just like when reading json? > > The reason we want to get this done is because we have a problem in our > data files. We have a json file containing this > > {"a": NESTED_JSON_VALUE} > {"a":"null"} > > It should have been empty json but due to a bug it became "null" instead. > Now, when we read the file the column "a" is considered as a String. > Instead what we want to do is ask spark to read the file considering "a" as > a String, filter the "null" out/replace with empty json and then ask spark > to infer schema of "a" after the fix so we can access the nested json > properly. >