Hi all, I'm trying to read around 1,000,000 JSONL files present in S3 using Spark. Once read, I need to write them to BigQuery. I have a schema that may not be an exact match with all the records. How can I filter records where there isn't an exact schema match:
Eg: if my records were: {"x": 1, "y": 1} {"x": 2, "y": 2} {"x": 3, "y": 3} {"x": 4, "y": "4"} {"x": 5, "y": 5, "z": 5} and if my schema were: root |-- x: long (nullable = true) |-- y: long (nullable = true) I need the records 4 and 5 to be filtered out. Record 4 should be filtered out since y is a string instead of long. Record 5 should be filtered out since z is not part of the schema. I tried applying my schema on read, but it does not work as needed: StructType schema = new StructType().add("a", DataTypes.LongType).add("b", DataTypes.LongType); Dataset<Row> ds = spark.read().schema(schema).json("path/to/file") This gives me a dataset that has record 4 with y=null and record 5 with x and y. Any help is appreciated. -- Thanks, Shashank Rao