Filtering JSON records when there isn't an exact schema match in Spark

Shashank Rao Mon, 03 Jul 2023 06:55:06 -0700

Hi all,
I'm trying to read around 1,000,000 JSONL files present in S3 using Spark.
Once read, I need to write them to BigQuery.
I have a schema that may not be an exact match with all the records.
How can I filter records where there isn't an exact schema match:


Eg: if my records were:
{"x": 1, "y": 1}
{"x": 2, "y": 2}
{"x": 3, "y": 3}
{"x": 4, "y": "4"}
{"x": 5, "y": 5, "z": 5}

and if my schema were:
root
 |-- x: long (nullable = true)
 |-- y: long (nullable = true)

I need the records 4 and 5 to be filtered out.
Record 4 should be filtered out since y is a string instead of long.
Record 5 should be filtered out since z is not part of the schema.

I tried applying my schema on read, but it does not work as needed:

StructType schema = new StructType().add("a", DataTypes.LongType).add("b",
DataTypes.LongType);
Dataset<Row> ds = spark.read().schema(schema).json("path/to/file")

This gives me a dataset that has record 4 with y=null and record 5 with x
and y.

Any help is appreciated.

-- 
Thanks,
Shashank Rao

Filtering JSON records when there isn't an exact schema match in Spark

Reply via email to