Update: Got it working by using the *_corrupt_record *field for the first
case (record 4)

schema = schema.add("_corrupt_record", DataTypes.StringType);
Dataset<Row> ds = spark.read().schema(schema).option("mode",
"PERMISSIVE").json("path").collect();
ds = ds.filter(functions.col("_corrupt_record").isNull()).collect();

However, I haven't figured out on how to ignore record 5.

Any help is appreciated.

On Mon, 3 Jul 2023 at 19:24, Shashank Rao <shashank93...@gmail.com> wrote:

> Hi all,
> I'm trying to read around 1,000,000 JSONL files present in S3 using Spark.
> Once read, I need to write them to BigQuery.
> I have a schema that may not be an exact match with all the records.
> How can I filter records where there isn't an exact schema match:
>
> Eg: if my records were:
> {"x": 1, "y": 1}
> {"x": 2, "y": 2}
> {"x": 3, "y": 3}
> {"x": 4, "y": "4"}
> {"x": 5, "y": 5, "z": 5}
>
> and if my schema were:
> root
>  |-- x: long (nullable = true)
>  |-- y: long (nullable = true)
>
> I need the records 4 and 5 to be filtered out.
> Record 4 should be filtered out since y is a string instead of long.
> Record 5 should be filtered out since z is not part of the schema.
>
> I tried applying my schema on read, but it does not work as needed:
>
> StructType schema = new StructType().add("a", DataTypes.LongType).add("b",
> DataTypes.LongType);
> Dataset<Row> ds = spark.read().schema(schema).json("path/to/file")
>
> This gives me a dataset that has record 4 with y=null and record 5 with x
> and y.
>
> Any help is appreciated.
>
> --
> Thanks,
> Shashank Rao
>


-- 
Regards,
Shashank Rao

Reply via email to