Have you tried dropmalformed option ? On Mon, Jul 3, 2023, 1:34 PM Shashank Rao <shashank93...@gmail.com> wrote:
> Update: Got it working by using the *_corrupt_record *field for the first > case (record 4) > > schema = schema.add("_corrupt_record", DataTypes.StringType); > Dataset<Row> ds = spark.read().schema(schema).option("mode", > "PERMISSIVE").json("path").collect(); > ds = ds.filter(functions.col("_corrupt_record").isNull()).collect(); > > However, I haven't figured out on how to ignore record 5. > > Any help is appreciated. > > On Mon, 3 Jul 2023 at 19:24, Shashank Rao <shashank93...@gmail.com> wrote: > >> Hi all, >> I'm trying to read around 1,000,000 JSONL files present in S3 using >> Spark. Once read, I need to write them to BigQuery. >> I have a schema that may not be an exact match with all the records. >> How can I filter records where there isn't an exact schema match: >> >> Eg: if my records were: >> {"x": 1, "y": 1} >> {"x": 2, "y": 2} >> {"x": 3, "y": 3} >> {"x": 4, "y": "4"} >> {"x": 5, "y": 5, "z": 5} >> >> and if my schema were: >> root >> |-- x: long (nullable = true) >> |-- y: long (nullable = true) >> >> I need the records 4 and 5 to be filtered out. >> Record 4 should be filtered out since y is a string instead of long. >> Record 5 should be filtered out since z is not part of the schema. >> >> I tried applying my schema on read, but it does not work as needed: >> >> StructType schema = new StructType().add("a", >> DataTypes.LongType).add("b", DataTypes.LongType); >> Dataset<Row> ds = spark.read().schema(schema).json("path/to/file") >> >> This gives me a dataset that has record 4 with y=null and record 5 with x >> and y. >> >> Any help is appreciated. >> >> -- >> Thanks, >> Shashank Rao >> > > > -- > Regards, > Shashank Rao >