Yes, drop malformed does filter out record4. However, record 5 is not. On Tue, 4 Jul 2023 at 07:41, Vikas Kumar <vku...@etsy.com> wrote:
> Have you tried dropmalformed option ? > > On Mon, Jul 3, 2023, 1:34 PM Shashank Rao <shashank93...@gmail.com> wrote: > >> Update: Got it working by using the *_corrupt_record *field for the >> first case (record 4) >> >> schema = schema.add("_corrupt_record", DataTypes.StringType); >> Dataset<Row> ds = spark.read().schema(schema).option("mode", >> "PERMISSIVE").json("path").collect(); >> ds = ds.filter(functions.col("_corrupt_record").isNull()).collect(); >> >> However, I haven't figured out on how to ignore record 5. >> >> Any help is appreciated. >> >> On Mon, 3 Jul 2023 at 19:24, Shashank Rao <shashank93...@gmail.com> >> wrote: >> >>> Hi all, >>> I'm trying to read around 1,000,000 JSONL files present in S3 using >>> Spark. Once read, I need to write them to BigQuery. >>> I have a schema that may not be an exact match with all the records. >>> How can I filter records where there isn't an exact schema match: >>> >>> Eg: if my records were: >>> {"x": 1, "y": 1} >>> {"x": 2, "y": 2} >>> {"x": 3, "y": 3} >>> {"x": 4, "y": "4"} >>> {"x": 5, "y": 5, "z": 5} >>> >>> and if my schema were: >>> root >>> |-- x: long (nullable = true) >>> |-- y: long (nullable = true) >>> >>> I need the records 4 and 5 to be filtered out. >>> Record 4 should be filtered out since y is a string instead of long. >>> Record 5 should be filtered out since z is not part of the schema. >>> >>> I tried applying my schema on read, but it does not work as needed: >>> >>> StructType schema = new StructType().add("a", >>> DataTypes.LongType).add("b", DataTypes.LongType); >>> Dataset<Row> ds = spark.read().schema(schema).json("path/to/file") >>> >>> This gives me a dataset that has record 4 with y=null and record 5 with >>> x and y. >>> >>> Any help is appreciated. >>> >>> -- >>> Thanks, >>> Shashank Rao >>> >> >> >> -- >> Regards, >> Shashank Rao >> > -- Regards, Shashank Rao