I think you can define schema with column z and filter out records with z is null.
On Tue, Jul 4, 2023 at 3:24 PM Shashank Rao <shashank93...@gmail.com> wrote: > Yes, drop malformed does filter out record4. However, record 5 is not. > > On Tue, 4 Jul 2023 at 07:41, Vikas Kumar <vku...@etsy.com> wrote: > >> Have you tried dropmalformed option ? >> >> On Mon, Jul 3, 2023, 1:34 PM Shashank Rao <shashank93...@gmail.com> >> wrote: >> >>> Update: Got it working by using the *_corrupt_record *field for the >>> first case (record 4) >>> >>> schema = schema.add("_corrupt_record", DataTypes.StringType); >>> Dataset<Row> ds = spark.read().schema(schema).option("mode", >>> "PERMISSIVE").json("path").collect(); >>> ds = ds.filter(functions.col("_corrupt_record").isNull()).collect(); >>> >>> However, I haven't figured out on how to ignore record 5. >>> >>> Any help is appreciated. >>> >>> On Mon, 3 Jul 2023 at 19:24, Shashank Rao <shashank93...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> I'm trying to read around 1,000,000 JSONL files present in S3 using >>>> Spark. Once read, I need to write them to BigQuery. >>>> I have a schema that may not be an exact match with all the records. >>>> How can I filter records where there isn't an exact schema match: >>>> >>>> Eg: if my records were: >>>> {"x": 1, "y": 1} >>>> {"x": 2, "y": 2} >>>> {"x": 3, "y": 3} >>>> {"x": 4, "y": "4"} >>>> {"x": 5, "y": 5, "z": 5} >>>> >>>> and if my schema were: >>>> root >>>> |-- x: long (nullable = true) >>>> |-- y: long (nullable = true) >>>> >>>> I need the records 4 and 5 to be filtered out. >>>> Record 4 should be filtered out since y is a string instead of long. >>>> Record 5 should be filtered out since z is not part of the schema. >>>> >>>> I tried applying my schema on read, but it does not work as needed: >>>> >>>> StructType schema = new StructType().add("a", >>>> DataTypes.LongType).add("b", DataTypes.LongType); >>>> Dataset<Row> ds = spark.read().schema(schema).json("path/to/file") >>>> >>>> This gives me a dataset that has record 4 with y=null and record 5 with >>>> x and y. >>>> >>>> Any help is appreciated. >>>> >>>> -- >>>> Thanks, >>>> Shashank Rao >>>> >>> >>> >>> -- >>> Regards, >>> Shashank Rao >>> >> > > -- > Regards, > Shashank Rao >