Z is just an example. It could be anything. Basically, anything that's not in schema should be filtered out.
On Tue, 4 Jul 2023, 13:27 Hill Liu, <hill....@gmail.com> wrote: > I think you can define schema with column z and filter out records with z > is null. > > On Tue, Jul 4, 2023 at 3:24 PM Shashank Rao <shashank93...@gmail.com> > wrote: > >> Yes, drop malformed does filter out record4. However, record 5 is not. >> >> On Tue, 4 Jul 2023 at 07:41, Vikas Kumar <vku...@etsy.com> wrote: >> >>> Have you tried dropmalformed option ? >>> >>> On Mon, Jul 3, 2023, 1:34 PM Shashank Rao <shashank93...@gmail.com> >>> wrote: >>> >>>> Update: Got it working by using the *_corrupt_record *field for the >>>> first case (record 4) >>>> >>>> schema = schema.add("_corrupt_record", DataTypes.StringType); >>>> Dataset<Row> ds = spark.read().schema(schema).option("mode", >>>> "PERMISSIVE").json("path").collect(); >>>> ds = ds.filter(functions.col("_corrupt_record").isNull()).collect(); >>>> >>>> However, I haven't figured out on how to ignore record 5. >>>> >>>> Any help is appreciated. >>>> >>>> On Mon, 3 Jul 2023 at 19:24, Shashank Rao <shashank93...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> I'm trying to read around 1,000,000 JSONL files present in S3 using >>>>> Spark. Once read, I need to write them to BigQuery. >>>>> I have a schema that may not be an exact match with all the records. >>>>> How can I filter records where there isn't an exact schema match: >>>>> >>>>> Eg: if my records were: >>>>> {"x": 1, "y": 1} >>>>> {"x": 2, "y": 2} >>>>> {"x": 3, "y": 3} >>>>> {"x": 4, "y": "4"} >>>>> {"x": 5, "y": 5, "z": 5} >>>>> >>>>> and if my schema were: >>>>> root >>>>> |-- x: long (nullable = true) >>>>> |-- y: long (nullable = true) >>>>> >>>>> I need the records 4 and 5 to be filtered out. >>>>> Record 4 should be filtered out since y is a string instead of long. >>>>> Record 5 should be filtered out since z is not part of the schema. >>>>> >>>>> I tried applying my schema on read, but it does not work as needed: >>>>> >>>>> StructType schema = new StructType().add("a", >>>>> DataTypes.LongType).add("b", DataTypes.LongType); >>>>> Dataset<Row> ds = spark.read().schema(schema).json("path/to/file") >>>>> >>>>> This gives me a dataset that has record 4 with y=null and record 5 >>>>> with x and y. >>>>> >>>>> Any help is appreciated. >>>>> >>>>> -- >>>>> Thanks, >>>>> Shashank Rao >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Shashank Rao >>>> >>> >> >> -- >> Regards, >> Shashank Rao >> >