Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
Z is just an example. It could be anything. Basically, anything that's not in schema should be filtered out. On Tue, 4 Jul 2023, 13:27 Hill Liu, wrote: > I think you can define schema with column z and filter out records with z > is null. > > On Tue, Jul 4, 2023 at 3:24 PM Shashank Rao >

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Hill Liu
I think you can define schema with column z and filter out records with z is null. On Tue, Jul 4, 2023 at 3:24 PM Shashank Rao wrote: > Yes, drop malformed does filter out record4. However, record 5 is not. > > On Tue, 4 Jul 2023 at 07:41, Vikas Kumar wrote: > >> Have you tried dropmalformed

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
Yes, drop malformed does filter out record4. However, record 5 is not. On Tue, 4 Jul 2023 at 07:41, Vikas Kumar wrote: > Have you tried dropmalformed option ? > > On Mon, Jul 3, 2023, 1:34 PM Shashank Rao wrote: > >> Update: Got it working by using the *_corrupt_record *field for the >> first

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Vikas Kumar
Have you tried dropmalformed option ? On Mon, Jul 3, 2023, 1:34 PM Shashank Rao wrote: > Update: Got it working by using the *_corrupt_record *field for the first > case (record 4) > > schema = schema.add("_corrupt_record", DataTypes.StringType); > Dataset ds =

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
Update: Got it working by using the *_corrupt_record *field for the first case (record 4) schema = schema.add("_corrupt_record", DataTypes.StringType); Dataset ds = spark.read().schema(schema).option("mode", "PERMISSIVE").json("path").collect(); ds =

Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
Hi all, I'm trying to read around 1,000,000 JSONL files present in S3 using Spark. Once read, I need to write them to BigQuery. I have a schema that may not be an exact match with all the records. How can I filter records where there isn't an exact schema match: Eg: if my records were: {"x": 1,