Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
Z is just an example. It could be anything. Basically, anything that's not in schema should be filtered out. On Tue, 4 Jul 2023, 13:27 Hill Liu, wrote: > I think you can define schema with column z and filter out records with z > is null. > > On Tue, Jul 4, 2023 at 3:24 PM Shashank

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
Yes, drop malformed does filter out record4. However, record 5 is not. On Tue, 4 Jul 2023 at 07:41, Vikas Kumar wrote: > Have you tried dropmalformed option ? > > On Mon, Jul 3, 2023, 1:34 PM Shashank Rao wrote: > >> Update: Got it working by using the *_corrupt_record *fiel

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
ds.filter(functions.col("_corrupt_record").isNull()).collect(); However, I haven't figured out on how to ignore record 5. Any help is appreciated. On Mon, 3 Jul 2023 at 19:24, Shashank Rao wrote: > Hi all, > I'm trying to read around 1,000,000 JSONL files present in S3 using Spark.

Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
quot;, DataTypes.LongType).add("b", DataTypes.LongType); Dataset ds = spark.read().schema(schema).json("path/to/file") This gives me a dataset that has record 4 with y=null and record 5 with x and y. Any help is appreciated. -- Thanks, Shashank Rao

Understanding Spark S3 Read Performance

2023-05-16 Thread Shashank Rao
odifying the source data is not an option that I have. Hence, I cannot merge multiple small files into a single large file. Any help is appreciated. -- Thanks, Shashank Rao