Z is just an example. It could be anything. Basically, anything that's not
in schema should be filtered out.

On Tue, 4 Jul 2023, 13:27 Hill Liu, <hill....@gmail.com> wrote:

> I think you can define schema with column z and filter out records with z
> is null.
>
> On Tue, Jul 4, 2023 at 3:24 PM Shashank Rao <shashank93...@gmail.com>
> wrote:
>
>> Yes, drop malformed does filter out record4. However, record 5 is not.
>>
>> On Tue, 4 Jul 2023 at 07:41, Vikas Kumar <vku...@etsy.com> wrote:
>>
>>> Have you tried dropmalformed option ?
>>>
>>> On Mon, Jul 3, 2023, 1:34 PM Shashank Rao <shashank93...@gmail.com>
>>> wrote:
>>>
>>>> Update: Got it working by using the *_corrupt_record *field for the
>>>> first case (record 4)
>>>>
>>>> schema = schema.add("_corrupt_record", DataTypes.StringType);
>>>> Dataset<Row> ds = spark.read().schema(schema).option("mode",
>>>> "PERMISSIVE").json("path").collect();
>>>> ds = ds.filter(functions.col("_corrupt_record").isNull()).collect();
>>>>
>>>> However, I haven't figured out on how to ignore record 5.
>>>>
>>>> Any help is appreciated.
>>>>
>>>> On Mon, 3 Jul 2023 at 19:24, Shashank Rao <shashank93...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>> I'm trying to read around 1,000,000 JSONL files present in S3 using
>>>>> Spark. Once read, I need to write them to BigQuery.
>>>>> I have a schema that may not be an exact match with all the records.
>>>>> How can I filter records where there isn't an exact schema match:
>>>>>
>>>>> Eg: if my records were:
>>>>> {"x": 1, "y": 1}
>>>>> {"x": 2, "y": 2}
>>>>> {"x": 3, "y": 3}
>>>>> {"x": 4, "y": "4"}
>>>>> {"x": 5, "y": 5, "z": 5}
>>>>>
>>>>> and if my schema were:
>>>>> root
>>>>>  |-- x: long (nullable = true)
>>>>>  |-- y: long (nullable = true)
>>>>>
>>>>> I need the records 4 and 5 to be filtered out.
>>>>> Record 4 should be filtered out since y is a string instead of long.
>>>>> Record 5 should be filtered out since z is not part of the schema.
>>>>>
>>>>> I tried applying my schema on read, but it does not work as needed:
>>>>>
>>>>> StructType schema = new StructType().add("a",
>>>>> DataTypes.LongType).add("b", DataTypes.LongType);
>>>>> Dataset<Row> ds = spark.read().schema(schema).json("path/to/file")
>>>>>
>>>>> This gives me a dataset that has record 4 with y=null and record 5
>>>>> with x and y.
>>>>>
>>>>> Any help is appreciated.
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Shashank Rao
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Shashank Rao
>>>>
>>>
>>
>> --
>> Regards,
>> Shashank Rao
>>
>

Reply via email to