Hi Team,

I still need help in understanding how reading works exactly?

Thanks,
Sid

On Mon, Jun 20, 2022 at 2:23 PM Sid <flinkbyhe...@gmail.com> wrote:

> Hi Team,
>
> Can somebody help?
>
> Thanks,
> Sid
>
> On Sun, Jun 19, 2022 at 3:51 PM Sid <flinkbyhe...@gmail.com> wrote:
>
>> Hi,
>>
>> I already have a partitioned JSON dataset in s3 like the below:
>>
>> edl_timestamp=20220908000000
>>
>> Now, the problem is, in the earlier 10 days of data collection there was
>> a duplicate columns issue due to which we couldn't read the data.
>>
>> Now the latest 10 days of data are proper. So, I am trying to do
>> something like the below:
>>
>>
>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)
>>
>> but I am getting the issue of the duplicate column which was present in
>> the old dataset. So, I am trying to understand how the spark reads the
>> data. Does it full dataset and filter on the basis of the last saved
>> timestamp or does it filter only what is required? If the second case is
>> true, then it should have read the data since the latest data is correct.
>>
>> So just trying to understand. Could anyone help here?
>>
>> Thanks,
>> Sid
>>
>>
>>

Reply via email to