Hi Team,

Can somebody help?

Thanks,
Sid

On Sun, Jun 19, 2022 at 3:51 PM Sid <flinkbyhe...@gmail.com> wrote:

> Hi,
>
> I already have a partitioned JSON dataset in s3 like the below:
>
> edl_timestamp=20220908000000
>
> Now, the problem is, in the earlier 10 days of data collection there was a
> duplicate columns issue due to which we couldn't read the data.
>
> Now the latest 10 days of data are proper. So, I am trying to do
> something like the below:
>
>
> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)
>
> but I am getting the issue of the duplicate column which was present in
> the old dataset. So, I am trying to understand how the spark reads the
> data. Does it full dataset and filter on the basis of the last saved
> timestamp or does it filter only what is required? If the second case is
> true, then it should have read the data since the latest data is correct.
>
> So just trying to understand. Could anyone help here?
>
> Thanks,
> Sid
>
>
>

Reply via email to