Hi,

I already have a partitioned JSON dataset in s3 like the below:

edl_timestamp=20220908000000

Now, the problem is, in the earlier 10 days of data collection there was a
duplicate columns issue due to which we couldn't read the data.

Now the latest 10 days of data are proper. So, I am trying to do
something like the below:

spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)

but I am getting the issue of the duplicate column which was present in the
old dataset. So, I am trying to understand how the spark reads the data.
Does it full dataset and filter on the basis of the last saved timestamp or
does it filter only what is required? If the second case is true, then it
should have read the data since the latest data is correct.

So just trying to understand. Could anyone help here?

Thanks,
Sid

Reply via email to