Hi, I already have a partitioned JSON dataset in s3 like the below:
edl_timestamp=20220908000000 Now, the problem is, in the earlier 10 days of data collection there was a duplicate columns issue due to which we couldn't read the data. Now the latest 10 days of data are proper. So, I am trying to do something like the below: spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) but I am getting the issue of the duplicate column which was present in the old dataset. So, I am trying to understand how the spark reads the data. Does it full dataset and filter on the basis of the last saved timestamp or does it filter only what is required? If the second case is true, then it should have read the data since the latest data is correct. So just trying to understand. Could anyone help here? Thanks, Sid