Hi Team, Can somebody help?
Thanks, Sid On Sun, Jun 19, 2022 at 3:51 PM Sid <flinkbyhe...@gmail.com> wrote: > Hi, > > I already have a partitioned JSON dataset in s3 like the below: > > edl_timestamp=20220908000000 > > Now, the problem is, in the earlier 10 days of data collection there was a > duplicate columns issue due to which we couldn't read the data. > > Now the latest 10 days of data are proper. So, I am trying to do > something like the below: > > > spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) > > but I am getting the issue of the duplicate column which was present in > the old dataset. So, I am trying to understand how the spark reads the > data. Does it full dataset and filter on the basis of the last saved > timestamp or does it filter only what is required? If the second case is > true, then it should have read the data since the latest data is correct. > > So just trying to understand. Could anyone help here? > > Thanks, > Sid > > >