Hi Team, I still need help in understanding how reading works exactly?
Thanks, Sid On Mon, Jun 20, 2022 at 2:23 PM Sid <flinkbyhe...@gmail.com> wrote: > Hi Team, > > Can somebody help? > > Thanks, > Sid > > On Sun, Jun 19, 2022 at 3:51 PM Sid <flinkbyhe...@gmail.com> wrote: > >> Hi, >> >> I already have a partitioned JSON dataset in s3 like the below: >> >> edl_timestamp=20220908000000 >> >> Now, the problem is, in the earlier 10 days of data collection there was >> a duplicate columns issue due to which we couldn't read the data. >> >> Now the latest 10 days of data are proper. So, I am trying to do >> something like the below: >> >> >> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) >> >> but I am getting the issue of the duplicate column which was present in >> the old dataset. So, I am trying to understand how the spark reads the >> data. Does it full dataset and filter on the basis of the last saved >> timestamp or does it filter only what is required? If the second case is >> true, then it should have read the data since the latest data is correct. >> >> So just trying to understand. Could anyone help here? >> >> Thanks, >> Sid >> >> >>