Yeah, I understood that now. Thanks for the explanation, Bjorn.
Sid On Wed, Jul 6, 2022 at 1:46 AM Bjørn Jørgensen <bjornjorgen...@gmail.com> wrote: > Ehh.. What is "*duplicate column*" ? I don't think Spark supports that. > > duplicate column = duplicate rows > > > tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen < > bjornjorgen...@gmail.com>: > >> "*but I am getting the issue of the duplicate column which was present >> in the old dataset.*" >> >> So you have answered your question! >> >> spark.read.option("multiline","true").json("path").filter( >> col("edl_timestamp")>last_saved_timestamp) As you have figured out, >> spark read all the json files in "path" then filter. >> >> There are some file formats that can have filters before reading files. >> The one that I know about is Parquet. Like this link explains Spark: >> Understand the Basic of Pushed Filter and Partition Filter Using Parquet >> File >> <https://medium.com/@songkunjump/spark-understand-the-basic-of-pushed-filter-and-partition-filter-using-parquet-file-3e5789e260bd> >> >> >> >> >> >> tir. 5. jul. 2022 kl. 21:21 skrev Sid <flinkbyhe...@gmail.com>: >> >>> Hi Team, >>> >>> I still need help in understanding how reading works exactly? >>> >>> Thanks, >>> Sid >>> >>> On Mon, Jun 20, 2022 at 2:23 PM Sid <flinkbyhe...@gmail.com> wrote: >>> >>>> Hi Team, >>>> >>>> Can somebody help? >>>> >>>> Thanks, >>>> Sid >>>> >>>> On Sun, Jun 19, 2022 at 3:51 PM Sid <flinkbyhe...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I already have a partitioned JSON dataset in s3 like the below: >>>>> >>>>> edl_timestamp=20220908000000 >>>>> >>>>> Now, the problem is, in the earlier 10 days of data collection there >>>>> was a duplicate columns issue due to which we couldn't read the data. >>>>> >>>>> Now the latest 10 days of data are proper. So, I am trying to do >>>>> something like the below: >>>>> >>>>> >>>>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) >>>>> >>>>> but I am getting the issue of the duplicate column which was present >>>>> in the old dataset. So, I am trying to understand how the spark reads the >>>>> data. Does it full dataset and filter on the basis of the last saved >>>>> timestamp or does it filter only what is required? If the second case is >>>>> true, then it should have read the data since the latest data is correct. >>>>> >>>>> So just trying to understand. Could anyone help here? >>>>> >>>>> Thanks, >>>>> Sid >>>>> >>>>> >>>>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> > > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 >