"*but I am getting the issue of the duplicate column which was present in the old dataset.*"
So you have answered your question! spark.read.option("multiline","true").json("path").filter( col("edl_timestamp")>last_saved_timestamp) As you have figured out, spark read all the json files in "path" then filter. There are some file formats that can have filters before reading files. The one that I know about is Parquet. Like this link explains Spark: Understand the Basic of Pushed Filter and Partition Filter Using Parquet File <https://medium.com/@songkunjump/spark-understand-the-basic-of-pushed-filter-and-partition-filter-using-parquet-file-3e5789e260bd> tir. 5. jul. 2022 kl. 21:21 skrev Sid <flinkbyhe...@gmail.com>: > Hi Team, > > I still need help in understanding how reading works exactly? > > Thanks, > Sid > > On Mon, Jun 20, 2022 at 2:23 PM Sid <flinkbyhe...@gmail.com> wrote: > >> Hi Team, >> >> Can somebody help? >> >> Thanks, >> Sid >> >> On Sun, Jun 19, 2022 at 3:51 PM Sid <flinkbyhe...@gmail.com> wrote: >> >>> Hi, >>> >>> I already have a partitioned JSON dataset in s3 like the below: >>> >>> edl_timestamp=20220908000000 >>> >>> Now, the problem is, in the earlier 10 days of data collection there was >>> a duplicate columns issue due to which we couldn't read the data. >>> >>> Now the latest 10 days of data are proper. So, I am trying to do >>> something like the below: >>> >>> >>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) >>> >>> but I am getting the issue of the duplicate column which was present in >>> the old dataset. So, I am trying to understand how the spark reads the >>> data. Does it full dataset and filter on the basis of the last saved >>> timestamp or does it filter only what is required? If the second case is >>> true, then it should have read the data since the latest data is correct. >>> >>> So just trying to understand. Could anyone help here? >>> >>> Thanks, >>> Sid >>> >>> >>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297