"*but I am getting the issue of the duplicate column which was present in
the old dataset.*"

So you have answered your question!

spark.read.option("multiline","true").json("path").filter(
col("edl_timestamp")>last_saved_timestamp) As you have figured out, spark
read all the json files in "path" then filter.

There are some file formats that can have filters before reading files. The
one that I know about is Parquet. Like this link explains Spark: Understand
the Basic of Pushed Filter and Partition Filter Using Parquet File
<https://medium.com/@songkunjump/spark-understand-the-basic-of-pushed-filter-and-partition-filter-using-parquet-file-3e5789e260bd>





tir. 5. jul. 2022 kl. 21:21 skrev Sid <flinkbyhe...@gmail.com>:

> Hi Team,
>
> I still need help in understanding how reading works exactly?
>
> Thanks,
> Sid
>
> On Mon, Jun 20, 2022 at 2:23 PM Sid <flinkbyhe...@gmail.com> wrote:
>
>> Hi Team,
>>
>> Can somebody help?
>>
>> Thanks,
>> Sid
>>
>> On Sun, Jun 19, 2022 at 3:51 PM Sid <flinkbyhe...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I already have a partitioned JSON dataset in s3 like the below:
>>>
>>> edl_timestamp=20220908000000
>>>
>>> Now, the problem is, in the earlier 10 days of data collection there was
>>> a duplicate columns issue due to which we couldn't read the data.
>>>
>>> Now the latest 10 days of data are proper. So, I am trying to do
>>> something like the below:
>>>
>>>
>>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)
>>>
>>> but I am getting the issue of the duplicate column which was present in
>>> the old dataset. So, I am trying to understand how the spark reads the
>>> data. Does it full dataset and filter on the basis of the last saved
>>> timestamp or does it filter only what is required? If the second case is
>>> true, then it should have read the data since the latest data is correct.
>>>
>>> So just trying to understand. Could anyone help here?
>>>
>>> Thanks,
>>> Sid
>>>
>>>
>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Reply via email to