Ehh.. What is "*duplicate column*" ? I don't think Spark supports that.

duplicate column = duplicate rows


tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen <bjornjorgen...@gmail.com
>:

> "*but I am getting the issue of the duplicate column which was present in
> the old dataset.*"
>
> So you have answered your question!
>
> spark.read.option("multiline","true").json("path").filter(
> col("edl_timestamp")>last_saved_timestamp) As you have figured out, spark
> read all the json files in "path" then filter.
>
> There are some file formats that can have filters before reading files.
> The one that I know about is Parquet. Like this link explains Spark:
> Understand the Basic of Pushed Filter and Partition Filter Using Parquet
> File
> <https://medium.com/@songkunjump/spark-understand-the-basic-of-pushed-filter-and-partition-filter-using-parquet-file-3e5789e260bd>
>
>
>
>
>
> tir. 5. jul. 2022 kl. 21:21 skrev Sid <flinkbyhe...@gmail.com>:
>
>> Hi Team,
>>
>> I still need help in understanding how reading works exactly?
>>
>> Thanks,
>> Sid
>>
>> On Mon, Jun 20, 2022 at 2:23 PM Sid <flinkbyhe...@gmail.com> wrote:
>>
>>> Hi Team,
>>>
>>> Can somebody help?
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Sun, Jun 19, 2022 at 3:51 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I already have a partitioned JSON dataset in s3 like the below:
>>>>
>>>> edl_timestamp=20220908000000
>>>>
>>>> Now, the problem is, in the earlier 10 days of data collection there
>>>> was a duplicate columns issue due to which we couldn't read the data.
>>>>
>>>> Now the latest 10 days of data are proper. So, I am trying to do
>>>> something like the below:
>>>>
>>>>
>>>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)
>>>>
>>>> but I am getting the issue of the duplicate column which was present in
>>>> the old dataset. So, I am trying to understand how the spark reads the
>>>> data. Does it full dataset and filter on the basis of the last saved
>>>> timestamp or does it filter only what is required? If the second case is
>>>> true, then it should have read the data since the latest data is correct.
>>>>
>>>> So just trying to understand. Could anyone help here?
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>>
>>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Reply via email to