Re: How reading works?

Sid Tue, 12 Jul 2022 23:16:52 -0700

Yeah, I understood that now.

Thanks for the explanation, Bjorn.


Sid

On Wed, Jul 6, 2022 at 1:46 AM Bjørn Jørgensen <bjornjorgen...@gmail.com>
wrote:

> Ehh.. What is "*duplicate column*" ? I don't think Spark supports that.
>
> duplicate column = duplicate rows
>
>
> tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen <
> bjornjorgen...@gmail.com>:
>
>> "*but I am getting the issue of the duplicate column which was present
>> in the old dataset.*"
>>
>> So you have answered your question!
>>
>> spark.read.option("multiline","true").json("path").filter(
>> col("edl_timestamp")>last_saved_timestamp) As you have figured out,
>> spark read all the json files in "path" then filter.
>>
>> There are some file formats that can have filters before reading files.
>> The one that I know about is Parquet. Like this link explains Spark:
>> Understand the Basic of Pushed Filter and Partition Filter Using Parquet
>> File
>> <https://medium.com/@songkunjump/spark-understand-the-basic-of-pushed-filter-and-partition-filter-using-parquet-file-3e5789e260bd>
>>
>>
>>
>>
>>
>> tir. 5. jul. 2022 kl. 21:21 skrev Sid <flinkbyhe...@gmail.com>:
>>
>>> Hi Team,
>>>
>>> I still need help in understanding how reading works exactly?
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Mon, Jun 20, 2022 at 2:23 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> Can somebody help?
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>> On Sun, Jun 19, 2022 at 3:51 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I already have a partitioned JSON dataset in s3 like the below:
>>>>>
>>>>> edl_timestamp=20220908000000
>>>>>
>>>>> Now, the problem is, in the earlier 10 days of data collection there
>>>>> was a duplicate columns issue due to which we couldn't read the data.
>>>>>
>>>>> Now the latest 10 days of data are proper. So, I am trying to do
>>>>> something like the below:
>>>>>
>>>>>
>>>>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)
>>>>>
>>>>> but I am getting the issue of the duplicate column which was present
>>>>> in the old dataset. So, I am trying to understand how the spark reads the
>>>>> data. Does it full dataset and filter on the basis of the last saved
>>>>> timestamp or does it filter only what is required? If the second case is
>>>>> true, then it should have read the data since the latest data is correct.
>>>>>
>>>>> So just trying to understand. Could anyone help here?
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>>
>>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: How reading works?

Reply via email to