Hi Enrico,

Thanks for the insights.

Could you please help me to understand with one example of compressed files
where the file wouldn't be split in partitions and will put load on a
single partition and might lead to OOM error?

Thanks,
Sid

On Wed, Jun 22, 2022 at 6:40 PM Enrico Minack <i...@enrico.minack.dev>
wrote:

> The RAM and disk memory consumtion depends on what you do with the data
> after reading them.
>
> Your particular action will read 20 lines from the first partition and
> show them. So it will not use any RAM or disk, no matter how large the CSV
> is.
>
> If you do a count instead of show, it will iterate over the each partition
> and return a count per partition, so no RAM here needed as well.
>
> If you do some real processing of the data, the requirement RAM and disk
> again depends on involved shuffles and intermediate results that need to be
> store in RAM or on disk.
>
> Enrico
>
>
> Am 22.06.22 um 14:54 schrieb Deepak Sharma:
>
> It will spill to disk if everything can’t be loaded in memory .
>
>
> On Wed, 22 Jun 2022 at 5:58 PM, Sid <flinkbyhe...@gmail.com> wrote:
>
>> I have a 150TB CSV file.
>>
>> I have a total of 100 TB RAM and 100TB disk. So If I do something like
>> this
>>
>> spark.read.option("header","true").csv(filepath).show(false)
>>
>> Will it lead to an OOM error since it doesn't have enough memory? or it
>> will spill data onto the disk and process it?
>>
>> Thanks,
>> Sid
>>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>

Reply via email to