Hi Enrico, Thanks for the insights.
Could you please help me to understand with one example of compressed files where the file wouldn't be split in partitions and will put load on a single partition and might lead to OOM error? Thanks, Sid On Wed, Jun 22, 2022 at 6:40 PM Enrico Minack <i...@enrico.minack.dev> wrote: > The RAM and disk memory consumtion depends on what you do with the data > after reading them. > > Your particular action will read 20 lines from the first partition and > show them. So it will not use any RAM or disk, no matter how large the CSV > is. > > If you do a count instead of show, it will iterate over the each partition > and return a count per partition, so no RAM here needed as well. > > If you do some real processing of the data, the requirement RAM and disk > again depends on involved shuffles and intermediate results that need to be > store in RAM or on disk. > > Enrico > > > Am 22.06.22 um 14:54 schrieb Deepak Sharma: > > It will spill to disk if everything can’t be loaded in memory . > > > On Wed, 22 Jun 2022 at 5:58 PM, Sid <flinkbyhe...@gmail.com> wrote: > >> I have a 150TB CSV file. >> >> I have a total of 100 TB RAM and 100TB disk. So If I do something like >> this >> >> spark.read.option("header","true").csv(filepath).show(false) >> >> Will it lead to an OOM error since it doesn't have enough memory? or it >> will spill data onto the disk and process it? >> >> Thanks, >> Sid >> > -- > Thanks > Deepak > www.bigdatabig.com > www.keosha.net > > >