Re: Splittable or not?

Enrico Minack Sat, 17 Sep 2022 09:20:50 -0700

If with "won't affect the performance" you mean "parquet is splittablethough it uses snappy", then yes. Splittable files allow for optimalparallelization, which "won't affect performance".

Spark writing data will split the data into multiple files already (hereparquet files). Even if each file would not be splittable, your datahave been split already. Splittable parquet files allow for moregranularity (more splitting if your data), in case those files are big.


Enrico


Am 14.09.22 um 21:57 schrieb Sid:

Okay so you mean to say that parquet compresses the denormalized datausing snappy so it won't affect the performance.


Only using snappy will affect the performance

Am I correct?

On Thu, 15 Sep 2022, 01:08 Amit Joshi, <mailtojoshia...@gmail.com> wrote:

    Hi Sid,

    Snappy itself is not splittable. But the format that contains the
    actual data like parquet (which are basically divided into row
    groups) can be compressed using snappy.
    This works because blocks(pages of parquet format) inside the
    parquet can be independently compressed using snappy.

    Thanks
    Amit

    On Wed, Sep 14, 2022 at 8:14 PM Sid <flinkbyhe...@gmail.com> wrote:

        Hello experts,

        I know that Gzip and snappy files are not splittable i.e data
        won't be distributed into multiple blocks rather it would try
        to load the data in a single partition/block

        So, my question is when I write the parquet data via spark it
        gets stored at the destination with something like
        /part*.snappy.parquet/
        /
        /
        So, when I read this data will it affect my performance?

        Please help me if there is any understanding gap.

        Thanks,
        Sid

Re: Splittable or not?

Reply via email to