If with "won't affect the performance" you mean "parquet is splittable though it uses snappy", then yes. Splittable files allow for optimal parallelization, which "won't affect performance".

Spark writing data will split the data into multiple files already (here parquet files). Even if each file would not be splittable, your data have been split already. Splittable parquet files allow for more granularity (more splitting if your data), in case those files are big.

Enrico


Am 14.09.22 um 21:57 schrieb Sid:
Okay so you mean to say that parquet compresses the denormalized data using snappy so it won't affect the performance.

Only using snappy will affect the performance

Am I correct?

On Thu, 15 Sep 2022, 01:08 Amit Joshi, <mailtojoshia...@gmail.com> wrote:

    Hi Sid,

    Snappy itself is not splittable. But the format that contains the
    actual data like parquet (which are basically divided into row
    groups) can be compressed using snappy.
    This works because blocks(pages of parquet format) inside the
    parquet can be independently compressed using snappy.

    Thanks
    Amit

    On Wed, Sep 14, 2022 at 8:14 PM Sid <flinkbyhe...@gmail.com> wrote:

        Hello experts,

        I know that Gzip and snappy files are not splittable i.e data
        won't be distributed into multiple blocks rather it would try
        to load the data in a single partition/block

        So, my question is when I write the parquet data via spark it
        gets stored at the destination with something like
        /part*.snappy.parquet/
        /
        /
        So, when I read this data will it affect my performance?

        Please help me if there is any understanding gap.

        Thanks,
        Sid

Reply via email to