If with "won't affect the performance" you mean "parquet is splittable
though it uses snappy", then yes. Splittable files allow for optimal
parallelization, which "won't affect performance".
Spark writing data will split the data into multiple files already (here
parquet files). Even if each file would not be splittable, your data
have been split already. Splittable parquet files allow for more
granularity (more splitting if your data), in case those files are big.
Enrico
Am 14.09.22 um 21:57 schrieb Sid:
Okay so you mean to say that parquet compresses the denormalized data
using snappy so it won't affect the performance.
Only using snappy will affect the performance
Am I correct?
On Thu, 15 Sep 2022, 01:08 Amit Joshi, <mailtojoshia...@gmail.com> wrote:
Hi Sid,
Snappy itself is not splittable. But the format that contains the
actual data like parquet (which are basically divided into row
groups) can be compressed using snappy.
This works because blocks(pages of parquet format) inside the
parquet can be independently compressed using snappy.
Thanks
Amit
On Wed, Sep 14, 2022 at 8:14 PM Sid <flinkbyhe...@gmail.com> wrote:
Hello experts,
I know that Gzip and snappy files are not splittable i.e data
won't be distributed into multiple blocks rather it would try
to load the data in a single partition/block
So, my question is when I write the parquet data via spark it
gets stored at the destination with something like
/part*.snappy.parquet/
/
/
So, when I read this data will it affect my performance?
Please help me if there is any understanding gap.
Thanks,
Sid