Re: Parquet Encoding - Enable DELTA_BINARY_PACKED

Rajesh Mahindra Wed, 28 Feb 2024 01:05:59 -0800

Hi Gang,

Even with Parquet v 2.0, I see that spark is writing parquet files with RLE
encoding for INT and DOUBLE. Anything additional that we need to configure?


On Tue, Feb 27, 2024 at 5:37 PM Gang Wu <[email protected]> wrote:

> Hi Ridha,
>
> DELTA_BINARY_PACKED is enabled for parquet v2 in the parquet-mr
> implementation. Have you tried to set `parquet.writer.version` [1] to
> PARQUET_2_0 in the Spark job? I'm not sure if this helps.
>
> [1]
>
> https://github.com/apache/parquet-mr/blob/86f90f57b7858ea1eede7bb8b6946c649d74f7e1/parquet-hadoop/README.md?plain=1#L130
>
> Best,
> Gang
>
> On Wed, Feb 28, 2024 at 1:39 AM Ridha Khan <[email protected]> wrote:
>
> > Hi Team,
> >
> > Hope you're all doing well.
> > This is a query regarding the Parquet Encoding used by spark.
> >
> > We are interested in reducing the parquet file size to as small as
> > possible. Looking at the nature of our data, DELTA_BINARY_PACKED seems to
> > be a good option.
> > However, with dictionary disabled, the DefaultV1ValuesWriter class
> defaults
> > to the PlainValuesWriter.
> >
> > Is there a way to create a custom parquet writer which can be used by
> > Spark?
> > Appreciate your help on this.
> >
> > Thanks,
> > Ridha
> >
>


-- 
Take Care,
Rajesh Mahindra

Re: Parquet Encoding - Enable DELTA_BINARY_PACKED

Reply via email to