Hi Gang, Even with Parquet v 2.0, I see that spark is writing parquet files with RLE encoding for INT and DOUBLE. Anything additional that we need to configure?
On Tue, Feb 27, 2024 at 5:37 PM Gang Wu <[email protected]> wrote: > Hi Ridha, > > DELTA_BINARY_PACKED is enabled for parquet v2 in the parquet-mr > implementation. Have you tried to set `parquet.writer.version` [1] to > PARQUET_2_0 in the Spark job? I'm not sure if this helps. > > [1] > > https://github.com/apache/parquet-mr/blob/86f90f57b7858ea1eede7bb8b6946c649d74f7e1/parquet-hadoop/README.md?plain=1#L130 > > Best, > Gang > > On Wed, Feb 28, 2024 at 1:39 AM Ridha Khan <[email protected]> wrote: > > > Hi Team, > > > > Hope you're all doing well. > > This is a query regarding the Parquet Encoding used by spark. > > > > We are interested in reducing the parquet file size to as small as > > possible. Looking at the nature of our data, DELTA_BINARY_PACKED seems to > > be a good option. > > However, with dictionary disabled, the DefaultV1ValuesWriter class > defaults > > to the PlainValuesWriter. > > > > Is there a way to create a custom parquet writer which can be used by > > Spark? > > Appreciate your help on this. > > > > Thanks, > > Ridha > > > -- Take Care, Rajesh Mahindra
