Hi Ridha, > We are interested in reducing the parquet file size to as small as possible.
Have you already tried basic techniques like colacating similar data in parquet files? This is usually done by partitioning on some sort of deviceId/userId/processId/appId and then sorting on that same id + eventTime inside of each partition. Max. On Wed, Feb 28, 2024 at 1:06 AM Rajesh Mahindra <[email protected]> wrote: > Hi Gang, > > Even with Parquet v 2.0, I see that spark is writing parquet files with RLE > encoding for INT and DOUBLE. Anything additional that we need to configure? > > On Tue, Feb 27, 2024 at 5:37 PM Gang Wu <[email protected]> wrote: > > > Hi Ridha, > > > > DELTA_BINARY_PACKED is enabled for parquet v2 in the parquet-mr > > implementation. Have you tried to set `parquet.writer.version` [1] to > > PARQUET_2_0 in the Spark job? I'm not sure if this helps. > > > > [1] > > > > > https://github.com/apache/parquet-mr/blob/86f90f57b7858ea1eede7bb8b6946c649d74f7e1/parquet-hadoop/README.md?plain=1#L130 > > > > Best, > > Gang > > > > On Wed, Feb 28, 2024 at 1:39 AM Ridha Khan <[email protected]> > wrote: > > > > > Hi Team, > > > > > > Hope you're all doing well. > > > This is a query regarding the Parquet Encoding used by spark. > > > > > > We are interested in reducing the parquet file size to as small as > > > possible. Looking at the nature of our data, DELTA_BINARY_PACKED seems > to > > > be a good option. > > > However, with dictionary disabled, the DefaultV1ValuesWriter class > > defaults > > > to the PlainValuesWriter. > > > > > > Is there a way to create a custom parquet writer which can be used by > > > Spark? > > > Appreciate your help on this. > > > > > > Thanks, > > > Ridha > > > > > > > > -- > Take Care, > Rajesh Mahindra >
