Re: Missing min/max statistics in file footer

Pradeep Gollakota Fri, 10 Feb 2017 11:32:29 -0800

metadata.getFileMetadata().createdBy() shows this "parquet-mr version
1.9.1-SNAPSHOT (build 2fd62ee4d524c270764e9b91dca72e5cf1a005b7)"


Ignore the 1.9.1-SNAPSHOT... that's my local build as I'm trying to work on
PARQUET-869 <https://issues.apache.org/jira/browse/PARQUET-869>

On Fri, Feb 10, 2017 at 10:17 AM, Lars Volker <l...@cloudera.com> wrote:

> Can you check the value of ParquetMetaData.created_by? Once you have that,
> you should see if it gets filtered by the code in CorruptStatistics.java.
>
> On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
>
> > Data was written with Spark but I'm using the parquet APIs directly for
> > reads. I checked the stats in the footer with the following code.
> >
> > ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
> > ParquetMetadataConverter.NO_FILTER);
> > ColumnPath deviceId = ColumnPath.get("deviceId");
> > metadata.getBlocks().forEach(b -> {
> >     if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
> >         System.out.println("\nBlockSize = " + b.getTotalByteSize());
> >         System.out.println("ComprSize = " + b.getCompressedSize());
> >         System.out.println("Num Rows  = " + b.getRowCount());
> >         b.getColumns().forEach(c -> {
> >             if (c.getPath().equals(deviceId)) {
> >                 Comparable max = c.getStatistics().genericGetMax();
> >                 Comparable min = c.getStatistics().genericGetMin();
> >                 System.out.println("\t" + c.getPath() + " [" + min +
> > ", " + max + "]");
> >             }
> >         });
> >     }
> > });
> >
> >
> > Thanks,
> > Pradeep
> >
> > On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <l...@cloudera.com> wrote:
> >
> > > Hi Pradeep,
> > >
> > > I don't have any experience with using Parquet APIs through Spark. That
> > > being said, there are currently several issues around column
> statistics,
> > > both in the format and in the parquet-mr implementation (PARQUET-686,
> > > PARQUET-839, PARQUET-840).
> > >
> > > However, in your case and depending on the versions involved, you might
> > > also hit PARQUET-251, which can cause statistics for some files to be
> > > ignored. In this context it may be worth to have a look at this file:
> > > https://github.com/apache/parquet-mr/blob/master/
> > > parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java
> > >
> > > How did you check that the statistics are not written to the footer? If
> > you
> > > used parquet-mr, they may be there but be ignored.
> > >
> > > Cheers, Lars
> > >
> > > On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <
> pradeep...@gmail.com
> > >
> > > wrote:
> > >
> > > > Bumping the thread to see if I get any responses.
> > > >
> > > > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
> > pradeep...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > I generated a bunch of parquet files using spark and
> > > > > ParquetThriftOutputFormat. The thirft model has a column called
> > > > "deviceId"
> > > > > which is a string column. It also has a "timestamp" column of
> int64.
> > > > After
> > > > > the files have been generated, I inspected the file footers and
> > noticed
> > > > > that only the "timestamp" field has min/max statistics. My primary
> > > filter
> > > > > will be deviceId, the data is partitioned and sorted by deviceId,
> but
> > > > since
> > > > > the statistics data is missing, it's not able to prune blocks from
> > > being
> > > > > read. Am I missing some configuration setting that allows it to
> > > generate
> > > > > the stats data? The following is code is how an RDD[Thrift] is
> being
> > > > saved
> > > > > to parquet. The configuration is default configuration.
> > > > >
> > > > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > > > ClassTag](rdd: RDD[T]) {
> > > > >   def saveAsParquet(output: String,
> > > > >                     conf: Configuration = rdd.context.
> > > hadoopConfiguration):
> > > > Unit = {
> > > > >     val job = Job.getInstance(conf)
> > > > >     val clazz: Class[T] = classTag[T].runtimeClass.
> > > > asInstanceOf[Class[T]]
> > > > >     ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > > > >     val r = rdd.map[(Void, T)](x => (null, x))
> > > > >       .saveAsNewAPIHadoopFile(
> > > > >         output,
> > > > >         classOf[Void],
> > > > >         clazz,
> > > > >         classOf[ParquetThriftOutputFormat[T]],
> > > > >         job.getConfiguration)
> > > > >   }
> > > > > }
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Pradeep
> > > > >
> > > >
> > >
> >
>

Re: Missing min/max statistics in file footer

Reply via email to