Hi Pradeep, I don't have any experience with using Parquet APIs through Spark. That being said, there are currently several issues around column statistics, both in the format and in the parquet-mr implementation (PARQUET-686, PARQUET-839, PARQUET-840).
However, in your case and depending on the versions involved, you might also hit PARQUET-251, which can cause statistics for some files to be ignored. In this context it may be worth to have a look at this file: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java How did you check that the statistics are not written to the footer? If you used parquet-mr, they may be there but be ignored. Cheers, Lars On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <pradeep...@gmail.com> wrote: > Bumping the thread to see if I get any responses. > > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <pradeep...@gmail.com> > wrote: > > > Hi folks, > > > > I generated a bunch of parquet files using spark and > > ParquetThriftOutputFormat. The thirft model has a column called > "deviceId" > > which is a string column. It also has a "timestamp" column of int64. > After > > the files have been generated, I inspected the file footers and noticed > > that only the "timestamp" field has min/max statistics. My primary filter > > will be deviceId, the data is partitioned and sorted by deviceId, but > since > > the statistics data is missing, it's not able to prune blocks from being > > read. Am I missing some configuration setting that allows it to generate > > the stats data? The following is code is how an RDD[Thrift] is being > saved > > to parquet. The configuration is default configuration. > > > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] : > ClassTag](rdd: RDD[T]) { > > def saveAsParquet(output: String, > > conf: Configuration = rdd.context.hadoopConfiguration): > Unit = { > > val job = Job.getInstance(conf) > > val clazz: Class[T] = classTag[T].runtimeClass. > asInstanceOf[Class[T]] > > ParquetThriftOutputFormat.setThriftClass(job, clazz) > > val r = rdd.map[(Void, T)](x => (null, x)) > > .saveAsNewAPIHadoopFile( > > output, > > classOf[Void], > > clazz, > > classOf[ParquetThriftOutputFormat[T]], > > job.getConfiguration) > > } > > } > > > > > > Thanks, > > Pradeep > > >