Bumping the thread to see if I get any responses. On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <pradeep...@gmail.com> wrote:
> Hi folks, > > I generated a bunch of parquet files using spark and > ParquetThriftOutputFormat. The thirft model has a column called "deviceId" > which is a string column. It also has a "timestamp" column of int64. After > the files have been generated, I inspected the file footers and noticed > that only the "timestamp" field has min/max statistics. My primary filter > will be deviceId, the data is partitioned and sorted by deviceId, but since > the statistics data is missing, it's not able to prune blocks from being > read. Am I missing some configuration setting that allows it to generate > the stats data? The following is code is how an RDD[Thrift] is being saved > to parquet. The configuration is default configuration. > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] : ClassTag](rdd: > RDD[T]) { > def saveAsParquet(output: String, > conf: Configuration = rdd.context.hadoopConfiguration): > Unit = { > val job = Job.getInstance(conf) > val clazz: Class[T] = classTag[T].runtimeClass.asInstanceOf[Class[T]] > ParquetThriftOutputFormat.setThriftClass(job, clazz) > val r = rdd.map[(Void, T)](x => (null, x)) > .saveAsNewAPIHadoopFile( > output, > classOf[Void], > clazz, > classOf[ParquetThriftOutputFormat[T]], > job.getConfiguration) > } > } > > > Thanks, > Pradeep >