Bumping the thread to see if I get any responses.

On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <pradeep...@gmail.com>
wrote:

> Hi folks,
>
> I generated a bunch of parquet files using spark and
> ParquetThriftOutputFormat. The thirft model has a column called "deviceId"
> which is a string column. It also has a "timestamp" column of int64. After
> the files have been generated, I inspected the file footers and noticed
> that only the "timestamp" field has min/max statistics. My primary filter
> will be deviceId, the data is partitioned and sorted by deviceId, but since
> the statistics data is missing, it's not able to prune blocks from being
> read. Am I missing some configuration setting that allows it to generate
> the stats data? The following is code is how an RDD[Thrift] is being saved
> to parquet. The configuration is default configuration.
>
> implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] : ClassTag](rdd: 
> RDD[T]) {
>   def saveAsParquet(output: String,
>                     conf: Configuration = rdd.context.hadoopConfiguration): 
> Unit = {
>     val job = Job.getInstance(conf)
>     val clazz: Class[T] = classTag[T].runtimeClass.asInstanceOf[Class[T]]
>     ParquetThriftOutputFormat.setThriftClass(job, clazz)
>     val r = rdd.map[(Void, T)](x => (null, x))
>       .saveAsNewAPIHadoopFile(
>         output,
>         classOf[Void],
>         clazz,
>         classOf[ParquetThriftOutputFormat[T]],
>         job.getConfiguration)
>   }
> }
>
>
> Thanks,
> Pradeep
>

Reply via email to