Can you check the value of ParquetMetaData.created_by? Once you have that, you should see if it gets filtered by the code in CorruptStatistics.java.
On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota <pradeep...@gmail.com> wrote: > Data was written with Spark but I'm using the parquet APIs directly for > reads. I checked the stats in the footer with the following code. > > ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path, > ParquetMetadataConverter.NO_FILTER); > ColumnPath deviceId = ColumnPath.get("deviceId"); > metadata.getBlocks().forEach(b -> { > if (b.getTotalByteSize() > 4 * 1024 * 1024L) { > System.out.println("\nBlockSize = " + b.getTotalByteSize()); > System.out.println("ComprSize = " + b.getCompressedSize()); > System.out.println("Num Rows = " + b.getRowCount()); > b.getColumns().forEach(c -> { > if (c.getPath().equals(deviceId)) { > Comparable max = c.getStatistics().genericGetMax(); > Comparable min = c.getStatistics().genericGetMin(); > System.out.println("\t" + c.getPath() + " [" + min + > ", " + max + "]"); > } > }); > } > }); > > > Thanks, > Pradeep > > On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <l...@cloudera.com> wrote: > > > Hi Pradeep, > > > > I don't have any experience with using Parquet APIs through Spark. That > > being said, there are currently several issues around column statistics, > > both in the format and in the parquet-mr implementation (PARQUET-686, > > PARQUET-839, PARQUET-840). > > > > However, in your case and depending on the versions involved, you might > > also hit PARQUET-251, which can cause statistics for some files to be > > ignored. In this context it may be worth to have a look at this file: > > https://github.com/apache/parquet-mr/blob/master/ > > parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java > > > > How did you check that the statistics are not written to the footer? If > you > > used parquet-mr, they may be there but be ignored. > > > > Cheers, Lars > > > > On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <pradeep...@gmail.com > > > > wrote: > > > > > Bumping the thread to see if I get any responses. > > > > > > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota < > pradeep...@gmail.com> > > > wrote: > > > > > > > Hi folks, > > > > > > > > I generated a bunch of parquet files using spark and > > > > ParquetThriftOutputFormat. The thirft model has a column called > > > "deviceId" > > > > which is a string column. It also has a "timestamp" column of int64. > > > After > > > > the files have been generated, I inspected the file footers and > noticed > > > > that only the "timestamp" field has min/max statistics. My primary > > filter > > > > will be deviceId, the data is partitioned and sorted by deviceId, but > > > since > > > > the statistics data is missing, it's not able to prune blocks from > > being > > > > read. Am I missing some configuration setting that allows it to > > generate > > > > the stats data? The following is code is how an RDD[Thrift] is being > > > saved > > > > to parquet. The configuration is default configuration. > > > > > > > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] : > > > ClassTag](rdd: RDD[T]) { > > > > def saveAsParquet(output: String, > > > > conf: Configuration = rdd.context. > > hadoopConfiguration): > > > Unit = { > > > > val job = Job.getInstance(conf) > > > > val clazz: Class[T] = classTag[T].runtimeClass. > > > asInstanceOf[Class[T]] > > > > ParquetThriftOutputFormat.setThriftClass(job, clazz) > > > > val r = rdd.map[(Void, T)](x => (null, x)) > > > > .saveAsNewAPIHadoopFile( > > > > output, > > > > classOf[Void], > > > > clazz, > > > > classOf[ParquetThriftOutputFormat[T]], > > > > job.getConfiguration) > > > > } > > > > } > > > > > > > > > > > > Thanks, > > > > Pradeep > > > > > > > > > >