Re: Missing min/max statistics in file footer

2017-02-10 Thread Julien Le Dem
When the reader ignores the stats, you should see a warning in the logs.
If you have a local build you can easily modify the logic to verify:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L347
 


> On Feb 10, 2017, at 12:39 PM, Lars Volker  wrote:
> 
> In that case I don't see why reading the stats shouldn't work, assuming
> they are in the file in the first place. I don't know why writing them
> would fail, so unless someone else can help you, you may have to debug the
> code that writes them.
> 
> On Fri, Feb 10, 2017 at 8:31 PM, Pradeep Gollakota 
> wrote:
> 
>> metadata.getFileMetadata().createdBy() shows this "parquet-mr version
>> 1.9.1-SNAPSHOT (build 2fd62ee4d524c270764e9b91dca72e5cf1a005b7)"
>> 
>> Ignore the 1.9.1-SNAPSHOT... that's my local build as I'm trying to work on
>> PARQUET-869 
>> 
>> On Fri, Feb 10, 2017 at 10:17 AM, Lars Volker  wrote:
>> 
>>> Can you check the value of ParquetMetaData.created_by? Once you have
>> that,
>>> you should see if it gets filtered by the code in CorruptStatistics.java.
>>> 
>>> On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota >> 
>>> wrote:
>>> 
 Data was written with Spark but I'm using the parquet APIs directly for
 reads. I checked the stats in the footer with the following code.
 
 ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
 ParquetMetadataConverter.NO_FILTER);
 ColumnPath deviceId = ColumnPath.get("deviceId");
 metadata.getBlocks().forEach(b -> {
if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
System.out.println("\nBlockSize = " + b.getTotalByteSize());
System.out.println("ComprSize = " + b.getCompressedSize());
System.out.println("Num Rows  = " + b.getRowCount());
b.getColumns().forEach(c -> {
if (c.getPath().equals(deviceId)) {
Comparable max = c.getStatistics().genericGetMax();
Comparable min = c.getStatistics().genericGetMin();
System.out.println("\t" + c.getPath() + " [" + min +
 ", " + max + "]");
}
});
}
 });
 
 
 Thanks,
 Pradeep
 
 On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker  wrote:
 
> Hi Pradeep,
> 
> I don't have any experience with using Parquet APIs through Spark.
>> That
> being said, there are currently several issues around column
>>> statistics,
> both in the format and in the parquet-mr implementation (PARQUET-686,
> PARQUET-839, PARQUET-840).
> 
> However, in your case and depending on the versions involved, you
>> might
> also hit PARQUET-251, which can cause statistics for some files to be
> ignored. In this context it may be worth to have a look at this file:
> https://github.com/apache/parquet-mr/blob/master/
> parquet-column/src/main/java/org/apache/parquet/
>> CorruptStatistics.java
> 
> How did you check that the statistics are not written to the footer?
>> If
 you
> used parquet-mr, they may be there but be ignored.
> 
> Cheers, Lars
> 
> On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <
>>> pradeep...@gmail.com
> 
> wrote:
> 
>> Bumping the thread to see if I get any responses.
>> 
>> On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
 pradeep...@gmail.com>
>> wrote:
>> 
>>> Hi folks,
>>> 
>>> I generated a bunch of parquet files using spark and
>>> ParquetThriftOutputFormat. The thirft model has a column called
>> "deviceId"
>>> which is a string column. It also has a "timestamp" column of
>>> int64.
>> After
>>> the files have been generated, I inspected the file footers and
 noticed
>>> that only the "timestamp" field has min/max statistics. My
>> primary
> filter
>>> will be deviceId, the data is partitioned and sorted by deviceId,
>>> but
>> since
>>> the statistics data is missing, it's not able to prune blocks
>> from
> being
>>> read. Am I missing some configuration setting that allows it to
> generate
>>> the stats data? The following is code is how an RDD[Thrift] is
>>> being
>> saved
>>> to parquet. The configuration is default configuration.
>>> 
>>> implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
>> ClassTag](rdd: RDD[T]) {
>>>  def saveAsParquet(output: String,
>>>conf: Configuration = rdd.context.
> hadoopConfiguration):
>> Unit = {
>>>val job = Job.getInstance(conf)
>>>val clazz: Class[T] = classTag[T].runtimeClass.
>> asInstanceOf[Class[T]]
>>>ParquetThriftOutputFormat.setThri

Re: Missing min/max statistics in file footer

2017-02-10 Thread Lars Volker
In that case I don't see why reading the stats shouldn't work, assuming
they are in the file in the first place. I don't know why writing them
would fail, so unless someone else can help you, you may have to debug the
code that writes them.

On Fri, Feb 10, 2017 at 8:31 PM, Pradeep Gollakota 
wrote:

> metadata.getFileMetadata().createdBy() shows this "parquet-mr version
> 1.9.1-SNAPSHOT (build 2fd62ee4d524c270764e9b91dca72e5cf1a005b7)"
>
> Ignore the 1.9.1-SNAPSHOT... that's my local build as I'm trying to work on
> PARQUET-869 
>
> On Fri, Feb 10, 2017 at 10:17 AM, Lars Volker  wrote:
>
> > Can you check the value of ParquetMetaData.created_by? Once you have
> that,
> > you should see if it gets filtered by the code in CorruptStatistics.java.
> >
> > On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota  >
> > wrote:
> >
> > > Data was written with Spark but I'm using the parquet APIs directly for
> > > reads. I checked the stats in the footer with the following code.
> > >
> > > ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
> > > ParquetMetadataConverter.NO_FILTER);
> > > ColumnPath deviceId = ColumnPath.get("deviceId");
> > > metadata.getBlocks().forEach(b -> {
> > > if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
> > > System.out.println("\nBlockSize = " + b.getTotalByteSize());
> > > System.out.println("ComprSize = " + b.getCompressedSize());
> > > System.out.println("Num Rows  = " + b.getRowCount());
> > > b.getColumns().forEach(c -> {
> > > if (c.getPath().equals(deviceId)) {
> > > Comparable max = c.getStatistics().genericGetMax();
> > > Comparable min = c.getStatistics().genericGetMin();
> > > System.out.println("\t" + c.getPath() + " [" + min +
> > > ", " + max + "]");
> > > }
> > > });
> > > }
> > > });
> > >
> > >
> > > Thanks,
> > > Pradeep
> > >
> > > On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker  wrote:
> > >
> > > > Hi Pradeep,
> > > >
> > > > I don't have any experience with using Parquet APIs through Spark.
> That
> > > > being said, there are currently several issues around column
> > statistics,
> > > > both in the format and in the parquet-mr implementation (PARQUET-686,
> > > > PARQUET-839, PARQUET-840).
> > > >
> > > > However, in your case and depending on the versions involved, you
> might
> > > > also hit PARQUET-251, which can cause statistics for some files to be
> > > > ignored. In this context it may be worth to have a look at this file:
> > > > https://github.com/apache/parquet-mr/blob/master/
> > > > parquet-column/src/main/java/org/apache/parquet/
> CorruptStatistics.java
> > > >
> > > > How did you check that the statistics are not written to the footer?
> If
> > > you
> > > > used parquet-mr, they may be there but be ignored.
> > > >
> > > > Cheers, Lars
> > > >
> > > > On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <
> > pradeep...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Bumping the thread to see if I get any responses.
> > > > >
> > > > > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
> > > pradeep...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi folks,
> > > > > >
> > > > > > I generated a bunch of parquet files using spark and
> > > > > > ParquetThriftOutputFormat. The thirft model has a column called
> > > > > "deviceId"
> > > > > > which is a string column. It also has a "timestamp" column of
> > int64.
> > > > > After
> > > > > > the files have been generated, I inspected the file footers and
> > > noticed
> > > > > > that only the "timestamp" field has min/max statistics. My
> primary
> > > > filter
> > > > > > will be deviceId, the data is partitioned and sorted by deviceId,
> > but
> > > > > since
> > > > > > the statistics data is missing, it's not able to prune blocks
> from
> > > > being
> > > > > > read. Am I missing some configuration setting that allows it to
> > > > generate
> > > > > > the stats data? The following is code is how an RDD[Thrift] is
> > being
> > > > > saved
> > > > > > to parquet. The configuration is default configuration.
> > > > > >
> > > > > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > > > > ClassTag](rdd: RDD[T]) {
> > > > > >   def saveAsParquet(output: String,
> > > > > > conf: Configuration = rdd.context.
> > > > hadoopConfiguration):
> > > > > Unit = {
> > > > > > val job = Job.getInstance(conf)
> > > > > > val clazz: Class[T] = classTag[T].runtimeClass.
> > > > > asInstanceOf[Class[T]]
> > > > > > ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > > > > > val r = rdd.map[(Void, T)](x => (null, x))
> > > > > >   .saveAsNewAPIHadoopFile(
> > > > > > output,
> > > > > > classOf[Void],
> > > > > > clazz,
> > > > > > classOf[ParquetThriftOutputFormat[T]],
> > > > > > job.getConfiguration)
> > > > > >   }
> > > > > > 

Re: Missing min/max statistics in file footer

2017-02-10 Thread Pradeep Gollakota
metadata.getFileMetadata().createdBy() shows this "parquet-mr version
1.9.1-SNAPSHOT (build 2fd62ee4d524c270764e9b91dca72e5cf1a005b7)"

Ignore the 1.9.1-SNAPSHOT... that's my local build as I'm trying to work on
PARQUET-869 

On Fri, Feb 10, 2017 at 10:17 AM, Lars Volker  wrote:

> Can you check the value of ParquetMetaData.created_by? Once you have that,
> you should see if it gets filtered by the code in CorruptStatistics.java.
>
> On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota 
> wrote:
>
> > Data was written with Spark but I'm using the parquet APIs directly for
> > reads. I checked the stats in the footer with the following code.
> >
> > ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
> > ParquetMetadataConverter.NO_FILTER);
> > ColumnPath deviceId = ColumnPath.get("deviceId");
> > metadata.getBlocks().forEach(b -> {
> > if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
> > System.out.println("\nBlockSize = " + b.getTotalByteSize());
> > System.out.println("ComprSize = " + b.getCompressedSize());
> > System.out.println("Num Rows  = " + b.getRowCount());
> > b.getColumns().forEach(c -> {
> > if (c.getPath().equals(deviceId)) {
> > Comparable max = c.getStatistics().genericGetMax();
> > Comparable min = c.getStatistics().genericGetMin();
> > System.out.println("\t" + c.getPath() + " [" + min +
> > ", " + max + "]");
> > }
> > });
> > }
> > });
> >
> >
> > Thanks,
> > Pradeep
> >
> > On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker  wrote:
> >
> > > Hi Pradeep,
> > >
> > > I don't have any experience with using Parquet APIs through Spark. That
> > > being said, there are currently several issues around column
> statistics,
> > > both in the format and in the parquet-mr implementation (PARQUET-686,
> > > PARQUET-839, PARQUET-840).
> > >
> > > However, in your case and depending on the versions involved, you might
> > > also hit PARQUET-251, which can cause statistics for some files to be
> > > ignored. In this context it may be worth to have a look at this file:
> > > https://github.com/apache/parquet-mr/blob/master/
> > > parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java
> > >
> > > How did you check that the statistics are not written to the footer? If
> > you
> > > used parquet-mr, they may be there but be ignored.
> > >
> > > Cheers, Lars
> > >
> > > On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <
> pradeep...@gmail.com
> > >
> > > wrote:
> > >
> > > > Bumping the thread to see if I get any responses.
> > > >
> > > > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
> > pradeep...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > I generated a bunch of parquet files using spark and
> > > > > ParquetThriftOutputFormat. The thirft model has a column called
> > > > "deviceId"
> > > > > which is a string column. It also has a "timestamp" column of
> int64.
> > > > After
> > > > > the files have been generated, I inspected the file footers and
> > noticed
> > > > > that only the "timestamp" field has min/max statistics. My primary
> > > filter
> > > > > will be deviceId, the data is partitioned and sorted by deviceId,
> but
> > > > since
> > > > > the statistics data is missing, it's not able to prune blocks from
> > > being
> > > > > read. Am I missing some configuration setting that allows it to
> > > generate
> > > > > the stats data? The following is code is how an RDD[Thrift] is
> being
> > > > saved
> > > > > to parquet. The configuration is default configuration.
> > > > >
> > > > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > > > ClassTag](rdd: RDD[T]) {
> > > > >   def saveAsParquet(output: String,
> > > > > conf: Configuration = rdd.context.
> > > hadoopConfiguration):
> > > > Unit = {
> > > > > val job = Job.getInstance(conf)
> > > > > val clazz: Class[T] = classTag[T].runtimeClass.
> > > > asInstanceOf[Class[T]]
> > > > > ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > > > > val r = rdd.map[(Void, T)](x => (null, x))
> > > > >   .saveAsNewAPIHadoopFile(
> > > > > output,
> > > > > classOf[Void],
> > > > > clazz,
> > > > > classOf[ParquetThriftOutputFormat[T]],
> > > > > job.getConfiguration)
> > > > >   }
> > > > > }
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Pradeep
> > > > >
> > > >
> > >
> >
>


Re: Missing min/max statistics in file footer

2017-02-10 Thread Lars Volker
Can you check the value of ParquetMetaData.created_by? Once you have that,
you should see if it gets filtered by the code in CorruptStatistics.java.

On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota 
wrote:

> Data was written with Spark but I'm using the parquet APIs directly for
> reads. I checked the stats in the footer with the following code.
>
> ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
> ParquetMetadataConverter.NO_FILTER);
> ColumnPath deviceId = ColumnPath.get("deviceId");
> metadata.getBlocks().forEach(b -> {
> if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
> System.out.println("\nBlockSize = " + b.getTotalByteSize());
> System.out.println("ComprSize = " + b.getCompressedSize());
> System.out.println("Num Rows  = " + b.getRowCount());
> b.getColumns().forEach(c -> {
> if (c.getPath().equals(deviceId)) {
> Comparable max = c.getStatistics().genericGetMax();
> Comparable min = c.getStatistics().genericGetMin();
> System.out.println("\t" + c.getPath() + " [" + min +
> ", " + max + "]");
> }
> });
> }
> });
>
>
> Thanks,
> Pradeep
>
> On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker  wrote:
>
> > Hi Pradeep,
> >
> > I don't have any experience with using Parquet APIs through Spark. That
> > being said, there are currently several issues around column statistics,
> > both in the format and in the parquet-mr implementation (PARQUET-686,
> > PARQUET-839, PARQUET-840).
> >
> > However, in your case and depending on the versions involved, you might
> > also hit PARQUET-251, which can cause statistics for some files to be
> > ignored. In this context it may be worth to have a look at this file:
> > https://github.com/apache/parquet-mr/blob/master/
> > parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java
> >
> > How did you check that the statistics are not written to the footer? If
> you
> > used parquet-mr, they may be there but be ignored.
> >
> > Cheers, Lars
> >
> > On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota  >
> > wrote:
> >
> > > Bumping the thread to see if I get any responses.
> > >
> > > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
> pradeep...@gmail.com>
> > > wrote:
> > >
> > > > Hi folks,
> > > >
> > > > I generated a bunch of parquet files using spark and
> > > > ParquetThriftOutputFormat. The thirft model has a column called
> > > "deviceId"
> > > > which is a string column. It also has a "timestamp" column of int64.
> > > After
> > > > the files have been generated, I inspected the file footers and
> noticed
> > > > that only the "timestamp" field has min/max statistics. My primary
> > filter
> > > > will be deviceId, the data is partitioned and sorted by deviceId, but
> > > since
> > > > the statistics data is missing, it's not able to prune blocks from
> > being
> > > > read. Am I missing some configuration setting that allows it to
> > generate
> > > > the stats data? The following is code is how an RDD[Thrift] is being
> > > saved
> > > > to parquet. The configuration is default configuration.
> > > >
> > > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > > ClassTag](rdd: RDD[T]) {
> > > >   def saveAsParquet(output: String,
> > > > conf: Configuration = rdd.context.
> > hadoopConfiguration):
> > > Unit = {
> > > > val job = Job.getInstance(conf)
> > > > val clazz: Class[T] = classTag[T].runtimeClass.
> > > asInstanceOf[Class[T]]
> > > > ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > > > val r = rdd.map[(Void, T)](x => (null, x))
> > > >   .saveAsNewAPIHadoopFile(
> > > > output,
> > > > classOf[Void],
> > > > clazz,
> > > > classOf[ParquetThriftOutputFormat[T]],
> > > > job.getConfiguration)
> > > >   }
> > > > }
> > > >
> > > >
> > > > Thanks,
> > > > Pradeep
> > > >
> > >
> >
>


Re: Missing min/max statistics in file footer

2017-02-10 Thread Pradeep Gollakota
Data was written with Spark but I'm using the parquet APIs directly for
reads. I checked the stats in the footer with the following code.

ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
ParquetMetadataConverter.NO_FILTER);
ColumnPath deviceId = ColumnPath.get("deviceId");
metadata.getBlocks().forEach(b -> {
if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
System.out.println("\nBlockSize = " + b.getTotalByteSize());
System.out.println("ComprSize = " + b.getCompressedSize());
System.out.println("Num Rows  = " + b.getRowCount());
b.getColumns().forEach(c -> {
if (c.getPath().equals(deviceId)) {
Comparable max = c.getStatistics().genericGetMax();
Comparable min = c.getStatistics().genericGetMin();
System.out.println("\t" + c.getPath() + " [" + min +
", " + max + "]");
}
});
}
});


Thanks,
Pradeep

On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker  wrote:

> Hi Pradeep,
>
> I don't have any experience with using Parquet APIs through Spark. That
> being said, there are currently several issues around column statistics,
> both in the format and in the parquet-mr implementation (PARQUET-686,
> PARQUET-839, PARQUET-840).
>
> However, in your case and depending on the versions involved, you might
> also hit PARQUET-251, which can cause statistics for some files to be
> ignored. In this context it may be worth to have a look at this file:
> https://github.com/apache/parquet-mr/blob/master/
> parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java
>
> How did you check that the statistics are not written to the footer? If you
> used parquet-mr, they may be there but be ignored.
>
> Cheers, Lars
>
> On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota 
> wrote:
>
> > Bumping the thread to see if I get any responses.
> >
> > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota 
> > wrote:
> >
> > > Hi folks,
> > >
> > > I generated a bunch of parquet files using spark and
> > > ParquetThriftOutputFormat. The thirft model has a column called
> > "deviceId"
> > > which is a string column. It also has a "timestamp" column of int64.
> > After
> > > the files have been generated, I inspected the file footers and noticed
> > > that only the "timestamp" field has min/max statistics. My primary
> filter
> > > will be deviceId, the data is partitioned and sorted by deviceId, but
> > since
> > > the statistics data is missing, it's not able to prune blocks from
> being
> > > read. Am I missing some configuration setting that allows it to
> generate
> > > the stats data? The following is code is how an RDD[Thrift] is being
> > saved
> > > to parquet. The configuration is default configuration.
> > >
> > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > ClassTag](rdd: RDD[T]) {
> > >   def saveAsParquet(output: String,
> > > conf: Configuration = rdd.context.
> hadoopConfiguration):
> > Unit = {
> > > val job = Job.getInstance(conf)
> > > val clazz: Class[T] = classTag[T].runtimeClass.
> > asInstanceOf[Class[T]]
> > > ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > > val r = rdd.map[(Void, T)](x => (null, x))
> > >   .saveAsNewAPIHadoopFile(
> > > output,
> > > classOf[Void],
> > > clazz,
> > > classOf[ParquetThriftOutputFormat[T]],
> > > job.getConfiguration)
> > >   }
> > > }
> > >
> > >
> > > Thanks,
> > > Pradeep
> > >
> >
>


Re: Missing min/max statistics in file footer

2017-02-10 Thread Lars Volker
Hi Pradeep,

I don't have any experience with using Parquet APIs through Spark. That
being said, there are currently several issues around column statistics,
both in the format and in the parquet-mr implementation (PARQUET-686,
PARQUET-839, PARQUET-840).

However, in your case and depending on the versions involved, you might
also hit PARQUET-251, which can cause statistics for some files to be
ignored. In this context it may be worth to have a look at this file:
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java

How did you check that the statistics are not written to the footer? If you
used parquet-mr, they may be there but be ignored.

Cheers, Lars

On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota 
wrote:

> Bumping the thread to see if I get any responses.
>
> On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota 
> wrote:
>
> > Hi folks,
> >
> > I generated a bunch of parquet files using spark and
> > ParquetThriftOutputFormat. The thirft model has a column called
> "deviceId"
> > which is a string column. It also has a "timestamp" column of int64.
> After
> > the files have been generated, I inspected the file footers and noticed
> > that only the "timestamp" field has min/max statistics. My primary filter
> > will be deviceId, the data is partitioned and sorted by deviceId, but
> since
> > the statistics data is missing, it's not able to prune blocks from being
> > read. Am I missing some configuration setting that allows it to generate
> > the stats data? The following is code is how an RDD[Thrift] is being
> saved
> > to parquet. The configuration is default configuration.
> >
> > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> ClassTag](rdd: RDD[T]) {
> >   def saveAsParquet(output: String,
> > conf: Configuration = rdd.context.hadoopConfiguration):
> Unit = {
> > val job = Job.getInstance(conf)
> > val clazz: Class[T] = classTag[T].runtimeClass.
> asInstanceOf[Class[T]]
> > ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > val r = rdd.map[(Void, T)](x => (null, x))
> >   .saveAsNewAPIHadoopFile(
> > output,
> > classOf[Void],
> > clazz,
> > classOf[ParquetThriftOutputFormat[T]],
> > job.getConfiguration)
> >   }
> > }
> >
> >
> > Thanks,
> > Pradeep
> >
>


Re: Missing min/max statistics in file footer

2017-02-10 Thread Pradeep Gollakota
Bumping the thread to see if I get any responses.

On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota 
wrote:

> Hi folks,
>
> I generated a bunch of parquet files using spark and
> ParquetThriftOutputFormat. The thirft model has a column called "deviceId"
> which is a string column. It also has a "timestamp" column of int64. After
> the files have been generated, I inspected the file footers and noticed
> that only the "timestamp" field has min/max statistics. My primary filter
> will be deviceId, the data is partitioned and sorted by deviceId, but since
> the statistics data is missing, it's not able to prune blocks from being
> read. Am I missing some configuration setting that allows it to generate
> the stats data? The following is code is how an RDD[Thrift] is being saved
> to parquet. The configuration is default configuration.
>
> implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] : ClassTag](rdd: 
> RDD[T]) {
>   def saveAsParquet(output: String,
> conf: Configuration = rdd.context.hadoopConfiguration): 
> Unit = {
> val job = Job.getInstance(conf)
> val clazz: Class[T] = classTag[T].runtimeClass.asInstanceOf[Class[T]]
> ParquetThriftOutputFormat.setThriftClass(job, clazz)
> val r = rdd.map[(Void, T)](x => (null, x))
>   .saveAsNewAPIHadoopFile(
> output,
> classOf[Void],
> clazz,
> classOf[ParquetThriftOutputFormat[T]],
> job.getConfiguration)
>   }
> }
>
>
> Thanks,
> Pradeep
>


[jira] [Commented] (PARQUET-678) Allow for custom compression codecs

2017-02-10 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860944#comment-15860944
 ] 

Uwe L. Korn commented on PARQUET-678:
-

[~cotton] A patch would be very welcome, I can help for that on the C++ side 
once we have a Java patch available.

> Allow for custom compression codecs
> ---
>
> Key: PARQUET-678
> URL: https://issues.apache.org/jira/browse/PARQUET-678
> Project: Parquet
>  Issue Type: Wish
>Reporter: Steven Anton
>Priority: Minor
>
> I understand that the list of accepted compression codecs is explicity 
> limited to uncompressed, snappy, gzip, and lzo. (See 
> parquet.hadoop.metadata.CompressionCodecName.java) Is there a reason for 
> this? Or is there an easy workaround? On the surface it seems like an 
> unnecessary restriction.
> I ask because I have written a custom codec to implement encryption and I'm 
> unable to use it with Parquet, which is a real shame because it is the main 
> storage format I was hoping to use.
> Other thoughts on how to implement encryption in Parquet with this limitation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-678) Allow for custom compression codecs

2017-02-10 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860940#comment-15860940
 ] 

Uwe L. Korn commented on PARQUET-678:
-

Adding them to parquet-cpp and parquet-format is easy, the only thing that 
looks a bit harder from my side is to add to Hadoop as a codec so it can be 
used in parquet-mr. At least for Zstd, this seems to be done already: 
https://issues.apache.org/jira/browse/HADOOP-13578

> Allow for custom compression codecs
> ---
>
> Key: PARQUET-678
> URL: https://issues.apache.org/jira/browse/PARQUET-678
> Project: Parquet
>  Issue Type: Wish
>Reporter: Steven Anton
>Priority: Minor
>
> I understand that the list of accepted compression codecs is explicity 
> limited to uncompressed, snappy, gzip, and lzo. (See 
> parquet.hadoop.metadata.CompressionCodecName.java) Is there a reason for 
> this? Or is there an easy workaround? On the surface it seems like an 
> unnecessary restriction.
> I ask because I have written a custom codec to implement encryption and I'm 
> unable to use it with Parquet, which is a real shame because it is the main 
> storage format I was hoping to use.
> Other thoughts on how to implement encryption in Parquet with this limitation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)