Re: [parquet-dev] Efficiently obtaining the number of rows of a Parquet data store.

Julien Le Dem Tue, 29 Jul 2014 10:03:07 -0700

Hi,
This works.
By efficiently it sounded like you wanted to read just the count in the file.
With this call you are reading the entire metadata and then keeping just the 
row count.
Depending on what you are doing this could be sufficient.


The code to read a footer is there:
https://github.com/apache/incubator-parquet-mr/blob/17864dfc0711d52d5af330469a1c2bd76128d46e/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java#L270
the _metadata file follows the same format but combines all the footers for all 
the files in the directory.
The file starts with a magic number and not the thrift bytes right away which 
could be your problem.

You could create a thrift file as follows:
namespace java parquet.format.rowcount
struct RowGroup {
  3: required i64 num_rows
}
struct FileMetaData {
  4: required list<RowGroup> row_groups
}


and use this trimmed down version of the metadata to read only what you need 
(thrift will skip the rest)
you would use code similar to this:
https://github.com/apache/incubator-parquet-format/blob/52a6a213f8649a955fa9a09a4395a02f0e43785f/src/main/java/parquet/format/Util.java
instead of the call there:
https://github.com/apache/incubator-parquet-mr/blob/17864dfc0711d52d5af330469a1c2bd76128d46e/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java#L296


On Jul 29, 2014, at 9:16 AM, Brandon Amos <[email protected]> wrote:

> Hi Julien,
> 
> Is the code snippet below the best way to read the total number of rows in my 
> Parquet store?
> 
> Thanks,
> Brandon.
> 
>          val metadataFooters: Seq[Footer] = ParquetFileReader
>            .readFooters(conf, new Path(hdfsPath))
>          val rows = metadataFooters.foldLeft(0L) {
>            (res: Long, footer: Footer) =>
>              res + footer.getParquetMetadata.getBlocks.foldLeft(0L) {
>                (inner_res: Long, block: BlockMetaData) =>
>                  inner_res+block.getRowCount
>            }
>          }
> 
> 
> On Jul 26, 2014, at 7:08 PM, Brandon Amos 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Hi Julien,
> 
> Thanks for the response.
> Can you give further details on how I can go from having a byte array of the
> metadata file to extracting the `num_rows` field?
> 
> The Thrift schema at [1] doesn't provide the schema of the entire metadata
> file, and I can't get any of the Util functions in [2] to read the metadata 
> file either.
> The `readFileMetaData` function outputs the following error when I try
> passing an InputStream of `_metadata`.
> 
> java.io.IOException: can not read class parquet.format.FileMetaData: Required 
> field 'version' was not found in serialized data! Struct: 
> FileMetaData(version:0, schema:null, num_rows:0, row_groups:null)
> at parquet.format.Util.read(Util.java:50)
> at parquet.format.Util.readFileMetaData(Util.java:34)
> 
> Thanks,
> Brandon.
> 
> [1]: 
> https://github.com/apache/incubator-parquet-format/blob/master/src/thrift/parquet.thrift
> [2]: 
> https://github.com/apache/incubator-parquet-format/blob/master/src/main/java/parquet/format/Util.java
> 
> On Jul 26, 2014, at 6:50 PM, JULIEN LE DEM 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Hi Brandon,
> You could probably make a copy of the thrift definition and keep only the 
> fields you need.
> If you use the generated classes to read the metadata, thrift will skip all 
> the other fields
> Julien
> 
> On Jul 26, 2014, at 12:16 AM, Brandon Amos wrote:
> 
> Hi Parquet team,
> 
> I apologize for the simple question, but I'm using Parquet on HDFS in
> a Scala/Spark application and am having trouble efficiently
> obtaining the number of rows in my Parquet data stores without
> loading and counting.
> 
> The README at https://github.com/apache/incubator-parquet-format
> has great information about the format of the metadata,
> and I want to extract the `num_rows` field from the
> `FileMetaData` Thrift object.
> However, the `_metadata` file contained in Parquet databases
> contains many Thrift objects and other information
> in addition to the `FileMetaData` object that I want to extract.
> 
> Can anybody give recommendations on how I can most efficiently
> extract the `num_rows` field?
> 
> Thanks,
> Brandon.
> 
> 
>

Re: [parquet-dev] Efficiently obtaining the number of rows of a Parquet data store.

Reply via email to