Hi Julien, Thanks for the response. Can you give further details on how I can go from having a byte array of the metadata file to extracting the `num_rows` field?
The Thrift schema at [1] doesn't provide the schema of the entire metadata file, and I can't get any of the Util functions in [2] to read the metadata file either. The `readFileMetaData` function outputs the following error when I try passing an InputStream of `_metadata`. java.io.IOException: can not read class parquet.format.FileMetaData: Required field 'version' was not found in serialized data! Struct: FileMetaData(version:0, schema:null, num_rows:0, row_groups:null) at parquet.format.Util.read(Util.java:50) at parquet.format.Util.readFileMetaData(Util.java:34) Thanks, Brandon. [1]: https://github.com/apache/incubator-parquet-format/blob/master/src/thrift/parquet.thrift [2]: https://github.com/apache/incubator-parquet-format/blob/master/src/main/java/parquet/format/Util.java On Jul 26, 2014, at 6:50 PM, JULIEN LE DEM <[email protected]<mailto:[email protected]>> wrote: Hi Brandon, You could probably make a copy of the thrift definition and keep only the fields you need. If you use the generated classes to read the metadata, thrift will skip all the other fields Julien On Jul 26, 2014, at 12:16 AM, Brandon Amos wrote: Hi Parquet team, I apologize for the simple question, but I'm using Parquet on HDFS in a Scala/Spark application and am having trouble efficiently obtaining the number of rows in my Parquet data stores without loading and counting. The README at https://github.com/apache/incubator-parquet-format has great information about the format of the metadata, and I want to extract the `num_rows` field from the `FileMetaData` Thrift object. However, the `_metadata` file contained in Parquet databases contains many Thrift objects and other information in addition to the `FileMetaData` object that I want to extract. Can anybody give recommendations on how I can most efficiently extract the `num_rows` field? Thanks, Brandon.
