Hi Julien,
Is the code snippet below the best way to read the total number of rows in my
Parquet store?
Thanks,
Brandon.
val metadataFooters: Seq[Footer] = ParquetFileReader
.readFooters(conf, new Path(hdfsPath))
val rows = metadataFooters.foldLeft(0L) {
(res: Long, footer: Footer) =>
res + footer.getParquetMetadata.getBlocks.foldLeft(0L) {
(inner_res: Long, block: BlockMetaData) =>
inner_res+block.getRowCount
}
}
On Jul 26, 2014, at 7:08 PM, Brandon Amos
<[email protected]<mailto:[email protected]>> wrote:
Hi Julien,
Thanks for the response.
Can you give further details on how I can go from having a byte array of the
metadata file to extracting the `num_rows` field?
The Thrift schema at [1] doesn't provide the schema of the entire metadata
file, and I can't get any of the Util functions in [2] to read the metadata
file either.
The `readFileMetaData` function outputs the following error when I try
passing an InputStream of `_metadata`.
java.io.IOException: can not read class parquet.format.FileMetaData: Required
field 'version' was not found in serialized data! Struct:
FileMetaData(version:0, schema:null, num_rows:0, row_groups:null)
at parquet.format.Util.read(Util.java:50)
at parquet.format.Util.readFileMetaData(Util.java:34)
Thanks,
Brandon.
[1]:
https://github.com/apache/incubator-parquet-format/blob/master/src/thrift/parquet.thrift
[2]:
https://github.com/apache/incubator-parquet-format/blob/master/src/main/java/parquet/format/Util.java
On Jul 26, 2014, at 6:50 PM, JULIEN LE DEM
<[email protected]<mailto:[email protected]>> wrote:
Hi Brandon,
You could probably make a copy of the thrift definition and keep only the
fields you need.
If you use the generated classes to read the metadata, thrift will skip all the
other fields
Julien
On Jul 26, 2014, at 12:16 AM, Brandon Amos wrote:
Hi Parquet team,
I apologize for the simple question, but I'm using Parquet on HDFS in
a Scala/Spark application and am having trouble efficiently
obtaining the number of rows in my Parquet data stores without
loading and counting.
The README at https://github.com/apache/incubator-parquet-format
has great information about the format of the metadata,
and I want to extract the `num_rows` field from the
`FileMetaData` Thrift object.
However, the `_metadata` file contained in Parquet databases
contains many Thrift objects and other information
in addition to the `FileMetaData` object that I want to extract.
Can anybody give recommendations on how I can most efficiently
extract the `num_rows` field?
Thanks,
Brandon.