[
https://issues.apache.org/jira/browse/PARQUET-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787369#comment-17787369
]
ASF GitHub Bot commented on PARQUET-2374:
-----------------------------------------
parthchandra commented on PR #1187:
URL: https://github.com/apache/parquet-mr/pull/1187#issuecomment-1816986727
@steveloughran I did look into leveraging Hadoop io stats but my first
attempt did not work too well and I thought a simpler initial implementation
would be more useful. Once we move to hadoop vector io, I'll take another stab
at it.
> What would be good if this stats was set up to
>
> take maps of key-value rather than a fixed enum
The fixed enum here is simply the Parquet file reader providing information
that these are the values it knows about. This implementation is not really
collecting and aggregating anything, it is simply recording the time and counts
and passing them on.
> collect those min/mean/max as well as counts.
The implementation of the parquet metrics callback will do that. So if the
execution engine is Spark, it can simply get the values and add them to it's
own metrics collection subsystem which then computes the min/max/mean.
> somehow provided a plugin point where we could add something to add any of
the parquet reader/writer stats to the thread context -trying to collect stats
from inside wrapped-many-times-over streams and iterators is way too complex. I
know, i have a branch of parquet where I tried that...
Hmm, that will take some work. I wanted to measure streaming decompression
time (where the `decompress` call simply returns a stream which is decompressed
as it is read), but found it required too many breaking changes to implement.
But a standard system like `IOStatistics` where such a stream is a
IOStatisticsSource would be perfect.
> Add metrics support for parquet file reader
> -------------------------------------------
>
> Key: PARQUET-2374
> URL: https://issues.apache.org/jira/browse/PARQUET-2374
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.13.1
> Reporter: Parth Chandra
> Priority: Major
>
> ParquetFileReader is used by many engines - Hadoop, Spark among them. These
> engines report various metrics to measure performance in different
> environments and it is usually useful to be able to get low level metrics out
> of the file reader and writers.
> It would be very useful to allow a simple interface to report the metrics.
> Callers can then implement the interface to record the metrics in any
> subsystem they choose.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)