[
https://issues.apache.org/jira/browse/PARQUET-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788940#comment-17788940
]
ASF GitHub Bot commented on PARQUET-2374:
-----------------------------------------
parthchandra commented on PR #1187:
URL: https://github.com/apache/parquet-mr/pull/1187#issuecomment-1823714212
> > For the object stores, things to measure are
> >
> > * time to open() and close() a file
> > * time for a read after a backwards seek
> > * time for a read after a forwards seek.
> > * how many reads actually took place
> > * for vector IO, whatever gets picked up there
> > * were errors reported and retried, or throttling events
> > * number of underlying GET requests
>
> CMIW, it seems that these stats can be collected solely at the input
stream level.
Yes, they are best collected by the file system client API. However it would
be nice to be able to hook up all these metrics together. Then we could, for
instance, show a single Spark scan operator that displays stats for the
operator, parquet reader, and the input stream in one place.
> Add metrics support for parquet file reader
> -------------------------------------------
>
> Key: PARQUET-2374
> URL: https://issues.apache.org/jira/browse/PARQUET-2374
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.13.1
> Reporter: Parth Chandra
> Priority: Major
>
> ParquetFileReader is used by many engines - Hadoop, Spark among them. These
> engines report various metrics to measure performance in different
> environments and it is usually useful to be able to get low level metrics out
> of the file reader and writers.
> It would be very useful to allow a simple interface to report the metrics.
> Callers can then implement the interface to record the metrics in any
> subsystem they choose.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)