[
https://issues.apache.org/jira/browse/PARQUET-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787346#comment-17787346
]
ASF GitHub Bot commented on PARQUET-2374:
-----------------------------------------
steveloughran commented on PR #1187:
URL: https://github.com/apache/parquet-mr/pull/1187#issuecomment-1816906884
it'd be really nice if somehow there was a way to push hadoop stream IOStats
here, especially the counters, min, max and mean maps:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/iostatistics.html
and its really interesting for s3, azure and gcs clients, where we collect
stream specific stuff, including things like: bytes discarded in seek, time for
GET, whether we did a HEAD first, and more. These are collected in a thread
level, but also include stats from helper threads such as those in async stream
draining, vector IO...
It'd take a move to hadoop 3.3.1+ to embrace the API, but if there was a way
for something to publish stats to your metric collector, then maybe something
could be done
Tip: you can enable a dump of a filesystem's aggregate stats in process
shutdown for azure and s3a
```
fs.iostatistics.logging.level=info
```
```
2023-11-17 18:30:28,634 [shutdown-hook-0] INFO
statistics.IOStatisticsLogging
(IOStatisticsLogging.java:logIOStatisticsAtLevel(269)) - IOStatistics:
counters=((action_http_head_request=3)
(audit_request_execution=15)
(audit_span_creation=12)
(object_list_request=12)
(object_metadata_request=3)
(op_get_file_status=1)
(op_glob_status=1)
(op_list_status=9)
(store_io_request=15));
gauges=();
minimums=((action_http_head_request.min=22)
(object_list_request.min=25)
(op_get_file_status.min=1)
(op_glob_status.min=9)
(op_list_status.min=25));
maximums=((action_http_head_request.max=41)
(object_list_request.max=398)
(op_get_file_status.max=1)
(op_glob_status.max=9)
(op_list_status.max=408));
means=((action_http_head_request.mean=(samples=3, sum=87, mean=29.0000))
(object_list_request.mean=(samples=12, sum=708, mean=59.0000))
(op_get_file_status.mean=(samples=1, sum=1, mean=1.0000))
(op_glob_status.mean=(samples=1, sum=9, mean=9.0000))
(op_list_status.mean=(samples=9, sum=814, mean=90.4444)));
```
> Add metrics support for parquet file reader
> -------------------------------------------
>
> Key: PARQUET-2374
> URL: https://issues.apache.org/jira/browse/PARQUET-2374
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.13.1
> Reporter: Parth Chandra
> Priority: Major
>
> ParquetFileReader is used by many engines - Hadoop, Spark among them. These
> engines report various metrics to measure performance in different
> environments and it is usually useful to be able to get low level metrics out
> of the file reader and writers.
> It would be very useful to allow a simple interface to report the metrics.
> Callers can then implement the interface to record the metrics in any
> subsystem they choose.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)