[ 
https://issues.apache.org/jira/browse/HADOOP-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16222766#comment-16222766
 ] 

Steve Loughran commented on HADOOP-14973:
-----------------------------------------

Sorry, can I apologise to Sean and anyone else who took my comment as anything 
other than lighthearted. I was just trying, alongside discouraging any attempts 
to avoid metrics2 to point out that the cloud infra itself has lots of logging, 
though it splits things up different ways. You can get a view of what load a 
bucket is getting, with UA settings you can see what jobs are doing it, but you 
don't see  how a specific job is interacting with different buckets, and how, 
say, layout problems in bucket 1 are slowing reads down so much that writing 
the output to bucket 2 is bad. You really need both, and for a lot of the query 
work, you need the stats assigned to a specific job, not just collected on a 
process somewhere on a VM which may not exist when you get round to wondering 
why things were slow.

Here's some stats collected in the committer as aggregated across all nodes in 
a job (here: 1) and then saved into the _SUCCESS file. They're per process, so 
if you run >1 executor in a process, in parallel or sequentially, you get 
duplicate numbers, hence things go wrong. And I'm not properly propagating the 
summary of input/output streams to FileSystem.StreamStatistics

{code}
2017-10-27 19:22:26,412 [ScalaTest-main-running-S3ACommitBulkDataSuite] INFO  
s3.S3AOperations (Logging.scala:logInfo(54)) - Success data at 
s3a://hwdev-steve-ireland-new/cloud-integration/DELAY_LISTING_ME/S3ACommitBulkDataSuite/bulkdata/output/landsat/orc/filtered/_SUCCESS
 : SuccessData{committer='PartitionedStagingCommitter', 
hostname='HW13176.cotham.uk', description='Task committer 
attempt_20171027192111_0000_m_000000_0', date='Fri Oct 27 19:22:25 BST 2017', 
filenames=[/cloud-integration/DELAY_LISTING_ME/S3ACommitBulkDataSuite/bulkdata/output/landsat/orc/filtered/part-00000-07c507f1-5fb9-4f7b-b9ef-cdddcaf9eb29-c000-local-1509128457802.snappy.orc]}
2017-10-27 19:22:26,412 [ScalaTest-main-running-S3ACommitBulkDataSuite] INFO  
s3.S3AOperations (Logging.scala:logInfo(54)) - Metrics:
  S3guard_metadatastore_put_path_latency50thPercentileLatency = 0
  S3guard_metadatastore_put_path_latency75thPercentileLatency = 0
  S3guard_metadatastore_put_path_latency90thPercentileLatency = 0
  S3guard_metadatastore_put_path_latency95thPercentileLatency = 0
  S3guard_metadatastore_put_path_latency99thPercentileLatency = 0
  S3guard_metadatastore_put_path_latencyNumOps = 0
  S3guard_metadatastore_throttle_rate50thPercentileFrequency (Hz) = 0
  S3guard_metadatastore_throttle_rate75thPercentileFrequency (Hz) = 0
  S3guard_metadatastore_throttle_rate90thPercentileFrequency (Hz) = 0
  S3guard_metadatastore_throttle_rate95thPercentileFrequency (Hz) = 0
  S3guard_metadatastore_throttle_rate99thPercentileFrequency (Hz) = 0
  S3guard_metadatastore_throttle_rateNumEvents = 0
  committer_bytes_committed = 2005091
  committer_bytes_uploaded = 2005091
  committer_commits_aborted = 0
  committer_commits_completed = 1
  committer_commits_created = 1
  committer_commits_failed = 0
  committer_commits_reverted = 0
  committer_jobs_completed = 1
  committer_jobs_failed = 0
  committer_tasks_completed = 1
  committer_tasks_failed = 0
  directories_created = 1
  directories_deleted = 0
  fake_directories_deleted = 11
  files_copied = 0
  files_copied_bytes = 0
  files_created = 0
  files_deleted = 117
  ignored_errors = 2
  object_continue_list_requests = 0
  object_copy_requests = 0
  object_delete_requests = 3
  object_list_requests = 6
  object_metadata_requests = 10
  object_multipart_aborted = 0
  object_put_bytes = 2005091
  object_put_bytes_pending = 0
  object_put_requests = 2
  object_put_requests_active = 0
  object_put_requests_completed = 2
  op_copy_from_local_file = 0
  op_exists = 1
  op_get_file_status = 4
  op_glob_status = 0
  op_is_directory = 0
  op_is_file = 0
  op_list_files = 0
  op_list_located_status = 0
  op_list_status = 0
  op_mkdirs = 0
  op_rename = 0
  s3guard_metadatastore_initialization = 0
  s3guard_metadatastore_put_path_request = 1
  s3guard_metadatastore_retry = 0
  s3guard_metadatastore_throttled = 0
  store_io_throttled = 0
  stream_aborted = 0
  stream_backward_seek_operations = 0
  stream_bytes_backwards_on_seek = 0
  stream_bytes_discarded_in_abort = 0
  stream_bytes_read = 0
  stream_bytes_read_in_close = 0
  stream_bytes_skipped_on_seek = 0
  stream_close_operations = 0
  stream_closed = 0
  stream_forward_seek_operations = 0
  stream_opened = 0
  stream_read_exceptions = 0
  stream_read_fully_operations = 0
  stream_read_operations = 0
  stream_read_operations_incomplete = 0
  stream_seek_operations = 0
  stream_write_block_uploads = 0
  stream_write_block_uploads_aborted = 0
  stream_write_block_uploads_active = 0
  stream_write_block_uploads_committed = 0
  stream_write_block_uploads_data_pending = 0
  stream_write_block_uploads_pending = 0
  stream_write_failures = 0
  stream_write_total_data = 0
  stream_write_total_time = 0
{code}

The idea here is to show the summary stats of a job, where you can say "took X 
minutes but you got a lot of 503 throttle events back, we were doing too many 
aborted reads...stuff that you can maybe tune in your app. 

And this is the end of the test with a comparison of all the values at the 
start of the run, and the changed values shown
{code}
2017-10-27 19:24:05,770 [ScalaTest-main-running-S3ACommitBulkDataSuite] INFO  
commit.S3ACommitBulkDataSuite (Logging.scala:logInfo(54)) - S3 Statistics 
  [object_put_requests_completed = 117]
  [op_get_file_status = 666]
  [object_list_requests = 363]
  [object_metadata_requests = 915]
  [op_exists = 10]
  [object_multipart_initiated = 113]
  [op_list_status = 113]
  [object_put_bytes = 7843270]
  [object_put_requests = 117]
  [object_delete_requests = 117]
  [op_list_files = 1]
{code}

Where this is weak is that its across all s3a instances in a process...what we 
really want is per-thread values which you can then tie particularly back to a 
piece of work, either under the hood via a _SUCCESS file, or explicitly with 
the execution frameworks asking the filesystems for them (more specifically: 
asking the committers to return values) and then pulling them in. for that we 
need

* per-thread stat collection
* proper collection of stats from IO streams &c
* committers to collect and return
* For S3: hook up to the metrics that the SDK itself collects
* wasb to do the same. 
* the query engines to collect and aggregate

I think that's fairly ambitious, but it's where I want things to be aiming for: 
the object store clients collecting data which can be aggregated across the 
cluster and tied back to the work itself.


> [s3a] Log StorageStatistics
> ---------------------------
>
>                 Key: HADOOP-14973
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14973
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.0.0-beta1, 2.8.1
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>
> S3A is currently storing much more detailed metrics via StorageStatistics 
> than are logged in a MapReduce job. Eventually, it would be nice to get 
> Spark, MapReduce and other workloads to retrieve and store these metrics, but 
> it may be some time before they all do that. I'd like to consider having S3A 
> publish the metrics itself in some form. This is tricky, as S3A has no daemon 
> but lives inside various other processes.
> Perhaps writing to a log file at some configurable interval and on close() 
> would be the best we could do. Other ideas would be welcome.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to