[ https://issues.apache.org/jira/browse/NUTCH-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-2909: ---------------------------------------- Summary: Establish a metrics naming convention (was: Standardize Nutch Metrics Counters) > Establish a metrics naming convention > ------------------------------------- > > Key: NUTCH-2909 > URL: https://issues.apache.org/jira/browse/NUTCH-2909 > Project: Nutch > Issue Type: Improvement > Components: metrics > Affects Versions: 1.18 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Priority: Major > Fix For: 1.19 > > > I revisited Nutch metrics counters and put some [metrics > documentation|https://cwiki.apache.org/confluence/display/NUTCH/Metrics] > together for others to consult should they wish. > I thought a comprehensive collection of all Nutch Counters would be useful so > I put together a [metrics > table|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-MetricsTable]. > One of this (unintended) outcomes was that this highlighted the variability > in counter group names and metric names. For example > *Metric Group*: > * _CleaningJobStatus_ - upper camel case > * _CrawlDB filter_ - inconsistent use of capitalization and space separated > * N/A - the DomainStatistics counters don't belong to a metric group > * _injector_ - lowercase named after the encapsulating Class > * _WebGraph.outlinks_ - inconsistent use of capitalization and period > separated > The *Metric Name*'s are basically the same... pretty much all over the place. > I am keen to bring some convention to the Nutch metrics definitions but this > is not all plain sailing. I do understand that existing users may rely upon > the above metrics as are and changing the values would have impacts > downstream. > *PROPOSAL* > I would like to discuss introducing a naming convention which follows some > simple principles motivated by a [Datadog employees response on > SO|https://stackoverflow.com/a/18131221]. > As a take on that post, I want to propose the following > {quote} > 1. With regards to *Metric Group* the highest level of hierarchy is the > product line or the process i.e., _*nutch*_. The highest level of hierarchy > is always lowercase. > 2. The next level of hierarchy is the sub-component/tool, i.e., > *_nutch.Injector_*, *_nutch.Generator_*, *_nutch.ParseSegment_*, > *_nutch.SitemapProcessor_*, etc. This constituent is exactly as that of the > enclosing Class. This way it is really simple to trace the metric back to the > Class which it was defined within. > 3. The third level of the hierarchy is the metric group which is a general > grouping of functionality for the metric being defined i.e. > *_nutch.QueueFeeder.fetcher_status_*. This constituent is lowercase with > words separated by underscore. If no obvious metric group exists simply > provide the enclosing Class in lowercase i.e., > *_nutch.Injector.injector.urls_filtered_* > 4. With regards to the *Metric Name*, the last level of hierarchy is the > thing being measured i.e., *_urls_filtered_*, > *_above_exception_threshold_in_queue_*, etc. Everything is lowercase and > words separated by underscore. Same as #3 above. > Example complete metrics > * *_nutch.Injector.injector.urls_filtered_* > * *_nutch.ResolverThread.update_host_db.checked_hosts_* > * *_nutch.WebGraph.outlinks.added links_* > {quote} > It would be greatly appreciated if folks could chime in on the details of the > proposal. I'm sure there are several areas which could be improved. > I will mention that my specific driver for cleaning this up is that I would > like to push Nutch metrics into Enterprise Splunk so that the Nutch crawler > subsystem will be integrated with all the rest of the subsystems I am > responsible for. We use Splunk for that kind of thing. I intend to do that by > implementing the [Java statsd > client|https://github.com/DataDog/java-dogstatsd-client] but I feel that > comes after we clean up metrics and establish a metrics naming convention. > Thanks for any input. -- This message was sent by Atlassian Jira (v8.20.1#820001)