[ 
https://issues.apache.org/jira/browse/NUTCH-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2909:
----------------------------------------
    Summary: Establish a metrics naming convention  (was: Standardize Nutch 
Metrics Counters)

> Establish a metrics naming convention
> -------------------------------------
>
>                 Key: NUTCH-2909
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2909
>             Project: Nutch
>          Issue Type: Improvement
>          Components: metrics
>    Affects Versions: 1.18
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.19
>
>
> I revisited Nutch metrics counters and put some [metrics 
> documentation|https://cwiki.apache.org/confluence/display/NUTCH/Metrics] 
> together for others to consult should they wish.
> I thought a comprehensive collection of all Nutch Counters would be useful so 
> I put together a [metrics 
> table|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-MetricsTable].
>  One of this (unintended) outcomes was that this highlighted the variability 
> in counter group names and metric names. For example
> *Metric Group*:
> * _CleaningJobStatus_ - upper camel case
> * _CrawlDB filter_ - inconsistent use of capitalization and space separated
> * N/A - the DomainStatistics counters don't belong to a metric group
> * _injector_ - lowercase named after the encapsulating Class
> * _WebGraph.outlinks_ - inconsistent use of capitalization and period 
> separated
> The *Metric Name*'s are basically the same... pretty much all over the place.
> I am keen to bring some convention to the Nutch metrics definitions but this 
> is not all plain sailing. I do understand that existing users may rely upon 
> the above metrics as are and changing the values would have impacts 
> downstream.
> *PROPOSAL*
> I would like to discuss introducing a naming convention which follows some 
> simple principles motivated by a [Datadog employees response on 
> SO|https://stackoverflow.com/a/18131221].
> As a take on that post, I want to propose the following
> {quote}
> 1. With regards to *Metric Group* the highest level of hierarchy is the 
> product line or the process i.e., _*nutch*_. The highest level of hierarchy 
> is always lowercase.
> 2. The next level of hierarchy is the sub-component/tool, i.e., 
> *_nutch.Injector_*, *_nutch.Generator_*, *_nutch.ParseSegment_*, 
> *_nutch.SitemapProcessor_*, etc. This constituent is exactly as that of the 
> enclosing Class. This way it is really simple to trace the metric back to the 
> Class which it was defined within.
> 3. The third level of the hierarchy is the metric group which is a general 
> grouping of functionality for the metric being defined i.e. 
> *_nutch.QueueFeeder.fetcher_status_*. This constituent is lowercase with 
> words separated by underscore. If no obvious metric group exists simply 
> provide the enclosing Class in lowercase i.e.,  
> *_nutch.Injector.injector.urls_filtered_*
> 4. With regards to the *Metric Name*, the last level of hierarchy is the 
> thing being measured i.e., *_urls_filtered_*, 
> *_above_exception_threshold_in_queue_*, etc. Everything is lowercase and 
> words separated by underscore. Same as #3 above.
> Example complete metrics
> * *_nutch.Injector.injector.urls_filtered_*
> * *_nutch.ResolverThread.update_host_db.checked_hosts_*
> * *_nutch.WebGraph.outlinks.added links_*
> {quote}
> It would be greatly appreciated if folks could chime in on the details of the 
> proposal. I'm sure there are several areas which could be improved. 
> I will mention that my specific driver for cleaning this up is that I would 
> like to push Nutch metrics into Enterprise Splunk so that the Nutch crawler 
> subsystem will be integrated with all the rest of the subsystems I am 
> responsible for. We use Splunk for that kind of thing. I intend to do that by 
> implementing the [Java statsd 
> client|https://github.com/DataDog/java-dogstatsd-client] but I feel that 
> comes after we clean up metrics and establish a metrics naming convention.
> Thanks for any input. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to