Hi Mike,

Sorry got swamped with work and didn’t get a chance to reply.

I misunderstood what you were trying to do. I thought you were just looking to 
create custom metrics vs looking for the existing Hadoop Output Format counters.

I’m not familiar enough with the Hadoop APIs but I think it would require a 
change to the 
SparkHadoopWriter<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala>
 class since it generates the JobContext which is required to read the 
counters. Then it could publish the counters to the Spark metrics system.

I would suggest going ahead and submitting a JIRA request if there isn’t one 
already.

Thanks,
Silvio

From: Mike Sukmanowsky 
<mike.sukmanow...@gmail.com<mailto:mike.sukmanow...@gmail.com>>
Date: Friday, March 25, 2016 at 10:48 AM
To: Silvio Fiorito 
<silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Spark Metrics Framework?

Pinging again - any thoughts?

On Wed, 23 Mar 2016 at 09:17 Mike Sukmanowsky 
<mike.sukmanow...@gmail.com<mailto:mike.sukmanow...@gmail.com>> wrote:
Thanks Ted and Silvio. I think I'll need a bit more hand holding here, sorry. 
The way we use ES Hadoop is in pyspark via 
org.elasticsearch.hadoop.mr.EsOutputFormat in a saveAsNewAPIHadoopFile call. 
Given the Hadoop interop, I wouldn't assume that the EsOutputFormat 
class<https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/mr/EsOutputFormat.java>
 could be modified to define a new Source and register it via 
MetricsSystem.createMetricsSystem. This feels like a good feature request for 
Spark actually: "Support Hadoop Counters in Input/OutputFormats as Spark 
metrics" but I wanted some feedback first to see if that makes sense.

That said, some of the custom RDD 
classes<https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/core/main/scala/org/elasticsearch/spark/rdd>
 could probably be modified to register a new Source when they perform 
reading/writing from/to Elasticsearch.

On Tue, 22 Mar 2016 at 15:17 Silvio Fiorito 
<silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>> wrote:
Hi Mike,

It’s been a while since I worked on a custom Source but I think all you need to 
do is make your Source in the org.apache.spark package.

Thanks,
Silvio

From: Mike Sukmanowsky 
<mike.sukmanow...@gmail.com<mailto:mike.sukmanow...@gmail.com>>
Date: Tuesday, March 22, 2016 at 3:13 PM
To: Silvio Fiorito 
<silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Spark Metrics Framework?

The Source class is 
private<https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/metrics/source/Source.scala#L22-L25>
 to the spark package and any new Sources added to the metrics registry must be 
of type 
Source<https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L144-L152>.
 So unless I'm mistaken, we can't define a custom source. I linked to 1.4.1 
code, but the same is true in 1.6.1.

On Mon, 21 Mar 2016 at 12:05 Silvio Fiorito 
<silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>> wrote:
You could use the metric sources and sinks described here: 
http://spark.apache.org/docs/latest/monitoring.html#metrics

If you want to push the metrics to another system you can define a custom sink. 
You can also extend the metrics by defining a custom source.

From: Mike Sukmanowsky 
<mike.sukmanow...@gmail.com<mailto:mike.sukmanow...@gmail.com>>
Date: Monday, March 21, 2016 at 11:54 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Spark Metrics Framework?

We make extensive use of the elasticsearch-hadoop library for Hadoop/Spark. In 
trying to troubleshoot our Spark applications, it'd be very handy to have 
access to some of the many 
metrics<https://www.elastic.co/guide/en/elasticsearch/hadoop/current/metrics.html>
 that the library makes available when running in map reduce mode. The 
library's author 
noted<https://discuss.elastic.co/t/access-es-hadoop-stats-from-spark/44913> 
that Spark doesn't offer any kind of a similar metrics API where by these 
metrics could be reported or aggregated on.

Are there any plans to bring a metrics framework similar to Hadoop's Counter 
system to Spark or is there an alternative means for us to grab metrics exposed 
when using Hadoop APIs to load/save RDDs?

Thanks,
Mike

Reply via email to