答复: Question on Metrics Server to Alibaba team

John Fang Tue, 22 Mar 2016 23:57:00 -0700


-----邮件原件-----
发件人: John Fang [mailto:[email protected]] 
发送时间: 2016年3月23日 14:39
收件人: [email protected]; 'Bobby Evans'
主题: 答复: Question on Metrics Server to Alibaba team


@ Bobby Evans Jstorm code has experienced a lot of tests over the past few 
years, espatially HA and scalability. We have done a lot of optimization about 
Metrics. The performance is better than Flink in my tests. In my personal 
opinion, the monitoring in jstorm offers very much informations. And the 
monitoring can tell us where is the bottleneck when we run a topology. The 
performance bottleneck maybe serialize/deserialize/netty/executor and so on. Of 
course, I also has some other good monitoring in the world. So I hope we can 
choice the better monitoring before phrase 2. And I will start study the Alas. 
If it is better, I am pleasured to redesign the monitoring by Alas.
  for my part, we have better make the monitoring to be a plugin.


Regards
       John Fang


-----邮件原件-----
发件人: Bobby Evans [mailto:[email protected]]
发送时间: 2016年3月22日 22:36
收件人: [email protected]
主题: Re: Question on Metrics Server to Alibaba team

My personal opinion is that we should not reinvent the wheel (aka distributed 
fault tolerant metrics) ourselves.  The local file blobstore with nimbus HA was 
a big enough pain to write and it is relatively simple in comparison.
If the JStorm code is simple and offers everything we need in terms of HA and 
scalability then I would be OK with it, but if it doesn't I would lean towards 
a different compatible open source solution. 

https://github.com/Netflix/atlas
looks very promising as a default option.  It is actively maintained by a group 
that I think has some of the best monitoring in the world.  And it is both java 
and apache compatible.  It has no histogram support that I could find, but that 
I don't see as being super critical.  The biggest drawback is there is little 
documentation on how to use it, to really be able to evaluate it for our needs. 
- Bobby 

    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim <[email protected]> wrote:
 

 Harsha,

That's why I think new metric feature of JStorm looks promising.

According to design doc on https://issues.apache.org/jira/browse/STORM-1329,
there's no distinction between topology stat (which Apache Storm includes to 
worker heartbeat) and built-in metrics (which should be handled with separate 
consumer, as you stated).
All metrics are passed to Nimbus and Nimbus cached metrics, which implies we 
can treat all metrics as same, and we can also provide built-in metrics 
(including custom metrics) to users via REST API, too.

I thought about standalone metrics server process which handles whole metric 
works (maybe TopologyMaster + Nimbus on design doc), but if current 
implementation of metric feature on JStorm can take care of what I'm assuming, 
I guess it's great enough.

Since I don't know about TopologyMaster, I just wonder that there're any SPOFs 
(including soft) and how metrics work when if component of SPOF goes down.
Since Cody gives digging point to take a look at, we can evaluate that feature 
before phase 2.

Thanks,
Jungtaek Lim (HeartSaVioR)

2016년 3월 22일 (화) 오전 1:36, Harsha <[email protected]>님이 작성:

> One of the goals of this work and probably can be addressed in 
> separate jira is how the topology metrics reporter works. Today its a 
> bolt thats part of a topology graph that means its another node in the 
> Topology DAG that needs be tuned for better performance. Some of our 
> users took performance hits by deploying topology metrics reporter 
> that can send metrics to Ganglia. Ideally this collection should be 
> asynchronous and not be a node in topology DAG.
>
> Shipping default metrics server and along with pluggable option for 
> users who wants to graphite or other timeline servers should be the 
> goal.
>
> --Harsha
>
>
> On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > @Cody - The design looks good. Does the design allow to aggregate 
> > metrics at the task/executor level? Basically, number of distinct 
> > metrics is proportional to the number of distinct tasks, did you 
> > ever run into such a use case?
> >
> >
> > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere 
> > <[email protected]>
> > wrote:
> >
> > > Also, you can read the code from our latest release JStorm 2.1.1.
> > >
> > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere 
> > > <[email protected]>
> > > wrote:
> > >
> > > > @Jungtaek,
> > > > We did some tests on codahale metrics, compared to 
> > > > meters/histograms, counters are quite fast. So we mainly focused 
> > > > on the optimization of
> > > meters
> > > > and histograms (they are indeed very slow) including double 
> > > > sampling, changing the clock from ns (System.nanoTime) to ms, etc.
> > > > You can take a look at the
> > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of 
> > > > our sequence-split-merge example code, as the client code entry 
> > > > to
> metrics.
> > > > After that, you may dig to TopologyMaster class, which is still 
> > > > part
> of a
> > > > topology, and then to TopologyMetricsRunnable, which is a part 
> > > > of
> nimbus
> > > > server, finally to MetricUploader plugin, this is where the 
> > > > metrics interfere with our "metrics server". Still, there're 
> > > > some nits in the
> > > code,
> > > > but I think that should be no big problem.
> > > >
> > > > I'd also like to point out that our "metrics server" is not 
> > > > strictly
> a
> > > > real metrics server, since most of the duty lies on nimbus 
> > > > server and topology master, it's more appropriate to call it metrics 
> > > > storage.
> The
> > > main
> > > > reason for this is that we don't want to make a heavy-weight 
> > > > metrics
> > > server
> > > > out of JStorm, and this makes us very easy to maintain (we have 
> > > > teams
> > > that
> > > > specifically maintain HBase/OTS in Alibaba since they're so 
> > > > commonly
> used
> > > > in production).
> > > >
> > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim 
> > > > <[email protected]>
> > > wrote:
> > > >
> > > >> Thanks Cody and Bobby for the explanation.
> > > >>
> > > >> Cody,
> > > >> I took a look at design doc and looks promising, especially it
> doesn't
> > > do
> > > >> sampling when metric type is 'counter'. As far as I heard (I 
> > > >> didn't
> try
> > > >> it)
> > > >> it becomes huge performance hit in Apache Storm when we change
> sample
> > > rate
> > > >> to 1.0.
> > > >> Could you guide the entry point of metric feature in JStorm to 
> > > >> dig
> into?
> > > >>
> > > >> And just a curiosity, did you consider extracting metric 
> > > >> feature
> (which
> > > is
> > > >> done with TopologyMasters and Nimbuses) into separate component?
> > > >> I understood your mention to 'metrics server' as separate
> component, but
> > > >> after seeing design doc, feature seems to be implemented on Nimbus.
> > > >>
> > > >> Thanks,
> > > >> Jungtaek Lim (HeartSaVioR)
> > > >>
> > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere 
> > > >> <[email protected]>님이
> 작성:
> > > >>
> > > >> > JStorm has provided a MetricUploader interface, which is 
> > > >> > similar
> to
> > > >> > IMetricsConsumer in storm, and the underlying implementation 
> > > >> > is
> > > >> pluggable,
> > > >> > you can use HBase, or any other KV store that supports 
> > > >> > timeline
> > > queries
> > > >> or
> > > >> > even a database(maybe for it's a small cluster). We provide 
> > > >> > model
> > > >> classes
> > > >> > in jstorm-core, as to what kinds of metrics data need to be
> stored,
> > > it's
> > > >> > totally up to the detailed implementation. Our internal
> implementation
> > > >> uses
> > > >> > OTS, which is a product of aliyun (
> > > https://www.aliyun.com/product/ots/
> > > >> ),
> > > >> > but it's easy to adapt to other implementations.
> > > >> >
> > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > >> <[email protected]
> > > >> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Yes we originally wanted to try and use the Hadoop Timeline
> Server
> > > for
> > > >> > > storm metrics feedback to nimbus + UI + history like server.
> But it
> > > >> was
> > > >> > > not stable at the time, so we stopped.  For the sake of 
> > > >> > > playing
> > > nicely
> > > >> > with
> > > >> > > the rest of the big data ecosystem I would like to see us
> support it
> > > >> as
> > > >> > an
> > > >> > > option for metrics collection/query, but until the timeline
> server
> > > v2
> > > >> is
> > > >> > > ready and released.  For me the important thing is that we 
> > > >> > > have
> a
> > > >> decent
> > > >> > > time series DB that comes with storm by default and is
> pluggable so
> > > we
> > > >> > can
> > > >> > > replace it with something else that has similar 
> > > >> > > capabilities in
> the
> > > >> > future.
> > > >> > >  - Bobby
> > > >> > >
> > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere < 
> > > >> > >[email protected]> wrote:
> > > >> > >
> > > >> > >
> > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm 
> > > >> > >absolutely
> ok
> > > to
> > > >> > > discuss this in advance.
> > > >> > >
> > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > [email protected]
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Yes it's already in production.
> > > >> > > > The implementation basically follows the design document 
> > > >> > > > in https://issues.apache.org/jira/browse/STORM-1329, you 
> > > >> > > > can
> take a
> > > >> look
> > > >> > > > first and feel free to ask questions.
> > > >> > > >
> > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> [email protected]
> > > >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > >> Hi,
> > > >> > > >>
> > > >> > > >> I got something to do with metrics so I'm seeking the 
> > > >> > > >> pull
> > > requests
> > > >> > > which
> > > >> > > >> addresses metrics.
> > > >> > > >> And at #753 <https://github.com/apache/storm/pull/753> I
> found
> > > >> Cody
> > > >> > > said
> > > >> > > >> we
> > > >> > > >> (maybe it means Alibaba team) are currently working on
> Metrics
> > > >> Server.
> > > >> > > >> (I also found comment which said there was some talk 
> > > >> > > >> while
> ago
> > > >> around
> > > >> > > >> integrating Hadoop timeline server. Seems like no one 
> > > >> > > >> came up
> > > with
> > > >> the
> > > >> > > >> result, and I prefer to avoid big dependency so I'm in 
> > > >> > > >> favor
> of
> > > >> > Metrics
> > > >> > > >> Server for now.)
> > > >> > > >>
> > > >> > > >> I think that would improve metrics feature of Storm much
> better,
> > > so
> > > >> > I'd
> > > >> > > >> like to see how the work is going. Sure it's only when
> there's no
> > > >> > issue
> > > >> > > >> for
> > > >> > > >> you to work transparently. I just would like to prevent
> > > >> duplication of
> > > >> > > >> work, and would like to help if needed and possible.
> > > >> > > >>
> > > >> > > >> Thanks,
> > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Abhishek Agarwal
>

答复: Question on Metrics Server to Alibaba team

Reply via email to