Re: Question on Metrics Server to Alibaba team

Cody Innowhere Wed, 23 Mar 2016 04:59:57 -0700

If we don't rely on any external system, our metrics system is still
available but will store metrics meta/data in rocksdb on nimbus servers.
There will be limits though, for example, we cannot store metrics data all
through the topology lifecycle, because rocksdb is only a KV storage, it
may not support efficient scan operations and too much data in local disk
may bring in extra IO overhead, so we may have to store latest 1hour of m1
data, 6 hours of m10 data as such (currently not implemented in JStorm, but
quite easy to do this).


TopologyMaster is merely a channel for registering/computing/uploading
metrics to nimbus, so if a TM goes down, the topology metrics will be
unavailable for a while before it gets pulled up somewhere else(for a
normal failover case, this should be very fast), while supervisor/nimbus
metrics are unaffected as they're sent to nimbus via thrift interface. As
long as TM is back, the topology metrics will be available again.

Currently JStorm does sync metrics meta but metrics data between multiple
nimbus serers is not synced. So under a nimbus failure, possibly we may
lose some metrics data.


On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <[email protected]> wrote:

> John,
>
> My concern is H/A of metrics on Storm by default. (I'm not 100% sure Bobby
> pointed out same things.)
>
> Since Apache Storm has been used by various users so that we can't assume
> that users have knowledges of external systems (including Hadoop ecosystem,
> personal opinion) and operate them smoothly.
> It reminds me about the importance to keep in mind about default.
>
> Therefore, I'm curious that new metrics feature of JStom can work smoothly
> without external system (HBase / OTS). And love to see it supports H/A
> without other systems, or users have to tolerate lost of metrics for some
> scenarios.
>
> I guess this may be valid questions on H/A (as far as my understanding of
> design doc is right): How metrics work when TopologyMaster is down? And how
> metrics work when failover of Nimbus occurs?
>
> Personally I don't mind losing metrics for short durations (just want to
> check availability of H/A), but failure shouldn't mess up whole metrics.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2016년 3월 23일 (수) 오후 3:39, John Fang <[email protected]>님이 작성:
>
> > @ Bobby Evans Jstorm code has experienced a lot of tests over the past
> few
> > years, espatially HA and scalability. We have done a lot of optimization
> > about Metrics. The performance is better than Flink in my tests. In my
> > personal opinion, the metric in jstorm offers very much informations. And
> > the metric can tell us where is the bottleneck when we run a topology.
> The
> > performance bottleneck maybe serialize/deserialize/netty/executor and so
> > on. Of course, I also has some other good monitoring in the world. So I
> > hope we can choice the better monitoring before phrase 2. And I will
> start
> > study the Alas. If it is better, I am pleasured to redesign the metric by
> > Alas.
> > -----邮件原件-----
> > 发件人: Bobby Evans [mailto:[email protected]]
> > 发送时间: 2016年3月22日 22:36
> > 收件人: [email protected]
> > 主题: Re: Question on Metrics Server to Alibaba team
> >
> > My personal opinion is that we should not reinvent the wheel (aka
> > distributed fault tolerant metrics) ourselves.  The local file blobstore
> > with nimbus HA was a big enough pain to write and it is relatively simple
> > in comparison.
> > If the JStorm code is simple and offers everything we need in terms of HA
> > and scalability then I would be OK with it, but if it doesn't I would
> lean
> > towards a different compatible open source solution.
> >
> > https://github.com/Netflix/atlas
> > looks very promising as a default option.  It is actively maintained by a
> > group that I think has some of the best monitoring in the world.  And it
> is
> > both java and apache compatible.  It has no histogram support that I
> could
> > find, but that I don't see as being super critical.  The biggest drawback
> > is there is little documentation on how to use it, to really be able to
> > evaluate it for our needs. - Bobby
> >
> >     On Monday, March 21, 2016 7:29 PM, Jungtaek Lim <[email protected]>
> > wrote:
> >
> >
> >  Harsha,
> >
> > That's why I think new metric feature of JStorm looks promising.
> >
> > According to design doc on
> > https://issues.apache.org/jira/browse/STORM-1329,
> > there's no distinction between topology stat (which Apache Storm includes
> > to worker heartbeat) and built-in metrics (which should be handled with
> > separate consumer, as you stated).
> > All metrics are passed to Nimbus and Nimbus cached metrics, which implies
> > we can treat all metrics as same, and we can also provide built-in
> metrics
> > (including custom metrics) to users via REST API, too.
> >
> > I thought about standalone metrics server process which handles whole
> > metric works (maybe TopologyMaster + Nimbus on design doc), but if
> current
> > implementation of metric feature on JStorm can take care of what I'm
> > assuming, I guess it's great enough.
> >
> > Since I don't know about TopologyMaster, I just wonder that there're any
> > SPOFs (including soft) and how metrics work when if component of SPOF
> goes
> > down.
> > Since Cody gives digging point to take a look at, we can evaluate that
> > feature before phase 2.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> >
> > 2016년 3월 22일 (화) 오전 1:36, Harsha <[email protected]>님이 작성:
> >
> > > One of the goals of this work and probably can be addressed in
> > > separate jira is how the topology metrics reporter works. Today its a
> > > bolt thats part of a topology graph that means its another node in the
> > > Topology DAG that needs be tuned for better performance. Some of our
> > > users took performance hits by deploying topology metrics reporter
> > > that can send metrics to Ganglia. Ideally this collection should be
> > > asynchronous and not be a node in topology DAG.
> > >
> > > Shipping default metrics server and along with pluggable option for
> > > users who wants to graphite or other timeline servers should be the
> > > goal.
> > >
> > > --Harsha
> > >
> > >
> > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > @Cody - The design looks good. Does the design allow to aggregate
> > > > metrics at the task/executor level? Basically, number of distinct
> > > > metrics is proportional to the number of distinct tasks, did you
> > > > ever run into such a use case?
> > > >
> > > >
> > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere
> > > > <[email protected]>
> > > > wrote:
> > > >
> > > > > Also, you can read the code from our latest release JStorm 2.1.1.
> > > > >
> > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere
> > > > > <[email protected]>
> > > > > wrote:
> > > > >
> > > > > > @Jungtaek,
> > > > > > We did some tests on codahale metrics, compared to
> > > > > > meters/histograms, counters are quite fast. So we mainly focused
> > > > > > on the optimization of
> > > > > meters
> > > > > > and histograms (they are indeed very slow) including double
> > > > > > sampling, changing the clock from ns (System.nanoTime) to ms,
> etc.
> > > > > > You can take a look at the
> > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" class of
> > > > > > our sequence-split-merge example code, as the client code entry
> > > > > > to
> > > metrics.
> > > > > > After that, you may dig to TopologyMaster class, which is still
> > > > > > part
> > > of a
> > > > > > topology, and then to TopologyMetricsRunnable, which is a part
> > > > > > of
> > > nimbus
> > > > > > server, finally to MetricUploader plugin, this is where the
> > > > > > metrics interfere with our "metrics server". Still, there're
> > > > > > some nits in the
> > > > > code,
> > > > > > but I think that should be no big problem.
> > > > > >
> > > > > > I'd also like to point out that our "metrics server" is not
> > > > > > strictly
> > > a
> > > > > > real metrics server, since most of the duty lies on nimbus
> > > > > > server and topology master, it's more appropriate to call it
> > metrics storage.
> > > The
> > > > > main
> > > > > > reason for this is that we don't want to make a heavy-weight
> > > > > > metrics
> > > > > server
> > > > > > out of JStorm, and this makes us very easy to maintain (we have
> > > > > > teams
> > > > > that
> > > > > > specifically maintain HBase/OTS in Alibaba since they're so
> > > > > > commonly
> > > used
> > > > > > in production).
> > > > > >
> > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim
> > > > > > <[email protected]>
> > > > > wrote:
> > > > > >
> > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > >>
> > > > > >> Cody,
> > > > > >> I took a look at design doc and looks promising, especially it
> > > doesn't
> > > > > do
> > > > > >> sampling when metric type is 'counter'. As far as I heard (I
> > > > > >> didn't
> > > try
> > > > > >> it)
> > > > > >> it becomes huge performance hit in Apache Storm when we change
> > > sample
> > > > > rate
> > > > > >> to 1.0.
> > > > > >> Could you guide the entry point of metric feature in JStorm to
> > > > > >> dig
> > > into?
> > > > > >>
> > > > > >> And just a curiosity, did you consider extracting metric
> > > > > >> feature
> > > (which
> > > > > is
> > > > > >> done with TopologyMasters and Nimbuses) into separate component?
> > > > > >> I understood your mention to 'metrics server' as separate
> > > component, but
> > > > > >> after seeing design doc, feature seems to be implemented on
> > Nimbus.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > >>
> > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere
> > > > > >> <[email protected]>님이
> > > 작성:
> > > > > >>
> > > > > >> > JStorm has provided a MetricUploader interface, which is
> > > > > >> > similar
> > > to
> > > > > >> > IMetricsConsumer in storm, and the underlying implementation
> > > > > >> > is
> > > > > >> pluggable,
> > > > > >> > you can use HBase, or any other KV store that supports
> > > > > >> > timeline
> > > > > queries
> > > > > >> or
> > > > > >> > even a database(maybe for it's a small cluster). We provide
> > > > > >> > model
> > > > > >> classes
> > > > > >> > in jstorm-core, as to what kinds of metrics data need to be
> > > stored,
> > > > > it's
> > > > > >> > totally up to the detailed implementation. Our internal
> > > implementation
> > > > > >> uses
> > > > > >> > OTS, which is a product of aliyun (
> > > > > https://www.aliyun.com/product/ots/
> > > > > >> ),
> > > > > >> > but it's easy to adapt to other implementations.
> > > > > >> >
> > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > >> <[email protected]
> > > > > >> > >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Yes we originally wanted to try and use the Hadoop Timeline
> > > Server
> > > > > for
> > > > > >> > > storm metrics feedback to nimbus + UI + history like server.
> > > But it
> > > > > >> was
> > > > > >> > > not stable at the time, so we stopped.  For the sake of
> > > > > >> > > playing
> > > > > nicely
> > > > > >> > with
> > > > > >> > > the rest of the big data ecosystem I would like to see us
> > > support it
> > > > > >> as
> > > > > >> > an
> > > > > >> > > option for metrics collection/query, but until the timeline
> > > server
> > > > > v2
> > > > > >> is
> > > > > >> > > ready and released.  For me the important thing is that we
> > > > > >> > > have
> > > a
> > > > > >> decent
> > > > > >> > > time series DB that comes with storm by default and is
> > > pluggable so
> > > > > we
> > > > > >> > can
> > > > > >> > > replace it with something else that has similar
> > > > > >> > > capabilities in
> > > the
> > > > > >> > future.
> > > > > >> > >  - Bobby
> > > > > >> > >
> > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere <
> > > > > >> > >[email protected]> wrote:
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm
> > > > > >> > >absolutely
> > > ok
> > > > > to
> > > > > >> > > discuss this in advance.
> > > > > >> > >
> > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > [email protected]
> > > > > >> >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Yes it's already in production.
> > > > > >> > > > The implementation basically follows the design document
> > > > > >> > > > in https://issues.apache.org/jira/browse/STORM-1329, you
> > > > > >> > > > can
> > > take a
> > > > > >> look
> > > > > >> > > > first and feel free to ask questions.
> > > > > >> > > >
> > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > [email protected]
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > >> Hi,
> > > > > >> > > >>
> > > > > >> > > >> I got something to do with metrics so I'm seeking the
> > > > > >> > > >> pull
> > > > > requests
> > > > > >> > > which
> > > > > >> > > >> addresses metrics.
> > > > > >> > > >> And at #753 <https://github.com/apache/storm/pull/753> I
> > > found
> > > > > >> Cody
> > > > > >> > > said
> > > > > >> > > >> we
> > > > > >> > > >> (maybe it means Alibaba team) are currently working on
> > > Metrics
> > > > > >> Server.
> > > > > >> > > >> (I also found comment which said there was some talk
> > > > > >> > > >> while
> > > ago
> > > > > >> around
> > > > > >> > > >> integrating Hadoop timeline server. Seems like no one
> > > > > >> > > >> came up
> > > > > with
> > > > > >> the
> > > > > >> > > >> result, and I prefer to avoid big dependency so I'm in
> > > > > >> > > >> favor
> > > of
> > > > > >> > Metrics
> > > > > >> > > >> Server for now.)
> > > > > >> > > >>
> > > > > >> > > >> I think that would improve metrics feature of Storm much
> > > better,
> > > > > so
> > > > > >> > I'd
> > > > > >> > > >> like to see how the work is going. Sure it's only when
> > > there's no
> > > > > >> > issue
> > > > > >> > > >> for
> > > > > >> > > >> you to work transparently. I just would like to prevent
> > > > > >> duplication of
> > > > > >> > > >> work, and would like to help if needed and possible.
> > > > > >> > > >>
> > > > > >> > > >> Thanks,
> > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > >> > > >>
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Abhishek Agarwal
> > >
> >
> >
> >
> >
>

Re: Question on Metrics Server to Alibaba team

Reply via email to