I did a little bit of research and although there are many solutions out there, the top two I shortlisted to based on many criteria (community, plugins, eco-system, features) are zabbix and nagios, and I list below the pros and cons from my perspective.
Given that our needs are not too complex, I find my self leaning slightly towards zabbix. # nagios pros - mature and battle tested - massive plugin collection - highly customizable - Configuration can be stored in a revision system (config files) - No database, simpler # nagios cons - steep learning curve (everything is a config file) - The web interface is limited - because of the huge eco-system some plugins are abandoned or not well maintained - community is a bit fragmented with many forks out there (e.g. icinga, etc ...) # zabbix pros - Powerful web interface - Excellent template system that makes complex flows - Easy to deploy and configure - Easy to learn and use - More functionality OOTB # zabbix cons - More complex architecture with a database backend - Less customizable - Smaller community and eco-system, plugins not as numerous On Sat, Aug 25, 2018 at 12:30 PM Jacques Le Roux <jacques.le.r...@les7arts.com> wrote: > > Mind you, I already asked when the infra stopped providing it. Then Daniel > (Gruno) told me I could use the free tool his company provided. I used that > for months, but it's now discontinued. So I had to find the one which suited > me best and I explained that at the start of this thread. > > The problem for me is how to share the burden and especially have more brains > around it. It's years I handle the demos alone... > > Jacques > > Le 25/08/2018 à 11:07, Pierre Smits a écrit : > > Since we're talking about our demo instances on the infrastructure of the > > ASF I suggest getting in touch with INFRA and work out a solution with them > > that favours both parties. They surely will have monitoring solutions in > > place and can advice on what is achievable. > > > > > > > > Best regards, > > > > Pierre Smits > > > > Apache Trafodion <https://trafodion.apache.org>, Vice President > > Apache Directory <https://directory.apache.org>, PMC Member > > Apache Incubator <https://incubator.apache.org>, committer > > *Apache OFBiz <https://ofbiz.apache.org>, contributor (without privileges) > > since 2008* > > Apache Steve <https://steve.apache.org>, committer > > > > > > On Fri, Aug 24, 2018 at 4:36 PM Jacques Le Roux < > > jacques.le.r...@les7arts.com> wrote: > > > >> Agreed, I have used VisualVMin the past, it's a simple and efficient tool > >> > >> I have planned to make a VOTE about options if needed. Let's see if it > >> will be necessary (consensus being preferred) > >> > >> Jacques > >> > >> > >> Le 24/08/2018 à 16:06, Girish Vasmatkar a écrit : > >>> Speaking of monitoring tools and if we don't want to go for third party > >>> tools, we can also use VisualVM that comes bundled with Oracle JDK. It > >> can > >>> connect to the remote VM (OFBiz process) and start displaying various > >>> information. > >>> > >>> Very minimal configuration is needed in the form of VM argument to allow > >>> for remote monitoring. Also, to enable further analysis of what went > >> wrong, > >>> why JVM crashed etc, we should also dump heap as the JVM shuts down. > >>> > >>> Too many ways and too many options. Probably need to reach a unanimous > >>> decision, IMO. > >>> > >>> Thanks and Best regards, > >>> Girish Vasmatkar > >>> > >>> On Fri, Aug 24, 2018 at 4:56 PM Jacques Le Roux < > >>> jacques.le.r...@les7arts.com> wrote: > >>> > >>>> Thanks Michael, > >>>> > >>>> Best idea so far! > >>>> > >>>> Jacques > >>>> > >>>> > >>>> Le 24/08/2018 à 11:08, Michael Brohl a écrit : > >>>>> We are monitoring our OFBiz instances with JMX and self hosted Zabbix > >>>> [1]. > >>>>> Zabbix gives you a nice overview about the system health and metrics > >>>> like memory consumption etc. It also sends out warnings (Email, SMS or > >>>> else) > >>>>> if metrics are exceeded (like CPU load or memory consumption) as well > >> as > >>>> the system is not accessible. > >>>>> Looks like this: [2] > >>>>> > >>>>> There is no programming needed, just some configuration for JMX and > >>>> Zabbix. > >>>>> [1] https://www.zabbix.com/ > >>>>> [2] > >>>> https://www.ecomify.de/wp-content/uploads/2018/08/Zabbix_Monitoring.png > >>>>> If we want to see why the demos crash, it might be useful. If we only > >>>> want to monitor if the system is up, a simple cron job which sends a > >> mail > >>>>> might be enough... > >>>>> > >>>>> Regards, > >>>>> > >>>>> Michael Brohl > >>>>> ecomify GmbH > >>>>> www.ecomify.de > >>>>> > >>>>> > >>>>> Am 24.08.18 um 10:07 schrieb Taher Alkhateeb: > >>>>>> Okay all neat ideas, I'm not sure if the energy you will put into > >>>> something > >>>>>> like this is equal to the value produced but if you want to make this > >>>>>> happen I would be happy to assist. > >>>>>> > >>>>>> How much time will it take to make something like this happen? I ask > >>>>>> because it seems Jacques ia getting annoyed with these crashes and > >> we'd > >>>>>> like to help him out. > >>>>>> > >>>>>> On Fri, Aug 24, 2018, 10:59 AM Girish Vasmatkar < > >>>>>> girish.vasmat...@hotwaxsystems.com> wrote: > >>>>>> > >>>>>>> Hi Taher > >>>>>>> > >>>>>>> Please see my reply below in-line. > >>>>>>> > >>>>>>> On Fri, Aug 24, 2018 at 12:22 PM Taher Alkhateeb < > >>>>>>> slidingfilame...@gmail.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Girish, inline... > >>>>>>>> > >>>>>>>> On Thu, Aug 23, 2018, 7:25 PM Girish Vasmatkar < > >>>>>>>> girish.vasmat...@hotwaxsystems.com> wrote: > >>>>>>>> > >>>>>>>>> I had earlier replied to this thread but looks like the email did > >> not > >>>>>>> go > >>>>>>>>> through. I had leaned towards using the tool (only just) instead of > >>>> may > >>>>>>>> be > >>>>>>>>> having a CRON job or an alternative. > >>>>>>>>> > >>>>>>>>> What I feel now is that may be we can use JMX here and try to use > >>>>>>> various > >>>>>>>>> in build MBeans that provide CPU usage for the system and also for > >>>> the > >>>>>>>> JVM > >>>>>>>>> process we are concerned about that is OFBiz instance. We should > >> also > >>>>>>> be > >>>>>>>>> able to get the memory usage of the JVM and if reaches a particular > >>>>>>>>> threshold we can be notified. > >>>>>>>>> > >>>>>>>> Do you have a PoC for all of this? > >>>>>>>> > >>>>>>> GV : I can have one ready; and there is going to be much doing > >>>> involved. > >>>>>>>>> In addition, I think we already add a shutdown hook to the JVM > >>>>>>>> process... I > >>>>>>>>> am not sure and have not used it much but may be we can use it to > >>>> send > >>>>>>>> some > >>>>>>>>> notifications? Of course, it is applicable for graceful exits of > >> JVM > >>>>>>> only > >>>>>>>>> and if you just happen to kill the process it won't be of much > >> help. > >>>>>>>> The shutdown hook is used for shutting down. I'm not sure what is > >> the > >>>>>>>> purpose of mentioning it here? > >>>>>>>> > >>>>>>> GV : The reason I mentioned shutdown hook was it can be used to > >>>> send > >>>>>>> notification (may be email) or anything per our needs indicating that > >>>> the > >>>>>>> demo process was shut down. Per my understanding, shutdown hook > >>>> gets > >>>>>>> called whenever JVM shuts down gracefully. Graceful word is very > >>>> important > >>>>>>> here because we won't be able to do much if someone just kills the > >>>> process. > >>>>>>> The only thing a shutdown hook will add to this is that we will be > >>>> notified > >>>>>>> then and there. > >>>>>>> > >>>>>>>>> Hope it makes sense and correct me if I am wrong. > >>>>>>>> Well I'm struggling a bit. I didn't understand exactly what needs to > >>>> be > >>>>>>>> done? I see mixed topics about JMX, Mbeans, Memory monitors and > >>>> shutdown > >>>>>>>> hooks. First this seems to be more like coding than a tool, and > >>>> second I > >>>>>>>> have no idea how you want to implement this? > >>>>>>>> > >>>>>>> GV: Yes, it would mostly be coding rather than being a > >> substitute > >>>> for > >>>>>>> the tool. My idea was that to have a timer service run within the JVM > >>>> and > >>>>>>> it access various MBeans for the CPU usage and Memory usages just for > >>>> our > >>>>>>> monitoring purpose and raise an alert if it reaches a threshold. It > >> was > >>>>>>> just to have a glance over how JVM is performing. The disadvantage? > >> The > >>>>>>> service will run in OFBiz JVM and there will be considerable amount > >> of > >>>>>>> coding involved. > >>>>>>> > >>>>>>>> My idea for example is simple: create a cronjob that checks the > >> system > >>>>>>>> periodically and if the demo process stopped, restart it (or maybe > >>>>>>> rebuild > >>>>>>>> and restart). To go with your suggestion we need to perhaps first > >>>>>>>> understand it. > >>>>>>>> > >>>>>>> GV: There is nothing wrong with creating a CRON job, per se. The > >>>> only > >>>>>>> reason why I introduced MBeans in the mix was to be able to sort of > >>>> having > >>>>>>> OFBiz monitor itself within it's realm, hence use of MBeans. I > >> believe > >>>> a > >>>>>>> CRON will be able to do it as well. I probably did not get that we > >>>> probably > >>>>>>> want something that take some action after the JVM has crashed and > >> not > >>>>>>> having something that monitors the process and alerts concerned > >> parties > >>>>>>> that the process is occupying more than say 2 GB or it's CPU usage > >> has > >>>>>>> spiked above 80%. > >>>>>>> > >>>>>>> All in all, I feel we should choose the solution based on what we > >> want > >>>> to > >>>>>>> do and whether we want to take it further as well. I do not know what > >>>> the > >>>>>>> tool does now or whether it can build the system again and restart it > >>>>>>> automatically. I also do not know what measures we take in such an > >>>> event. I > >>>>>>> agree CRON will be simplest of them all, but if the tool provides all > >>>> of > >>>>>>> these (be able to take corrective measures) and not just send > >>>>>>> notifications, then it can also be worth it's salt. Yes, CRON will be > >>>> more > >>>>>>> technical way of achieving :) > >>>>>>> > >>>>>>> Thanks and Best regards, > >>>>>>> Girish Vasmatkar > >>>>>>> HotWax Systems > >>>>>>> > >>>>>>>>> Best regards, > >>>>>>>>> Girish Vasmatkar > >>>>>>>>> HotWax Systems > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Aug 23, 2018 at 8:48 PM Jacques Le Roux < > >>>>>>>>> jacques.le.r...@les7arts.com> wrote: > >>>>>>>>> > >>>>>>>>>> Le 23/08/2018 à 14:04, Taher Alkhateeb a écrit : > >>>>>>>>>>> I'm not sure why you're hanging this on me, > >>>>>>>>>> Because you answered to the bait ;) > >>>>>>>>>> > >>>>>>>>>>> but sure I'm willing to > >>>>>>>>>>> help. > >>>>>>>>>> Thanks, much appreciated! > >>>>>>>>>> > >>>>>>>>>>> Can I get some information on how the crashes are happening and > >>>>>>>>>>> how you're getting notified, and I will take it from there. > >>>>>>>>>> I think after a crash it's mostly to use dumps there (we have > >>>> several > >>>>>>>>> from > >>>>>>>>>> the recent pas) but I'm not sure they will help, and it takes time > >>>> to > >>>>>>>>>> analyse. > >>>>>>>>>> > >>>>>>>>>> In the past I took the time to analyse some of them and it was > >>>>>>>>>> interesting. For instance in 2010 I found a bug in a Java version > >> we > >>>>>>>> were > >>>>>>>>>> using and it > >>>>>>>>>> helped me in a custom project I was also doing then: > >>>>>>>>>> https://markmail.org/message/byu2ivjn7wckayzz > >>>>>>>>>> > >>>>>>>>>> Lastly it was mostly lack of memory, despite having 8GB now. I > >>>>>>> created > >>>>>>>>>> https://issues.apache.org/jira/browse/INFRA-16780 for that, but > >> not > >>>>>>>> sure > >>>>>>>>>> it was > >>>>>>>>>> the reason. At least we have less issues since. > >>>>>>>>>> > >>>>>>>>>> Before (months ago) the Infra was monitoring our demos and > >> alerting > >>>>>>> us > >>>>>>>> by > >>>>>>>>>> mail (you just had to subscribe). Unfortunately we are on our own > >>>> for > >>>>>>>>> that > >>>>>>>>>> now, too much projects in the ASF... > >>>>>>>>>> As as I said initially in this thread I'm currently using > >>>>>>>> montastic.com > >>>>>>>>>> for the email alerts. > >>>>>>>>>> My idea when I started this thread was that it all depends on me, > >>>> and > >>>>>>>>>> that's bad. So I wanted people to be aware, you are much welcome. > >>>>>>>>>> > >>>>>>>>>> Jacques > >>>>>>>>>>> On Thu, Aug 23, 2018 at 2:29 PM Jacques Le Roux > >>>>>>>>>>> <jacques.le.r...@les7arts.com> wrote: > >>>>>>>>>>>> Yes we can, will you? > >>>>>>>>>>>> > >>>>>>>>>>>> Jacques > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Le 22/08/2018 à 19:29, Taher Alkhateeb a écrit : > >>>>>>>>>>>>> Well, we can ask Infra for help, we can check available > >>>>>>> solutions, > >>>>>>>> we > >>>>>>>>>>>>> can create a CRON script that checks things periodically, there > >>>>>>> are > >>>>>>>>>>>>> multiple ways to go about this. > >>>>>>>>>>>>> > >>>>>>>>>>>>> My personal preference is for a simple CRON script that takes > >>>>>>> care > >>>>>>>> of > >>>>>>>>>> this. > >>>>>>>>>>>>> On Wed, Aug 22, 2018 at 8:25 PM Jacques Le Roux > >>>>>>>>>>>>> <jacques.le.r...@les7arts.com> wrote: > >>>>>>>>>>>>>> So you prefer that I'm the only one to take care of the demos > >>>>>>> and > >>>>>>>>> act > >>>>>>>>>> on alerts? > >>>>>>>>>>>>>> Jacques > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Le 22/08/2018 à 18:53, Taher Alkhateeb a écrit : > >>>>>>>>>>>>>>> I prefer not to include any tools without proper analysis and > >>>>>>>>>>>>>>> discussion first. Less is more. > >>>>>>>>>>>>>>> On Wed, Aug 22, 2018 at 5:31 PM Jacques Le Roux > >>>>>>>>>>>>>>> <jacques.le.r...@les7arts.com> wrote: > >>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Should I consider no answers as a lazy consensus and should > >> I > >>>>>>>> send > >>>>>>>>>> (rare) alerts to this ML? > >>>>>>>>>>>>>>>> Without any answers I'll consider it a lazy consensus in 2 > >>>>>>> days. > >>>>>>>>>>>>>>>> Jacques > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Le 17/08/2018 à 12:22, Jacques Le Roux a écrit : > >>>>>>>>>>>>>>>>> Le 13/08/2018 à 18:21, Jacques Le Roux a écrit : > >>>>>>>>>>>>>>>>>> Le 12/08/2018 à 11:26, Jacques Le Roux a écrit : > >>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> This morning I noticed the old demo was down and > >> restarted > >>>>>>> it > >>>>>>>>>> after cleaning things. > >>>>>>>>>>>>>>>>>>> Previously (still some weeks ago) Daniel Gruno's (from > >>>>>>> Infra > >>>>>>>>>> team) company was kindly providing us a mean to monitor our demos > >>>> but > >>>>>>>> it > >>>>>>>>>> seems that > >>>>>>>>>>>>>>>>>>> this mean is no longer available > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I have asked about it and will let you know about it... > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Have a good weekend > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Jadques > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Daniel confirmed it's terminated. I turned to UpTimeRobot > >>>>>>>> which > >>>>>>>>>> is free and seems as well good :) > >>>>>>>>>>>>>>>>>> Jacques > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> This thread started on user ML but I don't want to bother > >>>>>>>>> everyone > >>>>>>>>>> with technical details. > >>>>>>>>>>>>>>>>> I used my own @a.o email to create the monitoring. > >>>>>>> UpTimeRobot > >>>>>>>> is > >>>>>>>>>> certainly the best free monitoring tool, with some possibilities > >>>>>>> others > >>>>>>>>>> don't give. > >>>>>>>>>>>>>>>>> But the free version has an inconvenient. You can only > >> check > >>>>>>>>> every > >>>>>>>>>> 5 mins and when the instances restart it takes more than 5 mins > >>>> each. > >>>>>>>>>>>>>>>>> So everyday I get a down an up alerts for each. I have > >>>>>>> switched > >>>>>>>>> to > >>>>>>>>>> montastic.com. > >>>>>>>>>>>>>>>>> I was wondering if we don't want to share that here. > >>>>>>>>>>>>>>>>> We could then have these alerts here and any committer, > >> using > >>>>>>>> the > >>>>>>>>>> info inhttps://svn.apache.org/repos/asf/ofbiz/tools/demo-backup > >>>>>>> could > >>>>>>>>>> handle issues. > >>>>>>>>>>>>>>>>> It seems better, isn'it? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Jacques > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >> >