Hey Merto, Any luck getting the patch running on your cluster?
In case you're interested, there's now a JIRA for this: https://issues.apache.org/jira/browse/HADOOP-8052. Varun On Wed, Feb 8, 2012 at 7:45 PM, Varun Kapoor <rez...@hortonworks.com> wrote: > Your general procedure sounds correct (i.e. dropping your newly built .jar > into $HD_HOME/lib/), but to make sure it's getting picked up, you should > explicitly add $HD_HOME/lib/ to your exported HADOOP_CLASSPATH environment > variable; here's mine, as an example: > > export HADOOP_CLASSPATH=".:./build/*.jar" > > About your second point, you certainly need to copy this newly patched > .jar to every node in your cluster, because my patch changes the value of a > couple metrics emitted TO gmetad (FROM all the nodes in the cluster), so > without copying it over to every node in the cluster, gmetad will still > likely receive some bad metrics. > > Varun > > > On Wed, Feb 8, 2012 at 6:19 PM, Merto Mertek <masmer...@gmail.com> wrote: > >> I will need your help. Please confirm if the following procedure is right. >> I have a dev environment where I pimp my scheduler (no hadoop running) and >> a small cluster environment where the changes(jars) are deployed with some >> scripts, however I have never compiled the whole hadoop from source so I >> do not know if I am doing it right. I' ve done it as follow: >> >> a) apply a patch >> b) cd $HD_HOME; ant >> c) copy $HD_HOME/*build*/patched-core-hadoop.jar -> >> cluster:/$HD_HOME/*lib* >> d) run $HD_HOME/bin/start-all.sh >> >> Is this enough? When I tried to test "hadoop dfs -ls /" I could see that a >> new jar was not loaded and instead a jar from >> $HD_HOME/*share*/hadoop-20.205.0.jar >> was taken.. >> Should I copy the entire hadoop folder to all nodes and reconfigure the >> entire cluster for the new build, or is enough if I configure it just on >> the node where gmetad will run? >> >> >> >> >> >> >> On 8 February 2012 06:33, Varun Kapoor <rez...@hortonworks.com> wrote: >> >> > I'm so sorry, Merto - like a silly goose, I attached the 2 patches to my >> > reply, and of course the mailing list did not accept the attachment. >> > >> > I plan on opening JIRAs for this tomorrow, but till then, here are >> links to >> > the 2 patches (from my Dropbox account): >> > >> > - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.Hadoop.patch >> > - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.gmetad.patch >> > >> > Here's hoping this works for you, >> > >> > Varun >> > On Tue, Feb 7, 2012 at 6:00 PM, Merto Mertek <masmer...@gmail.com> >> wrote: >> > >> > > Varun, have I missed your link to the patches? I have tried to search >> > them >> > > on jira but I did not find them.. Can you repost the link for these >> two >> > > patches? >> > > >> > > Thank you.. >> > > >> > > On 7 February 2012 20:36, Varun Kapoor <rez...@hortonworks.com> >> wrote: >> > > >> > > > I'm sorry to hear that gmetad cores continuously for you guys. Since >> > I'm >> > > > not seeing that behavior, I'm going to just put out the 2 possible >> > > patches >> > > > you could apply and wait to hear back from you. :) >> > > > >> > > > Option 1 >> > > > >> > > > * Apply gmetadBufferOverflow.Hadoop.patch to the relevant file ( >> > > > >> > > >> > >> http://svn.apache.org/viewvc/hadoop/common/branches/branch-1/src/core/org/apache/hadoop/metrics2/util/SampleStat.java?view=markupinmysetup) >> in your Hadoop sources and rebuild Hadoop. >> > > > >> > > > Option 2 >> > > > >> > > > * Apply gmetadBufferOverflow.gmetad.patch to gmetad/process_xml.c >> and >> > > > rebuild gmetad. >> > > > >> > > > Only 1 of these 2 fixes is required, and it would help me if you >> could >> > > > first try Option 1 and let me know if that fixes things for you. >> > > > >> > > > Varun >> > > > >> > > > On Mon, Feb 6, 2012 at 10:36 PM, mete <efk...@gmail.com> wrote: >> > > > >> > > >> Same with Merto's situation here, it always overflows short time >> after >> > > the >> > > >> restart. Without the hadoop metrics enabled everything is smooth. >> > > >> Regards >> > > >> >> > > >> Mete >> > > >> >> > > >> On Tue, Feb 7, 2012 at 4:58 AM, Merto Mertek <masmer...@gmail.com> >> > > wrote: >> > > >> >> > > >> > I have tried to run it but it repeats crashing.. >> > > >> > >> > > >> > - When you start gmetad and Hadoop is not emitting metrics, >> > > everything >> > > >> > > is peachy. >> > > >> > > >> > > >> > >> > > >> > Right, running just ganglia without running hadoop jobs seems >> stable >> > > >> for at >> > > >> > least a day.. >> > > >> > >> > > >> > >> > > >> > > - When you start Hadoop (and it thus starts emitting >> metrics), >> > > >> gmetad >> > > >> > > cores. >> > > >> > > >> > > >> > >> > > >> > True, with a following error : *** stack smashing detected ***: >> > > gmetad >> > > >> > terminated \n Segmentation fault >> > > >> > >> > > >> > - On my MacBookPro, it's a SIGABRT due to a buffer overflow. >> > > >> > > >> > > >> > > I believe this is happening for everyone. What I would like for >> > you >> > > to >> > > >> > try >> > > >> > > out are the following 2 scenarios: >> > > >> > > >> > > >> > > - Once gmetad cores, if you start it up again, does it core >> > again? >> > > >> Does >> > > >> > > this process repeat ad infinitum? >> > > >> > > >> > > >> > - On my MBP, the core is a one-time thing, and restarting >> gmetad >> > > >> > > after the first core makes things run perfectly smoothly. >> > > >> > > - I know others are saying this core occurs >> continuously, >> > > but >> > > >> > they >> > > >> > > were all using ganglia-3.1.x, and I'm interested in how >> > > >> > > ganglia-3.2.0 >> > > >> > > behaves for you. >> > > >> > > >> > > >> > >> > > >> > It cores everytime I run it. The difference is just that >> sometimes a >> > > >> > segmentation faults appears instantly, and sometimes it appears >> > after >> > > a >> > > >> > random time...lets say after a minute of running gmetad and >> > collecting >> > > >> > data. >> > > >> > >> > > >> > >> > > >> > > - If you start Hadoop first (so gmetad is not running >> when >> > > the >> > > >> > > first batch of Hadoop metrics are emitted) and THEN start >> gmetad >> > > >> after >> > > >> > a >> > > >> > > few seconds, do you still see gmetad coring? >> > > >> > > >> > > >> > >> > > >> > Yes >> > > >> > >> > > >> > >> > > >> > > - On my MBP, this sequence works perfectly fine, and there >> > are >> > > no >> > > >> > > gmetad cores whatsoever. >> > > >> > > >> > > >> > >> > > >> > I have tested this scenario with 2 working nodes so two gmond >> plus >> > the >> > > >> head >> > > >> > gmond on the server where gmetad is located. I have checked and >> all >> > of >> > > >> them >> > > >> > are versioned 3.2.0. >> > > >> > >> > > >> > Hope it helps.. >> > > >> > >> > > >> > >> > > >> > >> > > >> > > >> > > >> > > Bear in mind that this only addresses the gmetad coring issue - >> > the >> > > >> > > warnings emitted about '4.9E-324' being out of range will >> > continue, >> > > >> but I >> > > >> > > know what's causing that as well (and hope that my patch fixes >> it >> > > for >> > > >> > > free). >> > > >> > > >> > > >> > > Varun >> > > >> > > On Mon, Feb 6, 2012 at 2:39 PM, Merto Mertek < >> masmer...@gmail.com >> > > >> > > >> > wrote: >> > > >> > > >> > > >> > > > Yes I am encoutering the same problems and like Mete said >> few >> > > >> seconds >> > > >> > > > after restarting a segmentation fault appears.. here is my >> > conf.. >> > > >> > > > <http://pastebin.com/VgBjp08d> >> > > >> > > > >> > > >> > > > And here are some info from /var/log/messages (ubuntu server >> > > 10.10): >> > > >> > > > >> > > >> > > > kernel: [424447.140641] gmetad[26115] general protection >> > > >> > ip:7f7762428fdb >> > > >> > > > > sp:7f776362d370 error:0 in >> libgcc_s.so.1[7f776241a000+15000] >> > > >> > > > > >> > > >> > > > >> > > >> > > > When I compiled gmetad I used the following command: >> > > >> > > > >> > > >> > > > ./configure --with-gmetad --sysconfdir=/etc/ganglia >> > > >> > > > > CPPFLAGS="-I/usr/local/rrdtool-1.4.7/include" >> > > >> > > > > CFLAGS="-I/usr/local/rrdtool-1.4.7/include" >> > > >> > > > > LDFLAGS="-L/usr/local/rrdtool-1.4.7/lib" >> > > >> > > > > >> > > >> > > > >> > > >> > > > The same was tried with rrdtool 1.4.5. My current ganglia >> > version >> > > is >> > > >> > > 3.2.0 >> > > >> > > > and like Mete I tried it with version 3.1.7 but without >> > success.. >> > > >> > > > >> > > >> > > > Hope we will sort it out soon any solution.. >> > > >> > > > thank you >> > > >> > > > >> > > >> > > > >> > > >> > > > On 6 February 2012 20:09, mete <efk...@gmail.com> wrote: >> > > >> > > > >> > > >> > > > > Hello, >> > > >> > > > > i also face this issue when using GangliaContext31 and >> > > >> hadoop-1.0.0, >> > > >> > > and >> > > >> > > > > ganglia 3.1.7 (also tried 3.1.2). I continuously get buffer >> > > >> overflows >> > > >> > > as >> > > >> > > > > soon as i restart the gmetad. >> > > >> > > > > Regards >> > > >> > > > > Mete >> > > >> > > > > >> > > >> > > > > On Mon, Feb 6, 2012 at 7:42 PM, Vitthal "Suhas" Gogate < >> > > >> > > > > gog...@hortonworks.com> wrote: >> > > >> > > > > >> > > >> > > > > > I assume you have seen the following information on >> Hadoop >> > > >> twiki, >> > > >> > > > > > http://wiki.apache.org/hadoop/GangliaMetrics >> > > >> > > > > > >> > > >> > > > > > So do you use GangliaContext31 in >> > hadoop-metrics2.properties? >> > > >> > > > > > >> > > >> > > > > > We use Ganglia 3.2 with Hadoop 20.205 and works fine (I >> > > >> remember >> > > >> > > > seeing >> > > >> > > > > > gmetad sometime goes down due to buffer overflow problem >> > when >> > > >> > hadoop >> > > >> > > > > starts >> > > >> > > > > > pumping in the metrics.. but restarting works.. let me >> know >> > if >> > > >> you >> > > >> > > face >> > > >> > > > > > same problem? >> > > >> > > > > > >> > > >> > > > > > --Suhas >> > > >> > > > > > >> > > >> > > > > > Additionally, the Ganglia protocol change significantly >> > > between >> > > >> > > Ganglia >> > > >> > > > > 3.0 >> > > >> > > > > > and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with >> > > >> Ganglia >> > > >> > 3.0 >> > > >> > > > > > clients). This caused Hadoop to not work with Ganglia >> 3.1; >> > > there >> > > >> > is a >> > > >> > > > > patch >> > > >> > > > > > available for this, HADOOP-4675. As of November 2010, >> this >> > > patch >> > > >> > has >> > > >> > > > been >> > > >> > > > > > rolled into the mainline for 0.20.2 and later. To use the >> > > >> Ganglia >> > > >> > 3.1 >> > > >> > > > > > protocol in place of the 3.0, substitute >> > > >> > > > > > org.apache.hadoop.metrics.ganglia.GangliaContext31 for >> > > >> > > > > > org.apache.hadoop.metrics.ganglia.GangliaContext in the >> > > >> > > > > > hadoop-metrics.properties lines above. >> > > >> > > > > > >> > > >> > > > > > On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek < >> > > >> masmer...@gmail.com> >> > > >> > > > > wrote: >> > > >> > > > > > >> > > >> > > > > > > I spent a lot of time to figure it out however i did >> not >> > > find >> > > >> a >> > > >> > > > > solution. >> > > >> > > > > > > Problems from the logs pointed me for some bugs in >> > rrdupdate >> > > >> > tool, >> > > >> > > > > > however >> > > >> > > > > > > i tried to solve it with different versions of ganglia >> and >> > > >> > rrdtool >> > > >> > > > but >> > > >> > > > > > the >> > > >> > > > > > > error is the same. Segmentation fault appears after the >> > > >> following >> > > >> > > > > lines, >> > > >> > > > > > if >> > > >> > > > > > > I run gmetad in debug mode... >> > > >> > > > > > > >> > > >> > > > > > > "Created rrd >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd" >> > > >> > > > > > > "Created rrd >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.snapshot_max_time.rrd >> > > >> > > > > > > " >> > > >> > > > > > > >> > > >> > > > > > > which I suppose are generated from >> MetricsSystemImpl.java >> > > (Is >> > > >> > there >> > > >> > > > any >> > > >> > > > > > way >> > > >> > > > > > > just to disable this two metrics?) >> > > >> > > > > > > >> > > >> > > > > > > From the /var/log/messages there are a lot of errors: >> > > >> > > > > > > >> > > >> > > > > > > "xxx gmetad[15217]: RRD_update >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.publish_imax_time.rrd): >> > > >> > > > > > > converting '4.9E-324' to float: Numerical result out >> of >> > > >> range" >> > > >> > > > > > > "xxx gmetad[15217]: RRD_update >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.snapshot_imax_time.rrd): >> > > >> > > > > > > converting '4.9E-324' to float: Numerical result out >> of >> > > >> range" >> > > >> > > > > > > >> > > >> > > > > > > so probably there are some converting issues ? Where >> > should >> > > I >> > > >> > look >> > > >> > > > for >> > > >> > > > > > the >> > > >> > > > > > > solution? Would you rather suggest to use ganglia 3.0.x >> > with >> > > >> the >> > > >> > > old >> > > >> > > > > > > protocol and leave the version >3.1 for further >> releases? >> > > >> > > > > > > >> > > >> > > > > > > any help is realy appreciated... >> > > >> > > > > > > >> > > >> > > > > > > On 1 February 2012 04:04, Merto Mertek < >> > masmer...@gmail.com >> > > > >> > > >> > > wrote: >> > > >> > > > > > > >> > > >> > > > > > > > I would be glad to hear that too.. I've setup the >> > > following: >> > > >> > > > > > > > >> > > >> > > > > > > > Hadoop 0.20.205 >> > > >> > > > > > > > Ganglia Front 3.1.7 >> > > >> > > > > > > > Ganglia Back *(gmetad)* 3.1.7 >> > > >> > > > > > > > RRDTool <http://www.rrdtool.org/> 1.4.5. -> i had >> some >> > > >> > troubles >> > > >> > > > > > > > installing 1.4.4 >> > > >> > > > > > > > >> > > >> > > > > > > > Ganglia works just in case hadoop is not running, so >> > > metrics >> > > >> > are >> > > >> > > > not >> > > >> > > > > > > > publshed to gmetad node (conf with new >> > > >> > > > hadoop-metrics2.proprieties). >> > > >> > > > > > When >> > > >> > > > > > > > hadoop is started, a segmentation fault appears in >> > gmetad >> > > >> > deamon: >> > > >> > > > > > > > >> > > >> > > > > > > > sudo gmetad -d 2 >> > > >> > > > > > > > ....... >> > > >> > > > > > > > Updating host xxx, metric >> dfs.FSNamesystem.BlocksTotal >> > > >> > > > > > > > Updating host xxx, metric bytes_in >> > > >> > > > > > > > Updating host xxx, metric bytes_out >> > > >> > > > > > > > Updating host xxx, metric >> > > >> > > > > metricssystem.MetricsSystem.publish_max_time >> > > >> > > > > > > > Created rrd >> > > >> > > > > > > > >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> /var/lib/ganglia/rrds/hdcluster/hadoopmaster/metricssystem.MetricsSystem.publish_max_time.rrd >> > > >> > > > > > > > Segmentation fault >> > > >> > > > > > > > >> > > >> > > > > > > > And some info from the apache log < >> > > >> > http://pastebin.com/nrqKRtKJ >> > > >> > > >.. >> > > >> > > > > > > > >> > > >> > > > > > > > Can someone suggest a ganglia version that is tested >> > with >> > > >> > hadoop >> > > >> > > > > > > 0.20.205? >> > > >> > > > > > > > I will try to sort it out however it seems a not so >> > > tribial >> > > >> > > > problem.. >> > > >> > > > > > > > >> > > >> > > > > > > > Thank you >> > > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > On 2 December 2011 12:32, praveenesh kumar < >> > > >> > praveen...@gmail.com >> > > >> > > > >> > > >> > > > > > wrote: >> > > >> > > > > > > > >> > > >> > > > > > > >> or Do I have to apply some hadoop patch for this ? >> > > >> > > > > > > >> >> > > >> > > > > > > >> Thanks, >> > > >> > > > > > > >> Praveenesh >> > > >> > > > > > > >> >> > > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > > >> > > > >> > > > >> > > > -- >> > > > >> > > > >> > > > http://www.hadoopsummit.org/ >> > > > >> > > > >> > > >> > >> > >> > >> > -- >> > >> > >> > http://www.hadoopsummit.org/ >> > >> > > > > -- > > > http://www.hadoopsummit.org/ > > -- http://www.hadoopsummit.org/