It seems that it's the large leveldb size that cause the problem. What is the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short enough so we have too much entities in timeline store. And by the way, it will take a long time (hours) when the ATS do discard old entity operation, and it will also block the other operations. The patch https://issues.apache.org/jira/browse/YARN-3448 is a great performance improve. We just backport it and it works well.
2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) < garlanaganarasi...@huawei.com>: > Hi Krzysiek, > > > > *There are currently 8 Spark Streaming jobs constantly running, each 3 > with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to > ATS. How could I check what precisely is doing what or how to get some > logs about it, I don't know...* > > Not sure about the applications being run and if you have already tried > disabling the "Spark History Server doing the puts ATS" then not sure if > the apps are sending it out . AFAIK Spark history server had not integrated > with ATS (SPARK-1537). So most propably its the applications which are > pumping in the data. I think you need to check with them itself. > > > *2. Is 8 concurrent Spark Streaming jobs really that high for > Timelineserver? I have just a small cluster, how other larger companies are > handling much larger load? * > > Its not been used in large scale by us but according YARN-2556 (ATS > Performance Test Tool), it states that "On a 36 node cluster, this > results in ~830 concurrent containers (e.g maps), each firing 10KB of > payload, 20 times." but only thing being different is, data in your > system is already overloaded hence cost of querying (which is currently > happening during each insertion) is very high. > > May be guys from other company who have used or supported ATSV1 might be > able to tell the ATSV1 scale better ! > > > Regards, > > + Naga > ------------------------------ > *From:* Krzysztof Zarzycki [k.zarzy...@gmail.com] > *Sent:* Thursday, November 05, 2015 19:51 > *To:* user@hadoop.apache.org > *Subject:* Re: YARN timelineserver process taking 600% CPU > > Thanks Naga for your input, (I'm sorry for a late response, I was out for > some time). > > So you believe that Spark is actually doing the PUTs? There are currently > 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x > 10 s. I believe these are the jobs that publish to ATS. How could I check > what precisely is doing what or how to get some logs about it, I don't > know... > I though maybe it is Spark History Server doing the puts, but it seems it > is not, as I disabled it and the load hasn't gone down. So it seems these > are the jobs itself indeed. > > Now I have the following problems: > 1. The most important: How can I at least *workaround* this issue? Maybe > I will somehow disable Spark usage of Yarn timelineserver ? What are the > consequences? Is it only history of Spark finished jobs not being saved? If > yes, that doesn't hurt that much. Probably this is a question to Spark > group... > 2. Is 8 concurrent Spark Streaming jobs really that high for > Timelineserver? I have just a small cluster, how other larger companies are > handling much larger load? > > Thanks for helping me with this! > Krzysiek > > > > > > > > > > > 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <naganarasimha...@gmail.com > >: > >> Hi Krzysiek, >> Oops My mistake, 3 Gb seems to be on little higher side. >> And from the jstack it seems like there were no major activity other than >> puts seems like around 16 concurrent puts were happening which tries to get >> the timeline Entity hence hitting the native call. >> >> From the logs it seems like lot of ACL validations are happening and from >> the URL it seems like its for PUTEntites. >> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened >> and if all of these are for puts then roughly about 10 put calls/s is >> happening from *spark* side. This i feel is not right usage of ATS, can >> you check what is being published from the spark to ATS at this high rate ? >> >> Besides some improvements regarding the timeline metrics is available in >> trunk as part of YARN-3360 which could have been useful in analyzing your >> issue. >> >> + Naga >> >> >> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzy...@gmail.com> >> wrote: >> >>> Hi Naga, >>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows >>> numbers in kB). Does that seems reasonable as well? >>> There are new .sst files generated each minute. >>> There are now 26850 files in leveldb-timeline-store directory. New files >>> are generated each minute. Some are also being deleted. >>> >>> I started timeline server today, to gather logs and jstack, it was >>> running for ~20 minutes. I attach the tar bz2 archive with those logs. >>> >>> Thank you for helping me debug this. >>> Krzysiek >>> >>> >>> >>> >>> >>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla < >>> naganarasimha...@gmail.com>: >>> >>>> Hi Krzysiek, >>>> seems like the size is around 3 MB which seems to be fine. , >>>> Could you try enabling in debug and share the logs of ATS/AHS and also >>>> if possible the jstack output for the AHS process >>>> >>>> + Naga >>>> >>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki < >>>> k.zarzy...@gmail.com> wrote: >>>> >>>>> Hi Naga, >>>>> I see the following size: >>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline >>>>> 36 /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb >>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb >>>>> 3307812 /var/lib/hadoop/yarn/timeline >>>>> >>>>> The timeline service has been multiple times restarted as I was >>>>> looking for issue with it. But it was installed about a 2 months ago. Just >>>>> few applications (1?2? ) has been started since its last start. The >>>>> ResourceManager interface has 261 entries. >>>>> >>>>> As in yarn-site.xml that I attached, the variable you're asking for >>>>> has the following value: >>>>> >>>>> <property> >>>>> >>>>> >>>>> <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name> >>>>> <value>300000</value> >>>>> </property> >>>>> >>>>> >>>>> Ah, One more thing: When I looked with jstack to see what the process >>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So >>>>> I >>>>> *think* it is related to leveldb store. >>>>> >>>>> Please ask if any more information is needed. >>>>> Any help is appreciated! Thanks >>>>> Krzysiek >>>>> >>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) < >>>>> garlanaganarasi...@huawei.com>: >>>>> >>>>>> Hi , >>>>>> >>>>>> Whats the size of Store Files? >>>>>> Since when is it running ? how many applications have been run since >>>>>> it has been started ? >>>>>> Whats the value of " >>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ? >>>>>> >>>>>> + Naga >>>>>> ------------------------------ >>>>>> *From:* Krzysztof Zarzycki [k.zarzy...@gmail.com] >>>>>> *Sent:* Wednesday, September 30, 2015 19:20 >>>>>> *To:* user@hadoop.apache.org >>>>>> *Subject:* YARN timelineserver process taking 600% CPU >>>>>> >>>>>> Hi there Hadoopers, >>>>>> I have a serious issue with my installation of Hadoop & YARN in >>>>>> version 2.7.1 (HDP 2.3). >>>>>> The timelineserver process ( more >>>>>> precisely >>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer >>>>>> class) takes over 600% of CPU, generating enormous load on my master >>>>>> node. >>>>>> I can't guess why it happens. >>>>>> >>>>>> First, I run the timelineserver using java 8, thought that this was >>>>>> an issue. But no, I started timelineserver now with use of java 7 and >>>>>> still >>>>>> the problem is the same. >>>>>> >>>>>> My cluster is tiny- it consists of: >>>>>> - 2 HDFS nodes >>>>>> - 2 HBase RegionServers >>>>>> - 2 Kafkas >>>>>> - 2 Spark nodes >>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second >>>>>> TOTAL. >>>>>> >>>>>> I'll be very grateful for your help here. If you need any more info, >>>>>> please write. >>>>>> I also attach yarn-site.xml grepped to options related to timeline >>>>>> server. >>>>>> >>>>>> And here is a command of timeline that I see from ps : >>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m >>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn >>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn >>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log >>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log >>>>>> -Dyarn.home.dir= >>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA >>>>>> -Dyarn.root.logger=INFO,EWMA,RFA >>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native >>>>>> -Dyarn.policy.file=hadoop-policy.xml >>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn >>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn >>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log >>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log >>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver >>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop >>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA >>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native >>>>>> -classpath >>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties >>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer >>>>>> >>>>>> >>>>>> Thanks! >>>>>> Krzysztof >>>>>> >>>>>> >>>>> >>>> >>> >> >