Re: YARN timelineserver process taking 600% CPU

郭士伟 Sat, 05 Dec 2015 23:16:06 -0800

It seems that it's the large leveldb size that cause the problem. What is
the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
enough so we have too much entities in timeline store.
And by the way, it will take a long time (hours) when the ATS do discard
old entity operation, and it will also block the other operations. The
patch https://issues.apache.org/jira/browse/YARN-3448 is a great
performance improve. We just backport it and it works well.


2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
garlanaganarasi...@huawei.com>:

> Hi Krzysiek,
>
>
>
> *There are currently 8 Spark Streaming jobs constantly running, each 3
> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
> ATS.  How could I check what precisely is doing what or how to get some
> logs about it, I don't know...*
>
> Not sure about the applications being run and if you have already tried
> disabling the "Spark History Server doing the puts ATS" then not sure if
> the apps are sending it out . AFAIK Spark history server had not integrated
> with ATS (SPARK-1537). So most propably its the applications which are
> pumping in the data. I think you need to check with them itself.
>
>
> *2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load? *
>
> Its not been used in large scale by us but according YARN-2556 (ATS
> Performance Test Tool), it states that "On a 36 node cluster, this
> results in ~830 concurrent containers (e.g maps), each firing 10KB of
> payload, 20 times." but only thing being different is, data in your
> system is already overloaded hence cost of querying (which is currently
> happening during each insertion) is very high.
>
> May be guys from other company who have used or supported ATSV1 might be
> able to tell the ATSV1 scale better !
>
>
> Regards,
>
> + Naga
> ------------------------------
> *From:* Krzysztof Zarzycki [k.zarzy...@gmail.com]
> *Sent:* Thursday, November 05, 2015 19:51
> *To:* user@hadoop.apache.org
> *Subject:* Re: YARN timelineserver process taking 600% CPU
>
> Thanks Naga for your input,  (I'm sorry for a late response, I was out for
> some time).
>
> So you believe that Spark is actually doing the PUTs? There are currently
> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
> 10 s. I believe these are the jobs that publish to ATS.  How could I check
> what precisely is doing what or how to get some logs about it, I don't
> know...
> I though maybe it is Spark History Server doing the puts, but it seems it
> is not, as I disabled it and the load hasn't gone down. So it seems these
> are the jobs itself indeed.
>
> Now I have the following problems:
> 1. The most important: How can I at least *workaround* this issue? Maybe
> I will somehow disable Spark usage of Yarn timelineserver ? What are the
> consequences? Is it only history of Spark finished jobs not being saved? If
> yes, that doesn't hurt that much. Probably this is a question to Spark
> group...
> 2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load?
>
> Thanks for helping me with this!
> Krzysiek
>
>
>
>
>
>
>
>
>
>
> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <naganarasimha...@gmail.com
> >:
>
>> Hi Krzysiek,
>> Oops My mistake, 3 Gb seems to be on little higher side.
>> And from the jstack it seems like there were no major activity other than
>> puts seems like around 16 concurrent puts were happening which tries to get
>> the timeline Entity hence hitting the native call.
>>
>> From the logs it seems like lot of ACL validations are happening and from
>> the URL it seems like its for PUTEntites.
>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>> and if all of these are for puts then roughly about 10 put calls/s is
>> happening from *spark* side. This i feel is not right usage of ATS, can
>> you check what is being published from the spark to ATS at this high rate ?
>>
>> Besides some improvements regarding the timeline metrics is available in
>> trunk as part of YARN-3360 which could have been useful in analyzing your
>> issue.
>>
>> + Naga
>>
>>
>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzy...@gmail.com>
>> wrote:
>>
>>> Hi Naga,
>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>> numbers in kB). Does that seems reasonable as well?
>>> There are new .sst files generated each minute.
>>> There are now 26850 files in leveldb-timeline-store directory. New files
>>> are generated each minute. Some are also being deleted.
>>>
>>> I started timeline server today, to gather logs and jstack, it was
>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>
>>> Thank you for helping me debug this.
>>> Krzysiek
>>>
>>>
>>>
>>>
>>>
>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>> naganarasimha...@gmail.com>:
>>>
>>>> Hi Krzysiek,
>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>> if possible the jstack output for the AHS process
>>>>
>>>> + Naga
>>>>
>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>> k.zarzy...@gmail.com> wrote:
>>>>
>>>>> Hi Naga,
>>>>> I see the following size:
>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>
>>>>> The timeline service has been multiple times restarted as I was
>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>> few applications (1?2? ) has been started since its last start. The
>>>>> ResourceManager interface has 261 entries.
>>>>>
>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>> has the following value:
>>>>>
>>>>> <property>
>>>>>
>>>>>   
>>>>> <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>       <value>300000</value>
>>>>> </property>
>>>>>
>>>>>
>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So 
>>>>> I
>>>>> *think* it is related to leveldb store.
>>>>>
>>>>> Please ask if any more information is needed.
>>>>> Any help is appreciated! Thanks
>>>>> Krzysiek
>>>>>
>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>> garlanaganarasi...@huawei.com>:
>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> Whats the size of Store Files?
>>>>>> Since when is it running ? how many applications have been run since
>>>>>> it has been started ?
>>>>>> Whats the value of "
>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>
>>>>>> + Naga
>>>>>> ------------------------------
>>>>>> *From:* Krzysztof Zarzycki [k.zarzy...@gmail.com]
>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>
>>>>>> Hi there Hadoopers,
>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>> version 2.7.1 (HDP 2.3).
>>>>>> The timelineserver process ( more
>>>>>> precisely 
>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>> class) takes over 600% of CPU, generating enormous load on my master 
>>>>>> node.
>>>>>> I can't guess why it happens.
>>>>>>
>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>> an issue. But no, I started timelineserver now with use of java 7 and 
>>>>>> still
>>>>>> the problem is the same.
>>>>>>
>>>>>> My cluster is tiny- it consists of:
>>>>>> - 2 HDFS nodes
>>>>>> - 2 HBase RegionServers
>>>>>> - 2 Kafkas
>>>>>> - 2 Spark nodes
>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>> TOTAL.
>>>>>>
>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>> please write.
>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>> server.
>>>>>>
>>>>>> And here is a command of timeline that I see from ps :
>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log 
>>>>>> -Dyarn.home.dir=
>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -classpath
>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> Krzysztof
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Reply via email to