So our curl job helped to mask the issue, but eventually it too started 
backing up, unable to keep up with the slowdown.  It appears that this 
issue evidences itself every time garbage collection occurs.

We are using the junit plugin v1.9 
(https://wiki.jenkins-ci.org/display/JENKINS/JUnit+Plugin) to provide graph 
and convenient links to test failures. This plugin appears to be at the 
heart of the slowdown, but it isn't clear to us WHY these chains of reads 
are happening in the first place.  Also note we save 500 builds for each 
job and would save more if it was feasible.

So, lets manually trigger a garbage collection and then go visit two 
interstitial build pages (e.g. https://jenkins/jobs/job1/1234/ and 
https://jenkins/jobs/job2/3456)

Now lets run the following strace against the pid of the jenkins java 
processon our CentOS install -

strace -p <jenkins pid> -f -e trace='!futex' -s1000 -e trace='open' 2>&1 | 
grep 'jobs'

OUTPUT:
[pid 1000] 16:45:21 
open("/var/lib/jenkins/jobs/job1/builds/1234/junitResult.xml", O_RDONLY) = 
487
[pid 2000] 16:45:24 
open("/var/lib/jenkins/jobs/job2/builds/3456/junitResult.xml", O_RDONLY) = 
483
[pid 1000] 16:45:26 
open("/var/lib/jenkins/jobs/job1/builds/1233/junitResult.xml", O_RDONLY) = 
487
[pid 2000] 16:45:28 
open("/var/lib/jenkins/jobs/job2/builds/3455/junitResult.xml", O_RDONLY) = 
483

It will continue to do this for all 500 builds for each job, spinning up 
another java thread for each job, starting a cycle lasting ~41 minutes (500 
builds, 5 seconds each).  Remember, this was triggered by visiting a single 
interstitial job page for each job.  What is it doing?

It appears to be opening each 16MB junit.xml file and parsing it... but for 
what purpose?

strace of pid that is opening each 16MB junit.xml -
[pid 21366] 16:50:46 fstat(177, {st_mode=S_IFREG|0644, st_size=18788419, 
...}) = 0
[pid 21366] 16:50:46 lseek(177, 0, SEEK_CUR) = 10403840
[pid 21366] 16:50:46 lseek(177, 0, SEEK_END) = 18788419
[pid 21366] 16:50:46 lseek(177, 10403840, SEEK_SET) = 10403840
[pid 21366] 16:50:46 read(177, "8K of XML DATA

[pid 21366] 16:50:46 fstat(177, {st_mode=S_IFREG|0644, st_size=18788419, 
...}) = 0
[pid 21366] 16:50:46 lseek(177, 0, SEEK_CUR) = 10412032
[pid 21366] 16:50:46 lseek(177, 0, SEEK_END) = 18788419
[pid 21366] 16:50:46 lseek(177, 10412032, SEEK_SET) = 10412032
[pid 21366] 16:50:46 read(177, "NEXT 8K OF XML FILE

We have temporarily reduced the pain caused by this issue by reducing the 
number of saved builds from 500 to 50, but we would love to find a fix or 
better understand what is happening.

Thanks!

Kevin*2



On Wednesday, October 14, 2015 at 8:40:16 AM UTC-7, brother kevin wrote:
>
> To follow-up on this thread for anyone who is experiencing similar issues 
> with junit processing and memory usage -
>
> Firstly, we noticed that a specific set of pages were taking the longest 
> to load - the interstitial build number pages that show changes, scm 
> revisions, and a Test Result link (e.g. https://jenkins/job/myjob/43980/).
>
> Secondly, we noticed that the longer the job ran without someone actively 
> loading one of the interstitial build pages, the worse the problem became. 
>  For example, attempting to load one of these pages after a weekend with no 
> activity could take upwards of a minute.
>
> Finally, if the most recent interstitial job page was loaded by a single 
> user, all of the other pages then became responsive for all users.
>
> Our workaround for the time being is a cron job that pre-loads these 
> interstitial pages using curl.  We were pleasantly surprised this worked at 
> all and we are anxious to better understand what is causing this new 
> blocking behavior.
>
> 0 * * * * jenkins lastBuild=`ls -ltr /var/lib/jenkins/jobs/myjob/builds | 
> tail -1 | awk '{print $9}'`; curl -k https://jenkins/job/myjob/$lastBuild/ 
> > /tmp/$lastBuild.output
>
>
>
> On Tuesday, September 29, 2015 at 11:09:46 AM UTC-7, brother kevin wrote:
>>
>> The problem returned after changing the max builds back to 500 for the 
>> job in question.  We kept upping the -Xmx value and jenkins continued to 
>> run out of memory and the web ui would become unusable, with most requests 
>> timing out.
>>
>> We eventually settled on -Xmx16384M and the jenkins monitoring plugin 
>> showed the junit xml processing cresting the 16GB watermark.
>>
>> We upgraded from v1.8 of the junit plugin, to the latest version v1.9, 
>> and experienced the same excessive memory usage.
>>
>> The junit.xml file has includes 51,000 tests and is about 17MB in size. 
>>  Is that indicative of a job that has too many tests within it?
>>
>> Please advise, we have turned off junit processing on the job in question 
>> and our memory usage has dropped to below 4GB.
>>
>> Thank you!
>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-users/462df32f-9ae6-4c6d-bac9-fdb623dcab36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to