Re: How do I track down a painfully long pause in a small web app?

dennis zhuang Wed, 24 Sep 2014 01:46:39 -0700

You can use

    jstat -gcutil <pid> 2000


to print the GC statistics every 2 seconds,
http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html

It the long pause is from GC, the columns FGCT/FGC values would be large.

If you think it's a swap issue, you may want to use

   vmstat 1 10000

watch out the si/so columns.

What's your jvm arguments? Too small heap memory size may be the issue.

2014-09-24 9:47 GMT+08:00 larry google groups <lawrencecloj...@gmail.com>:

> I'm guessing that strace is showing me userland threads? When I quit
> strace I see:
>
> ^CProcess 19363 detached
> Process 19364 detached
> Process 19365 detached
> Process 19366 detached
> Process 19367 detached
> Process 19368 detached
> Process 19369 detached
> Process 19370 detached
> Process 19371 detached
> Process 19372 detached
> Process 19377 detached
> Process 19378 detached
> Process 19379 detached
> Process 19380 detached
> Process 19381 detached
> Process 19382 detached
> Process 19383 detached
> Process 19384 detached
> Process 19385 detached
> Process 19386 detached
> Process 19387 detached
> Process 19388 detached
> Process 19389 detached
> Process 19390 detached
> Process 19391 detached
> Process 19392 detached
> Process 19393 detached
> Process 19394 detached
> Process 19395 detached
> Process 19396 detached
> Process 19397 detached
> Process 19398 detached
> Process 19399 detached
> Process 19400 detached
> Process 19401 detached
> Process 19402 detached
> Process 19403 detached
> Process 19404 detached
> Process 19405 detached
> Process 19406 detached
> Process 19407 detached
> Process 19408 detached
> Process 19409 detached
> Process 19410 detached
> Process 19606 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  90.06   40.973072        1363     30059     10449 futex
>   4.23    1.926411         819      2353           epoll_wait
>   3.02    1.373282      114440        12         6 restart_syscall
>   1.24    0.563107       93851         6           accept
>   0.99    0.449988          12     36909           gettimeofday
>   0.35    0.156992           5     29410           clock_gettime
>   0.05    0.021064          67       316           recvfrom
>   0.02    0.010338          30       347           write
>   0.01    0.005117          24       209           sendto
>   0.01    0.004369          24       180           poll
>   0.01    0.002683          24       112        22 read
>   0.01    0.002563          24       108         6 epoll_ctl
>   0.00    0.001618          14       112           open
>   0.00    0.001189           5       230           fcntl
>   0.00    0.001132           8       142           mprotect
>   0.00    0.000969           8       118           close
>   0.00    0.000806          38        21           writev
>   0.00    0.000685           6       109           ioctl
>   0.00    0.000655           6       110           fstat
>   0.00    0.000229          13        17           mmap
>   0.00    0.000216          36         6           shutdown
>   0.00    0.000197          33         6           dup2
>   0.00    0.000092          46         2           madvise
>   0.00    0.000061           5        12           setsockopt
>   0.00    0.000057          14         4           munmap
>   0.00    0.000056           5        12           getsockname
>   0.00    0.000035           4         8           rt_sigprocmask
>   0.00    0.000018           5         4           sched_getaffinity
>   0.00    0.000010           5         2           clone
>   0.00    0.000009           9         1           rt_sigreturn
>   0.00    0.000009           5         2           uname
>   0.00    0.000009           5         2           set_robust_list
>   0.00    0.000008           4         2           gettid
> ------ ----------- ----------- --------- --------- ----------------
> 100.00   45.497046                100943     10483 total
>
>
>
>
>
>
> On Tuesday, September 23, 2014 9:44:52 PM UTC-4, larry google groups wrote:
>>
>> I am intrigued by this article, as the problem sounds the same as mine:
>>
>> http://corner.squareup.com/2014/09/logging-can-be-tricky.html
>>
>> "No significant amount of resources appeared to be in use — disk I/O,
>> network I/O, CPU, and memory all looked fairly tame. Furthermore, the bulk
>> of queries being served were all performing as expected. "
>>
>> So I tried to follow their example regarding strace. But I have never
>> worked with strace before. I used grep to find the PID and then I:
>>
>>  sudo strace -c -f -p 19363
>>
>> and I got:
>>
>> Process 19363 attached with 45 threads
>>
>> Then I ran our "health check" which is like a series of functional tests
>> that ping our actual app (a live environment rather than a test
>> environment). I got nothing out of strace except these 2 lines appeared:
>>
>> Process 20973 attached
>> Process 20974 attached
>>
>> What does this mean? I had the impression that the JVM ran in 1 process?
>> Does strace show me userland threads (like htop does) or are these child
>> processes?
>>
>>
>>
>>
>>
>> On Monday, September 15, 2014 12:15:14 AM UTC-4, larry google groups
>> wrote:
>>>
>>>
>>> I have an embarrassing problem. I convinced my boss that I could use
>>> Clojure to build a RESTful API. I was successful in so far as that went,
>>> but now I face the issue that every once in a while, the program pauses,
>>> for a painfully long time -- sometimes 30 seconds, which causes some
>>> requests to the API to timeout. We are still in testing, so there is no
>>> real load on the app, just the frontenders, writing Javascript and making
>>> Ajax calls to the service.
>>>
>>> The service seems like a basic Clojure web app. I use Jetty as the
>>> webserver, and the libraries in use are:
>>>
>>> Ring
>>>
>>> Compojure
>>>
>>> Liberator
>>>
>>> Monger
>>>
>>> Timbre
>>>
>>> Lamina
>>>
>>> Dire
>>>
>>> When someone complains about the pauses, I will go test the service, and
>>> I can hit with 40 requests in 10 seconds and it has great performance. The
>>> pauses actually seem to come after periods of inactivity, which made me
>>> think that this had something to do with garbage collection, except that
>>> the pauses are so extreme -- like I said, sometimes as much as 30 seconds,
>>> causing requests to timeout. When the app does finally start to respond it
>>> again, it goes very fast, and responds to those pending request very fast.
>>>
>>> But I have to find a way to fix these pauses.
>>>
>>> Right now I packaged the app as an Uberjar and put it on the server,
>>> spun it up on port 24000 and proxied it through Apache. I put a script in
>>> /etc/init.d to start the app using  start-stop-daemon.
>>>
>>> Possible things that could be going wrong:
>>>
>>> Maybe Jetty needs more threads, or maybe less threads? How would I test
>>> that?
>>>
>>> Maybe the link to MongoDB sometimes dies? (Mongo is on another server at
>>> Amazon) How would I test that?
>>>
>>> Maybe it is garbage collection? How would I test that?
>>>
>>> Maybe I have some code that somehow blocks the whole app? Seems unlikely
>>> but I'm trying to keep an open mind.
>>>
>>> Maybe the thread pool managed by Lamina sometimes gets overwhelmed? How
>>> would I test that?
>>>
>>> Maybe when Timbre writes to the log file it causes things to pause? (But
>>> I believe Timbre does this in its own thread?) How do I test that?
>>>
>>> This is a small app: only about 1,100 lines of code.
>>>
>>> I don't have much experience debugging problems on the JVM, so I welcome
>>> any suggestions.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>  --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
庄晓丹
Email:        killme2...@gmail.com xzhu...@avos.com
Site:           http://fnil.net
Twitter:      @killme2008

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: How do I track down a painfully long pause in a small web app?

Reply via email to