You can use
jstat -gcutil <pid> 2000
to print the GC statistics every 2 seconds,
http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html
It the long pause is from GC, the columns FGCT/FGC values would be large.
If you think it's a swap issue, you may want to use
vmstat 1 10000
watch out the si/so columns.
What's your jvm arguments? Too small heap memory size may be the issue.
2014-09-24 9:47 GMT+08:00 larry google groups <[email protected]>:
> I'm guessing that strace is showing me userland threads? When I quit
> strace I see:
>
> ^CProcess 19363 detached
> Process 19364 detached
> Process 19365 detached
> Process 19366 detached
> Process 19367 detached
> Process 19368 detached
> Process 19369 detached
> Process 19370 detached
> Process 19371 detached
> Process 19372 detached
> Process 19377 detached
> Process 19378 detached
> Process 19379 detached
> Process 19380 detached
> Process 19381 detached
> Process 19382 detached
> Process 19383 detached
> Process 19384 detached
> Process 19385 detached
> Process 19386 detached
> Process 19387 detached
> Process 19388 detached
> Process 19389 detached
> Process 19390 detached
> Process 19391 detached
> Process 19392 detached
> Process 19393 detached
> Process 19394 detached
> Process 19395 detached
> Process 19396 detached
> Process 19397 detached
> Process 19398 detached
> Process 19399 detached
> Process 19400 detached
> Process 19401 detached
> Process 19402 detached
> Process 19403 detached
> Process 19404 detached
> Process 19405 detached
> Process 19406 detached
> Process 19407 detached
> Process 19408 detached
> Process 19409 detached
> Process 19410 detached
> Process 19606 detached
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 90.06 40.973072 1363 30059 10449 futex
> 4.23 1.926411 819 2353 epoll_wait
> 3.02 1.373282 114440 12 6 restart_syscall
> 1.24 0.563107 93851 6 accept
> 0.99 0.449988 12 36909 gettimeofday
> 0.35 0.156992 5 29410 clock_gettime
> 0.05 0.021064 67 316 recvfrom
> 0.02 0.010338 30 347 write
> 0.01 0.005117 24 209 sendto
> 0.01 0.004369 24 180 poll
> 0.01 0.002683 24 112 22 read
> 0.01 0.002563 24 108 6 epoll_ctl
> 0.00 0.001618 14 112 open
> 0.00 0.001189 5 230 fcntl
> 0.00 0.001132 8 142 mprotect
> 0.00 0.000969 8 118 close
> 0.00 0.000806 38 21 writev
> 0.00 0.000685 6 109 ioctl
> 0.00 0.000655 6 110 fstat
> 0.00 0.000229 13 17 mmap
> 0.00 0.000216 36 6 shutdown
> 0.00 0.000197 33 6 dup2
> 0.00 0.000092 46 2 madvise
> 0.00 0.000061 5 12 setsockopt
> 0.00 0.000057 14 4 munmap
> 0.00 0.000056 5 12 getsockname
> 0.00 0.000035 4 8 rt_sigprocmask
> 0.00 0.000018 5 4 sched_getaffinity
> 0.00 0.000010 5 2 clone
> 0.00 0.000009 9 1 rt_sigreturn
> 0.00 0.000009 5 2 uname
> 0.00 0.000009 5 2 set_robust_list
> 0.00 0.000008 4 2 gettid
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 45.497046 100943 10483 total
>
>
>
>
>
>
> On Tuesday, September 23, 2014 9:44:52 PM UTC-4, larry google groups wrote:
>>
>> I am intrigued by this article, as the problem sounds the same as mine:
>>
>> http://corner.squareup.com/2014/09/logging-can-be-tricky.html
>>
>> "No significant amount of resources appeared to be in use — disk I/O,
>> network I/O, CPU, and memory all looked fairly tame. Furthermore, the bulk
>> of queries being served were all performing as expected. "
>>
>> So I tried to follow their example regarding strace. But I have never
>> worked with strace before. I used grep to find the PID and then I:
>>
>> sudo strace -c -f -p 19363
>>
>> and I got:
>>
>> Process 19363 attached with 45 threads
>>
>> Then I ran our "health check" which is like a series of functional tests
>> that ping our actual app (a live environment rather than a test
>> environment). I got nothing out of strace except these 2 lines appeared:
>>
>> Process 20973 attached
>> Process 20974 attached
>>
>> What does this mean? I had the impression that the JVM ran in 1 process?
>> Does strace show me userland threads (like htop does) or are these child
>> processes?
>>
>>
>>
>>
>>
>> On Monday, September 15, 2014 12:15:14 AM UTC-4, larry google groups
>> wrote:
>>>
>>>
>>> I have an embarrassing problem. I convinced my boss that I could use
>>> Clojure to build a RESTful API. I was successful in so far as that went,
>>> but now I face the issue that every once in a while, the program pauses,
>>> for a painfully long time -- sometimes 30 seconds, which causes some
>>> requests to the API to timeout. We are still in testing, so there is no
>>> real load on the app, just the frontenders, writing Javascript and making
>>> Ajax calls to the service.
>>>
>>> The service seems like a basic Clojure web app. I use Jetty as the
>>> webserver, and the libraries in use are:
>>>
>>> Ring
>>>
>>> Compojure
>>>
>>> Liberator
>>>
>>> Monger
>>>
>>> Timbre
>>>
>>> Lamina
>>>
>>> Dire
>>>
>>> When someone complains about the pauses, I will go test the service, and
>>> I can hit with 40 requests in 10 seconds and it has great performance. The
>>> pauses actually seem to come after periods of inactivity, which made me
>>> think that this had something to do with garbage collection, except that
>>> the pauses are so extreme -- like I said, sometimes as much as 30 seconds,
>>> causing requests to timeout. When the app does finally start to respond it
>>> again, it goes very fast, and responds to those pending request very fast.
>>>
>>> But I have to find a way to fix these pauses.
>>>
>>> Right now I packaged the app as an Uberjar and put it on the server,
>>> spun it up on port 24000 and proxied it through Apache. I put a script in
>>> /etc/init.d to start the app using start-stop-daemon.
>>>
>>> Possible things that could be going wrong:
>>>
>>> Maybe Jetty needs more threads, or maybe less threads? How would I test
>>> that?
>>>
>>> Maybe the link to MongoDB sometimes dies? (Mongo is on another server at
>>> Amazon) How would I test that?
>>>
>>> Maybe it is garbage collection? How would I test that?
>>>
>>> Maybe I have some code that somehow blocks the whole app? Seems unlikely
>>> but I'm trying to keep an open mind.
>>>
>>> Maybe the thread pool managed by Lamina sometimes gets overwhelmed? How
>>> would I test that?
>>>
>>> Maybe when Timbre writes to the log file it causes things to pause? (But
>>> I believe Timbre does this in its own thread?) How do I test that?
>>>
>>> This is a small app: only about 1,100 lines of code.
>>>
>>> I don't have much experience debugging problems on the JVM, so I welcome
>>> any suggestions.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to [email protected]
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
--
庄晓丹
Email: [email protected] [email protected]
Site: http://fnil.net
Twitter: @killme2008
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.