Re: How do I track down a painfully long pause in a small web app?

larry google groups Tue, 23 Sep 2014 18:47:48 -0700

I'm guessing that strace is showing me userland threads? When I quit strace 
I see:


^CProcess 19363 detached
Process 19364 detached
Process 19365 detached
Process 19366 detached
Process 19367 detached
Process 19368 detached
Process 19369 detached
Process 19370 detached
Process 19371 detached
Process 19372 detached
Process 19377 detached
Process 19378 detached
Process 19379 detached
Process 19380 detached
Process 19381 detached
Process 19382 detached
Process 19383 detached
Process 19384 detached
Process 19385 detached
Process 19386 detached
Process 19387 detached
Process 19388 detached
Process 19389 detached
Process 19390 detached
Process 19391 detached
Process 19392 detached
Process 19393 detached
Process 19394 detached
Process 19395 detached
Process 19396 detached
Process 19397 detached
Process 19398 detached
Process 19399 detached
Process 19400 detached
Process 19401 detached
Process 19402 detached
Process 19403 detached
Process 19404 detached
Process 19405 detached
Process 19406 detached
Process 19407 detached
Process 19408 detached
Process 19409 detached
Process 19410 detached
Process 19606 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 90.06   40.973072        1363     30059     10449 futex
  4.23    1.926411         819      2353           epoll_wait
  3.02    1.373282      114440        12         6 restart_syscall
  1.24    0.563107       93851         6           accept
  0.99    0.449988          12     36909           gettimeofday
  0.35    0.156992           5     29410           clock_gettime
  0.05    0.021064          67       316           recvfrom
  0.02    0.010338          30       347           write
  0.01    0.005117          24       209           sendto
  0.01    0.004369          24       180           poll
  0.01    0.002683          24       112        22 read
  0.01    0.002563          24       108         6 epoll_ctl
  0.00    0.001618          14       112           open
  0.00    0.001189           5       230           fcntl
  0.00    0.001132           8       142           mprotect
  0.00    0.000969           8       118           close
  0.00    0.000806          38        21           writev
  0.00    0.000685           6       109           ioctl
  0.00    0.000655           6       110           fstat
  0.00    0.000229          13        17           mmap
  0.00    0.000216          36         6           shutdown
  0.00    0.000197          33         6           dup2
  0.00    0.000092          46         2           madvise
  0.00    0.000061           5        12           setsockopt
  0.00    0.000057          14         4           munmap
  0.00    0.000056           5        12           getsockname
  0.00    0.000035           4         8           rt_sigprocmask
  0.00    0.000018           5         4           sched_getaffinity
  0.00    0.000010           5         2           clone
  0.00    0.000009           9         1           rt_sigreturn
  0.00    0.000009           5         2           uname
  0.00    0.000009           5         2           set_robust_list
  0.00    0.000008           4         2           gettid
------ ----------- ----------- --------- --------- ----------------
100.00   45.497046                100943     10483 total






On Tuesday, September 23, 2014 9:44:52 PM UTC-4, larry google groups wrote:
>
> I am intrigued by this article, as the problem sounds the same as mine:
>
> http://corner.squareup.com/2014/09/logging-can-be-tricky.html
>
> "No significant amount of resources appeared to be in use — disk I/O, 
> network I/O, CPU, and memory all looked fairly tame. Furthermore, the bulk 
> of queries being served were all performing as expected. "
>
> So I tried to follow their example regarding strace. But I have never 
> worked with strace before. I used grep to find the PID and then I:
>
>  sudo strace -c -f -p 19363
>
> and I got:
>
> Process 19363 attached with 45 threads
>
> Then I ran our "health check" which is like a series of functional tests 
> that ping our actual app (a live environment rather than a test 
> environment). I got nothing out of strace except these 2 lines appeared:
>
> Process 20973 attached
> Process 20974 attached
>
> What does this mean? I had the impression that the JVM ran in 1 process? 
> Does strace show me userland threads (like htop does) or are these child 
> processes? 
>
>
>
>
>
> On Monday, September 15, 2014 12:15:14 AM UTC-4, larry google groups wrote:
>>
>>
>> I have an embarrassing problem. I convinced my boss that I could use 
>> Clojure to build a RESTful API. I was successful in so far as that went, 
>> but now I face the issue that every once in a while, the program pauses, 
>> for a painfully long time -- sometimes 30 seconds, which causes some 
>> requests to the API to timeout. We are still in testing, so there is no 
>> real load on the app, just the frontenders, writing Javascript and making 
>> Ajax calls to the service. 
>>
>> The service seems like a basic Clojure web app. I use Jetty as the 
>> webserver, and the libraries in use are: 
>>
>> Ring
>>
>> Compojure
>>
>> Liberator
>>
>> Monger 
>>
>> Timbre
>>
>> Lamina
>>
>> Dire
>>
>> When someone complains about the pauses, I will go test the service, and 
>> I can hit with 40 requests in 10 seconds and it has great performance. The 
>> pauses actually seem to come after periods of inactivity, which made me 
>> think that this had something to do with garbage collection, except that 
>> the pauses are so extreme -- like I said, sometimes as much as 30 seconds, 
>> causing requests to timeout. When the app does finally start to respond it 
>> again, it goes very fast, and responds to those pending request very fast. 
>>
>> But I have to find a way to fix these pauses. 
>>
>> Right now I packaged the app as an Uberjar and put it on the server, spun 
>> it up on port 24000 and proxied it through Apache. I put a script in 
>> /etc/init.d to start the app using  start-stop-daemon.
>>
>> Possible things that could be going wrong:
>>
>> Maybe Jetty needs more threads, or maybe less threads? How would I test 
>> that?
>>
>> Maybe the link to MongoDB sometimes dies? (Mongo is on another server at 
>> Amazon) How would I test that? 
>>
>> Maybe it is garbage collection? How would I test that? 
>>
>> Maybe I have some code that somehow blocks the whole app? Seems unlikely 
>> but I'm trying to keep an open mind. 
>>
>> Maybe the thread pool managed by Lamina sometimes gets overwhelmed? How 
>> would I test that? 
>>
>> Maybe when Timbre writes to the log file it causes things to pause? (But 
>> I believe Timbre does this in its own thread?) How do I test that? 
>>
>> This is a small app: only about 1,100 lines of code.
>>
>> I don't have much experience debugging problems on the JVM, so I welcome 
>> any suggestions. 
>>
>>
>>
>>
>>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: How do I track down a painfully long pause in a small web app?

Reply via email to