What's strange is that the crawler will run just fine for up to several hours. At some point though the CPU utilization slowly begins to creep higher. Eventually everything locks and the program hangs. 'top' shows the processes connected to the queue database at or near %100, and the program ceases output (I have debugging messages built in to show current activity "[timestamp] : crawling [URL]"). At some point--anywhere from 30 minutes to several hours later--CPU utilization drops to normal and the program resumes operation as if everything were fine. This goes on for up to several hours, then the utilization issue repeats. So it's a very odd issue I've run into.
On 4/2/13, David Noel <david.i.n...@gmail.com> wrote: > I'm running into a strange issue whereby my postgres processes are > slowly creeping to 100% CPU utilization. I'm running > postgresql-server-9.2.3, FreeBSD 8.3-RELEASE-p6, and using the > postgresql-9.2-1002.jdbc4 driver. > > I'm not sure what information here is relevant, so I'll give > everything I can as concisely as I can. > > The hardware is a Dell PowerEdge 1420, dual Xeon Nocona's, 3.2ghz, > 16gb ram. The disks are 4 Kingston HyperX SATA3's attached to a > HighPoint RocketRAID 2721 controller, ZFS, RAID10. > > There are 2 databases, one a queue, and one containing tables storing > crawled data. > > The application is a webcrawler. The application pulls URLs from the > queue and marks them active in a single transaction. It then feeds the > URLs to the crawler threads who crawl the URL, populate the tables > with data, and signal the main thread to update the queue database, > marking the item as inactive and rescheduling it with a new > "NextCrawlDate". > > The processes that hang are the postgres processes that interact with > the queue database. > > 'select * from pg_stat_activity' shows that the queries are not > waiting, and are in the idle state. > > Essentially, the query in question is responsible for returning 1 URL > from the union of the list of URL's whose crawl session has timed out > and the list of URL's next in line to be crawled according to the > schedule (NextCrawlDate). The query is: "select * from ((select * from > "crawlq" where "Active" = 'true' AND "TimeoutDate" <= now()) UNION > (select * from "crawlq" where "Active" = 'false')) as RS order by > "NextCrawlDate" asc limit 1" > > Beyond this I don't know what useful debugging information to include. > I'll take a guess and start with some vmstat output. > > Under normal conditions (with the crawler running) vmstat shows the > following: > > procs memory page disks faults > cpu > r b w avm fre flt re pi po fr sr da0 da1 in sy > cs us sy id > 2 0 0 119G 8450M 1143 0 1 0 900 0 0 0 128 6700 > 8632 32 4 65 > 2 0 0 119G 8444M 1937 0 0 0 100 0 4 4 280 112206 > 7683 36 5 59 > 1 0 0 119G 8443M 427 0 0 0 1377 0 90 90 222 115889 > 9020 35 7 58 > 1 0 0 119G 8442M 1798 0 0 0 18 0 1 1 153 104954 > 7071 30 4 66 > 3 0 0 119G 8443M 528 0 0 0 681 0 10 10 293 125170 > 14523 40 15 46 > 1 0 0 119G 8432M 15227 0 0 4 2850 0 3 3 205 83830 > 6663 55 12 33 > 8 0 0 119G 8433M 3306 0 0 0 445 0 2 2 249 113792 > 7068 34 5 61 > 10 0 0 118G 8461M 1190 0 0 0 9909 0 72 73 451 120263 > 10036 56 9 35 > 14 0 0 118G 8456M 5887 0 0 0 1202 0 2 2 272 130795 > 9506 44 12 44 > 9 0 0 118G 8444M 7750 0 0 0 1070 0 9 9 298 87643 > 9584 80 13 7 > 3 0 0 118G 8442M 1335 0 0 0 648 0 5 5 189 143691 > 9234 36 6 58 > 1 0 0 118G 8442M 689 0 1 1 472 0 2 2 206 153868 > 8635 32 7 61 > 1 0 0 118G 8441M 203 0 0 0 1124 0 75 75 191 142598 > 8909 31 10 60 > 2 0 0 118G 8440M 9508 0 0 0 684 0 8 8 231 132785 > 10247 47 13 41 > 4 0 0 118G 8456M 4046 0 0 0 5469 0 11 11 299 143119 > 12475 54 22 24 > 4 0 0 117G 8490M 1076 0 0 0 9858 0 16 16 291 140701 > 14849 58 25 17 > 1 0 0 116G 8524M 344 0 0 0 8936 0 4 4 234 149103 > 12137 45 15 40 > 2 0 0 114G 8586M 715 0 0 5 17719 0 73 75 322 151002 > 11430 34 10 56 > 5 0 0 112G 8648M 2773 0 0 0 16997 0 6 6 225 118339 > 8700 30 10 61 > 1 0 0 110G 8705M 4429 0 0 0 15763 0 7 7 423 139590 > 10354 40 11 49 > 1 0 0 108G 8760M 1443 0 0 0 14519 0 7 7 405 139806 > 10214 37 5 58 > 1 0 0 104G 8863M 333 0 0 0 26537 0 5 5 284 107770 > 9947 34 6 60 > 1 0 0 104G 8859M 1331 0 0 0 1700 0 114 114 464 103248 > 12113 40 9 51 > 1 0 0 104G 8854M 1708 0 0 0 272 0 6 6 279 99817 > 9470 40 5 55 > 9 0 0 104G 8850M 3653 0 0 0 4809 0 28 28 346 160041 > 54071 42 32 26 > 12 3 0 105G 8845M 20576 0 0 0 18344 0 7 7 383 95019 > 32533 46 53 1 > 20 0 0 114G 8721M 46913 0 0 0 2941 0 11 11 461 77480 > 9794 72 28 0 > 12 1 0 110G 8759M 25109 0 0 0 35881 0 70 70 413 72631 > 10161 76 24 0 > 2 0 0 110G 8716M 12993 0 1 1 265 0 8 8 292 83085 > 10073 61 30 9 > 3 0 0 110G 8716M 2144 0 0 0 45 0 3 3 183 100994 > 7410 39 20 41 > > > ...and when postgres goes bonkers: > > procs memory page disks faults > cpu > r b w avm fre flt re pi po fr sr da0 da1 in sy > cs us sy id > 2 0 0 98G 10G 1091 0 1 0 849 0 0 0 114 2641 > 8582 30 4 66 > 2 0 0 98G 10G 20 0 0 0 0 0 0 0 197 20500 > 10454 46 2 53 > 2 0 0 98G 10G 59 0 0 0 0 0 0 0 284 23715 > 11180 46 3 51 > 3 0 0 98G 10G 17 0 0 0 652 0 69 70 288 21968 > 11571 46 4 50 > 2 0 0 98G 10G 56 0 0 0 0 0 0 0 242 22120 > 10251 47 2 50 > 2 0 0 98G 10G 16 0 0 0 0 0 0 0 207 20750 > 9982 48 1 51 > 4 0 0 98G 10G 61 0 0 0 97 0 0 0 205 21024 > 10395 49 2 50 > 2 0 0 98G 10G 24 0 0 0 0 0 0 0 215 21167 > 10378 48 1 50 > 4 0 0 98G 10G 57 0 0 0 346 0 46 46 272 22766 > 11314 47 3 50 > 2 0 0 98G 10G 19 0 0 0 0 0 0 0 205 20594 > 10340 48 1 50 > 2 0 0 98G 10G 53 0 0 0 0 0 0 0 250 22282 > 10526 48 2 50 > 2 0 0 98G 10G 27 0 0 0 0 0 0 0 260 27824 > 12090 51 4 44 > 2 0 0 98G 10G 543 0 0 0 429 0 0 0 267 28757 > 11817 46 4 50 > 2 0 0 98G 10G 23 0 0 0 260 0 44 44 278 33147 > 12624 51 5 43 > 2 0 0 98G 10G 55 0 0 0 0 0 0 0 243 26885 > 11081 47 3 50 > 2 0 0 98G 10G 16 0 0 0 0 0 0 0 207 20854 > 10348 48 1 51 > 2 0 0 98G 10G 55 0 0 0 9 0 1 1 222 22714 > 10766 48 1 51 > 2 0 0 98G 10G 23 0 0 0 0 0 0 0 200 20919 > 10392 47 2 51 > 2 0 0 98G 10G 55 0 0 0 517 0 88 87 247 20874 > 11032 46 3 51 > 4 0 0 98G 10G 20 0 0 0 0 0 0 0 253 30263 > 11705 50 3 47 > 2 0 0 98G 10G 1133 0 0 0 4 0 0 0 212 26791 > 11186 51 2 47 > 2 0 0 98G 10G 19 0 0 0 4 0 0 0 202 22043 > 10752 48 1 51 > 2 0 0 98G 10G 59 0 0 0 0 0 0 0 208 21416 > 10506 48 1 51 > 4 0 0 98G 10G 18 0 0 0 4 0 0 0 214 22207 > 10598 48 2 50 > 1 0 0 98G 10G 57 0 0 0 9 0 0 0 203 21102 > 10245 50 1 49 > 2 0 0 98G 10G 25 0 0 0 2 0 0 0 206 21531 > 10225 47 3 50 > 2 0 0 98G 10G 53 0 0 0 0 0 0 0 217 21083 > 10519 46 1 53 > 2 0 0 98G 10G 24 0 0 0 15 0 1 1 199 22009 > 10620 49 3 48 > 2 0 0 98G 10G 55 0 0 0 138 0 44 44 233 21647 > 10862 48 2 50 > 1 0 0 98G 10G 55 0 0 0 6 0 0 0 203 23002 > 10653 49 2 49 > > /etc/rc.conf: > postgresql_enable="YES" > postgresql_data="/zdb/pgsql/data" > postgresql_flags="-s -m smart" > > /boot/loader.conf: > kern.ipc.semmns="1024" > kern.ipc.semmni="128" > kern.ipc.shmall="1048576" > kern.ipc.shmseg="2048" > kern.ipc.shmmax="2147483647" > kern.ipc.shmmni="2048" > kern.maxusers="1024" > kern.maxswzone="335544320" > > postgresql.conf, all standard/default except for: > max_connections = 256 > > Any thoughts? What other information can I provide? > > Regards, > > -David > -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general