Re: [PHP-DB] Real Killer App!
on 3/15/03 7:55 PM, Nicholas Fitzgerald at [EMAIL PROTECTED] appended the following bits to my mbox: Spider still dies, but now it's finally given me an error: FATAL: erealloc(): unable to allocate 11 bytes. This is interesting, as I'm not using erealloc() anywhere in the script. When I went to php.net to check it out, all I got was a memory management page with this and some other memory type functions listed. Those are functions used by the actual PHP developers to communicate with the Zend engine (which itself communicates with the OS/CPU). You don't call those functions and can't affect them. This sounds like a very obscure bug with PHP on windows. I can't remember if you are running the latest version of PHP (4.3.1 or 4.3.2 RC 1). If not, please try it using that version of PHP to see if you can still reproduce the problem. If so, please file a bug report at http://bugs.php.net/ A quick search didn't pull up anything related to your issue. The only recent entry related to erealloc is for 4.2.3 on Windows: http://bugs.php.net/bug.php?id=20913 But it doesn't necessarily seem applicable. Unless you're running out of memory? Or maybe the server can't handle all the connections? If so, you could try sleeping for a second every 10 times through the loop, etc, to slow down the process and maybe keep it from dying. Hope that helps. Sincerely, Paul Burney http://paulburney.com/ ?php if ($your_php_version 4.1.2) { upgrade_now(); // to avoid major security problems } ? -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Real Killer App!
That sounds about right I think. I'm using 4.3.1 so I really am beginning to think it's a bug of some kind. I'm definitely not running into memory problems, this server has 1.5 gig and isn't coming anywhere close to using all of it even when everything on the box is jumping. It's most likely not connection either, but that is something to look closer at. In the mean time I'm going to try that delaying tactic you mentioned and see what kind of results I get. Thanks for the info! Nick Paul Burney wrote: on 3/15/03 7:55 PM, Nicholas Fitzgerald at [EMAIL PROTECTED] appended the following bits to my mbox: Spider still dies, but now it's finally given me an error: FATAL: erealloc(): unable to allocate 11 bytes. This is interesting, as I'm not using erealloc() anywhere in the script. When I went to php.net to check it out, all I got was a memory management page with this and some other memory type functions listed. Those are functions used by the actual PHP developers to communicate with the Zend engine (which itself communicates with the OS/CPU). You don't call those functions and can't affect them. This sounds like a very obscure bug with PHP on windows. I can't remember if you are running the latest version of PHP (4.3.1 or 4.3.2 RC 1). If not, please try it using that version of PHP to see if you can still reproduce the problem. If so, please file a bug report at http://bugs.php.net/ A quick search didn't pull up anything related to your issue. The only recent entry related to erealloc is for 4.2.3 on Windows: http://bugs.php.net/bug.php?id=20913 But it doesn't necessarily seem applicable. Unless you're running out of memory? Or maybe the server can't handle all the connections? If so, you could try sleeping for a second every 10 times through the loop, etc, to slow down the process and maybe keep it from dying. Hope that helps. Sincerely, Paul Burney http://paulburney.com/ ?php if ($your_php_version 4.1.2) { upgrade_now(); // to avoid major security problems } ?
Re: [PHP-DB] Real Killer App!
Well, I've gotten a long way on this, and here's the results up to now: On Red Hat 8.0 appache 2.0.40, php 4.22, mysql 3.23.54a PII 366 30gig hdd, 256meg mem: Everything works flawlessly. Have spidered several HUGE sites. Goes fast, goes accurate. On windows 2000 sp3, apache 2.0.44, php 4.3.1, mysql 3.23.55 max-nt dual PIII 1gig, 160gig raid, 1.5gig mem: Spider still dies, but now it's finally given me an error: FATAL: erealloc(): unable to allocate 11 bytes. This is interesting, as I'm not using erealloc() anywhere in the script. When I went to php.net to check it out, all I got was a memory management page with this and some other memory type functions listed. No information at all about what to do with them, how to use them, where to use them, when to use them, or anything. If anyone out there has any info on this function and/or the error it gave me please pass it along! Nick -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Real Killer App!
I would compare the php.ini settings on the 2 systems. phpinfo() will be able to give you the current stat of the variables PHP is using. There could be something strange on the Windows system that you'll notice as different on the Linux box's phpinfo() (hopefully). -VolVE - Original Message - From: Nicholas Fitzgerald [EMAIL PROTECTED] To: PHP Database List [EMAIL PROTECTED] Sent: Saturday, March 15, 2003 19:55 Subject: Re: [PHP-DB] Real Killer App! Well, I've gotten a long way on this, and here's the results up to now: On Red Hat 8.0 appache 2.0.40, php 4.22, mysql 3.23.54a PII 366 30gig hdd, 256meg mem: Everything works flawlessly. Have spidered several HUGE sites. Goes fast, goes accurate. On windows 2000 sp3, apache 2.0.44, php 4.3.1, mysql 3.23.55 max-nt dual PIII 1gig, 160gig raid, 1.5gig mem: Spider still dies, but now it's finally given me an error: FATAL: erealloc(): unable to allocate 11 bytes. This is interesting, as I'm not using erealloc() anywhere in the script. When I went to php.net to check it out, all I got was a memory management page with this and some other memory type functions listed. No information at all about what to do with them, how to use them, where to use them, when to use them, or anything. If anyone out there has any info on this function and/or the error it gave me please pass it along! Nick -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Real Killer App!
At 01:58 3/14/2003, Nicholas Fitzgerald wrote: As you guys know I've been going around in circles with this spider app problem for a couple days. How would you do it? http://www.hotscripts.com/PHP/Scripts_and_Programs/Search_Engines/more3.html Start Here to Find It Fast!© - http://www.US-Webmasters.com/best-start-page/ -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Real Killer App!
I've already looked at all of these, well most of them anyway. The only one's I haven't looked at are the ones that just do real time searches. Nothing of what I've seen is as functional as what I've designed, and for the post part built. Which is why I built it. This spider issue is the only thing that remains to be done. I'm currently using mnoGoSearch on an existing search engine I have, but I had to do a lot of work to get it to A: act like one would expect a search engine to act, and B: integrate into the site the way I wanted it. A lot of it was stuff I shouldn't have had to do. Besides, the spider is slow as hell, and only works on linux, unless I want to pay $100's for the windows version. Not that that's necessarily a problem, but I would like to have that option. The spider I've written, except for this problem I'm having, is much faster on windows than mnoGoSearch is on linux! As soon as I hit the send button here I'm going to be installing on a linux server and see if I have the same problem. Nick W. D. wrote: At 01:58 3/14/2003, Nicholas Fitzgerald wrote: As you guys know I've been going around in circles with this spider app problem for a couple days. How would you do it? http://www.hotscripts.com/PHP/Scripts_and_Programs/Search_Engines/more3.html Start Here to Find It Fast!© - http://www.US-Webmasters.com/best-start-page/
Re: [PHP-DB] Real Killer App!
Have you tried adding a flush() statement in at certain points? Perhaps put one in after every page is processed. Technically this is designed for browsers, but perhaps it will help here. It will most likely slow things down, but if it works, then you can adjust it from there (like every 10 pages or so). As another long shot, you may try using clearstatcache() also. You don't want to cache any files you are processing, but PHP may be doing this anyway. On Thursday, March 13, 2003, at 02:59 PM, Nicholas Fitzgerald wrote: Actually, that was the first thing I thought of. So I set set_time_limit(0); right at the top of the script. Is it possible there is another setting for this in php.ini that I also need to deal with? I have other scripts running and I don't want to have an unlimited execution time on all of them. Nick -- Brent Baisley Systems Architect Landover Associates, Inc. Search Advisory Services for Advanced Technology Environments p: 212.759.6400/800.759.0577 -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Real Killer App!
I did try a flush table on the table, even down to for every read or write with no luck. Haven't tried the clearstatcache() though, sounds like an idea who's time has come. Nick Brent Baisley wrote: Have you tried adding a flush() statement in at certain points? Perhaps put one in after every page is processed. Technically this is designed for browsers, but perhaps it will help here. It will most likely slow things down, but if it works, then you can adjust it from there (like every 10 pages or so). As another long shot, you may try using clearstatcache() also. You don't want to cache any files you are processing, but PHP may be doing this anyway. On Thursday, March 13, 2003, at 02:59 PM, Nicholas Fitzgerald wrote: Actually, that was the first thing I thought of. So I set set_time_limit(0); right at the top of the script. Is it possible there is another setting for this in php.ini that I also need to deal with? I have other scripts running and I don't want to have an unlimited execution time on all of them. Nick -- Brent Baisley Systems Architect Landover Associates, Inc. Search Advisory Services for Advanced Technology Environments p: 212.759.6400/800.759.0577 -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Real Killer App!
on 3/12/03 5:45 PM, Nicholas Fitzgerald at [EMAIL PROTECTED] appended the following bits to my mbox: is that entire prog as it now exists. Notice I have NOT configured it as yet to into the next level. I did this on purpose so I wouldn't have to kill it in the middle of operation and potencially scew stuff up. They way it is now, it looks at all the records in the database, updates them if necessary, then extracts all the links and puts them into the database for crawling on the next run through. Once I get this working I'll put a big loop in it so it keeps going until there's nothing left to look at. Meanwhile, if anyone sees anything in here that could be the cause of this problem please let me know! I don't think I've found the problem, but I thought I'd point out a couple things: // Open the database and start looking at URLs $sql = mysql_query(SELECT * FROM search); while($rslt = mysql_fetch_array($sql)){ $url = $rslt[url]; The above line gets all the data from the table and then starts looping through... // Put the stuff in the search database $puts = mysql_query(SELECT * FROM search WHERE url='$url'); $site = mysql_fetch_array($puts); $nurl = $site[url]; $ncrc = $site[checksum]; $ndate = $site[date]; if($ndate = $daycheck || $ncrc != $checksum){ That line does the same query again for this particular URL to set variables in the $site array, though you already have this info in the $rslt array. You could potentially save hundreds of queries there. // Get the page title $temp = stristr($read,title); snip $tchn = ($tend - $tpos); $title = strip_tags(substr($read, ($tpos+7),$tchn)); Aside: Interesting way of doing things. I usually just preg_match these things, but I like this too. // Kill any trailing slashes if(substr($link,(strlen($link)-1)) == /){ $link = substr($link,0,(strlen($link)-1)); } Why are you killing the trailing slashes? That's going to cause fopen double the work to get to the pages. That is, first it will request the page without the slash, then get a redirect response with the slash, and then request the page again. // Put the new URL in the search database $chk = mysql_query(SELECT * FROM search WHERE url = '$link'); $curec = mysql_fetch_array($chk); if(!$curec){ echo Adding: $link\n; $putup = mysql_query(INSERT INTO search SET url='$link'); } else{ continue; } You might want to give a different variable name to the new link, or encapsulate the above in a function, so your $link variables don't clobber each other. indicate where the chokepoint might be. It seems to be when the DB reaches a certain size, but 300 or so records should be a piece of cake for it. As far as the debug backtrace, there really isn't anything there that stands out. It's not an issue with a variable, something is going wrong in the execution either of php, or a sql query. I'm not finding any errors in the mysql error log, or anywhere else. What url is it dieing on? You could probably echo each $url to the terminal to watch it's progression and see where it is stopping. I've had problems with apache using custom php error docs where the error doc contained a php generated image that wasn't found. Each image that failed would generate another PHP error which cascaded until the server basically died. KIND OF BROADER ASIDE REGARDING SEARCH ENGINE PROBLEMS: I've also had recursion problems because php allows any characters to be appended after the request. For example, let's say you have an examples.php file and for some reason you have a relative link in examples.php to examples/somefile.html. If the examples directory doesn't exist, apache will serve examples.php to the user using the request of examples/somefile.html. A recursive search engine (that isn't too smart, i.e., infoseek and excite for colleges), will keep requesting things like: http://example.com/examples/examples/examples/examples/examples/examples/exa mples/examples/examples/examples/examples/examples/examples/examples/example s/examples/examples/examples/examples/examples/examples/examples/examples/ex amples/examples/examples/examples/examples/examples/examples/examples/exampl es/examples/examples/examples/examples/examples/examples/examples/examples/e xamples/examples/somefile.html As far as apache is concerned, it is fulfilling the request with the examples.php file and php just sees a really long query_string starting with /examples. I'm sure that isn't your problem, but I've been bit by it a few times. END OF ASIDE Hope some of that ramble helps. Please try to see if it is dieing on a particular URL so we can be of further assistance. Sincerely, Paul Burney http://paulburney.com/ Q: Tired of creating admin interfaces to your MySQL web applications? A: Use MySTRI instead. Version 3.1 now available.
Re: [PHP-DB] Real Killer App!
Ok, here's something else I've just noticed about this problem. I noticed that when this things gets to a certain point, somewhere in the 4th run through, it hits a certain url, then the next, at which point it seems to pause for several seconds, then it goes back and hits that first certain url. It looks something like this: Updating: http://www.domain.com/certainurl.html Updating: http://www.domain.com/nexturl.html Updating: http://www.domain.com/certainurl.html It only does this in one place, though not necessarily the same place, depending on where I start, then, afterwards, it goes through a few more pages, and that's where it dies. Another thing is, if I start from scratch with a site that has 69 pages in it, it goes right through and indexes it fine without any problems. I have two other sites I'm using for testing, and they both have over 1000 pages. on both of them it dies after putting about 240-250 records in the database, usually right at 244. This is true of both sites, and it happens no matter where I start or how I configure the initial URL.I keep going back to a memory problem with either PHP or MySQL, or could I be looking in the wrong place? Could the problem be with Apache? I also wanted to thank you guys. I've gotten a few really good suggestions about the code, and although they haven't solved this specific problem, they have helped me improve the overall performance of the app. Nick Paul Burney wrote: on 3/12/03 5:45 PM, Nicholas Fitzgerald at [EMAIL PROTECTED] appended the following bits to my mbox: is that entire prog as it now exists. Notice I have NOT configured it as yet to into the next level. I did this on purpose so I wouldn't have to kill it in the middle of operation and potencially scew stuff up. They way it is now, it looks at all the records in the database, updates them if necessary, then extracts all the links and puts them into the database for crawling on the next run through. Once I get this working I'll put a big loop in it so it keeps going until there's nothing left to look at. Meanwhile, if anyone sees anything in here that could be the cause of this problem please let me know! I don't think I've found the problem, but I thought I'd point out a couple things: // Open the database and start looking at URLs $sql = mysql_query(SELECT * FROM search); while($rslt = mysql_fetch_array($sql)){ $url = $rslt[url]; The above line gets all the data from the table and then starts looping through... // Put the stuff in the search database $puts = mysql_query(SELECT * FROM search WHERE url='$url'); $site = mysql_fetch_array($puts); $nurl = $site[url]; $ncrc = $site[checksum]; $ndate = $site[date]; if($ndate = $daycheck || $ncrc != $checksum){ That line does the same query again for this particular URL to set variables in the $site array, though you already have this info in the $rslt array. You could potentially save hundreds of queries there. // Get the page title $temp = stristr($read,title); snip $tchn = ($tend - $tpos); $title = strip_tags(substr($read, ($tpos+7),$tchn)); Aside: Interesting way of doing things. I usually just preg_match these things, but I like this too. // Kill any trailing slashes if(substr($link,(strlen($link)-1)) == /){ $link = substr($link,0,(strlen($link)-1)); } Why are you killing the trailing slashes? That's going to cause fopen double the work to get to the pages. That is, first it will request the page without the slash, then get a redirect response with the slash, and then request the page again. // Put the new URL in the search database $chk = mysql_query(SELECT * FROM search WHERE url = '$link'); $curec = mysql_fetch_array($chk); if(!$curec){ echo Adding: $link\n; $putup = mysql_query(INSERT INTO search SET url='$link'); } else{ continue; } You might want to give a different variable name to the new link, or encapsulate the above in a function, so your $link variables don't clobber each other. indicate where the chokepoint might be. It seems to be when the DB reaches a certain size, but 300 or so records should be a piece of cake for it. As far as the debug backtrace, there really isn't anything there that stands out. It's not an issue with a variable, something is going wrong in the execution either of php, or a sql query. I'm not finding any errors in the mysql error log, or anywhere else. What url is it dieing on? You could probably echo each $url to the terminal to watch it's progression and see where it is stopping. I've had problems with apache using custom php error docs where the error doc contained a php generated image that wasn't found. Each image that failed would generate another PHP error which cascaded until the server basically died. KIND OF BROADER ASIDE REGARDING SEARCH ENGINE PROBLEMS: I've also had recursion problems because php allows any
Re: [PHP-DB] Real Killer App!
I missed the beginning of this whole thing so excuse me if this has been covered. Have you looked at how much time elapses before it dies? If the same amount of time lapses before it dies, than that's your problem. I don't know what you have your maximum script execution run time set to, but you may need to change that number. On Thursday, March 13, 2003, at 02:36 PM, Nicholas Fitzgerald wrote: It only does this in one place, though not necessarily the same place, depending on where I start, then, afterwards, it goes through a few more pages, and that's where it dies. Another thing is, if I start from scratch with a site that has 69 pages in it, it goes right through and indexes it fine without any problems. I have two other sites I'm using for testing, and they both have over 1000 pages. on both of them it dies after putting about 240-250 records in the database, usually right at 244. This is true of both sites, and it happens no matter where I start or how I configure the initial URL.I keep going back to a memory problem with either PHP or MySQL, or could I be looking in the wrong place? Could the problem be with Apache? -- Brent Baisley Systems Architect Landover Associates, Inc. Search Advisory Services for Advanced Technology Environments p: 212.759.6400/800.759.0577 -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Real Killer App!
Actually, that was the first thing I thought of. So I set set_time_limit(0); right at the top of the script. Is it possible there is another setting for this in php.ini that I also need to deal with? I have other scripts running and I don't want to have an unlimited execution time on all of them. Nick Brent Baisley wrote: I missed the beginning of this whole thing so excuse me if this has been covered. Have you looked at how much time elapses before it dies? If the same amount of time lapses before it dies, than that's your problem. I don't know what you have your maximum script execution run time set to, but you may need to change that number. On Thursday, March 13, 2003, at 02:36 PM, Nicholas Fitzgerald wrote: It only does this in one place, though not necessarily the same place, depending on where I start, then, afterwards, it goes through a few more pages, and that's where it dies. Another thing is, if I start from scratch with a site that has 69 pages in it, it goes right through and indexes it fine without any problems. I have two other sites I'm using for testing, and they both have over 1000 pages. on both of them it dies after putting about 240-250 records in the database, usually right at 244. This is true of both sites, and it happens no matter where I start or how I configure the initial URL.I keep going back to a memory problem with either PHP or MySQL, or could I be looking in the wrong place? Could the problem be with Apache? -- Brent Baisley Systems Architect Landover Associates, Inc. Search Advisory Services for Advanced Technology Environments p: 212.759.6400/800.759.0577 -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Real Killer App!
As you guys know I've been going around in circles with this spider app problem for a couple days. I think I finally found where the screwup is, and I'm sure you'll be interested to hear about it. I had been testing with three sites because they were fairly diverse and gave me a lot of different stuff to stress the app. One was a site with about 70 pages made up almost entirely of links that include a query string. The other two where sites mostly of just text with lots of link lists and stuff, and that used a lot of SSI, and stuff like that. These two sites had about 1000 pages or so each, but only indexed about 200 or so of them then died. I decided to try something different. I indexed a site that I have on my server. Since it's hitting it at 100MHz things went pretty fast. It ran for a while and indexed over 1600 pages. Then, not only did it die, I got BSOD! Yep, that's right, w2k handed me a hard ntoskernel crash. Isn't that special? At this point I'm not sure if it's a bug in windows, in php 4.3.1, or mysql 3.23.55. In any case, I'm moving the app to a linux box tomorrow and will be very interested to see the results of that. It's a standard red hat 8 setup, php4.2.2 and mysql 3.23.54a. I hate to have to release this as a linux only release, but if that's what it takes, that's what it takes! In the mean time, is there anything in php or mysql that is critical for windows that might not be addressed in the standard docs? The hardware is good, and not anywhere near being taxed. The OS seems stable, I haven't had a lot of trouble with w2k when asking it to do much more complex stuff than this. Anyway, if this doesn't work on the linux box, I'm going to have to start from scratch on this spider. How would you do it? Nick -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DB] Real Killer App!
I'm having a heck of a time trying to write a little web crawler for my intranet. I've got everything functionally working it seems like, but there is a very strange problem I can't nail down. If I put in an entry and start the crawler it goes great through the first loop. It gets the url, gets the page info, puts it in the database, and then parses all of the links out and puts them raw into the database. On the second loop it picks up all the new stuff and does the same thing. By the time the second loop is completed I'll have just over 300 items in the database. On the third loop is where the problem starts. Once it gets into the third loop, it starts to slow down a lot. Then, after a while, if I'm running from the command line, it'll just go to a command prompt. If I'm running in a browser, it returns a document contains no data error. This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux box yet, but I'd rather run it on the windows server since it's bigger and has plenty of cpu, memory, and raid space. It's almost like the thing is getting confused when it starts to get more than 300 entries in the database. Any ideas out there as to what would cause this kind of problem? Nick Can you post some code? Are your script timeouts set appropriately? Does memory/CPU useage increase dramatically or are there any other symptoms of where it is choking...? What DB is it updating? What does the database tell you is happening when it starts choking? What do debug messages tell you wrt finding the bottleneck? Does it happen always no matter what start point is used? Are you using recursive functions? Sorry lots of questions but no answers... :) Cheers Rich -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Real Killer App!
Rich Gray wrote: I'm having a heck of a time trying to write a little web crawler for my intranet. I've got everything functionally working it seems like, but there is a very strange problem I can't nail down. If I put in an entry and start the crawler it goes great through the first loop. It gets the url, gets the page info, puts it in the database, and then parses all of the links out and puts them raw into the database. On the second loop it picks up all the new stuff and does the same thing. By the time the second loop is completed I'll have just over 300 items in the database. On the third loop is where the problem starts. Once it gets into the third loop, it starts to slow down a lot. Then, after a while, if I'm running from the command line, it'll just go to a command prompt. If I'm running in a browser, it returns a document contains no data error. This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux box yet, but I'd rather run it on the windows server since it's bigger and has plenty of cpu, memory, and raid space. It's almost like the thing is getting confused when it starts to get more than 300 entries in the database. Any ideas out there as to what would cause this kind of problem? Nick Can you post some code? Are your script timeouts set appropriately? Does memory/CPU useage increase dramatically or are there any other symptoms of where it is choking...? What DB is it updating? What does the database tell you is happening when it starts choking? What do debug messages tell you wrt finding the bottleneck? Does it happen always no matter what start point is used? Are you using recursive functions? Sorry lots of questions but no answers... :) Cheers Rich Recognizing that this script would take a long time to run I'm using set_time_limit(0) in it so a timeout doesn't become an issue. The server has 1.5 gig of memory and is a dual processor 1GHz PIII. I have never seen it get over 15% cpu usage, even while this is going on, and it never gets anywhere near full memory usage. The tax on the system itself is actually negligable. There are no symptoms that I can find to indicate where the chokepoint might be. It seems to be when the DB reaches a certain size, but 300 or so records should be a piece of cake for it. As far as the debug backtrace, there really isn't anything there that stands out. It's not an issue with a variable, something is going wrong in the execution either of php, or a sql query. I'm not finding any errors in the mysql error log, or anywhere else. Basically the prog is in two parts. First, it goes and gets the current contents of the DB, one record at a time, and checks it. If it meets the criteria it is then indexed or reindexed. If it is indexed, then it goes to the second part. This is where it strips any links from the page and puts them in the DB for indexing, if thery're not already there. When it dies, this is where it dies. I'll get the UPDATING: titleurl message that comes up when it does an update, but at that point, where it is going into strip links, it dies right there. Nick
Re: [PHP-DB] Real Killer App!
Just a guess, but do you have an index on the table that you are using to store the URLs that still need to be parsed? This table is going to get huge! And if you do not delete the URL that you just parsed from the list it will grow even faster. And if you do not have an index on that table and you are doing a table scan to see if the new URL is in it or not, this is going to take longer and longer to complete every time you process another URL. This is because this temp table of URLs to process will always get larger, and will rarely go down in size because you add about 5+ new URLs for every one that you process. But then again, we don't know for sure on anything without seeing 'some' code. So far we have not seen any so everything is total speculation and guessing. I would be interested in seeing the code that handles the processing of the URLs once you cull them from a web page. Jim Hunter ---Original Message--- From: Nicholas Fitzgerald Date: Wednesday, March 12, 2003 10:15:52 AM To: [EMAIL PROTECTED] Subject: Re: [PHP-DB] Real Killer App! Rich Gray wrote: I'm having a heck of a time trying to write a little web crawler for my intranet. I've got everything functionally working it seems like, but there is a very strange problem I can't nail down. If I put in an entry and start the crawler it goes great through the first loop. It gets the url, gets the page info, puts it in the database, and then parses all of the links out and puts them raw into the database. On the second loop it picks up all the new stuff and does the same thing. By the time the second loop is completed I'll have just over 300 items in the database. On the third loop is where the problem starts. Once it gets into the third loop, it starts to slow down a lot. Then, after a while, if I'm running from the command line, it'll just go to a command prompt. If I'm running in a browser, it returns a document contains no data error. This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux box yet, but I'd rather run it on the windows server since it's bigger and has plenty of cpu, memory, and raid space. It's almost like the thing is getting confused when it starts to get more than 300 entries in the database. Any ideas out there as to what would cause this kind of problem? Nick Can you post some code? Are your script timeouts set appropriately? Does memory/CPU useage increase dramatically or are there any other symptoms of where it is choking...? What DB is it updating? What does the database tell you is happening when it starts choking? What do debug messages tell you wrt finding the bottleneck? Does it happen always no matter what start point is used? Are you using recursive functions? Sorry lots of questions but no answers... :) Cheers Rich Recognizing that this script would take a long time to run I'm using set_time_limit(0) in it so a timeout doesn't become an issue. The server has 1.5 gig of memory and is a dual processor 1GHz PIII. I have never seen it get over 15% cpu usage, even while this is going on, and it never gets anywhere near full memory usage. The tax on the system itself is actually negligable. There are no symptoms that I can find to indicate where the chokepoint might be. It seems to be when the DB reaches a certain size, but 300 or so records should be a piece of cake for it. As far as the debug backtrace, there really isn't anything there that stands out. It's not an issue with a variable, something is going wrong in the execution either of php, or a sql query. I'm not finding any errors in the mysql error log, or anywhere else. Basically the prog is in two parts. First, it goes and gets the current contents of the DB, one record at a time, and checks it. If it meets the criteria it is then indexed or reindexed. If it is indexed, then it goes to the second part. This is where it strips any links from the page and puts them in the DB for indexing, if thery're not already there. When it dies, this is where it dies. I'll get the UPDATING: titleurl message that comes up when it does an update, but at that point, where it is going into strip links, it dies right there. Nick
Re: [PHP-DB] Real Killer App!
Could it be that a certain web server sees you connect, you request a file, and that file happens to take forever to load, leaving your script hanging until memory runs out or something else? Do you have timeouts set properly to stop the HTTP GET/POST if nothing is happening on that connection? Peter On Tue, 11 Mar 2003, Nicholas Fitzgerald wrote: I'm having a heck of a time trying to write a little web crawler for my intranet. I've got everything functionally working it seems like, but there is a very strange problem I can't nail down. If I put in an entry and start the crawler it goes great through the first loop. It gets the url, gets the page info, puts it in the database, and then parses all of the links out and puts them raw into the database. On the second loop it picks up all the new stuff and does the same thing. By the time the second loop is completed I'll have just over 300 items in the database. On the third loop is where the problem starts. Once it gets into the third loop, it starts to slow down a lot. Then, after a while, if I'm running from the command line, it'll just go to a command prompt. If I'm running in a browser, it returns a document contains no data error. This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux box yet, but I'd rather run it on the windows server since it's bigger and has plenty of cpu, memory, and raid space. It's almost like the thing is getting confused when it starts to get more than 300 entries in the database. Any ideas out there as to what would cause this kind of problem? Nick -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php --- Peter Beckman Internet Guy [EMAIL PROTECTED] http://www.purplecow.com/ --- -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Real Killer App!
){ continue; } // Kill off any dot dots ../../ $ddotcheck = substr_count($bpath,../); if($ddotcheck != ){ $lpos = strrpos($bpath,..); $bpath = substr($bpath,$lpos); } // Comparitive analisys if($bpath != substr($bpath,0,1) != /){ if(strrpos($tpath,.) === false){ $bpath = $tpath . / . $bpath; } if(strrpos($tpath,.)){ $ttmp = substr($tpath,0,(strrpos($tpath,/)+1)); $bpath = $ttmp . $bpath; if(substr($bpath,0,1) != /){ $bpath = / . $bpath; } } } if($bhost == ){ $link = $tschm . :// . $thost . $bpath; } // Kill any trailing slashes if(substr($link,(strlen($link)-1)) == /){ $link = substr($link,0,(strlen($link)-1)); } // If there is a query string put it back on if($bqury != ){ $link = $link . ? . $bqury; } if($link == ){ continue; } // Put the new URL in the search database $chk = mysql_query(SELECT * FROM search WHERE url = '$link'); $curec = mysql_fetch_array($chk); if(!$curec){ echo Adding: $link\n; $putup = mysql_query(INSERT INTO search SET url='$link'); } else{ continue; } } } echo \n\n# The Spider is Finished, You Can Now Close This Console #\n; ? Jim Hunter wrote: Just a guess, but do you have an index on the table that you are using to store the URLs that still need to be parsed? This table is going to get huge! And if you do not delete the URL that you just parsed from the list it will grow even faster. And if you do not have an index on that table and you are doing a table scan to see if the new URL is in it or not, this is going to take longer and longer to complete every time you process another URL. This is because this temp table of URLs to process will always get larger, and will rarely go down in size because you add about 5+ new URLs for every one that you process. But then again, we don't know for sure on anything without seeing 'some' code. So far we have not seen any so everything is total speculation and guessing. I would be interested in seeing the code that handles the processing of the URLs once you cull them from a web page. Jim Hunter ---Original Message--- From: Nicholas Fitzgerald Date: Wednesday, March 12, 2003 10:15:52 AM To: [EMAIL PROTECTED] Subject: Re: [PHP-DB] Real Killer App! Rich Gray wrote: I'm having a heck of a time trying to write a little web crawler for my intranet. I've got everything functionally working it seems like, but there is a very strange problem I can't nail down. If I put in an entry and start the crawler it goes great through the first loop. It gets the url, gets the page info, puts it in the database, and then parses all of the links out and puts them raw into the database. On the second loop it picks up all the new stuff and does the same thing. By the time the second loop is completed I'll have just over 300 items in the database. On the third loop is where the problem starts. Once it gets into the third loop, it starts to slow down a lot. Then, after a while, if I'm running from the command line, it'll just go to a command prompt. If I'm running in a browser, it returns a document contains no data error. This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux box yet, but I'd rather run it on the windows server since it's bigger and has plenty of cpu, memory, and raid space. It's almost like the thing is getting confused when it starts to get more than 300 entries in the database. Any ideas out there as to what would cause this kind of problem? Nick Can you post some code? Are your script timeouts set appropriately? Does memory/CPU useage increase dramatically or are there any other symptoms of where it is choking...? What DB is it updating? What does the database tell you is happening when it starts choking? What do debug messages tell you wrt finding the bottleneck? Does it happen always no matter what start point is used? Are you using recursive functions? Sorry lots of questions but no answers... :) Cheers Rich Recognizing that this script would take a long time to run I'm using set_time_limit(0) in it so a timeout doesn't become an issue. The server has 1.5 gig of memory and is a dual processor 1GHz PIII. I have never seen it get over 15% cpu usage, even while this is going on, and it never gets anywhere near full memory usage. The tax on the system itself is actually negligable. There are no symptoms