subject:"RE\: \[PHP\-DB\] Real Killer App\!"

Re: [PHP-DB] Real Killer App!

2003-03-16 Thread Nicholas Fitzgerald

That sounds about right I think. I'm using 4.3.1 so I really am 
beginning to think it's a bug of some kind. I'm definitely not running 
into memory problems, this server has 1.5 gig and isn't coming anywhere 
close to using all of it even when everything on the box is jumping. 
It's most likely not connection either, but that is something to look 
closer at. In the mean time I'm going to try that delaying tactic you 
mentioned and see what kind of results I get. Thanks for the info!

Nick

Paul Burney wrote:

on 3/15/03 7:55 PM, Nicholas Fitzgerald at [EMAIL PROTECTED] appended the
following bits to my mbox:
 

Spider still dies, but now it's finally given me an error: "FATAL:
erealloc(): unable to allocate 11 bytes". This is interesting, as I'm
not using erealloc() anywhere in the script. When I went to php.net to
check it out, all I got was a "memory management" page with this and
some other memory type functions listed.
   

Those are functions used by the actual PHP developers to communicate with
the Zend engine (which itself communicates with the OS/CPU).
You don't call those functions and can't affect them.  This sounds like a
very obscure bug with PHP on windows.  I can't remember if you are running
the latest version of PHP (4.3.1 or 4.3.2 RC 1).  If not, please try it
using that version of PHP to see if you can still reproduce the problem.  If
so, please file a bug report at http://bugs.php.net/
A quick search didn't pull up anything related to your issue.  The only
recent entry related to erealloc is for 4.2.3 on Windows:


But it doesn't necessarily seem applicable.  Unless you're running out of
memory?  Or maybe the server can't handle all the connections?  If so, you
could try sleeping for a second every 10 times through the loop, etc, to
slow down the process and maybe keep it from dying.
Hope that helps.

Sincerely,

Paul Burney
http://paulburney.com/


if ($your_php_version < 4.1.2) {
   upgrade_now();  // to avoid major security problems
}
?>

Re: [PHP-DB] Real Killer App!

2003-03-16 Thread Paul Burney

on 3/15/03 7:55 PM, Nicholas Fitzgerald at [EMAIL PROTECTED] appended the
following bits to my mbox:

> Spider still dies, but now it's finally given me an error: "FATAL:
> erealloc(): unable to allocate 11 bytes". This is interesting, as I'm
> not using erealloc() anywhere in the script. When I went to php.net to
> check it out, all I got was a "memory management" page with this and
> some other memory type functions listed.

Those are functions used by the actual PHP developers to communicate with
the Zend engine (which itself communicates with the OS/CPU).

You don't call those functions and can't affect them.  This sounds like a
very obscure bug with PHP on windows.  I can't remember if you are running
the latest version of PHP (4.3.1 or 4.3.2 RC 1).  If not, please try it
using that version of PHP to see if you can still reproduce the problem.  If
so, please file a bug report at http://bugs.php.net/

A quick search didn't pull up anything related to your issue.  The only
recent entry related to erealloc is for 4.2.3 on Windows:



But it doesn't necessarily seem applicable.  Unless you're running out of
memory?  Or maybe the server can't handle all the connections?  If so, you
could try sleeping for a second every 10 times through the loop, etc, to
slow down the process and maybe keep it from dying.

Hope that helps.

Sincerely,

Paul Burney
http://paulburney.com/





-- 
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DB] Real Killer App!

2003-03-15 Thread VolVE

I would compare the php.ini settings on the 2 systems.

phpinfo() will be able to give you the current stat of the variables PHP is
using. There could be something strange on the Windows system that you'll
notice as different on the Linux box's phpinfo() (hopefully).

-VolVE

- Original Message -
From: "Nicholas Fitzgerald" <[EMAIL PROTECTED]>
To: "PHP Database List" <[EMAIL PROTECTED]>
Sent: Saturday, March 15, 2003 19:55
Subject: Re: [PHP-DB] Real Killer App!


> Well, I've gotten a long way on this, and here's the results up to now:
>
> On Red Hat 8.0 appache 2.0.40, php 4.22, mysql 3.23.54a
> PII 366 30gig hdd, 256meg mem:
> Everything works flawlessly. Have spidered several HUGE sites. Goes
> fast, goes accurate.
>
> On windows 2000 sp3, apache 2.0.44, php 4.3.1, mysql 3.23.55 max-nt
> dual PIII 1gig, 160gig raid, 1.5gig mem:
> Spider still dies, but now it's finally given me an error: "FATAL:
> erealloc(): unable to allocate 11 bytes". This is interesting, as I'm
> not using erealloc() anywhere in the script. When I went to php.net to
> check it out, all I got was a "memory management" page with this and
> some other memory type functions listed. No information at all about
> what to do with them, how to use them, where to use them, when to use
> them, or anything. If anyone out there has any info on this function
> and/or the error it gave me please pass it along!
>
> Nick
>
>
>
> --
> PHP Database Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>


-- 
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DB] Real Killer App!

2003-03-15 Thread Nicholas Fitzgerald

Well, I've gotten a long way on this, and here's the results up to now:

On Red Hat 8.0 appache 2.0.40, php 4.22, mysql 3.23.54a
PII 366 30gig hdd, 256meg mem:
Everything works flawlessly. Have spidered several HUGE sites. Goes 
fast, goes accurate.

On windows 2000 sp3, apache 2.0.44, php 4.3.1, mysql 3.23.55 max-nt
dual PIII 1gig, 160gig raid, 1.5gig mem:
Spider still dies, but now it's finally given me an error: "FATAL: 
erealloc(): unable to allocate 11 bytes". This is interesting, as I'm 
not using erealloc() anywhere in the script. When I went to php.net to 
check it out, all I got was a "memory management" page with this and 
some other memory type functions listed. No information at all about 
what to do with them, how to use them, where to use them, when to use 
them, or anything. If anyone out there has any info on this function 
and/or the error it gave me please pass it along!

Nick



--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DB] Real Killer App!

2003-03-14 Thread Nicholas Fitzgerald

I did try a flush table on the table, even down to for every read or 
write with no luck. Haven't tried the clearstatcache() though, sounds 
like an idea who's time has come.

Nick

Brent Baisley wrote:

Have you tried adding a flush() statement in at certain points? 
Perhaps put one in after every page is processed. Technically this is 
designed for browsers, but perhaps it will help here. It will most 
likely slow things down, but if it works, then you can adjust it from 
there (like every 10 pages or so).
As another long shot, you may try using clearstatcache() also. You 
don't want to cache any files you are processing, but PHP may be doing 
this anyway.

On Thursday, March 13, 2003, at 02:59 PM, Nicholas Fitzgerald wrote:

Actually, that was the first thing I thought of. So I set 
set_time_limit(0); right at the top of the script. Is it possible 
there is another setting for this in php.ini that I also need to deal 
with? I have other scripts running and I don't want to have an 
unlimited execution time on all of them.

Nick

--
Brent Baisley
Systems Architect
Landover Associates, Inc.
Search & Advisory Services for Advanced Technology Environments
p: 212.759.6400/800.759.0577





--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DB] Real Killer App!

2003-03-14 Thread Brent Baisley

Have you tried adding a flush() statement in at certain points? Perhaps 
put one in after every page is processed. Technically this is designed 
for browsers, but perhaps it will help here. It will most likely slow 
things down, but if it works, then you can adjust it from there (like 
every 10 pages or so).
As another long shot, you may try using clearstatcache() also. You don't 
want to cache any files you are processing, but PHP may be doing this 
anyway.

On Thursday, March 13, 2003, at 02:59 PM, Nicholas Fitzgerald wrote:

Actually, that was the first thing I thought of. So I set 
set_time_limit(0); right at the top of the script. Is it possible there 
is another setting for this in php.ini that I also need to deal with? I 
have other scripts running and I don't want to have an unlimited 
execution time on all of them.

Nick

--
Brent Baisley
Systems Architect
Landover Associates, Inc.
Search & Advisory Services for Advanced Technology Environments
p: 212.759.6400/800.759.0577
--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DB] Real Killer App!

2003-03-14 Thread Nicholas Fitzgerald

I've already looked at all of these, well most of them anyway. The only 
one's I haven't looked at are the ones that just do real time searches. 
Nothing of what I've seen is as functional as what I've designed, and 
for the post part built. Which is why I built it.  This spider issue is 
the only thing that remains to be done. I'm currently using mnoGoSearch 
on an existing search engine I have, but I had to do a lot of work to 
get it to A: act like one would expect a search engine to act, and B: 
integrate into the site the way I wanted it. A lot of it was stuff I 
shouldn't have had to do. Besides, the spider is slow as hell, and only 
works on linux, unless I want to pay $100's for the windows version. Not 
that that's necessarily a problem, but I would like to have that option. 
The spider I've written, except for this problem I'm having, is much 
faster on windows than mnoGoSearch is on linux!  As soon as I hit the 
send button here I'm going to be installing on a linux server and see if 
I have the same problem.

Nick

W. D. wrote:

At 01:58 3/14/2003, Nicholas Fitzgerald wrote:
 

As you guys know I've been going around in circles with this spider app 
problem for a couple days. 

How would you do it?
   

http://www.hotscripts.com/PHP/Scripts_and_Programs/Search_Engines/more3.html



Start Here to Find It Fast!© -> http://www.US-Webmasters.com/best-start-page/

Re: [PHP-DB] Real Killer App!

2003-03-14 Thread W. D.

At 01:58 3/14/2003, Nicholas Fitzgerald wrote:
>As you guys know I've been going around in circles with this spider app 
>problem for a couple days. 
>
>How would you do it?

http://www.hotscripts.com/PHP/Scripts_and_Programs/Search_Engines/more3.html



Start Here to Find It Fast!© -> http://www.US-Webmasters.com/best-start-page/


--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DB] Real Killer App!

2003-03-13 Thread Nicholas Fitzgerald

As you guys know I've been going around in circles with this spider app 
problem for a couple days. I think I finally found where the screwup is, 
and I'm sure you'll be interested to hear about it. I had been testing 
with three sites because they were fairly diverse and gave me a lot of 
different stuff  to stress the app. One was a site with about 70 pages 
made up almost entirely of links that include a query string. The other 
two where sites mostly of just text with lots of link lists and stuff, 
and that used a lot of SSI, and stuff like that. These two sites had 
about 1000 pages or so each, but only indexed about 200 or so of them 
then died. I decided to try something different.

I indexed a site that I have on my server. Since it's hitting it at 
100MHz things went pretty fast. It ran for a while and indexed over 1600 
pages. Then, not only did it die, I got BSOD! Yep, that's right, w2k 
handed me a hard ntoskernel crash. Isn't that special? At this point I'm 
not sure if it's a bug in windows, in php 4.3.1, or mysql 3.23.55. In 
any case, I'm moving the app to a linux box tomorrow and will be very 
interested to see the results of that. It's a standard red hat 8 setup, 
php4.2.2 and mysql 3.23.54a. I hate to have to release this as a linux 
only release, but if that's what it takes, that's what it takes!

In the mean time, is there anything in php or mysql that is critical for 
windows that might not be addressed in the standard docs? The hardware 
is good, and not anywhere near being taxed. The OS seems stable, I 
haven't had a lot of trouble with w2k when asking it to do much more 
complex stuff than this. Anyway, if this doesn't work on the linux box, 
I'm going to have to start from scratch on this spider. How would you do it?

Nick



--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DB] Real Killer App!

2003-03-13 Thread Nicholas Fitzgerald

Actually, that was the first thing I thought of. So I set 
set_time_limit(0); right at the top of the script. Is it possible there 
is another setting for this in php.ini that I also need to deal with? I 
have other scripts running and I don't want to have an unlimited 
execution time on all of them.

Nick

Brent Baisley wrote:

I missed the beginning of this whole thing so excuse me if this has 
been covered. Have you looked at how much time elapses before it dies? 
If the same amount of time lapses before it dies, than that's your 
problem. I don't know what you have your maximum script execution run 
time set to, but you may need to change that number.

On Thursday, March 13, 2003, at 02:36 PM, Nicholas Fitzgerald wrote:

It only does this in one place, though not necessarily the same 
place, depending on where I start, then, afterwards, it goes through 
a few more pages, and that's where it dies. Another thing is, if I 
start from scratch with a site that has 69 pages in it, it goes right 
through and indexes it fine without any problems. I have two other 
sites I'm using for testing, and they both have over 1000 pages. on 
both of them it dies after putting about 240-250 records in the 
database, usually right at 244. This is true of both sites, and it 
happens no matter where I start or how I configure the initial URL.I 
keep going back to a memory problem with either PHP or MySQL, or 
could I be looking in the wrong place? Could the problem be with Apache?

--
Brent Baisley
Systems Architect
Landover Associates, Inc.
Search & Advisory Services for Advanced Technology Environments
p: 212.759.6400/800.759.0577





--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DB] Real Killer App!

2003-03-13 Thread Brent Baisley

I missed the beginning of this whole thing so excuse me if this has been 
covered. Have you looked at how much time elapses before it dies? If the 
same amount of time lapses before it dies, than that's your problem. I 
don't know what you have your maximum script execution run time set to, 
but you may need to change that number.

On Thursday, March 13, 2003, at 02:36 PM, Nicholas Fitzgerald wrote:

It only does this in one place, though not necessarily the same place, 
depending on where I start, then, afterwards, it goes through a few 
more pages, and that's where it dies. Another thing is, if I start from 
scratch with a site that has 69 pages in it, it goes right through and 
indexes it fine without any problems. I have two other sites I'm using 
for testing, and they both have over 1000 pages. on both of them it 
dies after putting about 240-250 records in the database, usually right 
at 244. This is true of both sites, and it happens no matter where I 
start or how I configure the initial URL.I keep going back to a memory 
problem with either PHP or MySQL, or could I be looking in the wrong 
place? Could the problem be with Apache?

--
Brent Baisley
Systems Architect
Landover Associates, Inc.
Search & Advisory Services for Advanced Technology Environments
p: 212.759.6400/800.759.0577
--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DB] Real Killer App!

2003-03-13 Thread Nicholas Fitzgerald

Ok, here's something else I've just noticed about this problem. I 
noticed that when this things gets to a certain point, somewhere in the 
4th run through, it hits a certain url, then the next, at which point it 
seems to pause for several seconds, then it goes back and hits that 
first certain url. It looks something like this:

Updating: http://www.domain.com/certainurl.html
Updating: http://www.domain.com/nexturl.html
Updating: http://www.domain.com/certainurl.html
It only does this in one place, though not necessarily the same place, 
depending on where I start, then, afterwards, it goes through a few more 
pages, and that's where it dies. Another thing is, if I start from 
scratch with a site that has 69 pages in it, it goes right through and 
indexes it fine without any problems. I have two other sites I'm using 
for testing, and they both have over 1000 pages. on both of them it dies 
after putting about 240-250 records in the database, usually right at 
244. This is true of both sites, and it happens no matter where I start 
or how I configure the initial URL.I keep going back to a memory problem 
with either PHP or MySQL, or could I be looking in the wrong place? 
Could the problem be with Apache?

I also wanted to thank you guys. I've gotten a few really good 
suggestions about the code, and although they haven't solved this 
specific problem, they have helped me improve the overall performance of 
the app.

Nick





Paul Burney wrote:

on 3/12/03 5:45 PM, Nicholas Fitzgerald at [EMAIL PROTECTED] appended the
following bits to my mbox:
 

is that entire prog as it now exists. Notice I have NOT configured it as
yet to into the next level. I did this on purpose so I wouldn't have to
kill it in the middle of operation and potencially scew stuff up. They
way it is now, it looks at all the records in the database, updates them
if necessary, then extracts all the links and puts them into the
database for crawling on the next run through. Once I get this working
I'll put a big loop in it so it keeps going until there's nothing left
to look at. Meanwhile, if anyone sees anything in here that could be the
cause of this problem please let me know!
   

I don't think I've found the problem, but I thought I'd point out a couple
things:
 

// Open the database and start looking at URLs
$sql = mysql_query("SELECT * FROM search");
while($rslt = mysql_fetch_array($sql)){
 $url = $rslt["url"];
   

The above line gets all the data from the table and then starts looping
through...
 

// Put the stuff in the search database
 $puts = mysql_query("SELECT * FROM search WHERE url='$url'");
 $site = mysql_fetch_array($puts);
 $nurl = $site["url"];
 $ncrc = $site["checksum"];
 $ndate = $site["date"];
 if($ndate <= $daycheck || $ncrc != $checksum){
   

That line does the same query again for this particular URL to set variables
in the $site array, though you already have this info in the $rslt array.
You could potentially save hundreds of queries there.
 

// Get the page title
 $temp = stristr($read,"");
   


 

 $tchn = ($tend - $tpos);
 $title = strip_tags(substr($read, ($tpos+7),$tchn));
   

Aside: Interesting way of doing things.  I usually just preg_match these
things, but I like this too.
 

 // Kill any trailing slashes
 if(substr($link,(strlen($link)-1)) == "/"){
 $link = substr($link,0,(strlen($link)-1));
 }
   

Why are you killing the trailing slashes?  That's going to cause fopen
double the work to get to the pages.  That is, first it will request the
page without the slash, then get a redirect response with the slash, and
then request the page again.
 

 // Put the new URL in the search database
 $chk = mysql_query("SELECT * FROM search WHERE url = '$link'");
 $curec = mysql_fetch_array($chk);
 if(!$curec){
 echo "Adding: $link\n";
 $putup = mysql_query("INSERT INTO search SET url='$link'");
 }
 else{
 continue;
 }
   

You might want to give a different variable name to the "new link", or
encapsulate the above in a function, so your $link variables don't clobber
each other.
 

indicate where the chokepoint might be. It seems to be when the DB
reaches a certain size, but 300 or so records should be a piece of cake
for it. As far as the debug backtrace, there really isn't anything there
that stands out. It's not an issue with a variable, something is going
wrong in the execution either of php, or a sql query. I'm not finding
any errors in the mysql error log, or anywhere else.
 

What url is it dieing on?  You could probably echo each $url to the terminal
to watch it's progression and see where it is stopping.
I've had problems with apache using custom php error docs where the error
doc contained a php generated image that wasn't found.  Each image that
failed would generate another PHP error which cascaded until the server
basically died.
KIND OF BROADER ASIDE REGARDING SEARCH ENGINE PROBLEMS:

I've also had recursion problems because p

Re: [PHP-DB] Real Killer App!

2003-03-13 Thread Paul Burney

on 3/12/03 5:45 PM, Nicholas Fitzgerald at [EMAIL PROTECTED] appended the
following bits to my mbox:

> is that entire prog as it now exists. Notice I have NOT configured it as
> yet to into the next level. I did this on purpose so I wouldn't have to
> kill it in the middle of operation and potencially scew stuff up. They
> way it is now, it looks at all the records in the database, updates them
> if necessary, then extracts all the links and puts them into the
> database for crawling on the next run through. Once I get this working
> I'll put a big loop in it so it keeps going until there's nothing left
> to look at. Meanwhile, if anyone sees anything in here that could be the
> cause of this problem please let me know!

I don't think I've found the problem, but I thought I'd point out a couple
things:

> // Open the database and start looking at URLs
> $sql = mysql_query("SELECT * FROM search");
> while($rslt = mysql_fetch_array($sql)){
>   $url = $rslt["url"];

The above line gets all the data from the table and then starts looping
through...

> // Put the stuff in the search database
>   $puts = mysql_query("SELECT * FROM search WHERE url='$url'");
>   $site = mysql_fetch_array($puts);
>   $nurl = $site["url"];
>   $ncrc = $site["checksum"];
>   $ndate = $site["date"];
>   if($ndate <= $daycheck || $ncrc != $checksum){

That line does the same query again for this particular URL to set variables
in the $site array, though you already have this info in the $rslt array.
You could potentially save hundreds of queries there.

> // Get the page title
>   $temp = stristr($read,"");

>   $tchn = ($tend - $tpos);
>   $title = strip_tags(substr($read, ($tpos+7),$tchn));

Aside: Interesting way of doing things.  I usually just preg_match these
things, but I like this too.


>   // Kill any trailing slashes
>   if(substr($link,(strlen($link)-1)) == "/"){
>   $link = substr($link,0,(strlen($link)-1));
>   }

Why are you killing the trailing slashes?  That's going to cause fopen
double the work to get to the pages.  That is, first it will request the
page without the slash, then get a redirect response with the slash, and
then request the page again.

>   // Put the new URL in the search database
>   $chk = mysql_query("SELECT * FROM search WHERE url = '$link'");
>   $curec = mysql_fetch_array($chk);
>   if(!$curec){
>   echo "Adding: $link\n";
>   $putup = mysql_query("INSERT INTO search SET url='$link'");
>   }
>   else{
>   continue;
>   }

You might want to give a different variable name to the "new link", or
encapsulate the above in a function, so your $link variables don't clobber
each other.

>> indicate where the chokepoint might be. It seems to be when the DB
>> reaches a certain size, but 300 or so records should be a piece of cake
>> for it. As far as the debug backtrace, there really isn't anything there
>> that stands out. It's not an issue with a variable, something is going
>> wrong in the execution either of php, or a sql query. I'm not finding
>> any errors in the mysql error log, or anywhere else.

What url is it dieing on?  You could probably echo each $url to the terminal
to watch it's progression and see where it is stopping.

I've had problems with apache using custom php error docs where the error
doc contained a php generated image that wasn't found.  Each image that
failed would generate another PHP error which cascaded until the server
basically died.

KIND OF BROADER ASIDE REGARDING SEARCH ENGINE PROBLEMS:

I've also had recursion problems because php allows any characters to be
appended after the request.  For example, let's say you have an examples.php
file and for some reason you have a relative link in  examples.php to
examples/somefile.html.  If the examples directory doesn't exist, apache
will serve examples.php to the user using the request of
examples/somefile.html.  A recursive search engine (that isn't too smart,
i.e., infoseek and excite for colleges), will keep requesting things like:

http://example.com/examples/examples/examples/examples/examples/examples/exa
mples/examples/examples/examples/examples/examples/examples/examples/example
s/examples/examples/examples/examples/examples/examples/examples/examples/ex
amples/examples/examples/examples/examples/examples/examples/examples/exampl
es/examples/examples/examples/examples/examples/examples/examples/examples/e
xamples/examples/somefile.html

As far as apache is concerned, it is fulfilling the request with the
examples.php file and php just sees a really long query_string starting with
/examples.

I'm sure that isn't your problem, but I've been bit by it a few times.

END OF ASIDE

Hope some of that ramble helps.  Please try to see if it is dieing on a
particular URL so we can be of further assistance.

Sincerely,

Paul Burney


Q: Tired of creating admin interfaces to your MySQL web applications?

A: Use MySTRI instead. Version 3.1 now a

Re: [PHP-DB] Real Killer App!

2003-03-12 Thread Nicholas Fitzgerald

;
   }
   if(substr_count($link,"mpg") != 0){
   continue;
   }
   if(substr_count($link,"mpeg") != 0){
   continue;
   }
   if(substr_count($link,"av") != 0){
   continue;
   }
   // Parse the current link
   $bot = @parse_url($link);
   if(!$bot){
   continue;
   }
   $bschm = $bot["scheme"];
   $bhost = $bot["host"];
   $bpath = $bot["path"];
   $bqury = $bot["query"];
   $bfrag = $bot["fragment"];
   // Get rid of outside links
   if($bhost != "" && $bhost != $thost){
   continue;
   }
   // Kill off any dot dots "../../"
   $ddotcheck = substr_count($bpath,"../");
   if($ddotcheck != ""){
   $lpos = strrpos($bpath,"..");
   $bpath = substr($bpath,$lpos);
   }
   // Comparitive analisys
   if($bpath != "" && substr($bpath,0,1) != "/"){
   if(strrpos($tpath,".") === false){
   $bpath = $tpath . "/" . $bpath;
   }
   if(strrpos($tpath,".")){
   $ttmp = substr($tpath,0,(strrpos($tpath,"/")+1));
   $bpath = $ttmp . $bpath;
   if(substr($bpath,0,1) != "/"){
   $bpath = "/" . $bpath;
   }
   }
   }
   if($bhost == ""){
   $link = $tschm . "://" . $thost . $bpath;
   }
   // Kill any trailing slashes
   if(substr($link,(strlen($link)-1)) == "/"){
   $link = substr($link,0,(strlen($link)-1));
   }
   // If there is a query string put it back on
   if($bqury != ""){
   $link = $link . "?" . $bqury;
   }
   if($link == ""){
   continue;
   }
   // Put the new URL in the search database
   $chk = mysql_query("SELECT * FROM search WHERE url = '$link'");
   $curec = mysql_fetch_array($chk);
   if(!$curec){
   echo "Adding: $link\n";
   $putup = mysql_query("INSERT INTO search SET url='$link'");
   }
   else{
   continue;
   }
  
   }

}

echo "\n\n# The Spider is Finished, You Can Now Close This Console 
#\n";

?>

Jim Hunter wrote:

Just a guess, but do you have an index on the table that you are using to
store the URLs that still need to be parsed? This table is going to get
huge! And if you do not delete the URL that you just parsed from the list it
will grow even faster. And if you do not have an index on that table and you
are doing a table scan to see if the new URL is in it or not, this is going
to take longer and longer to complete every time you process another URL.
This is because this temp table of URLs to process will always get larger,
and will rarely go down in size because you add about 5+ new URLs for every
one that you process. 

But then again, we don't know for sure on anything without seeing 'some'
code. So far we have not seen any so everything is total speculation and
guessing. I would be interested in seeing the code that handles the
processing of the URLs once you cull them from a web page. 



Jim Hunter







---Original Message---



From: Nicholas Fitzgerald

Date: Wednesday, March 12, 2003 10:15:52 AM

To: [EMAIL PROTECTED]

Subject: Re: [PHP-DB] Real Killer App!





Rich Gray wrote:



 

I'm having a heck of a time trying to write a little web crawler for my
 

 

intranet. I've got everything functionally working it seems like, but
 

 

there is a very strange problem I can't nail down. If I put in an entry
 

 

and start the crawler it goes great through the first loop. It gets the
 

 

url, gets the page info, puts it in the database, and then parses all of
 

 

the links out and puts them raw into the database. On the second loop it
 

 

picks up all the new stuff and does the same thing. By the time the
 

 

second loop is completed I'll have just over 300 items in the database.
 

 

On the third loop is where the problem starts. Once it gets into the
 

 

third loop, it starts to slow down a lot. Then, after a while, if I'm
 

 

running from the command line, it'll just go to a command prompt. If I'm
 

 

running in a browser, it returns a "document contains no data" error.
 

 

This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux
 

 

box yet, but I'd rather run it on the windows server since it's bigger
 

 

and has plenty of cpu, memory, and raid space. It's almost like the
 

 

thing is getting confused when it starts to get more than 300 entries in
 

 

the database. Any ideas ou

Re: [PHP-DB] Real Killer App!

2003-03-12 Thread Peter Beckman

Could it be that a certain web server sees you connect, you request a file,
and that file happens to take forever to load, leaving your script hanging
until memory runs out or something else?  Do you have timeouts set properly
to stop the HTTP GET/POST if nothing is happening on that connection?

Peter

On Tue, 11 Mar 2003, Nicholas Fitzgerald wrote:

> I'm having a heck of a time trying to write a little web crawler for my
> intranet. I've got everything functionally working it seems like, but
> there is a very strange problem I can't nail down. If I put in an entry
> and start the crawler it goes great through the first loop. It gets the
> url, gets the page info, puts it in the database, and then parses all of
> the links out and puts them raw into the database. On the second loop it
> picks up all the new stuff and does the same thing. By the time the
> second loop is completed I'll have just over 300 items in the database.
> On the third loop is where the problem starts. Once it gets into the
> third loop, it starts to slow down a lot. Then, after a while, if I'm
> running from the command line, it'll just go to a command prompt. If I'm
> running in a browser, it returns a "document contains no data" error.
> This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux
> box yet, but I'd rather run it on the windows server since it's bigger
> and has plenty of cpu, memory, and raid space. It's almost like the
> thing is getting confused when it starts to get more than 300 entries in
> the database. Any ideas out there as to what would cause this kind of
> problem?
>
> Nick
>
>
>
> --
> PHP Database Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>

---
Peter Beckman  Internet Guy
[EMAIL PROTECTED] http://www.purplecow.com/
---

-- 
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DB] Real Killer App!

2003-03-12 Thread Jim Hunter

Just a guess, but do you have an index on the table that you are using to
store the URLs that still need to be parsed? This table is going to get
huge! And if you do not delete the URL that you just parsed from the list it
will grow even faster. And if you do not have an index on that table and you
are doing a table scan to see if the new URL is in it or not, this is going
to take longer and longer to complete every time you process another URL.
This is because this temp table of URLs to process will always get larger,
and will rarely go down in size because you add about 5+ new URLs for every
one that you process. 
But then again, we don't know for sure on anything without seeing 'some'
code. So far we have not seen any so everything is total speculation and
guessing. I would be interested in seeing the code that handles the
processing of the URLs once you cull them from a web page. 

Jim Hunter

---Original Message---

From: Nicholas Fitzgerald
Date: Wednesday, March 12, 2003 10:15:52 AM
To: [EMAIL PROTECTED]
Subject: Re: [PHP-DB] Real Killer App!

Rich Gray wrote:

>>I'm having a heck of a time trying to write a little web crawler for my
>>intranet. I've got everything functionally working it seems like, but
>>there is a very strange problem I can't nail down. If I put in an entry
>>and start the crawler it goes great through the first loop. It gets the
>>url, gets the page info, puts it in the database, and then parses all of
>>the links out and puts them raw into the database. On the second loop it
>>picks up all the new stuff and does the same thing. By the time the
>>second loop is completed I'll have just over 300 items in the database.
>>On the third loop is where the problem starts. Once it gets into the
>>third loop, it starts to slow down a lot. Then, after a while, if I'm
>>running from the command line, it'll just go to a command prompt. If I'm
>>running in a browser, it returns a "document contains no data" error.
>>This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux
>>box yet, but I'd rather run it on the windows server since it's bigger
>>and has plenty of cpu, memory, and raid space. It's almost like the
>>thing is getting confused when it starts to get more than 300 entries in
>>the database. Any ideas out there as to what would cause this kind of
>>problem?
>>
>>Nick
>> 
>>
>
>Can you post some code? Are your script timeouts set appropriately? Does
>memory/CPU useage increase dramatically or are there any other symptoms of
>where it is choking...? What DB is it updating? What does the database tell
>you is happening when it starts choking? What do debug messages tell you
wrt
>finding the bottleneck? Does it happen always no matter what start point is
>used? Are you using recursive functions?
>
>Sorry lots of questions but no answers... :)
>
>Cheers
>Rich
> 
>
Recognizing that this script would take a long time to run I'm using 
set_time_limit(0) in it so a timeout doesn't become an issue. The server 
has 1.5 gig of memory and is a dual processor 1GHz PIII. I have never 
seen it get over 15% cpu usage, even while this is going on, and it 
never gets anywhere near full memory usage. The tax on the system itself 
is actually negligable. There are no symptoms that I can find to 
indicate where the chokepoint might be. It seems to be when the DB 
reaches a certain size, but 300 or so records should be a piece of cake 
for it. As far as the debug backtrace, there really isn't anything there 
that stands out. It's not an issue with a variable, something is going 
wrong in the execution either of php, or a sql query. I'm not finding 
any errors in the mysql error log, or anywhere else.

Basically the prog is in two parts. First, it goes and gets the current 
contents of the DB, one record at a time, and checks it. If it meets the 
criteria it is then indexed or reindexed. If it is indexed, then it goes 
to the second part. This is where it strips any links from the page and 
puts them in the DB for indexing, if thery're not already there. When it 
dies, this is where it dies. I'll get the "UPDATING:  
message that comes up when it does an update, but at that point, where 
it is going into strip links, it dies right there.

Nick

> 
>

Re: [PHP-DB] Real Killer App!

2003-03-12 Thread Nicholas Fitzgerald



Rich Gray wrote:

I'm having a heck of a time trying to write a little web crawler for my
intranet. I've got everything functionally working it seems like, but
there is a very strange problem I can't nail down. If I put in an entry
and start the crawler it goes great through the first loop. It gets the
url, gets the page info, puts it in the database, and then parses all of
the links out and puts them raw into the database. On the second loop it
picks up all the new stuff and does the same thing. By the time the
second loop is completed I'll have just over 300 items in the database.
On the third loop is where the problem starts. Once it gets into the
third loop, it starts to slow down a lot. Then, after a while, if I'm
running from the command line, it'll just go to a command prompt. If I'm
running in a browser, it returns a "document contains no data" error.
This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux
box yet, but I'd rather run it on the windows server since it's bigger
and has plenty of cpu, memory, and raid space. It's almost like the
thing is getting confused when it starts to get more than 300 entries in
the database. Any ideas out there as to what would cause this kind of
problem?
Nick
   

Can you post some code? Are your script timeouts set appropriately? Does
memory/CPU useage increase dramatically or are there any other symptoms of
where it is choking...? What DB is it updating? What does the database tell
you is happening when it starts choking? What do debug messages tell you wrt
finding the bottleneck? Does it happen always no matter what start point is
used? Are you using recursive functions?
Sorry lots of questions but no answers... :)

Cheers
Rich
 

Recognizing that this script would take a long time to run I'm using 
set_time_limit(0) in it so a timeout doesn't become an issue. The server 
has 1.5 gig of memory and is a dual processor 1GHz PIII. I have never 
seen it get over 15% cpu usage, even while this is going on, and it 
never gets anywhere near full memory usage. The tax on the system itself 
is actually negligable. There are no symptoms that I can find to 
indicate where the chokepoint might be. It seems to be when the DB 
reaches a certain size, but 300 or so records should be a piece of cake 
for it. As far as the debug backtrace, there really isn't anything there 
that stands out. It's not an issue with a variable, something is going 
wrong in the execution either of php, or a sql query. I'm not finding 
any errors in the mysql error log, or anywhere else.

Basically the prog is in two parts. First, it goes and gets the current 
contents of the DB, one record at a time, and checks it. If it meets the 
criteria it is then indexed or reindexed. If it is indexed, then it goes 
to the second part. This is where it strips any links from the page and 
puts them in the DB for indexing, if thery're not already there. When it 
dies, this is where it dies. I'll get the "UPDATING:  
message that comes up when it does an update, but at that point, where 
it is going into strip links, it dies right there.

Nick

RE: [PHP-DB] Real Killer App!

2003-03-12 Thread Rich Gray

> I'm having a heck of a time trying to write a little web crawler for my
> intranet. I've got everything functionally working it seems like, but
> there is a very strange problem I can't nail down. If I put in an entry
> and start the crawler it goes great through the first loop. It gets the
> url, gets the page info, puts it in the database, and then parses all of
> the links out and puts them raw into the database. On the second loop it
> picks up all the new stuff and does the same thing. By the time the
> second loop is completed I'll have just over 300 items in the database.
> On the third loop is where the problem starts. Once it gets into the
> third loop, it starts to slow down a lot. Then, after a while, if I'm
> running from the command line, it'll just go to a command prompt. If I'm
> running in a browser, it returns a "document contains no data" error.
> This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux
> box yet, but I'd rather run it on the windows server since it's bigger
> and has plenty of cpu, memory, and raid space. It's almost like the
> thing is getting confused when it starts to get more than 300 entries in
> the database. Any ideas out there as to what would cause this kind of
> problem?
>
> Nick

Can you post some code? Are your script timeouts set appropriately? Does
memory/CPU useage increase dramatically or are there any other symptoms of
where it is choking...? What DB is it updating? What does the database tell
you is happening when it starts choking? What do debug messages tell you wrt
finding the bottleneck? Does it happen always no matter what start point is
used? Are you using recursive functions?

Sorry lots of questions but no answers... :)

Cheers
Rich


-- 
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

Re: [PHP-DB] Real Killer App!

RE: [PHP-DB] Real Killer App!

18 matches

Site Navigation

Mail list logo

Footer information