Well, the first suggestion is don't use ColdFusion for this -- not really the best tool for the job.

But, assuming you won't easily be dissuaded from that path ;) .....

Not sure why you necessarily need to keep a hash of the file contents, but that is likely slow for big chunks of HTML.

You might find a simple regex to find links is faster than all the string manipulation you are doing.

Also, I didn't look too carefully, but it doesn't seem like your code really deals with circular references in a web site -- pages that link to each other could cause this to just go and go and go, no? Same goes for links outside the site -- what causes it to stop crawling (I don't see what you do with the LEVEL argument, for instance)?

At some level, doing LOTS of CFHTTP calls like this is just going to be slow, especially if the site is not on your local network.



[EMAIL PROTECTED] wrote:




I'm trying to write a CFC that will spider a website and create an
inventory of all the pages/files on the website.  Its a fairly simple
program but awful slow.  I create a page list a structure called
request.tree.  Here is the function


      <cffunction name="get">
            <cfargument name="incomingURL" type="string">
            <cfset var local=structNew()>

            <cfhttp url="#arguments.incomingURL#"  method="get"
resolveurl="yes"/>
            <cfscript>
                  local.fileContent=cfhttp.fileContent;
                  request.tree[arguments.incomingURL] = structnew();

request.tree[arguments.incomingURL].linksArray=arraynew(1);

request.tree[arguments.incomingURL].hash=hash(local.fileContent);
                  local.startLink =
findnocase('http://',local.fileContent,1);
                  while (local.startLink)
                        {

local.endlink=min(findnocase('>',local.fileContent,local.startLink),findnocase('
 ',local.fileContent,local.startLink));

local.link=trim(mid(local.fileContent,local.startLink,local.endlink-local.startLink));
                        local.link=replace(local.link,chr(34),'',"ALL");
                        local.link=replace(local.link,'>','',"ALL");
                        local.link=replace(local.link,chr(32),'',"ALL");


arrayappend(request.tree[arguments.incomingURL].linksArray,local.link);
                        if ( local.link contains request.base and not
structkeyexists(request.tree,local.link) )
                              {

get(incomingURL=local.link,level=arguments.level+1);
                              }

local.startLink=findnocase('http://',local.fileContent,local.endlink);
                        }
            </cfscript>

            <cfreturn />
      </cffunction>


Unfortunately, its painstakingly slow even for fairly simple sites.  Can
anybody make any suggestions?

Jason Cronk
[EMAIL PROTECTED]




----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to 
cfcdev@cfczone.org with the words 'unsubscribe cfcdev' as the subject of the 
email.

CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting 
(www.cfxhosting.com).

CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm

An archive of the CFCDev list is available at 
www.mail-archive.com/cfcdev@cfczone.org




----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to 
cfcdev@cfczone.org with the words 'unsubscribe cfcdev' as the subject of the 
email.

CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting 
(www.cfxhosting.com).

CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm

An archive of the CFCDev list is available at 
www.mail-archive.com/cfcdev@cfczone.org


Reply via email to