Well, the first suggestion is don't use ColdFusion for this -- not
really the best tool for the job.
But, assuming you won't easily be dissuaded from that path ;) .....
Not sure why you necessarily need to keep a hash of the file contents,
but that is likely slow for big chunks of HTML.
You might find a simple regex to find links is faster than all the
string manipulation you are doing.
Also, I didn't look too carefully, but it doesn't seem like your code
really deals with circular references in a web site -- pages that link
to each other could cause this to just go and go and go, no? Same goes
for links outside the site -- what causes it to stop crawling (I don't
see what you do with the LEVEL argument, for instance)?
At some level, doing LOTS of CFHTTP calls like this is just going to be
slow, especially if the site is not on your local network.
[EMAIL PROTECTED] wrote:
I'm trying to write a CFC that will spider a website and create an
inventory of all the pages/files on the website. Its a fairly simple
program but awful slow. I create a page list a structure called
request.tree. Here is the function
<cffunction name="get">
<cfargument name="incomingURL" type="string">
<cfset var local=structNew()>
<cfhttp url="#arguments.incomingURL#" method="get"
resolveurl="yes"/>
<cfscript>
local.fileContent=cfhttp.fileContent;
request.tree[arguments.incomingURL] = structnew();
request.tree[arguments.incomingURL].linksArray=arraynew(1);
request.tree[arguments.incomingURL].hash=hash(local.fileContent);
local.startLink =
findnocase('http://',local.fileContent,1);
while (local.startLink)
{
local.endlink=min(findnocase('>',local.fileContent,local.startLink),findnocase('
',local.fileContent,local.startLink));
local.link=trim(mid(local.fileContent,local.startLink,local.endlink-local.startLink));
local.link=replace(local.link,chr(34),'',"ALL");
local.link=replace(local.link,'>','',"ALL");
local.link=replace(local.link,chr(32),'',"ALL");
arrayappend(request.tree[arguments.incomingURL].linksArray,local.link);
if ( local.link contains request.base and not
structkeyexists(request.tree,local.link) )
{
get(incomingURL=local.link,level=arguments.level+1);
}
local.startLink=findnocase('http://',local.fileContent,local.endlink);
}
</cfscript>
<cfreturn />
</cffunction>
Unfortunately, its painstakingly slow even for fairly simple sites. Can
anybody make any suggestions?
Jason Cronk
[EMAIL PROTECTED]
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to
cfcdev@cfczone.org with the words 'unsubscribe cfcdev' as the subject of the
email.
CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting
(www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at
www.mail-archive.com/cfcdev@cfczone.org
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to
cfcdev@cfczone.org with the words 'unsubscribe cfcdev' as the subject of the
email.
CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting
(www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at
www.mail-archive.com/cfcdev@cfczone.org