You know... I played around with that yesterday.

I'm having two problems:  My CF Server is Linux, so I don't think it can
spider my PDF's right?  What I've done is used pdf2text and just
converted my PDF's to text for indexing.  Also, I can't find any decent
directions for a beginner to get that server working on Linux.  Lots for
Windows.

If anybody could point me toward some good directions, and if it would
work for my PDF's... that would be brilliant!

--
Jillian

> -----Original Message-----
> From: Jeff Garza [mailto:[EMAIL PROTECTED] 
> Sent: March 13, 2003 10:09 AM
> To: CF-Talk
> Subject: Re: Search Indexing
> 
> 
> Jillian,
> 
> Have you tried using the Verity K2 Spider on your site?  It's 
> a full web spider that will follow all the links on your site 
> and index the content regardless of it being static or dynamic.
> 
> Jeff
> 
> ----- Original Message -----
> From: "Jillian Carroll" <[EMAIL PROTECTED]>
> To: "CF-Talk" <[EMAIL PROTECTED]>
> Sent: Thursday, March 13, 2003 9:01 AM
> Subject: Search Indexing
> 
> 
> Good Morning!
> 
> I know I'm close here... but I'm having a few problems trying 
> to get my search engine to index my static pages.  The code 
> below works... but I need to evolve it to do two more things, 
> and I can't find anything through google/macromedia forums.
> 
> My questions: How can I evolve my code so it will not only 
> index the directory specified, but it will index all of the 
> subdirectories of that directory?  AND how can I index more 
> than one directory?  Is there a way to do that without just 
> duplicating my entire loop?
> 
> My code is pasted below.  Thank you!!
> 
> --
> Jillian
> 
> *** *** ***
> 
> <!---Form Proccessing--->
> <cfif isdefined("Form.Action")>
> <cfswitch expression="#Form.Action#">
> <cfcase value="Update">
> <cfquery name="delete_existing"
> datasource="#DSN#">
> DELETE from indexedpages
> </cfquery>
> 
> <!--- This variable will hold all the
> files found --->
> <cfset FileList = "">
> 
> <!--- Note that filter is optional for
> "mixed" sites --->
> <!--- This loop collects all the files
> into one list --->
> <cfloop list="#Form.DirList#" index=Dir>
> 
> <CFDIRECTORY ACTION="list"
> DIRECTORY="#Dir#"
> NAME="IndexList">
> 
> <cfoutput query="IndexList">
> <cfif IndexList.type is
> "file">
> <cfset FileList
> = listappend(FileList,Dir&IndexList.name)>
> </cfif>
> </cfoutput>
> </cfloop>
> 
> <!---This loop reads/parses and inserts
> each file--->
> <cfloop list="#FileList#" index="File">
> <cffile action="read"
> file="#File#" variable="ParseMe">
> <cfset Title="Untitled">
> 
> <!---Fetch Title--->
> <cfif
> REFindNoCase("<h[0-9]",ParseMe,1) is not 0>
> <cfset
> start=find(">",ParseMe,REFindNoCase("<h[0-9]",ParseMe,1)) + 
> 1> <cfset end=find("<",ParseMe,start)> <cfset length = end -
> start>
> <cfset
> title=mid(ParseMe,start,length)>
> </cfif>
> 
> <!---Remove Common tags--->
> <cfset ParseMe =
> REReplaceNoCase(ParseMe,"<[^>]*>","","all")>
> 
> <!---Remove Noise Words--->
> <cfset NoiseWords =
> 
> "a,an,and,at,as,are,all,be,but,by,can,do,for,get,got,here,I,if
> ,it,is,in,
> like,may,
> 
> not,our,or,of,on,that,then,the,they,there,to,which,we,you,your">
> 
> <cfloop list="#NoiseWords#"
> index="Noise">
> <cfset ParseMe = 
> REReplaceNoCase(ParseMe,"[[:space:]]#Noise#[[:space:]]"," 
> ","all")> </cfloop>
> 
> <!---Remove Extra Space--->
> <cfloop from="1" to="10"
> index="loop">
> <cfset ParseMe = 
> REReplaceNoCase(ParseMe,"[[:space:]]+[[:space:]]"," ","all")> 
> </cfloop>
> 
> <!---Resolve Web Root--->
> <cfset WebPath= 
> ReplaceNoCase(File,Form.SitePathRoot,Form.SiteWebRoot,"all")>
> <cfset WebPath=
> ReplaceNoCase(WebPath,"\","/","all")>
> 
> <!---Insert Into Dbase--->
> <cftry>
> <cfquery
> name="insert_pages" datasource='#DSN#'>
> INSERT into indexedpages
> 
> (
> webpath,
> filepath,
> title,
> contents
> )
> VALUES (
> '#WebPath#',
> '#File#',
> '#Title#',
> '#ParseMe#'
> )
> </cfquery>
> 
> <cfcatch
> type='Database'>
> <cfoutput>
> Database
> error for:
> <br
> />file:<b>#Title#</b>
> <br
> />file:<b>#File#</b>
> <br
> />file:<b>#WebPath#</b>
> <br />
> </cfoutput>
> </cfcatch>
> </cftry>
> <!---End of File loop--->
> </cfloop>
> 
> <!---Update the collection--->
> <cfquery datasource="#DSN#"
> name="getContents">
> SELECT id,
> webpath,
> filepath,
> title,
> contents
> FROM indexedpages
> </cfquery>
> 
> <cftry>
> <cfindex
> action="update"
> collection="epipages"
> query="getContents"
> key="ID"
> title="title"
> type="Custom"
> body="Contents"
> custom1="WebPath">
> 
> <cfcatch type="any">
> <br />Sorry, an error occurred
> while trying to update your collection. Check that the 
> collection exists. </cfcatch>
> 
> </cftry>
> <!---End of Update Case--->
> <br />Collection has been updated successfully.
> </cfcase>
> </cfswitch>
> </cfif>
> 
> <!--- Action Form --->
> <cfoutput>
> <form action="#ThisFileName#" method="post">
> <table border="0" cellpadding="4" cellspacing="0">
> 
> <tr>
> <td>List of Directories to Index:</td>
> <td><input type="text" size=50
> name="DirList" value="/data/aliases/epi/pdfshadow/"></td>
> </tr>
> <tr>
> <td>
> File Path to Site Root:
> <br />(Trailing slashes are required)
> </td>
> <td valign="top"><input type="text" size=50
> name="SitePathRoot" value="/data/aliases/epi/pdfshadow/"></td>
> </tr>
> <tr>
> <td>
> URL Path to Site Root:
> <br />(ie http://www.yoursite.com/)
> 
> <br />(Trailing slashes are required)
> </td>
> <td valign="top"><input type="text" size=50
> name="SiteWebRoot" value="http://epidev.lights.com/";></td>
> </tr>
> <tr>
> <td colspan="2"><input type="submit"
> name="action" value="Update"></td>
> </tr>
> </table>
> </form>
> </cfoutput>
> 
> 
> 
> 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Archives: http://www.houseoffusion.com/cf_lists/index.cfm?forumid=4
Subscription: 
http://www.houseoffusion.com/cf_lists/index.cfm?method=subscribe&forumid=4
FAQ: http://www.thenetprofits.co.uk/coldfusion/faq
Get the mailserver that powers this list at http://www.coolfusion.com

                                Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
                                

Reply via email to