Re: XML verity indexing problem

2004-09-03 Thread Doug James
Judging by your paths I am going to guess that you are on Windows. You 
might want to take a look at the verity spider. Just a thought.

Doug

Charles Chen wrote:

>Hi All. I'm a fairly new CF developer. I'm having weirdness indexing a website that is mostly a collection of XML pages. I'm really stumped. Can someone help me?
>
>If I create the verity collection via the colfusion administrator and add .XML to the extension types, it will properly create and index the site and the XML files will show up in my search.
>
>However, I need to use the programmatic method rather than the administrator because content on the site changes frequently and I want to schedule a CFM page to index the collection automatically at intervals. So, I programmed CFM pages to create a collection and then index it. In the cfindex, I have specified .xml as an extension type in addition to the standard HTML, CFM, etc. However, when I do a search the indexed collection only returns HTML files. even though XML and CFM has been declared as extensions to index as well. I tried using action="" and action="" I even tried action="" followed by action="" but no dice. Any idea why this is the case? I've cut and paste my code for creating an collection and for indexing it.
>
>Thank you for any generous help!
>
>- Charles
>
>Create collection action page (takes values from form):
>
>
>
>
>
>collection="#Form.CollectionName#"
>path="c:\cfusionmx\verity\collections\">
>The collection #Form.CollectionName# is created.
>
>	
>
>
>collection="#Form.CollectionName#">
>The collection #Form.CollectionName# is repaired.
>
>	
>
>
>collection="#Form.CollectionName#">
>The collection #Form.CollectionName# is optimized.
>
>	
>
>
>collection="#Form.CollectionName#">
>Collection deleted.
>
>
>
>
>Index collection page (takes collection name value from form):
>
>
>	   action="">
>	   extensions=".htm, .html, .cfm, .cfml, .xml"
>	   key="c:\InetPub\wwwroot\www.vsarts.org\"
>	   type="path"
>	   urlpath="http://www.vsarts.org"
>	   recurse="Yes"
>	   language="English">
>	
>	
>	   The collection #Form.IndexColl# has been indexed.
>	
>
>
 [Todays Threads] 
 [This Message] 
 [Subscription] 
 [Fast Unsubscribe] 
 [User Settings]
 [Donations and Support]




Re: XML verity indexing problem

2004-09-03 Thread Andrew Dixon
I'm not 100% sure about this, but I think the version of Verity
shipped with CF is so old that it does understand that your page with
the extension .xml is rendered like an HTML page in the browser and
therefore doesn't know how to parse it. I had a similar problem with
PDF file recently. The Verity with CFMX can only read PDF up one
created with version 4 of Acrobat so when I asked Verity to index
about 300 PDFs created with Acrobat 6 it simply ignored them. The
solution I came up with was to use a third party Java API (pdfbox) to
extract the text from the PDF file and then pass that text to verity.
Worked like a treat. Maybe you will need to do something similar to
extract the text from your XML pages and pass that to Verity with the
filename of the page.

Also, I'm not Verity can index CFM page either, as it would need to
call the CFM and get the returned page. I'm pretty sure that Verity
only reads files. So it will index you HTML pages as it simply reads
the file, removes the HTML tags and indexes the remaining text. Verity
can be very good, but it need some tweaking.

Also, if you have a good budget, and I mean good, you could look at
using Verity Ultraseek as that will spider a site and index pages,
pdf, etc and all you do is give it the base URL. You can connect to it
with CF as a web service, it is pretty easy. But like a say Ultraseek
is not cheap.

Hope that helps.

Andrew.

- Original Message -
From: Charles Chen <[EMAIL PROTECTED]>
Date: Thu, 02 Sep 2004 18:27:38 -0400
Subject: XML verity indexing problem
To: CF-Talk <[EMAIL PROTECTED]>

Hi All. I'm a fairly new CF developer. I'm having weirdness indexing a
website that is mostly a collection of XML pages. I'm really stumped.
Can someone help me?

 If I create the verity collection via the colfusion administrator and
add .XML to the extension types, it will properly create and index the
site and the XML files will show up in my search.

 However, I need to use the programmatic method rather than the
administrator because content on the site changes frequently and I
want to schedule a CFM page to index the collection automatically at
intervals. So, I programmed CFM pages to create a collection and then
index it. In the cfindex, I have specified .xml as an extension type
in addition to the standard HTML, CFM, etc. However, when I do a
search the indexed collection only returns HTML files. even though XML
and CFM has been declared as extensions to index as well. I tried
using action="" and action="" I even tried
action="" followed by action="" but no dice. Any idea why
this is the case? I've cut and paste my code for creating an
collection and for indexing it.

 Thank you for any generous help!

 - Charles

 Create collection action page (takes values from form):

 
 
 
 
 collection="#Form.CollectionName#"
 path="c:\cfusionmx\verity\collections\">
 The collection #Form.CollectionName# is created.
 

 
 
 collection="#Form.CollectionName#">
 The collection #Form.CollectionName# is repaired.
 

 
 
 collection="#Form.CollectionName#">
 The collection #Form.CollectionName# is optimized.
 

 
 
 collection="#Form.CollectionName#">
 Collection deleted.
 
 
 

 Index collection page (takes collection name value from form):

 
    action="">
    extensions=".htm, .html, .cfm, .cfml, .xml"
    key="c:\InetPub\wwwroot\www.vsarts.org\"
    type="path"
    urlpath="http://www.vsarts.org"
    recurse="Yes"
    language="English">

 
    The collection #Form.IndexColl# has been indexed.
 
 [Todays Threads] 
 [This Message] 
 [Subscription] 
 [Fast Unsubscribe] 
 [User Settings]
 [Donations and Support]




XML verity indexing problem

2004-09-02 Thread Charles Chen
Hi All. I'm a fairly new CF developer. I'm having weirdness indexing a website that is mostly a collection of XML pages. I'm really stumped. Can someone help me?

If I create the verity collection via the colfusion administrator and add .XML to the extension types, it will properly create and index the site and the XML files will show up in my search.

However, I need to use the programmatic method rather than the administrator because content on the site changes frequently and I want to schedule a CFM page to index the collection automatically at intervals. So, I programmed CFM pages to create a collection and then index it. In the cfindex, I have specified .xml as an extension type in addition to the standard HTML, CFM, etc. However, when I do a search the indexed collection only returns HTML files. even though XML and CFM has been declared as extensions to index as well. I tried using action="" and action="" I even tried action="" followed by action="" but no dice. Any idea why this is the case? I've cut and paste my code for creating an collection and for indexing it.

Thank you for any generous help!

- Charles

Create collection action page (takes values from form):





collection="#Form.CollectionName#"
path="c:\cfusionmx\verity\collections\">
The collection #Form.CollectionName# is created.

	


collection="#Form.CollectionName#">
The collection #Form.CollectionName# is repaired.

	


collection="#Form.CollectionName#">
The collection #Form.CollectionName# is optimized.

	


collection="#Form.CollectionName#">
Collection deleted.




Index collection page (takes collection name value from form):


	   action="">
	   extensions=".htm, .html, .cfm, .cfml, .xml"
	   key="c:\InetPub\wwwroot\www.vsarts.org\"
	   type="path"
	   urlpath="http://www.vsarts.org"
	   recurse="Yes"
	   language="English">
	
	
	   The collection #Form.IndexColl# has been indexed.
	
 [Todays Threads] 
 [This Message] 
 [Subscription] 
 [Fast Unsubscribe] 
 [User Settings]
 [Donations and Support]