I'm not 100% sure about this, but I think the version of Verity
shipped with CF is so old that it does understand that your page with
the extension .xml is rendered like an HTML page in the browser and
therefore doesn't know how to parse it. I had a similar problem with
PDF file recently. The Verity with CFMX can only read PDF up one
created with version 4 of Acrobat so when I asked Verity to index
about 300 PDFs created with Acrobat 6 it simply ignored them. The
solution I came up with was to use a third party Java API (pdfbox) to
extract the text from the PDF file and then pass that text to verity.
Worked like a treat. Maybe you will need to do something similar to
extract the text from your XML pages and pass that to Verity with the
filename of the page.

Also, I'm not Verity can index CFM page either, as it would need to
call the CFM and get the returned page. I'm pretty sure that Verity
only reads files. So it will index you HTML pages as it simply reads
the file, removes the HTML tags and indexes the remaining text. Verity
can be very good, but it need some tweaking.

Also, if you have a good budget, and I mean good, you could look at
using Verity Ultraseek as that will spider a site and index pages,
pdf, etc and all you do is give it the base URL. You can connect to it
with CF as a web service, it is pretty easy. But like a say Ultraseek
is not cheap.

Hope that helps.

Andrew.

----- Original Message -----
From: Charles Chen <[EMAIL PROTECTED]>
Date: Thu, 02 Sep 2004 18:27:38 -0400
Subject: XML verity indexing problem
To: CF-Talk <[EMAIL PROTECTED]>

Hi All. I'm a fairly new CF developer. I'm having weirdness indexing a
website that is mostly a collection of XML pages. I'm really stumped.
Can someone help me?

If I create the verity collection via the colfusion administrator and
add .XML to the extension types, it will properly create and index the
site and the XML files will show up in my search.

However, I need to use the programmatic method rather than the
administrator because content on the site changes frequently and I
want to schedule a CFM page to index the collection automatically at
intervals. So, I programmed CFM pages to create a collection and then
index it. In the cfindex, I have specified .xml as an extension type
in addition to the standard HTML, CFM, etc. However, when I do a
search the indexed collection only returns HTML files. even though XML
and CFM has been declared as extensions to index as well. I tried
using action="" and action="" I even tried
action="" followed by action="" but no dice. Any idea why
this is the case? I've cut and paste my code for creating an
collection and for indexing it.

Thank you for any generous help!

- Charles

Create collection action page (takes values from form):

<cfoutput>
<cfswitch _expression_=#Form.collectionaction#>
<cfcase value="Create">
<cfcollection action=""> collection="#Form.CollectionName#"
path="c:\cfusionmx\verity\collections\">
<p>The collection #Form.CollectionName# is created.
</cfcase>

<cfcase value="Repair">
<cfcollection action=""> collection="#Form.CollectionName#">
<p>The collection #Form.CollectionName# is repaired.
</cfcase>

<cfcase value="Optimize">
<cfcollection action=""> collection="#Form.CollectionName#">
<p>The collection #Form.CollectionName# is optimized.
</cfcase>

<cfcase value="Delete">
<cfcollection action=""> collection="#Form.CollectionName#">
<p>Collection deleted.
</cfcase>
</cfswitch>
</cfoutput>

Index collection page (takes collection name value from form):

<cfindex collection="#Form.IndexColl#"
    action="">     extensions=".htm, .html, .cfm, .cfml, .xml"
    key="c:\InetPub\wwwroot\www.vsarts.org\"
    type="path"
    urlpath="http://www.vsarts.org"
    recurse="Yes"
    language="English">

<cfoutput>
    The collection #Form.IndexColl# has been indexed.
</cfoutput>________________________________
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings] [Donations and Support]

Reply via email to