Re: XML verity indexing problem
Judging by your paths I am going to guess that you are on Windows. You might want to take a look at the verity spider. Just a thought. Doug Charles Chen wrote: >Hi All. I'm a fairly new CF developer. I'm having weirdness indexing a website that is mostly a collection of XML pages. I'm really stumped. Can someone help me? > >If I create the verity collection via the colfusion administrator and add .XML to the extension types, it will properly create and index the site and the XML files will show up in my search. > >However, I need to use the programmatic method rather than the administrator because content on the site changes frequently and I want to schedule a CFM page to index the collection automatically at intervals. So, I programmed CFM pages to create a collection and then index it. In the cfindex, I have specified .xml as an extension type in addition to the standard HTML, CFM, etc. However, when I do a search the indexed collection only returns HTML files. even though XML and CFM has been declared as extensions to index as well. I tried using action="" and action="" I even tried action="" followed by action="" but no dice. Any idea why this is the case? I've cut and paste my code for creating an collection and for indexing it. > >Thank you for any generous help! > >- Charles > >Create collection action page (takes values from form): > > > > > >collection="#Form.CollectionName#" >path="c:\cfusionmx\verity\collections\"> >The collection #Form.CollectionName# is created. > > > > >collection="#Form.CollectionName#"> >The collection #Form.CollectionName# is repaired. > > > > >collection="#Form.CollectionName#"> >The collection #Form.CollectionName# is optimized. > > > > >collection="#Form.CollectionName#"> >Collection deleted. > > > > >Index collection page (takes collection name value from form): > > > action=""> > extensions=".htm, .html, .cfm, .cfml, .xml" > key="c:\InetPub\wwwroot\www.vsarts.org\" > type="path" > urlpath="http://www.vsarts.org" > recurse="Yes" > language="English"> > > > The collection #Form.IndexColl# has been indexed. > > > [Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings] [Donations and Support]
Re: XML verity indexing problem
I'm not 100% sure about this, but I think the version of Verity shipped with CF is so old that it does understand that your page with the extension .xml is rendered like an HTML page in the browser and therefore doesn't know how to parse it. I had a similar problem with PDF file recently. The Verity with CFMX can only read PDF up one created with version 4 of Acrobat so when I asked Verity to index about 300 PDFs created with Acrobat 6 it simply ignored them. The solution I came up with was to use a third party Java API (pdfbox) to extract the text from the PDF file and then pass that text to verity. Worked like a treat. Maybe you will need to do something similar to extract the text from your XML pages and pass that to Verity with the filename of the page. Also, I'm not Verity can index CFM page either, as it would need to call the CFM and get the returned page. I'm pretty sure that Verity only reads files. So it will index you HTML pages as it simply reads the file, removes the HTML tags and indexes the remaining text. Verity can be very good, but it need some tweaking. Also, if you have a good budget, and I mean good, you could look at using Verity Ultraseek as that will spider a site and index pages, pdf, etc and all you do is give it the base URL. You can connect to it with CF as a web service, it is pretty easy. But like a say Ultraseek is not cheap. Hope that helps. Andrew. - Original Message - From: Charles Chen <[EMAIL PROTECTED]> Date: Thu, 02 Sep 2004 18:27:38 -0400 Subject: XML verity indexing problem To: CF-Talk <[EMAIL PROTECTED]> Hi All. I'm a fairly new CF developer. I'm having weirdness indexing a website that is mostly a collection of XML pages. I'm really stumped. Can someone help me? If I create the verity collection via the colfusion administrator and add .XML to the extension types, it will properly create and index the site and the XML files will show up in my search. However, I need to use the programmatic method rather than the administrator because content on the site changes frequently and I want to schedule a CFM page to index the collection automatically at intervals. So, I programmed CFM pages to create a collection and then index it. In the cfindex, I have specified .xml as an extension type in addition to the standard HTML, CFM, etc. However, when I do a search the indexed collection only returns HTML files. even though XML and CFM has been declared as extensions to index as well. I tried using action="" and action="" I even tried action="" followed by action="" but no dice. Any idea why this is the case? I've cut and paste my code for creating an collection and for indexing it. Thank you for any generous help! - Charles Create collection action page (takes values from form): collection="#Form.CollectionName#" path="c:\cfusionmx\verity\collections\"> The collection #Form.CollectionName# is created. collection="#Form.CollectionName#"> The collection #Form.CollectionName# is repaired. collection="#Form.CollectionName#"> The collection #Form.CollectionName# is optimized. collection="#Form.CollectionName#"> Collection deleted. Index collection page (takes collection name value from form): action=""> extensions=".htm, .html, .cfm, .cfml, .xml" key="c:\InetPub\wwwroot\www.vsarts.org\" type="path" urlpath="http://www.vsarts.org" recurse="Yes" language="English"> The collection #Form.IndexColl# has been indexed. [Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings] [Donations and Support]
XML verity indexing problem
Hi All. I'm a fairly new CF developer. I'm having weirdness indexing a website that is mostly a collection of XML pages. I'm really stumped. Can someone help me? If I create the verity collection via the colfusion administrator and add .XML to the extension types, it will properly create and index the site and the XML files will show up in my search. However, I need to use the programmatic method rather than the administrator because content on the site changes frequently and I want to schedule a CFM page to index the collection automatically at intervals. So, I programmed CFM pages to create a collection and then index it. In the cfindex, I have specified .xml as an extension type in addition to the standard HTML, CFM, etc. However, when I do a search the indexed collection only returns HTML files. even though XML and CFM has been declared as extensions to index as well. I tried using action="" and action="" I even tried action="" followed by action="" but no dice. Any idea why this is the case? I've cut and paste my code for creating an collection and for indexing it. Thank you for any generous help! - Charles Create collection action page (takes values from form): collection="#Form.CollectionName#" path="c:\cfusionmx\verity\collections\"> The collection #Form.CollectionName# is created. collection="#Form.CollectionName#"> The collection #Form.CollectionName# is repaired. collection="#Form.CollectionName#"> The collection #Form.CollectionName# is optimized. collection="#Form.CollectionName#"> Collection deleted. Index collection page (takes collection name value from form): action=""> extensions=".htm, .html, .cfm, .cfml, .xml" key="c:\InetPub\wwwroot\www.vsarts.org\" type="path" urlpath="http://www.vsarts.org" recurse="Yes" language="English"> The collection #Form.IndexColl# has been indexed. [Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings] [Donations and Support]