Re: Moving Verity collections from Win to Linux (PDF/DOC problem)
Hi Dave, While I had already read those links, they were indeed among the most helpful. However, it's still hard to prethink pitfalls associated with Lucene/CFMX spidering just going by these tutorials. Therefore, in order to eliminate a lot of the unknown, I'm going to avoid Lucene for the time being, and try to hack a Verity solution together. Thanks, Jamie On Fri, 13 Feb 2004 18:14:09 -0500, in cf-talk you wrote: Perhaps these links might help in your quest? Searching with Lucene and MX: Part 1: http://www.sys-con.com/coldfusion/article.cfm?id=629 Part 2: http://www.sys-con.com/coldfusion/article.cfm?id=639 Extracting text from a PDF (from Matt Liotta's 1/12 blog entries): http://devilm.com/mt/mt-tb.cgi/60 [Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]
Re: Moving Verity collections from Win to Linux (PDF/DOC problem)
It would appear you are going the free route, if that is not true don't forget about google. http://www.google.com/services/ On Fri, 2004-02-13 at 13:56, Jamie Jackson wrote: I've been tasked with estimating the LOE of making a CFMX/Linux site searchable. The site needs to be spidered (as opposed to a *regular* Verity index), and PDFs and DOCs need to be indexed as well. Issue: AFAIK, Verity still can't directly index DOCs and PDFs. The options as I see them, are: 1. Copy site to a Win box (running CF5), and do the VK2 spidering/indexing there, then move the collection to the CFMX/Linux box. 2. Stick with _CFMX_/Linux/VK2, and run toText routines on problem file types. 3. Go with Lucene. Seeing that MM/Verity isn't addressing the PDF/DOC issue (or are they?), it seems that the best long-term solution would be #3 (Lucene), but it's a big unknown for me. I don't have much of a clue as to how long it would take me (a Java novice) to set up a spider/index/search for the first time, and what potential deficiencies I'd be left with once it had been set up. #2 seems okay, but it could get complicated when it comes to crawling to the text alternatives. I'm also unsure what becomes of metadata (i.e. titles) when doing these conversions. However, the solution that falls best within my current skillset is #1, as I've done several Win/VK2/CF5 spiders. Here's the question: Is this solution as straightforward as it seems? I know there are several steps, but having done the aforementioned spiders, I would guess it would take me two days to knock this out (leaving me with a somewhat less than automatic process for future updates... which I could automate later). Are there any GOTCHAs here? Thanks, Jamie [Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]
Re: Moving Verity collections from Win to Linux (PDF/DOC problem)
On Fri, 2004-02-13 at 13:56, Jamie Jackson wrote: I've been tasked with estimating the LOE of making a CFMX/Linux site searchable. The site needs to be spidered (as opposed to a *regular* Verity index), and PDFs and DOCs need to be indexed as well. Issue: AFAIK, Verity still can't directly index DOCs and PDFs. The options as I see them, are: 1. Copy site to a Win box (running CF5), and do the VK2 spidering/indexing there, then move the collection to the CFMX/Linux box. 2. Stick with _CFMX_/Linux/VK2, and run toText routines on problem file types. 3. Go with Lucene. Seeing that MM/Verity isn't addressing the PDF/DOC issue (or are they?), it seems that the best long-term solution would be #3 (Lucene), but it's a big unknown for me. I don't have much of a clue as to how long it would take me (a Java novice) to set up a spider/index/search for the first time, and what potential deficiencies I'd be left with once it had been set up. #2 seems okay, but it could get complicated when it comes to crawling to the text alternatives. I'm also unsure what becomes of metadata (i.e. titles) when doing these conversions. However, the solution that falls best within my current skillset is #1, as I've done several Win/VK2/CF5 spiders. Here's the question: Is this solution as straightforward as it seems? I know there are several steps, but having done the aforementioned spiders, I would guess it would take me two days to knock this out (leaving me with a somewhat less than automatic process for future updates... which I could automate later). Are there any GOTCHAs here? Perhaps these links might help in your quest? Searching with Lucene and MX: Part 1: http://www.sys-con.com/coldfusion/article.cfm?id=629 Part 2: http://www.sys-con.com/coldfusion/article.cfm?id=639 Extracting text from a PDF (from Matt Liotta's 1/12 blog entries): http://devilm.com/mt/mt-tb.cgi/60 Regards, Dave. [Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]
Re: Moving Verity collections from Win to Linux (PDF/DOC problem)
On 13 Feb 2004 14:46:18 -0800, in cf-talk you wrote: It would appear you are going the free route, if that is not true don't forget about google. http://www.google.com/services/ Hmm, I had forgotten about Google. If I can do what I need with robots.txt (wrt filtering), this might be a viable solution for this project. Does anyone know what's an average ballpark of Google's index frequency (how long a stale index might live)? This is a noteworthy solution, but if anybody has it, I'd appreciate any information on the other solutions I mentioned. Thanks, Jamie [Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]
RE: Moving Verity Collections
Can you export the reg entry in the Allaire key that represent the verity collections? Then import on the new machine. John Cesta http://www.cybersmarts.net - ColdFusion ASP and ActiveState PERL Hosting www.serverautomationtools.com -Original Message- From: Morgan, Thomas J. [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 09, 2000 8:02 AM To: '[EMAIL PROTECTED]' Subject: Moving Verity Collections I am upgrading our web server and need to move some Verity collections from the old server to the new one. Any suggestions on the procedure. Thanks. Thomas J. Morgan Information Delivery Systems Research Triangle Institute 3040 Cornwallis Road RTP, NC 27709 (919)541-7414 [EMAIL PROTECTED] Http:\\ids.rti.org -- Archives: http://www.mail-archive.com/cf-talk@houseoffusion.com/ To Unsubscribe visit http://www.houseoffusion.com/index.cfm?sidebar=listsbody=lists/cf_talk or send a message to [EMAIL PROTECTED] with 'unsubscribe' in the body. -- Archives: http://www.mail-archive.com/cf-talk@houseoffusion.com/ To Unsubscribe visit http://www.houseoffusion.com/index.cfm?sidebar=listsbody=lists/cf_talk or send a message to [EMAIL PROTECTED] with 'unsubscribe' in the body.