Re: Moving Verity collections from Win to Linux (PDF/DOC problem)

Rob Rohan Fri, 13 Feb 2004 14:47:29 -0800

It would appear you are going the free route, if that is not true don't
forget about google.
http://www.google.com/services/

On Fri, 2004-02-13 at 13:56, Jamie Jackson wrote:
> I've been tasked with estimating the LOE of making a CFMX/Linux site
> searchable. The site needs to be spidered (as opposed to a *regular*
> Verity index), and PDFs and DOCs need to be indexed as well.
>
> Issue: AFAIK, Verity still can't directly index DOCs and PDFs.
>
> The options as I see them, are:
> 1. Copy site to a Win box (running CF5), and do the VK2
> spidering/indexing there, then move the collection to the CFMX/Linux
> box.
> 2. Stick with _CFMX_/Linux/VK2, and run "toText" routines on problem
> file types.
> 3. Go with Lucene.
>
> Seeing that MM/Verity isn't addressing the PDF/DOC issue (or are
> they?), it seems that the best long-term solution would be #3
> (Lucene), but it's a big unknown for me. I don't have much of a clue
> as to how long it would take me (a Java novice) to set up a
> spider/index/search for the first time, and what potential
> deficiencies I'd be left with once it had been set up.
>
> #2 seems okay, but it could get complicated when it comes to crawling
> to the text alternatives. I'm also unsure what becomes of metadata
> (i.e. titles) when doing these conversions.
>
> However, the solution that falls best within my current skillset is
> #1, as I've done several Win/VK2/CF5 spiders. Here's the question: Is
> this solution as straightforward as it seems? I know there are several
> steps, but having done the aforementioned spiders, I would guess it
> would take me two days to knock this out (leaving me with a somewhat
> less than automatic process for future updates... which I could
> automate later). Are there any GOTCHAs here?
>
> Thanks,
> Jamie
>

[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Re: Moving Verity collections from Win to Linux (PDF/DOC problem)

Reply via email to