And is there something smart you have done to make this go fast, both on the infrastructure or the system it selves?
Best regards, Gard Arneson Haugen
Email : [EMAIL PROTECTED]
Mobile: +47 93 05 01 91 Fax : +47 21 95 51 99
Magenta News AS - Møllergata 8, 0179 Oslo
Will Allen wrote:
I have an application that I run monthly that indexes 40 million documents into 6 indexes, then uses a multisearcher. The advantage for me is that I can have multiple writers indexing 1/6 of that total data reducing the time it takes to index by about 5X.
-----Original Message----- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:39 PM To: Lucene Users List Subject: Re: Acedemic Question About Indexing
Don't worry, regardless of what I learn in this forum I am telling my company to get me a copy of that bad boy when it comes out (which as far as I am concerned can't be soon enough). I will pay for grama's myself.
I think I have reviewed the code you are referring to and have something similar working in my own indexer (using the "uid"). All is well.
My stupid question for the day is why would you ever want multiple indexes running if you can build one smart indexer that does everything as efficiently as possible? Does the answer to this question move me to multi threaded indexing territory?
Thanks,
Luke
----- Original Message ----- From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 10, 2004 2:08 PM
Subject: Re: Acedemic Question About Indexing
Uh, I hate to market it, but.... it's in the book. But you don't have to wait for it, as there already is a Lucene demo that does what you described. I am not sure if the demo always recreates the index or whether it deletes and re-adds only the new and modified files, but if it's the former, you would only need to modify the demo a little bit to check the timestamps of File objects and compare them to those stored in the index (if they are being stored - if not, you should add a field to hold that data)
Otis
--- Luke Shannon <[EMAIL PROTECTED]> wrote:
I am working on debugging an existing Lucene implementation.
Before I started, I built a demo to understand Lucene. In my demo I indexed the entire content hierarhcy all at once, and than optimize this index and used it for queries. It was time consuming but very simply.
The code I am currently trying to fix indexes the content hierarchy by folder creating a seperate index for each one. Thus it ends up with a bunch of indexes. I still don't understand how this works (I am assuming they get merged someone that I have tracked down yet) but I have noticed it doesn't always index the right folder. This results in the users reporting "inconsistant" behavior in searching after they make a change to a document. To keep things simiple I would like to remove all the logic that figures out which folder to index and just do them all (usually less than 1000 files) so I end up with one index.
Would indexing time be the only area I would be losing out in, or is there something more to the approach of creating multiple indexes and merging them.
What is a good approach I can take to indexing a content hierarchy composed primarily of pdf, xsl, doc and xml where any of these documents can be changed several times a day?
Thanks,
Luke
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]