Re: Similar Document Search

2003-08-20 Thread Peter Becker
Hi all, it seems there are quite a few people looking for similar features, i.e. (a) document identity and (b) forward indexing. So far we hacked (a) by using a wrapper implementing equals/hashcode based on a unique field, but of course that assumes maintaining a unique field in the index. (b)

Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Doug Cutting
Leo Galambos wrote: Isn't it better for Dan to skip the optimization phase before merging? I am not sure, but he could save some time on this (if he has enough file handles for that, of course). It depends. If you have 10 machines, each with a single disk, that you use for indexing in parallel,

Re: Will failed optimize corrupt an index?

2003-08-20 Thread Doug Cutting
The index should be fine. Lucene index updates are atomic. Doug Dan Quaroni wrote: My index grew about 7 gigs larger than I projected it would, and it ran out of disk space during optimize. Does lucene have transactions or anything that would prevent this from corrupting an index, or do I need

Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Leo Galambos
Isn't it better for Dan to skip the optimization phase before merging? I am not sure, but he could save some time on this (if he has enough file handles for that, of course). What strategy do you use in "nutch"? THX -g- Doug Cutting wrote: As the index grows, disk i/o becomes the bottleneck.

Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Doug Cutting
As the index grows, disk i/o becomes the bottleneck. The default indexing parameters do a pretty good job of optimizing this. But if you have lots of CPUs and lots of disks, you might try building several indexes in parallel, each containing a subset of the documents, optimize each index and

RE: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Dan Quaroni
Looks like I spoke too soon... As the index gets larger, time to merge becomes prohibitably high. It appears to increase linearly. Oh well. I guess I'll just have to go with about 3ms/doc. - To unsubscribe, e-mail: [EMAIL PROTE

Re: Lucene Index on NFS Server

2003-08-20 Thread Doug Cutting
I don't know the details of how lock files are unreliable over NFS, only that they are. The window of vulnerability, when the lock file is used, is when one JVM is opening all of the files in an index, and another is completing an update at the same time. If the updating machine removes some

Re: Searching while optimizing

2003-08-20 Thread Doug Cutting
That is an old FAQ item. Lucene has been thread safe for a while now. Doug Steve Rajavuori wrote: This seems to contradict an item from the Lucene FAQ: << 41. Can I modify the index while performing ongoing searches ? Yes and no. At the time of writing this FAQ (June 2001), Lucene is not thread

Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Dan Quaroni
Hey there. What's the fastest way to do a batch index with lucene 1.3-rc1 on a dual or quad-processor box? The files I'm indexing are very easy to split divide among multiple threads. Here's what I've done at this point: Each thread has its own IndexWriter writing to its own RAMDirectory. Ever

Re: Question on Lucene when indexing big pdf files

2003-08-20 Thread Yang Sun
Hi, When I use luke to look at my index, it seems all right. The content in the index is well, all the contents are extracted from the pdf file. I copy the pdf file content (namely "content" field), and search the keyword, but I can not found the keyword either. I think there is nothing wron

RE: Similar Document Search

2003-08-20 Thread Gregor Heinrich
Hello Terry, Lucene can do forward indexing, as Mark Rosen outlines in his Master's thesis: http://citeseer.nj.nec.com/rosen03email.html. We use a similar approach for (probabilistic) latent semantic analysis and vector space searches. However, the solution is not really completely fixed yet, the

Re: Question on Lucene when indexing big pdf files

2003-08-20 Thread Ben Litchfield
> "cisco". I use Luke and my searcher program as the searching client, > it seems no problem. Can anyone help me? Or any comments on this When you use luke to look at your index does it show the correct contents for those documents? Ben -

Re: Question on Lucene when indexing big pdf files

2003-08-20 Thread Damien Lust
I don't know if it can help you but here you are my code to extract code of pdf doc: /** * Extracts text from a pdf document * * @param in The InputStream representing the pdf file. * @return The text in the file */ public String extractText(InputStream in) { String s = n

updating a document

2003-08-20 Thread Lars Hammer
Hello I'm trying to update a document in my index. As far as i can tell from the FAQ and other places of documentation, the only way to do this is by deleting the document and adding it again. Now, I want to be able to add the document a new but keep from having to re-parse the original file a

Question on Lucene when indexing big pdf files

2003-08-20 Thread Yang Sun
Hi, I am a newbie on Lucene. Now I want to index all my harddisk contents for searching, these includes html file, pdf file, word file and etc. But I have encounter a problem when I try to index pdf files, I need your help. My environment is lucene-1.3-rc (lucene-1.2 has also been tried),