Hi all,
it seems there are quite a few people looking for similar features, i.e.
(a) document identity and (b) forward indexing. So far we hacked (a) by
using a wrapper implementing equals/hashcode based on a unique field,
but of course that assumes maintaining a unique field in the index. (b)
Leo Galambos wrote:
Isn't it better for Dan to skip the optimization phase before merging? I
am not sure, but he could save some time on this (if he has enough file
handles for that, of course).
It depends. If you have 10 machines, each with a single disk, that you
use for indexing in parallel,
The index should be fine. Lucene index updates are atomic.
Doug
Dan Quaroni wrote:
My index grew about 7 gigs larger than I projected it would, and it ran out
of disk space during optimize. Does lucene have transactions or anything
that would prevent this from corrupting an index, or do I need
Isn't it better for Dan to skip the optimization phase before merging? I
am not sure, but he could save some time on this (if he has enough file
handles for that, of course). What strategy do you use in "nutch"?
THX
-g-
Doug Cutting wrote:
As the index grows, disk i/o becomes the bottleneck.
As the index grows, disk i/o becomes the bottleneck. The default
indexing parameters do a pretty good job of optimizing this. But if you
have lots of CPUs and lots of disks, you might try building several
indexes in parallel, each containing a subset of the documents, optimize
each index and
Looks like I spoke too soon... As the index gets larger, time to merge
becomes prohibitably high. It appears to increase linearly.
Oh well. I guess I'll just have to go with about 3ms/doc.
-
To unsubscribe, e-mail: [EMAIL PROTE
I don't know the details of how lock files are unreliable over NFS, only
that they are. The window of vulnerability, when the lock file is used,
is when one JVM is opening all of the files in an index, and another is
completing an update at the same time. If the updating machine removes
some
That is an old FAQ item. Lucene has been thread safe for a while now.
Doug
Steve Rajavuori wrote:
This seems to contradict an item from the Lucene FAQ:
<<
41. Can I modify the index while performing ongoing searches ?
Yes and no. At the time of writing this FAQ (June 2001), Lucene is not
thread
Hey there. What's the fastest way to do a batch index with lucene 1.3-rc1
on a dual or quad-processor box? The files I'm indexing are very easy to
split divide among multiple threads.
Here's what I've done at this point:
Each thread has its own IndexWriter writing to its own RAMDirectory. Ever
Hi,
When I use luke to look at my index, it seems all right. The content in the index
is well, all the contents are extracted from the pdf file. I copy the pdf file content
(namely "content" field), and search the keyword, but I can not found the keyword
either. I think there is nothing wron
Hello Terry,
Lucene can do forward indexing, as Mark Rosen outlines in his Master's
thesis: http://citeseer.nj.nec.com/rosen03email.html.
We use a similar approach for (probabilistic) latent semantic analysis and
vector space searches. However, the solution is not really completely fixed
yet, the
> "cisco". I use Luke and my searcher program as the searching client,
> it seems no problem. Can anyone help me? Or any comments on this
When you use luke to look at your index does it show the correct contents
for those documents?
Ben
-
I don't know if it can help you but here you are my code to extract
code of pdf doc:
/**
* Extracts text from a pdf document
*
* @param in The InputStream representing the pdf file.
* @return The text in the file
*/
public String extractText(InputStream in)
{
String s = n
Hello
I'm trying to update a document in my index. As far as i can tell from the FAQ and
other places of documentation, the only way to do this is by deleting the document and
adding it again.
Now, I want to be able to add the document a new but keep from having to re-parse the
original file a
Hi,
I am a newbie on Lucene. Now I want to index all my harddisk contents for
searching, these includes html file, pdf file, word file and etc. But I have encounter
a problem when I try to index pdf files, I need your help.
My environment is lucene-1.3-rc (lucene-1.2 has also been tried),
15 matches
Mail list logo