Re: Best Practices for Distributing Lucene Indexing and Searching

2005-03-01 Thread Doug Cutting
Yonik Seeley wrote:
6. Index locally and synchronize changes periodically. This is an
interesting idea and bears looking into. Lucene can combine multiple
indexes into a single one, which can be written out somewhere else, and
then distributed back to the search nodes to replace their existing
index.
This is a promising idea for handling a high update volume because it
avoids all of the search nodes having to do the analysis phase.
A clever way to do this is to take advantage of Lucene's index file 
structure.  Indexes are directories of files.  As the index changes 
through additions and deletions most files in the index stay the same. 
So you can efficiently synchronize multiple copies of an index by only 
copying the files that change.

The way I did this for Technorati was to:
1. On the index master, periodically checkpoint the index.  Every minute 
or so the IndexWriter is closed and a 'cp -lr index index.DATE' command 
is executed from Java, where DATE is the current date and time.  This 
efficiently makes a copy of the index when its in a consistent state by 
constructing a tree of hard links.  If Lucene re-writes any files (e.g., 
the segments file) a new inode is created and the copy is unchanged.

2. From a crontab on each search slave, periodically poll for new 
checkpoints.  When a new index.DATE is found, use 'cp -lr index 
index.DATE' to prepare a copy, then use 'rsync -W --delete 
master:index.DATE index.DATE' to get the incremental index changes. 
Then atomically install the updated index with a symbolic link (ln -fsn 
index.DATE index).

3. In Java on the slave, re-open 'index' it when its version changes. 
This is best done in a separate thread that periodically checks the 
index version.  When it changes, the new version is opened, a few 
typical queries are performed on it to pre-load Lucene's caches.  Then, 
in a synchronized block, the Searcher variable used in production is 
updated.

4. In a crontab on the master, periodically remove the oldest checkpoint 
indexes.

Technorati's Lucene index is updated this way every minute.  A 
mergeFactor of 2 is used on the master in order to minimize the number 
of segments in production.  The master has a hot spare.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fast access to a random page of the search results.

2005-03-01 Thread Doug Cutting
Daniel Naber wrote:
After fixing this I can reproduce the problem with a local index that 
contains about 220.000 documents (700MB). Fetching the first document 
takes for example 30ms, fetching the last one takes >100ms. Of course I 
tested this with a query that returns many results (about 50.000). 
Actually it happens even with the default sorting, no need to sort by some 
specific field.
In part this is due to the fact that Hits first searches for the 
top-scoring 100 documents.  Then, if you ask for a hit after that, it 
must re-query.  In part this is also due to the fact that maintaining a 
queue of the top 50k hits is more expensive than maintaining a queue of 
the top 100 hits, so the second query is slower.  And in part this could 
be caused by other things, such as that the highest ranking document 
might tend to be cached and not require disk io.

One could perform profiling to determine which is the largest factor. 
Of these, only the first is really fixable: if you know you'll need hit 
50k then you could tell this to Hits and have it perform only a single 
query.  But the algorithmic cost of keeping the queue of the top 50k is 
the same as collecting all the hits and sorting them.  So, in part, 
getting hits 49,990 through 50,000 is inherently slower than getting 
hits 0-10.  We can minimize that, but not eliminate it.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Custom filters & document numbers

2005-03-01 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
Does this happen frequently?  Like Stanislav has been asking... what sort of
operations on the index cause the document number to change for any given
document?
Documents are only re-numbered after there have been deletions.  Once 
there have been deletions, renumbering may be triggered by any document 
addition or index optimization.  Once an index is optimized, no 
renumbering will be performed unril more deletions are made.

If the document numbers change frequently, is there a
straightforward way to modify Lucene to keep the document numbers the same for
the life of the document?  I'd like to have mappings in my sql database that
point to the document numbers that Lucene search returns in its Hits objects.
If you require a persistent document id that survives deletions, then 
add it as a field to your documents.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fast access to a random page of the search results.

2005-03-01 Thread Doug Cutting
Stanislav Jordanov wrote:
startTs = System.currentTimeMillis();
dummyMethod(hits.doc(nHits - nHits));
stopTs = System.currentTimeMillis();
System.out.println("Last doc accessed in " + (stopTs -
startTs)
+ "ms");
'nHits - nHits' always equals zero.  So you're actually printing the 
first document, not the last.  The last document would be accessed with 
'hits.doc(nHits)'.  Accessing the last document should not be much 
slower (or faster) than accessing the first.

200+ milliseconds to access a document does seem slow.  Where is you 
index stored?  On a local hard drive?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: 1.4.x TermInfosWriter.indexInterval not public static ?

2005-03-01 Thread Doug Cutting
Kevin A. Burton wrote:
BTW.. can you define "a bit"...
Merriam-Webster says:
  a bit : SOMEWHAT, RATHER
Is "a bit" 5%?  10%?  Benchmarks would be ncie but I'm not that picky.  
If you want benchmarks, make benchmarks.
I just want to see what performance hits/benefits I could see by 
tweaking the values.
This parameter determines the amount of computation required per query 
term, regardless of the number of documents that contain that term.  In 
particular, it is the maximum number of other terms that must be scanned 
before a term is located and its frequency and position information may 
be processed.  In a large index with user-entered query terms, query 
processing time is likely to be dominated not by term lookup but rather 
by the processing of frequency and positional data.  In a small index or 
when many uncommon query terms are generated (e.g., by wildcard queries) 
term lookup may become a dominant cost.  Benchmarking your application 
is the best way to determine this.

There is no single percentage answer.  There are cases where 99% of the 
query processing is in term lookup and there are cases where 1% of the 
query processing is in term lookup.  Chances are that, with a large 
index and user-entered query terms, only a small percentage of the time 
is spent in term lookup and thus increasing this value somewhat will not 
affect overall performance much.

If you need something more precise than "much" or "a bit", measure it.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: 1.4.x TermInfosWriter.indexInterval not public static ?

2005-02-28 Thread Doug Cutting
Chris Hostetter wrote:
 1) If making it mutatable requires changes to other classes to propogate
it, then why is it now an instance variable instead of a static?
(Presumably making it an instance variable allows subclasses to
override the value, but if other classes have internal expectations
of the value, that doesn't seem safe)
Its an instance variable because it can vary from instance-to-instance. 
 This value is specified when an index segment is written, and 
subsequently read from disk and used when reading that segment.  It's an 
instance variable in both the writing and reading code.  The thing 
that's lacking is a way to pass in alternate values to the writing code.

The reason that other classes are involved is that the reading and 
writing code are in non-public classes.  We don't want to expose the 
implementation too much by making these public, but would rather expose 
these as getter/setter methods on the relevant public API.

 2) Should it be configurable through a get/set method, or through a
system property?
(which rehashes the instance/global question)
That's indeed the question.  My guess is that a system property would be 
probably be sufficient for most, but perhaps not for all.  Similarly 
with a static setter/getter.  But a getter/setter on IndexWriter would 
make everyone happy.

 3) Is it important that a writer updating an existing index use the same
value as the writer that initial created the index?  if so should
there really be a "preferedIndexInterval" variable which is mutatable,
and a "currentIndexInterval" which is set to the value of the index
currently being updated.  Such that preferedIndexInterval is used when
making an index from scratch and currentIndexInterval is used when
adding segments to a new index?
It's used whenever an index segment is created.  Index segments are 
created when documents are added and when index segments are merged to 
form larger index segments.  Merging happens frequently while indexing. 
 Optimization merges all segments.

The value can vary in each segment.
The default value is probably good for all but folks with very large 
indexes, who may wish to increase the default somewhat.  Also folks with 
smaller indexes and very high query volumes may wish to decrease the 
default.  It's a classic time/memory tradeoff.  Higher values use less 
memory and make searches a bit slower, smaller values use more memory 
and make searches a bit faster.

Unless there are objections I will add this as:
  IndexWriter.setTermIndexInterval()
  IndexWriter.getTermIndexInterval()
Both will be marked "Expert".
Further discussion should move to the lucene-dev list.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Boost doesn't works

2005-02-28 Thread Doug Cutting
Claude Libois wrote:
The explanation given by the IndexSearcher indicate me that the boost of my
title is
1.0 where  it should be 10.0.
I really don't understand what it's wrong.
You're seeing the boost for the query term, not the boost for the 
document's field.  The boost for the field in the document is multiplied 
by its lengthNorm.  This product is displayed in explanations as the 
"fieldNorm".

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: 1.4.x TermInfosWriter.indexInterval not public static ?

2005-02-25 Thread Doug Cutting
Kevin A. Burton wrote:
Whats the desired pattern of using of TermInfosWriter.indexInterval ?
There isn't one.  It is not a part of the public API.  It is an 
unsupported internal feature.

Do I have to compile my own version of Lucene to change this?
Yes.
The last 
API was public static final but this is not public nor static.
It was never public.  It used to be static and final, but is now an 
instance variable.

I'm wondering if we should just make this a value that can be set at 
runtime.  Considering the memory savings for larger installs this 
can/will be important.
The place to put getter/setters would be IndexWriter, since that's the 
public home of all other index parameters.  Some changes to 
DocumentWriter and SegmentMerger would be required to pass this value 
through to TermInfosWriter from IndexWriter.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote:
Is this setting incompatible with older indexes burned with the lower 
value?
Prior to 1.4, yes.  After 1.4, no.
What happens after 1.4?  Can I take indexes burned with 256 (a greater 
value) in 1.3 and open them up correctly with 1.4?
Not without hacking things.  If your 1.3 indexes were generated with 256 
then you can modify your version of Lucene 1.4+ to use 256 instead of 
128 when reading a Lucene 1.3 format index (SegmentTermEnum.java:54 today).

Prior to 1.4 this was a constant, hardwired into the index format.  In 
1.4 and later each index segment stores this value as a parameter.  So 
once 1.4 has re-written your index you'll no longer need a modified version.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote:
I finally had some time to take Doug's advice and reburn our indexes 
with a larger TermInfosWriter.INDEX_INTERVAL value.
It looks like you're using a pre-1.4 version of Lucene.  Since 1.4 this 
is no longer called TermInfosWriter.INDEX_INTERVAL, but rather 
TermInfosWriter.indexInterval.

Is this setting incompatible with older indexes burned with the lower 
value?
Prior to 1.4, yes.  After 1.4, no.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Javadoc error?

2005-02-23 Thread Doug Cutting
Mark Woon wrote:
The javadoc for Field.setBoost() claims:
"The boost is multiplied by |Document.getBoost()| 
 
of the document containing this field. If a document has multiple fields 
with the same name, all such values are multiplied together."

However, from what I can tell from IndexSearcher.explain(), multiple 
fields with the same name have their boost values added together.  It 
might very well be that I'm misinterprating what I'm seeing from 
explain(), but if I'm not, then either the javadoc is wrong or there's a 
bug somewhere...

Does anyone know which way it's actually supposed to work?
Boosts for multiple fields with the same name in the a document are 
multiplied together at index time to form the boost for that field of 
that document.  At search time, if multiple query terms from the same 
field match the same document, then that document's field boost is 
multiplied into the score for both terms, and these scores are then 
added.  If boost(field,doc) is the boost, and raw(term,doc) is the raw, 
unboosted score (I'm simplifying things) then the score for a two term 
query is something like:

  boosted(,d) =
boost(t1.field,d)*raw(t1,d) + boost(t2.field,d)*raw(t2,d)
which, when t1 and t2 are in the same field, is equivalent to:
  boosted(,d) = boost(field,d)*(raw(t1,d) + raw(t2,d))
The explain() feature prints things in the first form, where the boosts 
appear in separate components of a sum.

Does that help?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Iterate through all the document ids in the index?

2005-02-21 Thread Doug Cutting
William Lee wrote:
is there a simple and
fast way to get a list of document IDs through the lucene index?  

I can use a loop to iterate from 0 to IndexReader.maxDoc and
check whether an the document id is valid through
IndexReader.document(i), but this would imply that I have to
retrieve the documents fields.
Use IndexReader.isDeleted() to check if each id is valid.  This is quite 
fast.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Concurrent searching & re-indexing

2005-02-17 Thread Doug Cutting
Paul Mellor wrote:
I've read from various sources on the Internet that it is perfectly safe to
simultaneously search a Lucene index that is being updated from another
Thread, as long as all write access to the index is synchronized.  But does
this apply only to updating the index (i.e. deleting and adding documents),
or to a complete re-indexing (i.e. create a new IndexWriter with the
'create' argument true and then re-add all the documents)?
[ ...]
java.io.IOException: couldn't delete _a.f1
at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
[...]
This is running on Windows 2000.
On Windows one cannot delete a file while it is still open.  So, no, on 
Windows one cannot remove an index entirely while an IndexReader or 
Searcher is still open on it, since it is simply impossible to remove 
all the files in the index.

We might attempt to patch this by keeping a list of such files and 
attempt to delete them later (as is done when updating an index).  But 
this could cause problems, as a new index will eventually try to use 
these same file names again, and it would then conflict with the open 
IndexReader.  This is not a problem when updating an existing index, 
since filenames (except for a few which are not kept open, like 
"segments") are never reused in the lifetime of an index.  So, in order 
for such a fix to work we would need to switch to globally unique 
segment names, e.g., long random strings, rather than increasing integers.

In the meantime, the safe way to rebuild an index from scratch while 
other processes are reading it is simply to delete all of its documents, 
then start adding new ones.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-02-15 Thread Doug Cutting
Kevin A. Burton wrote:
1.  Do I have to do this with a NEW directory?  Our nightly index merger 
uses an existing "target" index which I assume will re-use the same 
settings as before?  I did this last night and it still seems to use the 
same amount of memory.  Above you assert that I should use a new empty 
directory and I'll try that tonight.
You need to re-write the entire index using a modified 
TermIndexWriter.java.  Optimize rewrites the entire index but is 
destructive.  Merging into a new empty directory is a non-destructive 
way to do this.

2. This isn't destructive is it?  I mean I'll be able to move BACK to a 
TermInfosWriter.indexInterval of 128 right?
Yes, you can go back if you re-optimize or re-merge again.
Also, there's no need to CC my personal email address.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene "cuts" the search results ?

2005-02-15 Thread Doug Cutting
markharw00d wrote:
The highlighter uses a number of "pluggable" services, one of which is the
choice of "Fragmenter" implementation. This interface is for classes which
decide the boundaries where to cut the original text into snippets. The 
default
implementation used simply breaks up text into evenly sized chunks. A more
intelligent implementation could be made to detect sentence boundaries.
Also note that paragraph boundaries alone would help a lot and are 
easier to reliably detect.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: new segment for each document

2005-02-10 Thread Doug Cutting
Daniel Naber wrote:
On Thursday 10 February 2005 22:27, Ravi wrote:
I tried setting the minMergeFactor on the writer to one. But
it did not work.
I think there's an off-by-one bug so two is the smallest value that works 
as expected.
You can simply create a new IndexWriter for each add and then close it. 
 IndexWriter is pretty lightweight, so this shouldn't have too much 
overhead.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Reconstruct segments file?

2005-02-07 Thread Doug Cutting
Ian Soboroff wrote:
Speaking of Counter, I have a dumb question.  If the segments are
named using an integer counter which is incremented, what is the point
in converting that counter into a string for the segment filename?
Why not just name the segments e.g. "1.frq", etc.?
The names are prefixed with an underscore, since it turns out that some 
filesystems have trouble (DOS?) with certain all-digit names.  Other 
than that, they are integers, just with a large radix.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Reconstruct segments file?

2005-02-04 Thread Doug Cutting
Ian Soboroff wrote:
I've looked over the file formats web page, and poked at a known-good
segments file from a separate, similar index using od(1) and such.  I
guess what I'm not sure how to do is to recover the SegSize from the
segment I have.
The SegSize should be the same as the length in bytes of any of the 
.f[0-9]+ files in the segment.  If your segment is in compound format 
then you can use IndexReader.main() in the current SVN version to list 
the files and sizes in the .cfs file, including its contained .f[0-9]+ 
files.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Disk space used by optimize

2005-01-31 Thread Doug Cutting
Yura Smolsky wrote:
There is a big difference when you use compound index format or
multiple files. I have tested it on the big index (45 Gb). When I used
compound file then optimize takes 3 times more space, b/c *.cfs needs
to be unpacked.
Now I do use non compound file format. It needs like twice as much
disk space.
Perhaps we should add something to the javadocs noting this?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort Performance Problems across large dataset

2005-01-27 Thread Doug Cutting
Peter Hollas wrote:
Currently we can issue a simple search query and expect a response back 
in about 0.2 seconds (~3,000 results) with the Lucene index that we have 
built. Lucene gives a much more predictable and faster average query 
time than using standard fulltext indexing with mySQL. This however 
returns result in score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species 
names as a seperate keyword field, and sorted using it whilst querying. 
This solution works fine, but is unacceptable since a query that returns 
thousands of results can take upwards of 30 seconds to sort them.
Are you using a Lucene Sort?  If you reuse the same IndexReader (or 
IndexSearcher) then perhaps the first query specifying a Sort will take 
30 seconds (although that's much slower than I'd expect), but subsequent 
searches that sort on the same field should be nearly as fast as results 
sorted by score.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-27 Thread Doug Cutting
Kevin A. Burton wrote:
Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?
You can increase TermInfosWriter.indexInterval.  You'll need to re-write 
the .tii file for this to take effect.  The simplest way to do this is 
to use IndexWriter.addIndexes(), adding your index to a new, empty, 
directory.  This will of course take a while for a 60GB index...

Doubling TermInfosWriter.indexInterval should half the Term memory usage 
and double the time required to look up terms in the dictionary.  With 
an index this large the the latter is probably not an issue, since 
processing term frequency and proximity data probably overwhelmingly 
dominate search performance.

Perhaps we should make this public by adding an IndexWriter method?
Also, you can list the size of your .tii file by using the main() from 
CompoundFileReader.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ParallellMultiSearcher Vs. One big Index

2005-01-18 Thread Doug Cutting
Ryan Aslett wrote:
What I found was that for queries with one term (First Name), the large
index beat the multiple indexes hands down (280 Queries/per second vs
170 Q/s).
But for queries with multiple terms (Address), the multiple indexes beat
out the Large index. (26 Q/s vs 16 Q/s)
Btw, Im running these on a 2 proc box with 16GB of ram.
So what Im trying to determine Is if there is some equations out there
that can help me find the sweet spot for splitting my indexes.
What appears to be the bottleneck, CPU or i/o?  Is your test system 
multi-threaded?  I.e., is it attempting to execute many queries in 
parallel?  If you're CPU-bound then a single index should be fastest. 
Are you using compound format?  If you're i/o-bound, the non-compound 
format may be somewhat faster, as it permits more parallel i/o.  Is the 
index data on multiple drives?  If you're i/o bound then it should be 
faster to use multiple drives.  To permit even more parallel i/o over 
multiple drives you might consider using a pool of IndexReaders.  That 
way, with, e.g., striped data, each could be simultaneously reading 
different portions of the same file.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to add a Lucene index to a jar file?

2005-01-17 Thread Doug Cutting
David Spencer wrote:
Isn't "ZipDirectory" the thing to search for?
I think it's actually URLDirectory:
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02453.html
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to add a Lucene index to a jar file?

2005-01-17 Thread Doug Cutting
Miles Barr wrote:
You'll have to implement org.apache.lucene.store.Directory to load the
index from the JAR file. Take a look at FSDirectory and RAMDirectory for
some more details.
Then you have either load the JAR file with java.util.jar.JarFile to get
to the files or you can use Classloader#getResourceAsStream to get to
them.
The problem is that a jar file entry becomes an InputStream, but 
InputStream is not random access, and Lucene requires random access.  So 
you need to extract the index either to disk or RAM in order to get 
random access.  I think folks have posted code for this to the list 
previously.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: stop words and index size

2005-01-14 Thread Doug Cutting
David Spencer wrote:
Does anyone know how much stop words are supposed to affect the index size?
I did an experiment of building an index once with, and once without, 
stop words.

The corpus is the English Wikipedia, and I indexed the title and body of 
the articles. I used a list of 525 stop words.

With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.
The unstopped version is indeed bigger and slower to build, but it's 
only slower to search when folks search on stop words.  One approach to 
minimizing stopwords in searches (used by, e.g. Nutch & Google) is to 
index all stop words but remove them from queries unless they're (a) in 
a phrase or (b) explicitly required with a "+".  (It might be nice if 
Lucene included a query parser that had this feature.)

Nutch also optimizes phrase searches involving a few very common stop 
words (e.g., "the", "a", "to") by indexing these as bigrams and 
converting phrases involving them to bigram phrases.  So, if someone 
searches for "to be or not to be" then this turns into a search for 
"to-be be or not-to to-be" which is considerably faster since it 
involves rarer terms.  But the more words you bigram the bigger the 
index gets and the slower updates get, so you probably can't afford to 
do this for your full stop list.  (It might be nice if Lucene included 
support for this technique too!)

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How do I unlock?

2005-01-11 Thread Doug Cutting
Joseph Ottinger wrote:
As one for whom the question's come up recently, I'd say that locks need
to be terminated gracefully, instead. I've noticed a number of cases where
the locks get abandoned in exceptional conditions, which is almost exactly
what you don't want.
The problem is that this is hard to do from Java.  A typical approach is 
to put the process id in the lock file, then, if that process is dead, 
ignore the lock file.  But Java does not let one know process ids.  Java 
1.4 provides a LockFile mechanism which should mostly solve this, but 
Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use that 
feature.  Lucene 2.0 is likely to require Java 1.4 and should be able to 
do a better job of automatically unlocking indexes when processes die.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: multi-threaded thru-put in lucene

2005-01-06 Thread Doug Cutting
John Wang wrote:
Is the operation IndexSearcher.search I/O or CPU bound if I am doing
100's of searches on the same query?
CPU bound.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: multi-threaded thru-put in lucene

2005-01-06 Thread Doug Cutting
John Wang wrote:
1 thread: 445 ms.
2 threads: 870 ms.
5 threads: 2200 ms.
Pretty much the same numbers you'd get if you are running them sequentially.
Any ideas? Am I doing something wrong?
If you're performing compute-bound work on a single-processor machine 
then threading should give you no better performance than sequential, 
perhaps a bit worse.  If you're performing io-bound work on a 
single-disk machine then threading should again provide no improvement. 
 If the task is evenly compute and i/o bound then you could achieve at 
best a 2x speedup on a single CPU system with a single disk.

If you're compute-bound on an N-CPU system then threading should 
optimally be able to provide a factor of N speedup.

Java's scheduling of compute-bound theads when no threads call 
Thread.sleep() can also be very unfair.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: 1.4.3 breaks 1.4.1 QueryParser functionality

2005-01-05 Thread Doug Cutting
Bill Janssen wrote:
Sure, if I wanted to ship different code for each micro-release of
Lucene (which, you might guess, I don't).  That signature doesn't
compile with 1.4.1.
Bill, most folks bundle appropriate versions of required jars with their 
applications to avoid this sort of problem.  How are you deploying 
things?  Are you not bundling a compatible version of the lucene jar 
with each release of your application?  If not, why not?

I'm not trying to be difficult, just trying to understand.
Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Doug Cutting
Doug Cutting wrote:
You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms("field")[doc]) at the end to get 
rid of the lengthNorm() and field boost (if any).
Much simpler would be to build a SpanNearQuery, call getSpans(), then 
loop, counting how many times Spans.next() returns true.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Doug Cutting
Andrew Cunningham wrote:
"computer dog"~50 looks like what I'm after - now is there someway I can 
call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).
You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms("field")[doc]) at the end to get 
rid of the lengthNorm() and field boost (if any).

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: CFS file and file formats

2004-12-23 Thread Doug Cutting
Steve Rajavuori wrote:
There are around 20 million documents in the orphaned segments, so it would
take a very long time to update the index. Is there an "unsafe" way to edit
the segments file to add these back? It seems like the missing piece of
information I need to do this is the correct segment size -- where can I
find that?
Do the CFS and non-CVS segment names correspond?  If so, then it 
probably crashed after the segment was complete, but perhaps before it 
was packed into a CFS file.  So I'd trust the non-CFS stuff first.  And 
it's easy to see the size of a non-CVS segement: it's just the number of 
bytes in each of the .f* files.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: CFS file and file formats

2004-12-23 Thread Doug Cutting
Steve Rajavuori wrote:
1) First of all, there are both CFS files and standard (non-compound) files
in this directory, and all of them have recent update dates, so I assume
they are all being used. My code never explicitly sets the compound file
flag, so I don't know how this happened.
This can happen if your application crashes while the index was being 
updated.  In this case these were never entered into the segments file 
and may be partially written.

2) Is there a way to force all files into compound mode? For example, if I
set the compound setting, then call optimize, will that recreate everything
into the CFS format?
It should.  Except, on Windows not all old CFS file will be deleted 
immediately, but may instead be listed in the 'deleteable' file for a while.

3) There are several other large .CFS files in this directory that I think
have somehow become detached from the index. They have recent update dates
-- however, the last time I ran optimize these were not touched, and they
are not being updated now. I know these segments have valid data, because
now when I search I am missing large chunks of data -- which I assume is in
these detached segments. So my thought is to edit the 'segments' file to
make Lucene recognize these again -- but I need to know the correct segment
size in order to do this. So how do I determine what the correct segment
size should be?
These could also be the result of crashes.  In this case they may be 
partially written.

The safest approach is to remove files not mentioned in the segments 
file and update the index with the missing documents.  How does your 
application recover if it crashes during an update?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: To Sort or not to Sort

2004-12-16 Thread Doug Cutting
Scott Smith wrote:
1.	Simply use the built-in lucene sort functionality, cache the hit
list and then page through the list.  Adv: looks pretty straight
forward, I write less code.  Dis: for searches that return a large
number of hits (having a search return several hundred to a few thousand
hits is not uncommon), Lucene is sorting a lot of entries that don't
really need to be sorted (because the user will never look at them) and
sorting tends to be expensive.
2.	The other solution uses a priority heap to collect the top N (or
next N) entries.  I still have to walk the entire hit list, but keeping
entries in a priority heap means I can determine the N entries I need
with a few comparisons and minimal sorting.  I don't have to sort a
bunch of entries whose order I don't care about.  Additionally, I don't
have to have all of the entries in memory at one time.  The big
disadvantage with this is that I have to write more code.  However, it
may be worth it if the performance difference is large enough. 
Lucene's built-in sorting code already performs the optimization you 
describe as (2).  So don't bother re-inventing it!

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Chris Hostetter wrote:
For example, using the current scoring equation, if i do a search for
"Doug Cutting" and the results/scores i get back are...
  1:   0.9
  2:   0.3
  3:   0.21
  4:   0.21
  5:   0.1
...then there are at least two meaningful pieces of data I can glean:
   a) document #1 is significantly better then the other results
   b) document #3 and #4 are both equaly relevant to "Doug Cutting"
If I then do a search for "Chris Hostetter" and get back the following
results/scores...
  9:   0.9
  8:   0.3
  7:   0.21
  6:   0.21
  5:   0.1
...then I can assume the same corrisponding information is true about my
new search term (#9 is significantly better, and #7/#8 are equally as good)
However, I *cannot* say either of the following:
  x) document #9 is as relevant for "Chris Hostetter" as document #1 is
 relevant to "Doug Cutting"
  y) document #5 is equally relevant to both "Chris Hostetter" and
 "Doug Cutting"
That's right.  Thanks for the nice description of the issue.
I think the OP is arguing that if the scoring algorithm was modified in
the way they suggested, then you would be able to make statements x & y.
And I am not convinced that, with the changes Chuck describes, one can 
be any more confident of x and y.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Otis Gospodnetic wrote:
There is one case that I can think of where this 'constant' scoring
would be useful, and I think Chuck already mentioned this 1-2 months
ago.  For instace, having such scores would allow one to create alert
applications where queries run by some scheduler would trigger an alert
whenever the score is > X.  So that is where the absolue value of the
score would be useful.
Right, but the question is, would a single score threshold be effective 
for all queries, or would one need a separate score threshold for each 
query?  My hunch is that the latter is better, regardless of the scoring 
algorithm.

Also, just because Lucene's default scoring does not guarantee scores 
between zero and one does not necessarily mean that these scores are 
less "meaningful".

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Chuck Williams wrote:
I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize.  The pure vector space model implements a cosine in the strictly positive sector of the coordinate space.  This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., "0.8" means something about the result quality independent of the query).
I question whether such scores are more meaningful.  Yes, such scores 
would be guaranteed to be between zero and one, but would 0.8 really be 
meaningful?  I don't think so.  Do you have pointers to research which 
demonstrates this?  E.g., when such a scoring method is used, that 
thresholding by score is useful across queries?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: java.io.FileNotFoundException: ... (No such file or directory)

2004-12-08 Thread Doug Cutting
Justin Swanhart wrote:
The indexes are located on a NFS mountpoint. Could this be the
problem?
Yes.  Lucene's lock mechanism is designed to keep this from happening, 
but the sort of lock files that FSDirectory uses are known to be broken 
with NFS.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Recommended values for mergeFactor, minMergeDocs, maxMergeDocs

2004-12-06 Thread Doug Cutting
Chuck Williams wrote:
I've got about 30k documents and have 3 indexing scenarios:
1.   Full indexing and optimize
2.   Incremental indexing and optimize
3.   Parallel incremental indexing without optimize
Search performance is critical.  For both cases 1 and 2, I'd like the
fastest possible indexing time.  For case 3, I'd like minimal pauses and
no noticeable degradation in search performance.
 

Based on reading the code (including the javadocs comments), I'm
thinking of values along these lines:
mergeFactor:  1000 during Full indexing, and during optimize (for both
cases 1 and 2); 10 during incremental indexing (cases 2 and 3)
1000 is too big of a mergeFactor for any practical purpose.
I don't see a point in using different mergeFactors in cases 1 and 2. 
If you're going to optimize before you search, then you want the fastest 
batch indexing mode.  I would use something like 50 for both cases 1 and 2.

For case 3, where unoptimized search performance is very important, I 
would use something smaller than 10.  For Technorati's blog search, 
which incrementally maintains a Lucene index with millions of documents, 
I used a mergeFactor of 2 in order to maximize search performance. 
Indexing performance on a single CPU is still adequate to keep up with 
the rate of change of today's blogosphere.

minMergeDocs:  1000 during Full indexing, 10 during incremental indexing
I see no reason to lower this when indexing incrementally.  1000 is a 
good value for high performance indexing when RAM is plentiful and 
documents are not too large.

maxMergeDocs:  Integer.MAX_VALUE during full indexing, 1000 during
incremental indexing
1000 seems low to me, as it will result in too many segments, slowing 
search.  Here one should select the largest value that can be merged in 
the maximum time delay permitted in your application between a new 
document arriving and it appearing in search results.  So how up-to-date 
must your index be?  If it's okay for it to ocassionally be a few 
minutes out of date, then you can probably safely increase this to at 
least tens or hundreds of thousands, perhaps even millions.  When 
incrementally indexing, the most recently added segments stay cached in 
RAM by the filesystem.  So, on a system with a gigabyte of RAM that's 
dedicated to incremental indexing, you might safely set maxMergeDocs to 
account for a few hundred megabytes of index without encountering slow, 
i/o-bound merges.

Since mergeFactor is used in both addDocument() and optimize(), I'm
thinking of using two different values in case 2:  10 during the
incremental indexing, and then 1000 during the optimize.  Is changing
the value like this going to cause a problem?
It should not cause problems to use different mergeFactors at different 
times.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: reoot site query results

2004-12-06 Thread Doug Cutting
In web search, link information helps greatly.  (This was Google's big 
discovery.)  There are lots more links that point to 
http://www.slashdot.org/ than to http://www.slashdot.org/xxx/yyy, and 
many (if not most) of these links have the term "slashdot", while links 
to http://www.slashdot.org/xxx/yyy are somewhat less likely to contain 
the term "slashdot".

As Erik hinted, Nutch uses this information.  It keeps has a database of 
links that point to each page, indexes their anchor text along with the 
page, and boosts highly linked pages more than lesser linked pages.

Doug
Chris Fraschetti wrote:
My lucene implementation works great, its basically an index of many
web crawls. The main thing my users complain about is say a search for
"slashdot" will return the
http://www.slashdot.org/soem_dir/somepage.asp as the top result
because the factors i have scoring it determine it as so... but
obviously in true search engine fashion.. i would like
http://www.slashdot.org/ to be the very top result... i've added a
boost to queries that match the hostname field, which helped a little,
but obviously not a proper solution. Does anyone out there in the
search engine world have a good schema for determining root websites
and applying a huge boost to them in one fashion or another? mainly so
it appears before any sub pages? (assuming the query is in reference
to that site) ...
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Too many open files issue

2004-11-26 Thread Doug Cutting
John Wang wrote:
In the Lucene code, I don't see where the reader speicified when
creating a field is closed. That holds on to the file.
I am looking at DocumentWriter.invertDocument()
It is closed in a finally clause on line 170, when the TokenStream is 
closed.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Doug Cutting
Hoss wrote:
The attachment contains my RangeFilter, a unit test that demonstrates it,
and a Benchmarking unit test that does a side-by-side comparison with
RangeQuery [6].  If developers feel that this class is useful, then by all
means roll it into the code base.  (90% of it is cut/pasted from
DateFilter/RangeQuery anyway)
+1
DateFilter could be deprecated, and replaced with the more generally and 
appropriately named RangeFilter.  Should we also deprecate DateField, in 
preference for DateTools?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: document ID and performance

2004-11-16 Thread Doug Cutting
Yan Pujante wrote:
I want to run a very fast search that simply returns the matching 
document id. Is there any way to associate the document id returned in 
the hit collector to the internal document ID stored in the index ? 
Anybody has any idea how to do that ? Ideally you would want to be able 
to write something like this:

document.add(Field.ID(documentID));
and then in the HitCollector API:
collect(String documentID, float score) with the documentID being the 
one you stored (but which would be returned very efficiently)
Have a look at:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/FieldCache.html
In your HitCollector, access an array, from the field cache, that maps 
Lucene ids to your ids.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Backup strategies

2004-11-16 Thread Doug Cutting
Christoph Kiehl wrote:
I'm curious about your strategy to backup indexes based on FSDirectory. 
If I do a file based copy I suspect I will get corrupted data because of 
concurrent write access.
My current favorite is to create an empty index and use 
IndexWriter.addIndexes() to copy the current index state. But I'm not 
sure about the performance of this solution.

How do you make your backups?
A safe way to backup is to have your indexing process, when it knows the 
index is stable (e.g., just after calling IndexWriter.close()), make a 
checkpoint copy of the index by running a shell command like "cp -lpr 
index index.YYYMMDDHHmmSS".  This is very fast and requires little disk 
space, since it creates only a new directory of hard links.  Then you 
can separately back this up and subsequently remove it.

This is also a useful way to replicate indexes.  On the master indexing 
server periodically perform "cp -lpr" as above.  Then search slaves can 
use rsync to pull down the latest version of the index.  If a very small 
mergefactor is used (e.g., 2) then the index will have only a few 
segments, so that searches are fast.  On the slave, periodically find 
the latest index.YYYMMDDHHmmSS, use "cp -lpr index/ index.YYYMMDDHHmmSS" 
and 'rsync --delete master:index.YYYMMDDHHmmSS index.YYYMMDDHHmmSS' to 
efficiently get a local copy, and finally "ln -fsn index.YYYMMDDHHmmSS 
index" to publish the new version of the index.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search speed

2004-11-02 Thread Doug Cutting
Jeff Munson wrote:
Single word searches return pretty fast, but when I try phrases,
searching seems to slow considerably. [ ... ]
However, if I use this query, contents:"all parts including picture tube
guaranteed", it returns hits in 2890 millseconds.  Other phrases take
longer as well.  
You could use an analyzer that inserts bigrams for common terms.  Nutch 
does this.  So, if you declare that "all" and "including" are common 
terms, then this could be tokenized as the following tokens:

0 - all all.parts
1 - parts parts.including
2 - including including.picture
3 - picture
4 - tube
5 - guaranteed
Two tokens at a position indicate where the second has position 
increment of zero.

Then your phrase search could be converted to:
  "all.parts parts.including including.picture picture tube guaranteed"
which should be much faster, since it has replaced common terms with 
rare terms.

This approach does make the index larger, and hence makes indexing 
somewhat slower.  So you don't want to declare too many words as common, 
but a handful can make a big difference if they're used frequently in 
queries.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting and score ordering

2004-10-13 Thread Doug Cutting
Paul Elschot wrote:
Along with that, is there a simple way to assign a new scorer to the
searcher? So I can use the same lucene algorithm for my hits, but
tweak it a little to fit my needs?

There is no one to one relationship between a seacher and a scorer.
But you can use a different Similarity implementation with each Searcher.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: locking problems

2004-10-08 Thread Doug Cutting
Aad Nales wrote:
1. can I have one or multiple searchers open when I open a writer?
2. can I have one or multiple readers open when I open a writer?
Yes, with one caveat: if you've called the IndexReader methods delete(), 
undelete() or setNorm() then you may not open an IndexWriter until 
you've closed that IndexReader instance.

In general, only a single object may modify an index at once, but many 
may access it simultaneously in a read-only manner, including while it 
is modified.  Indexes are modified by either an IndexWriter or by the 
IndexReader methods delete(), undelete() and setNorm().

Typically an application which modifies and searches simultaneously 
should keep the following open:

  1. A single IndexReader instance used for all searches, perhaps 
opened via an IndexSearcher.  Periodically, as the index changes, this 
is discarded, and replaced with a new instance.

  2. Either:
 a. An IndexReader to delete documents.
 b. An IndexWriter to add documents; or
So an updating thread might open (2a), delete old documents, close it, 
then open (2b) add new documents, perhaps optimize, then close.  At this 
point, when the index has been updated (1) can be discarded and replaced 
with a new instance.  Typically the old instance of (1) is not 
explicitly closed, rather the garbage collector closes it when the last 
thread searching it completes.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort regeneration in multithreaded server

2004-10-08 Thread Doug Cutting
Stephen Halsey wrote:
I was wondering if anyone could help with a problem (or should that be
"challenge"?) I'm having using Sort in Lucene over a large number of records
in multi-threaded server program on a continually updated index.
I am using lucene-1.4-rc3.
A number of bugs with the sorting code have been fixed since that 
release.  Can you please try with 1.4.2 and see if you still have the 
problem?  Thanks.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: multifield-boolean vs singlefield-enum query performance

2004-10-07 Thread Doug Cutting
Tea Yu wrote:
For the following implementations:
1) storing boolean strings in fields X and Y separately
2) storing the same info in a field XY as 3 enums: X, Y, B, N meaning only X
is True, only Y is True, both are True or both are False
Is there significant performance gain when we substitute "X:T OR Y:T" by
"XY:B", while significant loss in "X:T" by "XY:X OR XY:B"?  Or are they
negligible?
As with most performance questions, it's best to try both and measure! 
It depends on the size of your index, the relative frequencies of X and 
Y, etc.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


new release: 1.4.2

2004-10-01 Thread Doug Cutting
There's a new release of Lucene, 1.4.2, which mostly fixes bugs in 
1.4.1.  Details are at http://jakarta.apache.org/lucene/.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: removing duplicate Documents from Hits

2004-10-01 Thread Doug Cutting
Timm, Andy (ETW) wrote:
Hello, I've searched on previous posts on this topic but couldn't find an answer.  I want to query my index (which are a number of 'flattened' Oracle tables) for some criteria, then return Hits such that there are no Documents that duplicate a particular field.  In the case where table A has a one-to-many relationship to table B, I get one Document for each (A1-B1, A1-B2, A1-B3...).  My index needs to have each of these records as 'B' is a searchable field in the index.  However, after the query is executed, I want my resulting Hits on be unique on 'A'.  I'm only returning the Oracle object ID, so once I've seen it once I don't need it again.  It looks like some sort of custom Filter is in order.
I'd suggest a HitCollector that uses a FieldCache of the "A" values to 
check for duplicates, and collect only a the best document id for each 
value of "A".  This would use a bit of RAM, but be very fast.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/HitCollector.html
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/FieldCache.html
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problem with get/setBoost of document fields

2004-09-29 Thread Doug Cutting
Bastian Grimm [Eastbeam GmbH] wrote:
that works... but i have to do this setNorm() for each document, which 
has been indexed up to now, right? there are round about 1 mio. docs in 
the index... i dont think it's a good idea to perform a search and do it 
for every doc (and every field of the doc...).
is there any possibility to do something like: setNorm(alldocs, 
"fieldX", 2.0f) - a global boost for a named field for every doc.
setNorm() is quite fast.  Calling it 1M times will not take long.
a last question: lucene creates some .f[1-9]  after setNorm() has 
finished. does this file remain all the time in this folder? i tried to 
optimize and so one but nothing happend.
If you add or remove documents and optimize then these will go away.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Shouldnt IndexWriter.flushRamSegments() be public? or at least protected?

2004-09-28 Thread Doug Cutting
Christian Rodriguez wrote:
Now the problem I have is that I dont have a way to force a flush of
the IndexWriter without closing it and I need to do that before
commiting a transaction or I would get random errors. Shouldnt that
function be public, in case the user wants to force a flush at some
point that is not when the IndexWriter is closed? If not I am forced
to create a new IndexWriter and close it EVERY TIME I commit a
transaction (which in my application is very often).
Opening and closing IndexWriters should be a lightweight operation. 
Have you tried this and found it to be too slow?  A flush() would have 
to do just about the same work.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document contents split among different Fields

2004-09-23 Thread Doug Cutting
Greg Langmead wrote:
Am I right in saying that the design of Token's support for highlighting
really only supports having the entire document stored as one monolithic
"contents" Field?
No, I don't think so.
Has anyone tackled indexing multiple content Fields
before that could shed some light?
Do you need highlights from all fields?  If so, then you can use:
  TextFragment[] getBestTextFragments(TokenStream, ...);
with a TokenStream for each field, then select the highest scoring 
fragments across all fields.  Would that work for you?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: demo HTML parser question

2004-09-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
We were originally attempting to use the demo html parser (Lucene 1.2), but as
you know, its for a demo.  I think its threaded to optimize on time, to allow
the calling thread to grab the title or top message even though its not done
parsing the entire html document.
That's almost right.  I originally wrote it that way to avoid having to 
ever buffer the entire text of the document.  The document is indexed 
while it is parsed.  But, as observed, this has lots of problems and was 
probably a bad idea.

Could someone provide a patch that removes the multi-threading?  We'd 
simply use a StringBuffer in HTMLParser.jj to collect the text.  Calls 
to pipeOut.write() would be replaced with text.append().  Then have the 
HTMLParser's constructor parse the page before returning, rather than 
spawn a thread, and getReader() would return a StringReader.  The public 
API of HTMLParser need not change at all and lots of complex threading 
code would be thrown away.  Anyone interested in coding this?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problem with get/setBoost of document fields

2004-09-23 Thread Doug Cutting
You can change field boosts without re-indexing.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#setNorm(int,%20java.lang.String,%20byte)
Doug
Bastian Grimm [Eastbeam GmbH] wrote:
thanks for your reply, eric.
so i am right that its not possible to change the boost without 
reindexing all files? thats not good... or is it ok only to change the 
boosts an optimize the index to take changes effecting the index?

if not, will i be able to boost those fields in the searcher?
thanks, bastian
-
The boost is not thrown away, but rather combined with the length 
normalization factor during indexing.  So while your actual boost value 
is not stored directly in the index, it is taken into consideration for 
scoring appropriately.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Running OutOfMemory while optimizing and searching

2004-09-17 Thread Doug Cutting
John Z wrote:
We have indexes of around 1 million docs and around 25 searchable fields.
We noticed that without any searches performed on the indexes, on startup, the memory taken up by the searcher is roughly 7 times the .tii file size. The .tii file is read into memory as per the code. Our .tii files are around 8-10 MB in size and our startup memory foot print is around 60-70 MB.
 
Then when we start doing our searches, the memory goes up, depending on the fields we search on. We are noticing that if we start searching on new fields, the memory kind of goes up. 
 
Doug, 
 
Your calculation below on what is taken up by the searcher, does it take into account the .tii file being read into memory  or am I not making any sense ? 
 
1 byte * Number of searchable fields in your index * Number of docs in 
your index
plus
1k bytes * number of terms in query
plus
1k bytes * number of phrase terms in query
You make perfect sense.  The formula above does not include the .tii. 
My mistake: I forgot that.  By default, every 128th Term in the index is 
read into memory, to permit random access to terms.  These are stored in 
the .tii file, compressed.  So it is not surprising that they require 7x 
the size of the .tii file in memory.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Similarity score computation documentation

2004-09-14 Thread Doug Cutting
Your analysis sounds correct.
At base, a weight is a normalized tf*idf.  So a document weight is:
  docTf * idf * docNorm
and a query weight is:
  queryTf * idf * queryNorm
where queryTf is always one.
So the product of these is (docTf * idf * docNorm) * (idf * queryNorm), 
which indeed contains idf twice.  I think the best documentation fix 
would be to add another idf(t) clause at the end of the formula, next to 
queryNorm(q), so this is clear.  Does that sound right to you?

Doug
Ken McCracken wrote:
Hi,
I was looking through the score computation when running search, and
think there may be a discrepancy between what is _documented_ in the
org.apache.lucene.search.Similarity class overview Javadocs, and what
actually occurs in the code.
I believe the problem is only with the documentation.
I'm pretty sure that there should be an idf^2 in the sum.  Look at
org.apache.lucene.search.TermQuery, the inner class TermWeight.  You
can see that first sumOfSquaredWeights() is called, followed by
normalize(), during search.  Further, the resulting value stored in
the field "value" is set as the "weightValue" on the TermScorer.
If we look at what happens to TermWeight, sumOfSquaredWeights() sets
"queryWeight" to idf * boost.  During normalize(), "queryWeight" is
multiplied by the query norm, and "value" is set to queryWeight * idf
== idf * boost * query norm * idf == idf^2 * boost * query norm.  This
becomes the "weightValue" in the TermScorer that is then used to
multiply with the appropriate tf, etc., values.
The remaining terms in the Similarity description are properly
appended.  I also see that the queryNorm effectively "cancels out"
(dimensionally, since it is a 1/ square root of a sum of squares of
idfs) one of the idfs, so the formula still ends up being roughly a
TF-IDF formula.  But the idf^2 should still be there, along with the
expansion of queryNorm.
Am I mistaken, or is the documentation off?
Thanks for your help,
-Ken
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency, 
 it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).
Almost.
If the user enters "a recursize purser", then: "a", which is in, say, 
>50% of the documents, is probably spelled correctly and "recursize", 
which is in zero documents, is probably mispelled.  But what about 
"purser"?  If we run the spell check algorithm on "purser" and generate 
"parser", should we show it to the user?  If "purser" occurs in 1% of 
documents and "parser" occurs in 5%, then we probably should, since 
"parser" is a more common word than "purser".  But if "parser" only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting "parser".

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does "purser" or "parser" occur 
more frequently near "descent".  But that gets expensive.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
Andrzej Bialecki wrote:
I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.
Sloppy PhraseQuery's are slower than BooleanQueries, but not horribly 
slower.  The problem is that they don't handle the case where phrase 
elements are missing altogether, while a BooleanQuery does.  So what you 
really need is maybe a variation of a sloppy PhraseQuery that scores 
matches that do not contain all of the terms...

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread Doug Cutting
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a 
large proportion of the collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the "did you mean" spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely that 
the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).
I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-10 Thread Doug Cutting
Daniel Naber wrote:
On Thursday 09 September 2004 18:52, Doug Cutting wrote:

I have not been
able to construct a two-word query that returns a page without both
words in either the content, the title, the url or in a single anchor.
Can you?

Like this one?
konvens leitseite 

Leitseite is only in the title of the first match (www.gldv.org), konvens 
is only in the body.
Good job finding that!  I guess I should fix Nutch's BasicQueryFilter.
Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Doug Cutting
It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is 
that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it works 
for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before FieldCache 
was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and merge 
them together, this was done to do the first production run, updates 
are now being handled incrementally

 

Exception in thread "main" java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to 
be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've come 

across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 

swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 

of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: "Daniel Taurat" 
<[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 

number of
 

documents

   

Daniel Aber schrieb:
 
 

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

  

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix 


about files
 

not being deleted after 1.4.1. Not sure if that could cause the 
problems
you're experiencing.

Regards
Daniel

   

Well, it seems not to be files, it looks more like those 
SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the 
developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...

Anyway: any idea if there is an API command to re-init caches?
Thanks,

Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
David Spencer wrote:
Good heuristics but are there any more precise, standard guidelines as 
to how to balance or combine what I think are the following possible 
criteria in suggesting a better choice:
Not that I know of.
- ignore(penalize?) terms that are rare
I think this one is easy to threshold: ignore matching terms that are 
rarer than the term entered.

- ignore(penalize?) terms that are common
This, in effect, falls out of the previous criterion.  A term that is 
very common will not have any matching terms that are more common.  As 
an optimization, you could avoid even looking for matching terms when a 
term is very common.

- terms that are closer (string distance) to the term entered are better
This is the meaty one.
- terms that start w/ the same 'n' chars as the users term are better
Perhaps.  Are folks really better at spelling the beginning of words?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 
Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only "corrections" 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Doug Cutting
Bill Janssen wrote:
I'd think that if a user specified a query "cutting lucene", with an
implicit AND and the default fields "title" and "author", they'd
expect to see a match in which both "cutting" and "lucene" appears.  That is,
(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
Your proposal is certainly an improvement.
It's interesting to note that in Nutch I implemented something 
different.  There, a search for "cutting lucene" expands to something like:

 (+url:cutting^4.0 +url:lucene^4.0 +url:"cutting lucene"~2147483647^4.0)
 (+anchor:cutting^2.0 +anchor:lucene^2.0 +anchor:"cutting lucene"~4^2.0)
 (+content:cutting +content:lucene +content:"cutting lucene"~2147483647)
So a page with "cutting" in the body and "lucene" in anchor text won't 
match: the body, anchor or url must contain all query terms.  A single 
authority (content, url or anchor) must vouch for all attributes.

Note that Nutch also boosts matches where the terms are close together. 
 Using "~2147483647" permits them to be anywhere in the document, but 
boosts more when they're closer and in-order.  (The "~4" in anchor 
matches is to prohibit matches across different anchors.  Each anchor is 
separated by a Token.positionIncrement() of 4.)

But perhaps this is not a feature.  Perhaps Nutch should instead expand 
this to:

 +(url:cutting^4.0 anchor:cutting^2.0 content:cutting)
 +(url:lucene^4.0 anchor:lucene^2.0 content:lucene)
 url:"cutting lucene"~2147483647^4.0
 anchor:"cutting lucene"~4^2.0
 content:"cutting lucene"~2147483647
That would, e.g., permit a match with only "lucene" in an anchor and 
"cutting" in the content, which the earlier formulation would not.

Can anyone tell whether Google has this requirement?  I have not been 
able to construct a two-word query that returns a page without both 
words in either the content, the title, the url or in a single anchor. 
Can you?

If you're interested, the Nutch query expansion code in question is:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/query-basic/src/java/net/nutch/searcher/basic/BasicQueryFilter.java?view=markup
To play with it you can download Nutch and use the command:
  bin/nutch net.nutch.searcher.Query
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116

Yes, the approach there is similar.  I attempted to complete the
solution and provide a working replacement for MultiFieldQueryParser.
But, inspired by that message, couldn't MultiFieldQueryParser just be a 
subclass of QueryParser that overrides getFieldQuery()?

Cheers,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: maximum index size

2004-09-08 Thread Doug Cutting
Chris Fraschetti wrote:
I've seen throughout the list mentions of millions of documents.. 8
million, 20 million, etc etc.. but can lucene potentially handle
billions of documents and still efficiently search through them?
Lucene can currently handle up to 2^31 documents in a single index.  To 
a large degree this is limited by Java ints and arrays (which are 
accessed by ints).  There are also a few places where the file format 
limits things to 2^32.

On typical PC hardware, 2-3 word searches of an index with 10M 
documents, each with around 10k of text, require around 1 second, 
including index i/o time.  Performance is more-or-less linear, so that a 
100M document index might require nearly 10 seconds per search.  Thus, 
as indexes grow folks tend to distribute searches in parallel to many 
smaller indexes.  That's what Nutch and Google 
(http://www.computer.org/micro/mi2003/m2022.pdf) do.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PDF->Text Performance comparison

2004-09-08 Thread Doug Cutting
Ben Litchfield wrote:
PDFBox: slow PDF text extraction for Java applications
http://www.pdfbox.org
Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java 
applications, with Lucene integration"?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why doesn't Document use a HashSet instead of a LinkedList (DocumentFieldList)

2004-09-07 Thread Doug Cutting
Kevin A. Burton wrote:
It looks like Document.java uses its own implementation of a LinkedList..
Why not use a HashMap to enable O(1) lookup... right now field lookup is 
O(N) which is certainly no fun.

Was this benchmarked?  Perhaps theres the assumption that since 
documents often have few fields the object overhead and hashcode 
overhead would have been less this way.
I have never benchmarked this but would be surprised if it makes a 
measureable difference in any real application.  A linked list is used 
because it naturally supports multiple entries with the same key.  A 
home-grown linked list was used because, when Lucene was first written, 
java.util.LinkedList did not exist.

Please feel free to benchmark this against a HashMap of LinkedList of 
Field.  This would be slower to construct, which may offset any 
increased access speed.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to remove duplicate documents in sort API?

2004-09-07 Thread Doug Cutting
Kevin A. Burton wrote:
My problem is that I have two machines... one for searching, one for 
indexing.

The searcher has an existing index.
The indexer found an UPDATED document and then adds it to a new index 
and pushes that new index over to the searcher.

The searcher then reloads and when someone performs a search BOTH 
documents could show up (including the stale document).

I can't do a delete() on the searcher because the indexer doesn't have 
the entire index as the searcher.
I can think of a couple ways to fix this.
If the indexer box kept copies of the indexes that it has already sent 
to the searcher, then it can mark updated documents as deleted in these 
old indexes.  Then you can, with the new index, also distribute new .del 
files for the old indexes.

Alternately, you could, on the searcher box, before you open the new 
index, open an IndexReader on all of the existing indexes and mark all 
new documents as deleted in the old indexes.  This shouldn't take more 
than a few seconds.

IndexReader.delete() just sets a bit in a bit vector that is written to 
file by IndexReader.close().  So it's quite fast.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: telling one version of the index from another?

2004-09-07 Thread Doug Cutting
Bill Janssen wrote:
Hi.
Hey, Bill.  It's been a long time!
I've got a Lucene application that's been in use for about two years.
Some users are using Lucene 1.2, some 1.3, and some are moving to 1.4.
The indices seem to behave differently under each version.  I'd like
to add code to my application that checks the current user's index
version against the version of Lucene that they are using, and
automatically re-indexes their files if necessary.  However, I can't
figure out how to tell the version, from the index files.
Prior to 1.4, there were no format numbers in the index.  These are 
being added, file-by-file, as we change file formats.  As you've 
discovered, there is currently no public API to obtain the format number 
of an index.  Also, the formats of different files are revved at 
different times, so there may not be a single format number for the 
entire index.  (Perhaps we should remedy this, by, e.g., always revving 
the "segments" version whenever any file changes format.)

The documentation on the file formats, at
http://jakarta.apache.org/lucene/docs/fileformats.html, directs me to
the "segments" file.  However, when I look at a version 1.3 segments
file, it seems to bear little relationship to the format described in
fileformats.html. 
Have a look at the version of fileformats.html that shipped with 1.3. 
You can find this by browsing CVS, looking for the 1.3-final tag.  But 
let me do it for you:

http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/docs/fileformats.html?rev=1.15
According to CVS tags, that describes both the 1.3 and 1.2 index file 
formats.

But the part of fileformats.html dealing with the
segments file contains no "compatibility notes", so I assume it hasn't
changed since 1.3. 
I wrote the bit about "compatibility notes" when I first documented file 
formats, and then promptly forgot about it.  So, until someone 
contributes them, there are no compatibility notes.  Sorry.

Even if it had, what's the idea of using -1 as the
format number for 1.4?
The idea is to promptly break 1.3 and 1.2 code which tries to read the 
index.  Those versions of Lucene don't check format numbers (because 
there were none).  Positive values would give unpredictable errors.  A 
negative value causes an immediate failure.

So, anyone know a way to tell the difference between the various
versions of the index files?  Crufty hacks welcome :-).
The first four bytes of the "segments" file will mostly do the trick. 
If it is zero or positive, then the index is a 1.2 or 1.3 index.  If it 
is -2, then it's a 1.4-final or later index.

There was a change in formats between 1.2 and 1.3, with no format number 
change.  This was in 1.3 RC1 (note #12 in CHANGES.txt).  The semantics 
of each byte in norm files (.f[0-9]) changed.  In 1.3 each byte 
represented 0.0-255.0 on a linear scale.  In 1.3 and later they're 
eight-bit floats (three-bit mantissa, five-bit exponent, no sign bit). 
The net result is that if you use a 1.2 index with 1.3 or later then the 
correct documents will be returned, but scores and rankings will be wacky.

With the exception of this last bit, 1.4 should be able to correctly 
handle indexes from earlier releases.  Please report if this is not the 
case.

Cheers,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: speeding up queries (MySQL faster)

2004-08-22 Thread Doug Cutting
Yonik Seeley wrote:
Setup info & Stats:
- 4.3M documents, 12 keyword fields per document, 11
 [ ... ]
"field1:4 AND field2:188453 AND field3:1"
field1:4  done alone selects around 4.2M records
field2:188453 done alone selects around 1.6M records
field3:1  done alone selects around 1K records
The whole query normally selects less than 50 records
Only the first 10 are returned (or whatever range
the client selects).
The "field1:4" clause is probably dominating the cost of query 
execution.  Clauses which match large portions of the collection are 
slow to evaluate.  If there are not too many different such clauses then 
you can optimize this by re-using a Filter in place of such clauses, 
typically a QueryFilter.

For example, Nutch automatically translates such clauses into 
QueryFilters.  See:

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/searcher/LuceneQueryOptimizer.java?view=markup
Note that this only converts clauses whose boost is zero.  Since filters 
do not affect ranking we can only safely convert clauses which do not 
contribute to the score, i.e, those whose boost is zero.  Scores might 
still be different in the filtered results because of 
Similarity.coord().  But, in Nutch, Similarity.coord() is overidden to 
always return 1.0, so that the replacement of clauses with filters does 
not alter the final scores at all.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Debian build problem with 1.4.1

2004-08-20 Thread Doug Cutting
I can successfully use gcc 3.4.0 with Lucene as follows:
ant jar jar-demo
gcj -O3 build/lucene-1.5-rc1-dev.jar build/lucene-demos-1.5-rc1-dev.jar 
-o indexer --main=org.apache.lucene.demo.IndexHTML

./indexer -create docs
It runs pretty snappy too!  However I don't know if there's much milage 
in packaging Lucene as a native library.  It's easy enough for folks to 
compile Lucene this way, and applications built this way are pretty 
small.  The big thing to install is libgcj.

Doug
Jeff Breidenbach wrote:
Ok, Lucene 1.4.1 has been uploaded to Debian. Hopefully it will have
enough time to percolate before the sarge release.
Now that that is taken care of, I'm curious about the status of gcj
compilation. Packaging Lucene as a native library might be useful for
projects such as PyLucene, and it is also advantageous for license
reasons i.e. avoiding the non-free JVM dependency. What's the current
gcj compilation recipe? The best I could find on Google (below) seems
a little bit stale.
http://www.mail-archive.com/[EMAIL PROTECTED]/msg04131.html
Cheers,
Jeff

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NegativeArraySizeException when creating a new IndexSearcher

2004-08-20 Thread Doug Cutting
Looks to me like you're using an older version of Lucene on your Linux 
box.  The code is back-compatible, it will read old indexes, but Lucene 
1.3 cannot read indexes created by Lucene 1.4, and will fail in the way 
you describe.

Doug
Sven wrote:
Hi!
I have a problem to port a Lucene based knowledgebase from Windows to Linux.
On Windows it works fine whereas I get a NegativeArraySizeException on Linux
when I try to initialise a new IndexSearcher to search the index. Deleting
and rebuilding the index didn't help. I checked permissions, file path and
lock_dir but as far as I can say they seem to be all right. As I couldn't
find another one with the same problem I guess I've overlooked sth, but I've
run out of ideas. I use lucene-1.4-rc2 and tomcat 5.0.18. Can someone help
me please with this or has an idea?
Kind regards,
Sven
java.lang.NegativeArraySizeException
 at
org.apache.lucene.index.TermInfosReader.readIndex(TermInfosReader.java:106)
 at org.apache.lucene.index.TermInfosReader.(TermInfosReader.java:82)
 at org.apache.lucene.index.SegmentReader.(SegmentReader.java:141)
 at org.apache.lucene.index.SegmentReader.(SegmentReader.java:120)
 at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118)
 at org.apache.lucene.store.Lock$With.run(Lock.java:148)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:99)
 at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:75)
 at
com.sykon.knowledgebase.action.ListQueryResultAction.act(ListQueryResultActi
on.java:134)
 at
org.apache.cocoon.components.treeprocessor.sitemap.ActTypeNode.invoke(ActTyp
eNode.java:159)
 at
org.apache.cocoon.components.treeprocessor.sitemap.ActionSetNode.call(Action
SetNode.java:121)
 at
org.apache.cocoon.components.treeprocessor.sitemap.ActSetNode.invoke(ActSetN
ode.java:98)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:84)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PreparableMatchNode.invok
e(PreparableMatchNode.java:165)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:107)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(Pipel
ineNode.java:162)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:107)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(Pipe
linesNode.java:136)
 at
org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess
or.java:371)
 at
org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess
or.java:312)
 at
org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNod
e.java:133)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:84)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PreparableMatchNode.invok
e(PreparableMatchNode.java:165)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:107)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(Pipel
ineNode.java:162)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:107)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(Pipe
linesNode.java:136)
 at
org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess
or.java:371)
 at
org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess
or.java:312)
 at org.apache.cocoon.Cocoon.process(Cocoon.java:656)
 at org.apache.cocoon.servlet.CocoonServlet.service(CocoonServlet.java:1112)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:856)
 at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:284)
 at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:204)
 at
org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.
java:742)
 at
org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDis
patcher.java:506)
 at
org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatch
er.java:443)
 at
org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher
.java:359)
 at
org.apache.jasper.runtime.PageContextImpl.doForward(PageContextImpl.java:712
)
 at
org.apache.jasper.runtime.PageContextImpl.forward(PageContextImpl.java:682)
 at
org.apache.jsp.knowlegebase.controller_jsp._jspService(controller_jsp.java:8
44)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:133)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:856)
 at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
11)
 at org.apache.jasper.serv

Re: Split an existing index into smaller segments without a re-index?

2004-08-04 Thread Doug Cutting
Kevin A. Burton wrote:
Is it possible to take an existing index (say 1G) and break it up into a 
number of smaller indexes (say 10 100M indexes)...

I don't think theres currently an API for this but its certainly 
possible (I think).
Yes, it is theoretically possible but not yet implemented.
An easy way to implement it would be to subclass FilterIndexReader to 
return a subset of documents, then use IndexWriter.addIndexes() to write 
out each subset as a new index.  Subsets could be ranges of document 
numbers, and one could use TermPositions.skipTo() to accelerate the 
TermPositions subset implementation, but this still wouldn't be quite as 
fast as an index splitter that only reads each TermPositions once.  If 
we added a lower-level index writing API then one could use that to 
implement this...

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Negative Boost

2004-08-04 Thread Doug Cutting
Terry Steichen wrote:
But if, in the future, I or someone else took on this task of enhancing QueryParser, I'd like to be assured that the underlying Lucene engine will accept and support negative boosting.  Is that the case?
Lucene will multiply negative boosts into scores just like positive 
ones.  I've never been convinced that it makes much sense to use 
negative boosts in a scoring formula such as Lucene's, but there's 
nothing stopping you from using them.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Hit & Score [ Between ]

2004-08-04 Thread Doug Cutting
You could instead use a HitCollector to gather only documents with 
scores in that range.

Doug
Karthik N S wrote:
Hi 

Apologies
If I want to get all the  hits for Scores  between  0.5f  to 0.8f, 
I usally use
query = QueryParser.parse(srchkey,Fields, analyzer);
int tothits = searcher.search(query);

for (int i = 0; i
docs = hits.doc(i);
Score = hits.score(i);
 
if ((Score > 0.5f ) && (Score < 0.8f) ) {
System.out.println(" FileName  : " + docs.get("filename");
}
}

Is there any other way to Do this ,
Please Advise me..
Thx.

  WITH WARM REGARDS 
  HAVE A NICE DAY 
  [ N.S.KARTHIK] 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Caching of TermDocs

2004-07-27 Thread Doug Cutting
John Patterson wrote:
I would like to hold a significant amount of the index in memory but use the
disk index as a spill over.  Obviously the best situation is to hold in
memory only the information that is likely to be used again soon.  It seems
that caching TermDocs would allow popular search terms to be searched more
efficiently while the less common terms would need to be read from disk.
The operating system already caches recent disk i/o.  So what you'd save 
primarily would be the overhead of parsing the data.  However the parsed 
form, a sequence of docNo and freq ints, is nearly eight times as large 
as its compressed size in the index.  So your cache would consume a lot 
of memory.

Whether it this provide much overall speedup depends on the distribution 
of common terms in your query traffic.  If you have a few terms that are 
searched very frequently then it might pay off.  In my experience with 
general-purpose search engines this is not usually the case: folks seem 
to use rarer words in queries than they do in ordinary text.  But in 
some search applications perhaps the traffic is more skewed.  Only some 
experiments would tell for sure.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: over 300 GB to index: feasability and performance issue

2004-07-26 Thread Doug Cutting
Vincent Le Maout wrote:
I have to index a huge, huge amount of data: about 10 million documents
making up about 300 GB. Is there any technical limitation in Lucene that
could prevent me from processing such amount (I mean, of course, apart
from the external limits induce by the hardware: RAM, disks, the system,
whatever) ?
Lucene is in theory able to support up to 2B documents in a single 
index.  Folks have sucessfully built indexes with several hundred 
million documents.  10 million should not be a problem.

If possible, does anyone have an idea of the amount of resource
needed: RAM, CPU time, size of indexes, access time on such a collection ?
if not, is it possible to extrapolate an estimation from previous 
benchmarks ?
For simple 2-3 term queries, with average sized documents (~10k of text) 
you should get decent performance (1 second / query) on a 10M document 
index.  An index typically requires around 35% of the plain text size.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Boosting documents

2004-07-26 Thread Doug Cutting
Rob Clews wrote:
I want to do the same, set a boost for a field containing a date that
lowers as the date is further from now, is there any way I could do
this?
You could implement Similarity.idf(Term, Searcher) to, when 
Term.field().equals("date"), return a value that is greater for more 
recent dates.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Logic of score method in hits class

2004-07-26 Thread Doug Cutting
Lucene scores are not percentages.  They really only make sense compared 
to other scores for the same query.  If you like percentages, you can 
divide all scores by the first score and multiply by 100.

Doug
lingaraju wrote:
Dear  All
How the score method works(logic) in Hits class
For 100% match also score is returning only 69% 

Thanks and regards
Raju
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Weighting database fields

2004-07-21 Thread Doug Cutting
Ernesto De Santis wrote:
If some field have set a boots value in index time, and when in search time
the query have another boost value for this field, what happens?
which value is used for boost?
The two boosts are both multiplied into the score.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort: 1.4-rc3 vs. 1.4-final

2004-07-21 Thread Doug Cutting
The key in the WeakHashMap should be the IndexReader, not the Entry.  I 
think this should become a two-level cache, a WeakHashMap of HashMaps, 
the WeakHashMap keyed by IndexReader, the HashMap keyed by Entry.  I 
think the Entry class can also be changed to not include an IndexReader 
field.  Does this make sense?  Would someone like to construct a patch 
and submit it to the developer list?

Doug
Aviran wrote:
I think I found the problem
FieldCacheImpl uses WeakHashMap to store the cached objects, but since there
is no other reference to this cache it is getting released.
Switching to HashMap solves it.
The only problem is that I don't see anywhere where the cached object will
get released if you open a new IndexReader.
Aviran
-Original Message-
From: Greg Gershman [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 21, 2004 13:13 PM
To: Lucene Users List
Subject: RE: Sort: 1.4-rc3 vs. 1.4-final

I've done a bit more snooping around; it seems that in
FieldSortedHitQueue.getCachedComparator(line 153), calls to lookup a stored
comparator in the cache always return null.  This occurs even for the
built-in sort types (I tested it on integers and my code for longs).  The
comparators don't even appear to be being stored in the HashMap to begin
with.
Any ideas?
Greg
 

--- Aviran <[EMAIL PROTECTED]> wrote:
Since I had to implement sorting in lucene 1.2 I had
to write my own sorting
using something similar to a lucene's contribution
called SortField.
Yesterday I did some tests, trying to use lucene 1.4
Sort objects and I
realized that my old implementation works 40% faster
then Lucene's
implementation. My guess is that you are right and
there is a problem with
the cache although I couldn't find what that is yet.
Aviran
-Original Message-
From: Greg Gershman [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 21, 2004 9:22 AM
To: [EMAIL PROTECTED]
Subject: Sort: 1.4-rc3 vs. 1.4-final
When rc3 came out, I modified the classes used for
Sorting to, in addition to Integer, Float and
String-based sort keys, use Long values.  All I did
was add extra statements in 2 classes (SortField and
FieldSortedHitQueue) that made a special case for
longs, and created a LongSortedHitQueue identical to
the IntegerSortedHitQueue, only using longs.
This worked as expected; Long values converted to
strings and stored in Field.Keyword type fields
would
be sorted according to Long order.  The initial
query
would take a while, to build the sorted array, but
subsequent queries would take little to no time at
all.
I went back to look at 1.4 final, and noticed the
Sort implementation has
changed quite a bit.  I tried the same type of
modifications to the existing
source files, but was unable to achieve similiar
results.
Each subsequent query seems to take a significant
amount of time, as if the Sorted array is being
rebuilt each time.  Also, I tried sorting on an
Integer fields and got similar results, which leads
me
to believe there might be a caching problem
somewhere.
Has anyone else seen this in 1.4-final?  Also, I
would
like it if Long sorted fields could become a part of
the API; it makes sorting by date a breeze.
Thanks!
Greg Gershman

__
Do you Yahoo!?
New and Improved Yahoo! Mail - Send 10MB messages!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





__
Do you Yahoo!?
Vote for the stars of Yahoo!'s next ad campaign!
http://advision.webevents.yahoo.com/yahoo/votelifeengine/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Very slow IndexReader.open() performance

2004-07-20 Thread Doug Cutting
Optimization should not require huge amounts of memory.  Can you tell a 
bit more about your configuration:  What JVM?  What OS?  How many 
fields?  What mergeFactor have you used?

Also, please attach the output of 'ls -l' of your index directory, as 
well as the stack trace you see when OutOfMemory is thrown.

Thanks,
Doug
Mark Florence wrote:
Hi -- We have a large index (~4m documents, ~14gb) that we haven't been
able to optimize for some time, because the JVM throws OutOfMemory, after
climbing to the maximum we can throw at it, 2gb. 

In fact, the OutOfMemory condition occurred most recently during a segment 
merge operation. maxMergeDocs was set to the default, and we seem to have
gotten around this problem by setting it to some lower value, currently
100,000. The index is highly interactive so I took the hint from earlier
posts to set it to this value.

Good news! No more OutOfMemory conditions.
Bad news: now, calling IndexReader.open() is taking 20+ seconds, and it 
is killing performance.

I followed the design pattern in another earlier post from Doug. I take a
batch of deletes, open an IndexReader, perform the deletes, then close it.
Then I take a batch of adds, open an IndexWriter, perform the adds, then
close it. Then I get a new IndexSearcher for searching.
But because the index is so interactive, this sequence repeats itself all
the time. 

My question is, is there a better way? Performance was fine when I could
optimize. Can I hold onto singleton a IndexReader/IndexWriter/IndexSearcher
to avoid the overhead of the open?
Any help would be most gratefully received.
Mark Florence, CTO, AIRS
[EMAIL PROTECTED]
800-897-7714x1703
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Post-sorted inverted index?

2004-07-20 Thread Doug Cutting
You can define a subclass of FilterIndexReader that re-sorts documents 
in TermPositions(Term) and document(int), then use 
IndexWriter.addIndexes() to write this in Lucene's standard format.  I 
have done this in Nutch, with the (as yet unused) IndexOptimizer.

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/indexer/IndexOptimizer.java?view=markup
Doug
Aphinyanaphongs, Yindalon wrote:
I gather from reading the documentation that the scores for each document hit are computed at query time.  I have an application that, due to the complexity of the function, cannot compute scores at query time.  Would it be possible for me to store the documents in pre-sorted order in the inverted index? (i.e. after the initial index is created, to have a post processing step to sort and reindex the final documents).
 
For example:
Document A - score 0.2
Document B - score 0.4
Document C - score 0.6
 
Thus for the word 'the', the stored order in the index would be C,B,A.
 
Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: release & migration plan

2004-07-15 Thread Doug Cutting
fp235-5 wrote:
I am looking at the code to implement setIndexInterval() in IndexWriter. I'd
like to have your opinion on the best way to do it.
Currently the creation of an instance of TermInfosWriter requires the following
steps:
...
IndexWriter.addDocument(Document)
IndexWriter.addDocument(Document, Analyser)
DocumentWriter.addDocument(String, Document)
DocumentWriter.writePostings(Posting[],String)
TermInfosWriter.
To give a different value to indexInterval in TermInfosWriter, we need to add a
variable holding this value into IndexWriter and DocumentWriter and modify the
constructors for DocumentWriter and TermInfosWriter. (quite heavy changes)
I think this is the best approach.  I would replace other parameters in 
these constructors which can be derived from an IndexWriter with the 
IndexWriter.  That way, if we add more parameters like this, they can 
also be passed in through the IndexWriter.

All of the parameters to the DocumentWriter constructor are fields of 
IndexWriter.  So one can instead simply pass a single parameter, an 
IndexWriter, then access its directory, analyzer, similarity and 
maxFieldLength in the DocumentWriter constructor.  A public 
getDirectory() method would also need to be added to IndexWriter for 
this to work.

Similarly, two of SegmentMerger's constructor parameters could be 
replaced with an IndexWriter, the directory and boolean useCompoundFile.

In SegmentMerge I would replace the directory parameter with IndexWriter.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Token or not Token, PerFieldAnalyzer

2004-07-15 Thread Doug Cutting
Florian Sauvin wrote:
Everywhere in the documentation (and it seems logical) you say to use
the same analyzer for indexing and querying... how is this handled on
not tokenized fields?
Imperfectly.
The QueryParser knows nothing about the index, so it does not know which 
fields were tokenized and which were not.  Moreover, even the index does 
not know this, since you can freely intermix tokenized and untokenized 
values in a single field.

In my case, I have certain fields on which I want the tokenization and
anlysis and everything to happen... but on other fields, I just want to
index the content as it is (no alterations at all) and not analyze at
query time... is that possible?
It is very possible.  A good way to handle this is to use 
PerFieldAnalyzerWrapper.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Scoring without normalization!

2004-07-15 Thread Doug Cutting
Have you looked at:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
in particular, at:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int)
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#queryNorm(float)
Doug
Jones G wrote:
Sadly, I am still running into problems
Explain shows the following after the modification.
Rank: 1 ID: 11285358Score: 5.5740864E8
5.5740864E8 = product of:
  8.3611296E8 = sum of:
8.3611296E8 = product of:
  6.6889037E9 = weight(title:iron in 1235940), product of:
0.12621856 = queryWeight(title:iron), product of:
  7.0507255 = idf(docFreq=10816)
  0.017901499 = queryNorm
5.2994613E10 = fieldWeight(title:iron in 1235940), product of:
  1.0 = tf(termFreq(title:iron)=1)
  7.0507255 = idf(docFreq=10816)
  7.5161928E9 = fieldNorm(field=title, doc=1235940)
  0.125 = coord(1/8)
2.7106019E-8 = product of:
  1.08424075E-7 = sum of:
5.7318403E-9 = weight(abstract:an in 1235940), product of:
  0.03711049 = queryWeight(abstract:an), product of:
2.073038 = idf(docFreq=1569960)
0.017901499 = queryNorm
  1.5445337E-7 = fieldWeight(abstract:an in 1235940), product of:
1.0 = tf(termFreq(abstract:an)=1)
2.073038 = idf(docFreq=1569960)
7.4505806E-8 = fieldNorm(field=abstract, doc=1235940)
1.0269223E-7 = weight(abstract:iron in 1235940), product of:
  0.111071706 = queryWeight(abstract:iron), product of:
6.2046037 = idf(docFreq=25209)
0.017901499 = queryNorm
  9.24558E-7 = fieldWeight(abstract:iron in 1235940), product of:
2.0 = tf(termFreq(abstract:iron)=4)
6.2046037 = idf(docFreq=25209)
7.4505806E-8 = fieldNorm(field=abstract, doc=1235940)
  0.25 = coord(2/8)
  0.667 = coord(2/3)
Rank: 2 ID: 8157438 Score: 2.7870432E8
2.7870432E8 = product of:
  8.3611296E8 = product of:
6.6889037E9 = weight(title:iron in 159395), product of:
  0.12621856 = queryWeight(title:iron), product of:
7.0507255 = idf(docFreq=10816)
0.017901499 = queryNorm
  5.2994613E10 = fieldWeight(title:iron in 159395), product of:
1.0 = tf(termFreq(title:iron)=1)
7.0507255 = idf(docFreq=10816)
7.5161928E9 = fieldNorm(field=title, doc=159395)
0.125 = coord(1/8)
  0.3334 = coord(1/3)
Rank: 3 ID: 10543103Score: 2.7870432E8
2.7870432E8 = product of:
  8.3611296E8 = product of:
6.6889037E9 = weight(title:iron in 553967), product of:
  0.12621856 = queryWeight(title:iron), product of:
7.0507255 = idf(docFreq=10816)
0.017901499 = queryNorm
  5.2994613E10 = fieldWeight(title:iron in 553967), product of:
1.0 = tf(termFreq(title:iron)=1)
7.0507255 = idf(docFreq=10816)
7.5161928E9 = fieldNorm(field=title, doc=553967)
0.125 = coord(1/8)
  0.3334 = coord(1/3)
Rank: 4 ID: 8753559 Score: 2.7870432E8
2.7870432E8 = product of:
  8.3611296E8 = product of:
6.6889037E9 = weight(title:iron in 2563152), product of:
  0.12621856 = queryWeight(title:iron), product of:
7.0507255 = idf(docFreq=10816)
0.017901499 = queryNorm
  5.2994613E10 = fieldWeight(title:iron in 2563152), product of:
1.0 = tf(termFreq(title:iron)=1)
7.0507255 = idf(docFreq=10816)
7.5161928E9 = fieldNorm(field=title, doc=2563152)
0.125 = coord(1/8)
  0.3334 = coord(1/3)
I would like to get rid of all normalizations and just have TF and IDF.
What am I missing?
On Thu, 15 Jul 2004 Anson Lau wrote :
If you don't mind hacking the source:
In Hits.java
In method "getMoreDocs()"

   // Comment out the following
   //float scoreNorm = 1.0f;
   //if (length > 0 && scoreDocs[0].score > 1.0f) {
   //  scoreNorm = 1.0f / scoreDocs[0].score;
   //}
   // And just set scoreNorm to 1.
   int scoreNorm = 1;
I don't know if u can do it without going to the src.
Anson
-Original Message-
From: Jones G [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 15, 2004 6:52 AM
To: [EMAIL PROTECTED]
Subject: Scoring without normalization!
How do I remove document normalization from scoring in Lucene? I just want
to stick to TF IDF.
Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Pool of IndexReaders or Pool of Searchers?

2004-07-13 Thread Doug Cutting
Whether this will make a difference depends on the size of the index. 
If your index is relatively small, then this patch will help more.  If 
your index is large, it will help less.

Aviran wrote:
Try to compile this code changes into lucene
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06116.html
In our tests it improved performance by ~100%
Aviran
-Original Message-
From: Anson Lau [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 13, 2004 3:40 AM
To: 'Lucene Users List'
Subject: RE: Pool of IndexReaders or Pool of Searchers?

Don't have a formal report but I can give you a bit more details:
What are we testing: a search app powered by lucene, search app is a web app
built on Struts.
Index: 1.8 million database records
Hardware: Dual P4 2.8 HT, 4G Ram, Raid 5 SCIC HDs.
Directory type: FSDirectory
Load test app: A load test app which has 15 threads, each firing 1 search
request (http request) per second to the search app.
All the http requests are search request, but keep in mind all the overhead
of jsp, struts, etc.
We thought especially with multiple CPU systems using a small pool of index
searcher may improve concurrency.
Under lucene 1.4:
Using 1 static index searcher: 12 request per second.
Using a pool of 4 index search: same 12 request per second
Under lucene 1.3:
I can't remember the exact numbers, but pooling index searcher did make a
noticeable difference.
Hope that's useful.
Anson

-Original Message-
From: Vince Taluskie [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 13, 2004 3:50 PM
To: Lucene Users List
Subject: Re: Pool of IndexReaders or Pool of Searchers?
Can you supply details on the config tested?
Vince
Anson Lau wrote:

Hi,
When I did some load testing on a lucene powered search app, using a 
pool of index searchers doesn't give me any more search per second than 
just using a singleton index searcher.

Anson
Quoting [EMAIL PROTECTED]:


Hi,
I have multiple threads reading an index.  Should they all be using
the same IndexReader and using a pool of IndexSearchers?  Or
should they be
using a pool of IndexReaders?
Basically, one reader or many?
Thanks.
  

-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why is Field.java final?

2004-07-13 Thread Doug Cutting
John Wang wrote:
   On the same thought, how about the org.apache.lucene.analysis.Token
class. Can we make it non-final?
Sure, if you make a case for why it should be non-final.
What would your subclasses do?  Which methods would you override?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why is Field.java final?

2004-07-13 Thread Doug Cutting
Kevin A. Burton wrote:
Doug Cutting wrote:
Field and Document are not designed to be extensible. They are 
persisted in such a way that added methods are not available when the 
field is restored. In other words, when a field is read, it always 
constructs an instance of Field, not a subclass.
Thats fine... I think thats acceptable behavior. I don't think anyone 
would assume that inner vars are restored or that the field is serialized.
You'd be surprised what people would assume!
The bottom line is that making this non-final would enable you to do 
anything that you cannot do today, and it would permit lots of folks to 
try things that won't work, get confused, and complain.  Part of the job 
of an API is to steer you towards best practices and away from dead 
ends.  If there were important functionality that would be enabled by 
making this non-final then I'd be for it.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-13 Thread Doug Cutting
Aviran wrote:
I changed the Lucene 1.4 final source code and yes this is the source
version I changed.
Note that this patch won't produce the a speedup on earlier releases, 
since their was another multi-thread bottleneck higher up the stack that 
was only recently removed, revealing this lower-level bottleneck.

The other patch was:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg07873.html
Both are required to see the speedup.
Also, is there any reason folks cannot use 1.4 final now?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote:
I use Lucene 1.4 final
Here is the thread dump for one blocked thread (If you want a full thread
dump for all threads I can do that too)
Thanks.  I think I get the point.  I recently removed a synchronization 
point higher in the stack, so that now this one shows up!

Whether or not you submit a patch, please file a bug report in Bugzilla 
with your proposed change, so that we don't lose track of this issue.

Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote:
First let me explain what I found out. I'm running Lucene on a 4 CPU server.
While doing some stress tests I've noticed (by doing full thread dump) that
searching threads are blocked on the method: public FieldInfo fieldInfo(int
fieldNumber) This causes for a significant cpu idle time. 
What version of Lucene are you running?  Also, can you please send the 
stack traces of the blocked threads, or at least a description of them? 
 I'd be interested to see what context this happens in.  In particular, 
which IndexReader and Searcher/Scorer/Weight methods does it happen under?

I noticed that the class org.apache.lucene.index.FieldInfos uses private
class members Vector byNumber and Hashtable byName, both of which are
synchronized objects. By changing the Vector byNumber to ArrayList byNumber
I was able to get 110% improvement in performance (number of searches per
second).
That's impressive!  Good job finding a bottleneck!
My question is: do the fields byNumber and byName have to be synchronized
and what can happen if I'll change them to be ArrayList and HashMap which
are not synchronized ? Can this corrupt the index or the integrity of the
results?
I think that is a safe change.  FieldInfos is only modifed by 
DocumentWriter and SegmentMerger, and there is no possibility of other 
threads accessing those instances.  Please submit a patch to the 
developer mailing list.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: AW: Understanding TooManyClauses-Exception and Query-RAM-size

2004-07-12 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
What I  really would like to see are some best practices or some advice from
some users who are working with really large indices how they handle this
situation, or why they  don't have to  care about it or maybe why I am
completely missing the point ;-))
Many folks with really large indexes just don't permit things like 
wildcard and range searches.  For example, Google supports no wildcards 
and has only recently added limited numeric range searching.  Yahoo! 
supports neither.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Field.java -> STORED, NOT_STORED, etc...

2004-07-11 Thread Doug Cutting
Doug Cutting wrote:
The calls would look like:
new Field("name", "value", Stored.YES, Indexed.NO, Tokenized.YES);
Stored could be implemented as the nested class:
public final class Stored {
  private Stored() {}
  public static final Stored YES = new Stored();
  public static final Stored NO = new Stored();
}
Actually, while we're at it, Indexed and Tokenized are confounded.  A 
single entry would be better, something like:

public final class Index {
  private Index() {}
  public static final Index NO = new Index();
  public static final Index TOKENIZED = new Index();
  public static final Index UN_TOKENIZED = new Index();
}
then calls would look like just:
new Field("name", "value", Store.YES, Index.TOKENIZED);
BTW, I think Stored would be better named Store too.
BooleanQuery's required and prohibited flags could get the same 
treatment, with the addition of a nested class like:

public final class Occur {
  private Occur() {}
  public static final Occur MUST_NOT = new Occur();
  public static final Occur SHOULD = new Occur();
  public static final Occur MUST = new Occur();
}
and adding a boolean clause would look like:
booleanQuery.add(new TermQuery(...), Occur.MUST);
Then we can deprecate the old methods.
Comments?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Field.java -> STORED, NOT_STORED, etc...

2004-07-11 Thread Doug Cutting
Kevin A. Burton wrote:
So I added a few constants to my class:
new Field( "name", "value", NOT_STORED, INDEXED, NOT_TOKENIZED );
which IMO is a lot easier to maintain.
Why not add these constants to Field.java:
   public static final boolean STORED = true;
   public static final boolean NOT_STORED = false;
   public static final boolean INDEXED = true;
   public static final boolean NOT_INDEXED = false;
   public static final boolean TOKENIZED = true;
   public static final boolean NOT_TOKENIZED = false;
Of course you still have to remember the order but this becomes a lot 
easier to maintain.
It would be best to get the compiler to check the order.
If we change this, why not use type-safe enumerations:
http://www.javapractices.com/Topic1.cjp
The calls would look like:
new Field("name", "value", Stored.YES, Indexed.NO, Tokenized.YES);
Stored could be implemented as the nested class:
public final class Stored {
  private Stored() {}
  public static final Stored YES = new Stored();
  public static final Stored NO = new Stored();
}
and the compiler would check the order of arguments.
How's that?
Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why is Field.java final?

2004-07-11 Thread Doug Cutting
Kevin A. Burton wrote:
I was going to create a new IDField class which just calls super( name, 
value, false, true, false) but noticed I was prevented because 
Field.java is final?
You don't need to subclass to do this, just a static method somewhere.
Why is this?  I can't see any harm in making it non-final...
Field and Document are not designed to be extensible.  They are 
persisted in such a way that added methods are not available when the 
field is restored.  In other words, when a field is read, it always 
constructs an instance of Field, not a subclass.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene shouldn't use java.io.tmpdir

2004-07-09 Thread Doug Cutting
Armbrust, Daniel C. wrote:
The problem I ran into the other day with the new lock location is that Person A had started an index, ran into problems, erased the index and asked me to look at it.  I tried to rebuild the index (in the same place on a Solaris machine) and found out that A) - her locks still existed, B) - I didn't have a clue where it put the locks on the Solaris machine (since no full path was given with the error - has this been fixed?) and C) - I didn't have permission to remove her locks.
I think these problems have been fixed.  When an index is created, all 
old locks are first removed.  And when a lock cannot be obtained, it's 
full pathname is printed.  Can you replicate this with 1.4-final?

I think the locks should go back in the index, and we should fall back or give an option to put them elsewhere for the case of the read-only index.
Changing the lock location is risky.  Code which writes an index would 
not be required to alter the lock location, but code which reads it 
would be.  This can easily lead to uncoordinated access.

So it is best if the default lock location works well in most cases.  We 
try to use a temporary directory writable by all users, and attempt to 
handle situations like those you describe above.  Please tell me if you 
continue to have problems with locking.

Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-09 Thread Doug Cutting
Kevin A. Burton wrote:
With the typical handful of fields, one should never see more than 
hundreds of files.

We only have 13 fields... Though to be honest I'm worried that even if I 
COULD do the optimize that it would run out of file handles.
Optimization doesn't open all files at once.  The most files that are 
ever opened by an IndexWriter is just:

4 + (5 + numIndexedFields) * (mergeFactor-1)
This includes during optimization.
However, when searching, an IndexReader must keep most files open.  In 
particular, the maximum number of files an unoptimized, non-compound 
IndexReader can have open is:

(5 + numIndexedFields) * (mergeFactor-1) * 
(log_base_mergeFactor(numDocs/minMergeDocs))

A compound IndexReader, on the other hand, should open at most, just:
(mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs))
An optimized, non-compound IndexReader will open just (5 + 
numIndexedFields) files.

And an optimized, compound IndexReader should only keep one file open.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  1   2   3   4   5   >