from:"Doron Cohen \(JIRA\)"

[jira] Created: (LUCENE-665) temporary file access denied on Windows

2006-08-25 Thread Doron Cohen (JIRA)

temporary file access denied on Windows
---

Key: LUCENE-665
URL: http://issues.apache.org/jira/browse/LUCENE-665
Project: Lucene - Java
Issue Type: Bug
Components: Store
Affects Versions: 2.0.0
Environment: Windows
Reporter: Doron Cohen
Attachments: FSDirectory_Retry_Logic.patch, Test_Output.txt,
TestInterleavedAddAndRemoves.java

When interleaving adds and removes there is frequent opening/closing of readers
and writers.

I tried to measure performance in such a scenario (for issue 565), but the
performance test failed - the indexing process crashed consistently with file
access denied errors - cannot create a lock file in
lockFile.createNewFile() and cannot rename file.

This is related to:
- issue 516 (a closed issue: TestFSDirectory fails on Windows) -
http://issues.apache.org/jira/browse/LUCENE-516
- user list questions due to file errors:
-
http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html
-
http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
- discussion on lock-less commits
http://www.nabble.com/Lock-less-commits-tf2126935.html

My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs.

I noticed that the problem is more frequent when locks are created on one disk
and the index on another. Both are NTFS with Windows indexing service enabled.
I suspect this indexing service might be related - keeping files busy for a
while, but don't know for sure.

After experimenting with it I conclude that these problems - at least in my
scenario - are due to a temporary situation - the FS, or the OS, is
*temporarily* holding references to files or folders, preventing from renaming
them, deleting them, or creating new files in certain directories.

So I added to FSDirectory a retry logic in cases the error was related to
Access Denied. This is the same approach brought in
http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
- there, in addition to the retry, gc() is invoked (I did not gc()). This is
based on the *hope* that a access-denied situation would vanish after a small
delay, and the retry would succeed.

I modified FSDirectory this way for Access Denied errors during creating a
new files, renaming a file.

This worked fine for me. The performance test that failed before, now managed
to complete. There should be no performance implications due to this
modification, because only the cases that would otherwise wrongly fail are now
delaying some extra millis and retry.

I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these
changes to FSDirectory.
All ant test tests pass with this patch.

Also attaching a test case that demostrates the problem - at least on my
machine. There two tests cases in that test file - one that works in system
temp (like most Lucene tests) and one that creates the index in a different
disk. The latter case can only run if the path (D: , tmp) is valid.

It would be great if people that experienced these problems could try out this
patch and comment whether it made any difference for them.

If it turns out useful for others as well, including this patch in the code
might help to relieve some of those frustration user cases.

A comment on state of proposed patch:
- It is not a ready to deploy code - it has some debug printing, showing the
cases that the retry logic actually took place.
- I am not sure if current 30ms is the right delay... why not 50ms? 10ms? This
is currently defined by a constant.
- Should a call to gc() be added? (I think not.)
- Should the retry be attempted also on non access-denied exceptions? (I
think not).
- I feel it is somewhat woodoo programming, but though I don't like it, it
seems to work...

Attached files:
1. TestInterleavedAddAndRemoves.java - the LONG test that fails on XP without
the patch and passes with the patch.
2. FSDirectory_Retry_Logic.patch
3. Test_Output.txt- output of the test with the patch, on my XP. Only the
createNewFile() case had to be bypassed in this test, but for another program I
also saw the renameFile() being bypassed.

- Doron

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-25 Thread Doron Cohen (JIRA)

[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Doron Cohen updated LUCENE-565:
---

Attachment: TestBufferedDeletesPerf.java
perf-test-res.JPG
perfres.log

I ran a performance test for interleaved adds and removes - and compared
between IndexModifier and NewIndexModifier.

Few setups were tested, with a few combinations of consecutive adds before a
delete takes place, maxBufferredDocs, and number of total test iterations,
where each iteration does the conseutive adds and then does the deletes.

Each setup ran in this order - orig indexModifier, new one, orig, new one, and
the best time out of the two runs was used.

Results indicate that NewIndexModifier is far faster for most setups.

Attached is the performance test, the performance results, and the log of the
run. The performance test is written as a Junit test, and it fails in case the
original IndexModfier is faster than the new one by more than 1 second (smaller
than 1 sec difference is considered noise).

Test was run on XP (SP1) with IBM JDK 1.5.

Test was first failing with access denied errors due to what seems to be an
XP issue. So in order to run this test on XP (and probably other Windows
platforms) the patch from http://issues.apache.org/jira/browse/LUCENE-665
should be applied first.

It is interesting to notice that in addition to preformance gain,
NewIndexModifier seems less sensitive to access denied XP problems, because
it closes/reopens readers and writers less frequently, and indeed, at least in
my runs, these errors had to be bypassed (by the retry patch) only for the
current index-modifier.

- Doron

Supporting deleteDocuments in IndexWriter (Code and Performance Results
Provided)
-

Key: LUCENE-565
URL: http://issues.apache.org/jira/browse/LUCENE-565
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Ning Li
Attachments: IndexWriter.java, IndexWriter.July09.patch,
IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch,
NewIndexWriter.July18.patch, perf-test-res.JPG, perfres.log,
TestBufferedDeletesPerf.java, TestWriterDelete.java

Today, applications have to open/close an IndexWriter and open/close an
IndexReader directly or indirectly (via IndexModifier) in order to handle a
mix of inserts and deletes. This performs well when inserts and deletes
come in fairly large batches. However, the performance can degrade
dramatically when inserts and deletes are interleaved in small batches.
This is because the ramDirectory is flushed to disk whenever an IndexWriter
is closed, causing a lot of small segments to be created on disk, which
eventually need to be merged.
We would like to propose a small API change to eliminate this problem. We
are aware that this kind change has come up in discusions before. See
http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
. The difference this time is that we have implemented the change and
tested its performance, as described below.
API Changes
---
We propose adding a deleteDocuments(Term term) method to IndexWriter.
Using this method, inserts and deletes can be interleaved using the same
IndexWriter.
Note that, with this change it would be very easy to add another method to
IndexWriter for updating documents, allowing applications to avoid a
separate delete and insert to update a document.
Also note that this change can co-exist with the existing APIs for deleting
documents using an IndexReader. But if our proposal is accepted, we think
those APIs should probably be deprecated.
Coding Changes
--
Coding changes are localized to IndexWriter. Internally, the new
deleteDocuments() method works by buffering the terms to be deleted.
Deletes are deferred until the ramDirectory is flushed to disk, either
because it becomes full or because the IndexWriter is closed. Using Java
synchronization, care is taken to ensure that an interleaved sequence of
inserts and deletes for the same document are properly serialized.
We have attached a modified version of IndexWriter in Release 1.9.1 with
these changes. Only a few hundred lines of coding changes are needed. All
changes are commented by CHANGE. We have also attached a modified version
of an example from Chapter 2.2 of Lucene in Action.
Performance Results
---
To test the performance our proposed changes, we ran some experiments using
the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
Xeon server running Linux. The disk storage was configured as RAID0 array
with 5 drives. Before indexes were built, the input documents were parsed
to remove the HTML from them (i.e., only the text was indexed). This was

[jira] Commented: (LUCENE-665) temporary file access denied on Windows

2006-08-27 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12430919 ] 

Doron Cohen commented on LUCENE-665:


 just to confirm, is it the COMMIT lock that's throwing these 
 unhandled exceptions (not the WRITE lock)? 
 If so, lockless commits would fix this. 

In my tests so far, these errors appeared only for commit locks. However I 
consider this a coincidence - there is nothing as far as I can understand 
special with commit locks comparing to write locks - in particular they both 
use createNewFile. So, I agree that lockless commits would prevent this, which 
is good, but we cannot count on that it would not happen for write locks as 
well. 

Also, the more I think about it the more I like lock-less commits, still, they 
would take a while to get into Lucene, while this simple fix can help easily 
now.

Last, with lock-less commits, still, there would be calls to createNewFile for 
write lock, and there would be calls to renameFile() and other IO file 
operations, intensively. By having a safety code like the retry logic that is 
invoked only in rare cases of these unexpected, some nasty errors would be 
reduced, more users would be happy.

 Can you provide more details on the exceptions you're seeing? 
 Especially on the cannot rename file exception? 

Here is one from my run log, that occurs at the call to optimize, after at the 
end of all the add-remove iterations -

[junit] java.io.IOException: Cannot rename C:\Documents and 
Settings\tpowner\Local Settings\Temp\test.perf\index_24\deleteable.new to 
C:\Documents and Settings\tpowner\Local 
Settings\Temp\test.perf\index_24\deletable
[junit] at 
org.apache.lucene.store.FSDirectory.doRenameFile(FSDirectory.java:328)
[junit] at 
org.apache.lucene.store.FSDirectory.renameFile(FSDirectory.java:280)
[junit] at 
org.apache.lucene.index.IndexWriter.writeDeleteableFiles(IndexWriter.java:967)
[junit] at 
org.apache.lucene.index.IndexWriter.deleteSegments(IndexWriter.java:911)
[junit] at 
org.apache.lucene.index.IndexWriter.commitChanges(IndexWriter.java:872)
[junit] at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:823)
[junit] at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:798)
[junit] at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:614)
[junit] at 
org.apache.lucene.index.IndexModifier.optimize(IndexModifier.java:304)
[junit] at 
org.apache.lucene.index.TestBufferedDeletesPerf.doOptimize(TestBufferedDeletesPerf.java:266)
[junit] at 
org.apache.lucene.index.TestBufferedDeletesPerf.measureInterleavedAddRemove(TestBufferedDeletesPerf.java:218)
[junit] at 
org.apache.lucene.index.TestBufferedDeletesPerf.doTestBufferedDeletesPerf(TestBufferedDeletesPerf.java:144)
[junit] at 
org.apache.lucene.index.TestBufferedDeletesPerf.testBufferedDeletesPerfCase7(TestBufferedDeletesPerf.java:134)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:585)
[junit] at junit.framework.TestCase.runTest(TestCase.java:154)
[junit] at junit.framework.TestCase.runBare(TestCase.java:127)
[junit] at junit.framework.TestResult$1.protect(TestResult.java:106)
[junit] at junit.framework.TestResult.runProtected(TestResult.java:124)
[junit] at junit.framework.TestResult.run(TestResult.java:109)
[junit] at junit.framework.TestCase.run(TestCase.java:118)
[junit] at junit.framework.TestSuite.runTest(TestSuite.java:208)
[junit] at junit.framework.TestSuite.run(TestSuite.java:203)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:297)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:672)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:567)
[junit] Caused by: java.io.FileNotFoundException: C:\Documents and 
Settings\tpowner\Local Settings\Temp\test.perf\index_24\deletable (Access is 
denied)
[junit] at java.io.FileOutputStream.open(Native Method)
[junit] at java.io.FileOutputStream.init(FileOutputStream.java:179)
[junit] at java.io.FileOutputStream.init(FileOutputStream.java:131)
[junit] at 
org.apache.lucene.store.FSDirectory.doRenameFile(FSDirectory.java:312)
[junit] ... 27 more

This exception btw is from the performance test for 
interleaved-adds-and-removes - issue 565 - so IndexWriter line numbers here 
relate to applying recent patch from issue 565 (though the same errors are 
obtained with

[jira] Commented: (LUCENE-665) temporary file access denied on Windows

2006-08-28 Thread Doron Cohen (JIRA)

[
http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12431100 ]

Doron Cohen commented on LUCENE-665:

obtain() is supposed to return success or failure immediately.
I'd be tempted to override obtain(timout) for FS locks and keep the retry
logic there.

Right, this is the right place for the retry. This way changes are limited to
FSDirectory, and obtain() remains unchanged.

I am tesing this now and would subit an updated patch, where:
- UNEXPECTED_ERROR_RETRY_DELAY is set to 100ms.
- timeout in obtain(timeout) is always repected (even if the presence of those
unexpected io errors).
- IOExceptions bubble up as discussed.

temporary file access denied on Windows
---

When interleaving adds and removes there is frequent opening/closing of
readers and writers.
I tried to measure performance in such a scenario (for issue 565), but the
performance test failed - the indexing process crashed consistently with
file access denied errors - cannot create a lock file in
lockFile.createNewFile() and cannot rename file.
This is related to:
- issue 516 (a closed issue: TestFSDirectory fails on Windows) -
http://issues.apache.org/jira/browse/LUCENE-516
- user list questions due to file errors:
-
http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html
-
http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
- discussion on lock-less commits
http://www.nabble.com/Lock-less-commits-tf2126935.html
My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs.
I noticed that the problem is more frequent when locks are created on one
disk and the index on another. Both are NTFS with Windows indexing service
enabled. I suspect this indexing service might be related - keeping files
busy for a while, but don't know for sure.
After experimenting with it I conclude that these problems - at least in my
scenario - are due to a temporary situation - the FS, or the OS, is
*temporarily* holding references to files or folders, preventing from
renaming them, deleting them, or creating new files in certain directories.
So I added to FSDirectory a retry logic in cases the error was related to
Access Denied. This is the same approach brought in
http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
- there, in addition to the retry, gc() is invoked (I did not gc()). This is
based on the *hope* that a access-denied situation would vanish after a small
delay, and the retry would succeed.
I modified FSDirectory this way for Access Denied errors during creating a
new files, renaming a file.
This worked fine for me. The performance test that failed before, now managed
to complete. There should be no performance implications due to this
modification, because only the cases that would otherwise wrongly fail are
now delaying some extra millis and retry.
I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these
changes to FSDirectory.
All ant test tests pass with this patch.
Also attaching a test case that demostrates the problem - at least on my
machine. There two tests cases in that test file - one that works in system
temp (like most Lucene tests) and one that creates the index in a different
disk. The latter case can only run if the path (D: , tmp) is valid.
It would be great if people that experienced these problems could try out
this patch and comment whether it made any difference for them.
If it turns out useful for others as well, including this patch in the code
might help to relieve some of those frustration user cases.
A comment on state of proposed patch:
- It is not a ready to deploy code - it has some debug printing, showing
the cases that the retry logic actually took place.
- I am not sure if current 30ms is the right delay... why not 50ms? 10ms?
This is currently defined by a constant.
- Should a call to gc() be added? (I think not.)
- Should the retry be attempted also on non access-denied exceptions? (I
think not).
- I feel it is somewhat woodoo programming, but though I don't like it, it
seems to work...
Attached files:
1. TestInterleavedAddAndRemoves.java - the LONG test that fails on XP without
the patch and passes with the patch.
2. FSDirectory_Retry_Logic.patch
3. Test_Output.txt- output of the test with the patch, on my XP. Only the
createNewFile() case had to be

[jira] Commented: (LUCENE-635) [PATCH] Decouple locking implementation from Directory implementation

2006-08-29 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-635?page=comments#action_12431341 ] 

Doron Cohen commented on LUCENE-635:


While updating my patch for 665 according the changes here, I noticed something 
- I may be wrong here - but it seems to me that until this change, all the 
actual FS access operations where performed by FSDirectory, using the Directory 
API. 

The new SimpleFSLock and SimpleFSLockFactory also access the FS directly, not 
through FSDirectory API.

That Directory abstraction in Lucene allows to develop Lucene-in-RAM, 
Lucene-in-DB, etc. It is a nice feature. 

Guess we can say: well, now the abstraction is made of two interfaces - Lock 
and Directory, just make sure you use 'matching' implementations of them. This 
seems weaker than before.

Or, can limit all file access to go through FSDirectory - 
- one possibility is to add to LockFactory a Directory object (as a class 
member); SimpleFSLockFactory can require thas Directory object to be 
FSDirectory (cast, and fail otherwise); also, FSDirectory should be extened 
with createSingleFile(), mkdirs() and isDirectory().

 [PATCH] Decouple locking implementation from Directory implementation
 -

 Key: LUCENE-635
 URL: http://issues.apache.org/jira/browse/LUCENE-635
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.0.0
Reporter: Michael McCandless
 Assigned To: Yonik Seeley
Priority: Minor
 Fix For: 2.0.1

 Attachments: LUCENE-635-Aug27.patch, LUCENE-635-Aug3.patch, 
 patch-Jul26.tar


 This is a spinoff of http://issues.apache.org/jira/browse/LUCENE-305.
 I've opened this new issue to capture that it's wider scope than
 LUCENE-305.
 This is a patch originally created by Jeff Patterson (see above link)
 and then modified as described here:
   http://issues.apache.org/jira/browse/LUCENE-305#action_12418493
 with some small additional changes:
   * For each FSDirectory.getDirectory(), I made a corresponding
 version that also accepts a LockFactory instance.  So, you can
 construct an FSDirectory with your own LockFactory.
   * Cascaded defaulting for FSDirectory's LockFactory implementation:
 if you pass in a LockFactory instance, it's used; else if
 setDisableLocks was called, we use NoLockFactory; else, if the
 system property org.apache.lucene.store.FSDirectoryLockFactoryClass
 is defined, we use that; finally, we'll use the original locking
 implementation (SimpleFSLockFactory).
 The gist is that all locking code has been moved out of *Directory and
 into subclasses of a new abstract LockFactory class.  You can now set
 the LockFactory of a Directory to change how it does locking.  For
 example, you can create an FSDirectory but set its locking to
 SingleInstanceLockFactory (if you know all writing/reading will take
 place a single JVM).
 The changes pass all unit tests (on Ubuntu Linux Sun Java 1.5 and
 Windows XP Sun Java 1.4), and I added another TestCase to test the
 LockFactory code.
 Note that LockFactory defaults are not changed: FSDirectory defaults
 to SimpleFSLockFactory and RAMDirectory defaults to
 SingleInstanceLockFactory.
 Next step (separate issue) is to create a LockFactory that uses the OS
 native locks (through java.nio).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-29 Thread Doron Cohen (JIRA)

[
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12431354 ]

Doron Cohen commented on LUCENE-565:

Is it that results that were returned are suddenly (say after updates) not
returned anymore (indicating something bad happened to existing index)?

Or is it that the search does not reflect recent changes?

I don't remember how often Solr closes and re-opens the writer/modifier...
with this patch a delete does not immediately cause a flush to disk - so
flushes are controlled by closing the NewIndexModifier (and re-opening, since
there no flush() method) and by the limits for max-bufferred-docs and
max-bufferred-deletes. If this seems relevant to your case, what limits are in
effect?

Supporting deleteDocuments in IndexWriter (Code and Performance Results
Provided)
-

[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-29 Thread Doron Cohen (JIRA)

[
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12431419 ]

Doron Cohen commented on LUCENE-565:

Just to make sure on the scenario - are you -
(1) using NewIndexModifier at all, or
(2) just letting Solr use this IndexWriter (with the code changes introduced to
ebable NewIndexModifier) instead of the Lucene's svn-head (or cetrain release)
IndexModifier.

As is, Solr would not use NewIndexModifier or IndexModifier at all.

For case (2) above the bufferred deletes logic is not in effect at all.

I wonder if it possibe to re-create this with a simple Lucene stand-alone
(test) program rather than with Solr - it would be easier to analyze.

Supporting deleteDocuments in IndexWriter (Code and Performance Results
Provided)
-

[jira] Updated: (LUCENE-665) temporary file access denied on Windows

2006-08-30 Thread Doron Cohen (JIRA)

[ http://issues.apache.org/jira/browse/LUCENE-665?page=all ]

Doron Cohen updated LUCENE-665:
---

Attachment: FSDirs_Retry_Logic_3.patch

I am attaching an updated patch - FSDirs_Retry_Logic_3.patch.

In this update:
- merge with code changes by issue 635 (decouple locking from directory)
- modified by recommendations in above comments:
- do not rely on specific exception message text.
- overide lock.obtain(timeout) and handle unexpected exceptions there.
- do not modify logic of obtain() (no changes to this method).
- UNEXPECTED_ERROR_RETRY_DELAY set to 100ms.
- debug prints commented out.

ant test tests all pass.
My stress IO test passes as well.

temporary file access denied on Windows
---

Key: LUCENE-665
URL: http://issues.apache.org/jira/browse/LUCENE-665
Project: Lucene - Java
Issue Type: Bug
Components: Store
Affects Versions: 2.0.0
Environment: Windows
Reporter: Doron Cohen
Attachments: FSDirectory_Retry_Logic.patch,
FSDirs_Retry_Logic_3.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java

[jira] Commented: (LUCENE-635) [PATCH] Decouple locking implementation from Directory implementation

2006-08-30 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-635?page=comments#action_12431666 ] 

Doron Cohen commented on LUCENE-635:


 We could (as you're suggesting) indeed extend FSDirectory so that it 
 provided the low level methods required by a locking implementation, 
 and then alter SimpleFSLockFactory/NativeFSLockFactory (or make a new 
 LockFactory) so that all underlying IO is through the FSDirectory instead.

Yes, this is exactly (and only) what I am suggesting to consider - to include a 
Directory member within the LockFactory so that it is clear that any 
LockFactory implementation operates in the realm of a directory 
(implementation) and is using it for any actual store accesses.

 [PATCH] Decouple locking implementation from Directory implementation
 -

 Key: LUCENE-635
 URL: http://issues.apache.org/jira/browse/LUCENE-635
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.0.0
Reporter: Michael McCandless
 Assigned To: Yonik Seeley
Priority: Minor
 Fix For: 2.0.1

 Attachments: LUCENE-635-Aug27.patch, LUCENE-635-Aug3.patch, 
 patch-Jul26.tar


 This is a spinoff of http://issues.apache.org/jira/browse/LUCENE-305.
 I've opened this new issue to capture that it's wider scope than
 LUCENE-305.
 This is a patch originally created by Jeff Patterson (see above link)
 and then modified as described here:
   http://issues.apache.org/jira/browse/LUCENE-305#action_12418493
 with some small additional changes:
   * For each FSDirectory.getDirectory(), I made a corresponding
 version that also accepts a LockFactory instance.  So, you can
 construct an FSDirectory with your own LockFactory.
   * Cascaded defaulting for FSDirectory's LockFactory implementation:
 if you pass in a LockFactory instance, it's used; else if
 setDisableLocks was called, we use NoLockFactory; else, if the
 system property org.apache.lucene.store.FSDirectoryLockFactoryClass
 is defined, we use that; finally, we'll use the original locking
 implementation (SimpleFSLockFactory).
 The gist is that all locking code has been moved out of *Directory and
 into subclasses of a new abstract LockFactory class.  You can now set
 the LockFactory of a Directory to change how it does locking.  For
 example, you can create an FSDirectory but set its locking to
 SingleInstanceLockFactory (if you know all writing/reading will take
 place a single JVM).
 The changes pass all unit tests (on Ubuntu Linux Sun Java 1.5 and
 Windows XP Sun Java 1.4), and I added another TestCase to test the
 LockFactory code.
 Note that LockFactory defaults are not changed: FSDirectory defaults
 to SimpleFSLockFactory and RAMDirectory defaults to
 SingleInstanceLockFactory.
 Next step (separate issue) is to create a LockFactory that uses the OS
 native locks (through java.nio).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-665) temporary file access denied on Windows

2006-08-31 Thread Doron Cohen (JIRA)

[
http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12431801 ]

Doron Cohen commented on LUCENE-665:

I think I know which software is causing/exposing this behavior in my
environment.
This is the SVN client I am using - TortoiseSVN.

I tried the following sequence:
1) Run with TortoiseSVN installed - the test generates these access denied:
errors (and bypasses them).
2) Uninstalled TortoiseSVN (+reboot), run test - pass with no access denied
errorrs.
3) Installed TortoiseSVN again (+reboot), run test - same access denied
errors again.

I am using most recent stable TotoiseSVN version - 1.3.5 build 6804 - 32 bit,
for svn-1.3.2, downloaded from http://tortoisesvn.tigris.org/.

There is an interesting discussion thread of these type of errors on Windows
platforms in svn forums - http://svn.haxx.se/dev/archive-2003-10/0136.shtml. At
that case it was svn that suffers from these errors.

It says ...Windows allows applications to tag-along to see when a file has
been written - they will wait for it to close and then do whatever they do,
usually opening a file descriptor or handle. This would prevent that file from
being renamed for a brief period...

TortoiseSVN is a shell extension integrated into Windows explorer. As such, it
probably demonstrates the tag-along behavior described above.

(BTW, it is a great svn client to my opinion)

Here is another excerpt from that discussion thread -

sleep(1) would work, I suppose. ;~)

Most of the time, but not all the time. The only way I've made it work
well on all the machines I've tried it on is to put it into a sleep(1)
and retry loop of at *least* 20 or so attempts. Anything less and it
still fails on some machines. That implies it is very dependent on
machine speed or something, which means sleep times/retry times are just
guessing games at best.

If I could just get it recreated outside of Subversion and prove it's a
Microsoft problem...although it probably still wouldn't get fixed for
months at least.

We don't know that this is a bug in TortoiseSVN.
We cannot tell that there are no other such tag-along applications in users
machines.
One cannot seriously expect this Win32 behavior to be fixed.

I guess the question is - is it worth for Lucene to attempt to at least reduce
chances of failures in this case (I say yes:-)

temporary file access denied on Windows
---

Key: LUCENE-665
URL: http://issues.apache.org/jira/browse/LUCENE-665
Project: Lucene - Java
Issue Type: Bug
Components: Store
Affects Versions: 2.0.0
Environment: Windows
Reporter: Doron Cohen
Attachments: FSDirectory_Retry_Logic.patch,
FSDirs_Retry_Logic_3.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-31 Thread Doron Cohen (JIRA)

[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Doron Cohen updated LUCENE-565:
---

Attachment: perf-test-res2.JPG

Updated performance test results - perf-test-res2.JPG - in avarage, the new
code is *9* times faster!

What have changed? - in previous test I forgot to set max-buffered-deletes.

After fixing so, I removed the test cases with max-buffer of 5,000 and up,
because they consumed too much memory, and added more practical (I think) cases
of 2000 and 3000.

Here is a textual summary of the data in the attached image:

max buf add/del 10 10 100 1000 2000 3000
iterations 1 10100 100 200
300
adds/iteration 10 10 10 10 10
10
dels/iteration 55 555
5
orig time (sec) 0.13 0.869.57 8.8822.74
44.01
new time (sec) 0.20 0.95 1.74 1.302.16
3.08
Improvement (sec)-0.07-0.09 7.83 7.5820.58 40.94
Improvement (%) -55% -11% 82% 85% 90% 93%

Note: for the first two cases new code is slower by 11% and 55%, but this is a
very short test case, - the absolute difference here is less than 100ms,
comparing to the other cases, where the difference is measured in seconds and
10's of seconds.

Supporting deleteDocuments in IndexWriter (Code and Performance Results
Provided)
-

Key: LUCENE-565
URL: http://issues.apache.org/jira/browse/LUCENE-565
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Ning Li
Attachments: IndexWriter.java, IndexWriter.July09.patch,
IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch,
NewIndexWriter.July18.patch, perf-test-res.JPG, perf-test-res2.JPG,
perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java

[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-01 Thread Doron Cohen (JIRA)

[
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12432216 ]

Doron Cohen commented on LUCENE-565:

I agree - I also suspected it might change the merge behavior (and also had
reflections from the repeated trials to have that simple Indexwriter
buffered-docs patch correct...:-).

Guess I just wanted to get a feeling if there is interest to include this patch
before I delve into it too much - and the perf test was meant to see for my
self if it really helps. I was a bit surprised that it speeds 9 times in an
interleaving add/delete scenario. Guess this by itself now justifies delving
into this patch, analyzing merge behavior as you suggest - will do - I think
idealy should try this patch not to modify the merge behavior.

About the test - l was trying to test what I thought is a realistic use
scenario (max-buf, etc.) - I have a fixed version of the perf test that is
easier to modify for different scenarios - can upload it here if there is
interest.

Supporting deleteDocuments in IndexWriter (Code and Performance Results
Provided)
-

Key: LUCENE-565
URL: http://issues.apache.org/jira/browse/LUCENE-565
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Ning Li
Attachments: IndexWriter.java, IndexWriter.July09.patch,
IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch,
NewIndexWriter.July18.patch, perf-test-res.JPG, perf-test-res2.JPG,
perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java

[jira] Updated: (LUCENE-665) temporary file access denied on Windows

2006-09-20 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-665?page=all ]

Doron Cohen updated LUCENE-665:
---

Attachment: FSWinDirectory.patch

Attached patch - FSWinDirectory - implements retry logic of FS operations in a 
separate non default directory class as discussed above. 

By default this new class is not used. Applications can start using it by 
replacing the IMPL class in FSDirectory to be the new class FSWinDirectory. 

There are two ways to do this - by setting a system property (this is the 
original mechanism), or by calling FSDirectory static (new) method - 
setFSDirImplClass(name). 

There are 3 new classes in this patch: 
- FSWinDirectory (extends FSDirectory)
- SimpleFSWinLockFactory (extends SimpleFSLockFactory)
- TestWinLockFactory (extends TestLockFactory). 

Few simple modifications were required in FSDirectory, SimpleFSLockFactory and 
TestLockfactory in order to allow inheritance

Tests:
- ant test passes with new code.
- For test, I modified my copy of build-common.xml to set a system property so 
that the new WinFS class was always in effect and ran the tests - all passed. 
- my stress test TestinterleavedAddAndRemoves fails in my env by default and 
passes when FSWinDirectory is in effect.

 temporary file access denied on Windows
 ---

 Key: LUCENE-665
 URL: http://issues.apache.org/jira/browse/LUCENE-665
 Project: Lucene - Java
  Issue Type: Bug
  Components: Store
Affects Versions: 2.0.0
 Environment: Windows
Reporter: Doron Cohen
 Attachments: FSDirectory_Retry_Logic.patch, 
 FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch, Test_Output.txt, 
 TestInterleavedAddAndRemoves.java


 When interleaving adds and removes there is frequent opening/closing of 
 readers and writers. 
 I tried to measure performance in such a scenario (for issue 565), but the 
 performance test failed  - the indexing process crashed consistently with 
 file access denied errors - cannot create a lock file in 
 lockFile.createNewFile() and cannot rename file.
 This is related to:
 - issue 516 (a closed issue: TestFSDirectory fails on Windows) - 
 http://issues.apache.org/jira/browse/LUCENE-516 
 - user list questions due to file errors:
   - 
 http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html
   - 
 http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
 - discussion on lock-less commits 
 http://www.nabble.com/Lock-less-commits-tf2126935.html
 My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. 
 I noticed that the problem is more frequent when locks are created on one 
 disk and the index on another. Both are NTFS with Windows indexing service 
 enabled. I suspect this indexing service might be related - keeping files 
 busy for a while, but don't know for sure.
 After experimenting with it I conclude that these problems - at least in my 
 scenario - are due to a temporary situation - the FS, or the OS, is 
 *temporarily* holding references to files or folders, preventing from 
 renaming them, deleting them, or creating new files in certain directories. 
 So I added to FSDirectory a retry logic in cases the error was related to 
 Access Denied. This is the same approach brought in 
 http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
  - there, in addition to the retry, gc() is invoked (I did not gc()). This is 
 based on the *hope* that a access-denied situation would vanish after a small 
 delay, and the retry would succeed.
 I modified FSDirectory this way for Access Denied errors during creating a 
 new files, renaming a file.
 This worked fine for me. The performance test that failed before, now managed 
 to complete. There should be no performance implications due to this 
 modification, because only the cases that would otherwise wrongly fail are 
 now delaying some extra millis and retry.
 I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these 
 changes to FSDirectory. 
 All ant test tests pass with this patch.
 Also attaching a test case that demostrates the problem - at least on my 
 machine. There two tests cases in that test file - one that works in system 
 temp (like most Lucene tests) and one that creates the index in a different 
 disk. The latter case can only run if the path (D: , tmp) is valid.
 It would be great if people that experienced these problems could try out 
 this patch and comment whether it made any difference for them. 
 If it turns out useful for others as well, including this patch in the code 
 might help to relieve some of those frustration user cases.
 A comment on state of proposed patch: 
 - It is not a ready to deploy code - it has some debug printing, showing 
 the cases that the retry logic actually took place. 
 - I am not sure if

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-09-22 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436980 ] 

Doron Cohen commented on LUCENE-675:


Few things that would be nice to have in this performance package/framework - 

() indexing only overall time.
() indexing only time changes as the index grows (might be the case that 
indexing performance starts to misbehave from a certain size or so).
() search single user while indexing
() search only single user
() search only concurrent users
() short queries
() long queries
() wild card queries
() range queries
() queries with rare words
() queries with common words
() tokenization/analysis only (above indexing measurements include 
tokenization, but it would be important to be able to prove to oneself that 
tokenization/analysis time is not hurt by  a recent change).

() parametric control over:
() () location of test input data.
() () location of output index.
() () location of output log/results.
() ()  total collection size (total number of bytes/characters read from 
collection)
() () document (average) size (bytes/chars) - test can break input data and 
recompose it into documents of desired size.
() () implicit iteration size - merge-factor, max-buffered-docs
() () explicit iteration size - how often the perf test calls
() () long queries text
() () short queries text
() () which parts of the test framework capabilities to run
() () number of users / threads.
() () queries pace - how many queries are fired in, say, a minute.

Additional points:
() Would help if all test run parameters are maintained in a properties (or xml 
config) file, so one can easily modify the test input/output without having to 
recompile the code.
() Output to allow easy creation of graphs or so - perhaps best would be to 
have an result object, so others can easily extend with additional output 
formats.
() index size as part of output.
() number of index files as part of output (?)
() indexing input module that can loop over the input collection. This allows 
to test indexing of a collection larger than the actual input collection being 
used. 



 Lucene benchmark: objective performance test for Lucene
 ---

 Key: LUCENE-675
 URL: http://issues.apache.org/jira/browse/LUCENE-675
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Assigned To: Grant Ingersoll
 Attachments: LuceneBenchmark.java


 We need an objective way to measure the performance of Lucene, both indexing 
 and querying, on a known corpus. This issue is intended to collect comments 
 and patches implementing a suite of such benchmarking tests.
 Regarding the corpus: one of the widely used and freely available corpora is 
 the original Reuters collection, available from 
 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
 or 
 http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
  I propose to use this corpus as a base for benchmarks. The benchmarking 
 suite could automatically retrieve it from known locations, and cache it 
 locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-665) temporary file access denied on Windows

2006-09-27 Thread Doron Cohen (JIRA)

[ http://issues.apache.org/jira/browse/LUCENE-665?page=all ]

Doron Cohen updated LUCENE-665:
---

Attachment: FSWinDirectory_26_Sep_06.patch

Updated the patch according to review comments by Hoss, plus:
- protect currMillis usage from system clock modifications.
- all Win specific code in a single Java file with two inner classes, for
cleaner javadocs (now waitForRetry() is provate).

Tested as previous patch:
- ant test passes with new code.
- For test, modified build-common.xml to set a system property so that the new
WinFS class was always in effect and ran the tests - all passed.
- my stress test TestinterleavedAddAndRemoves fails in my env by default and
passes when FSWinDirectory is in effect.

temporary file access denied on Windows
---

Key: LUCENE-665
URL: http://issues.apache.org/jira/browse/LUCENE-665
Project: Lucene - Java
Issue Type: Bug
Components: Store
Affects Versions: 2.0.0
Environment: Windows
Reporter: Doron Cohen
Attachments: FSDirectory_Retry_Logic.patch,
FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch,
FSWinDirectory_26_Sep_06.patch, Test_Output.txt,
TestInterleavedAddAndRemoves.java

[jira] Updated: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-09-28 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-664?page=all ]

Doron Cohen updated LUCENE-664:
---

Attachment: boosts_plus_scoring_formula.patch

(1) added a section in Scoring.xml for Search Results Boosts, on ways to 
boost in Lucene, at search time and at indexing time. 

(2) updated the presentation of the scoring formula in Similarity.java, to:
- closely reflect the scoring code/process.
- distinguish between indexing time factors and search time factors, and
- point to differences between a scoring notion (e.g.tf, idf) to the way it is 
computed.

As result the scoring formula is presented differently in Similarity.java and 
in Scoring.html. I can update this if there are no objections to the updated 
formula presentation.


 [PATCH] small fixes to the new scoring.html doc
 ---

 Key: LUCENE-664
 URL: http://issues.apache.org/jira/browse/LUCENE-664
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.0.1
Reporter: Michael McCandless
 Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, 
 scoring-small-fixes.patch, scoring-small-fixes2.patch, 
 scoring-small-fixes3.patch


 This is an awesome initiative.  We need more docs that cleanly explain the 
 inner workings of Lucene in general... thanks Grant  Steve  others!
 I have a few small initial proposed fixes, largely just adding some more 
 description around the components of the formula.  But also a couple typos, 
 another link out to Wikipedia, a missing closing ), etc.  I've only made it 
 through the Understanding the Scoring Formula section so far.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-09-29 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12438854 ] 

Doron Cohen commented on LUCENE-664:


Hi Grant, 

For part 1, I am ok with having it after the scoring formula.

For part 2, my motivation was to make it more clear in:
  - - what's inside the sum and what's outside (as you said).
  - - what's decided at indexing time and what's still controllable at search 
time.
  - - how boosts and encoding/decoding play in.
  - - what's fixed and what can be modified, by subclassing, say, 
DefaultSimilarity.

So {indexBoost, searchBoost, normalizer} were the tools to clear this up, and 
also to make the formula shorter and easier to read in a glance.

Naturally, after delving so deep into it is now clear to me, but you are right, 
it would be good to hear from others how they like this part.

Thanks,
Doron

 [PATCH] small fixes to the new scoring.html doc
 ---

 Key: LUCENE-664
 URL: http://issues.apache.org/jira/browse/LUCENE-664
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.0.1
Reporter: Michael McCandless
 Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, 
 scoring-small-fixes.patch, scoring-small-fixes2.patch, 
 scoring-small-fixes3.patch


 This is an awesome initiative.  We need more docs that cleanly explain the 
 inner workings of Lucene in general... thanks Grant  Steve  others!
 I have a few small initial proposed fixes, largely just adding some more 
 description around the components of the formula.  But also a couple typos, 
 another link out to Wikipedia, a missing closing ), etc.  I've only made it 
 through the Understanding the Scoring Formula section so far.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-10-02 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12439370 ] 

Doron Cohen commented on LUCENE-664:


Two quick questions:

 I think 'norm' is a good term for the product of lengthNorm(d) and 
 field boost. That's what it is called consistently in the code and API. 
 This quantity is represented in two places, but seems like a logical 
 candidate for the sort of factoring done here. 

Norm would also Include the doc boost, right?
So this means replacing *indexBoost* with *norm* ?

 This could be placed to the left of the sigma, since it does not depend on t. 

I think that norm depends on the field name and there may be terms of more than 
one field in the query.?

 [PATCH] small fixes to the new scoring.html doc
 ---

 Key: LUCENE-664
 URL: http://issues.apache.org/jira/browse/LUCENE-664
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.0.1
Reporter: Michael McCandless
 Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, 
 scoring-small-fixes.patch, scoring-small-fixes2.patch, 
 scoring-small-fixes3.patch


 This is an awesome initiative.  We need more docs that cleanly explain the 
 inner workings of Lucene in general... thanks Grant  Steve  others!
 I have a few small initial proposed fixes, largely just adding some more 
 description around the components of the formula.  But also a couple typos, 
 another link out to Wikipedia, a missing closing ), etc.  I've only made it 
 through the Understanding the Scoring Formula section so far.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-10-06 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12440592 ] 

Doron Cohen commented on LUCENE-664:


Going to work on this now, according to comments by Doug and Grant.

Will give a try to the include idea - client side iframe as Chris suggested, 
see how it works. Iframe don't rely on Javascript (might be turned off for some 
users).  There are downsides to iframe too - possible scrollbars etc. - so need 
to see how it looks, and need to check if it is possible to somehow also 
include it in Scoring.html, otherwise guess we just link to it from there.

http://www.boutell.com/newfaq/creating/include.html



 [PATCH] small fixes to the new scoring.html doc
 ---

 Key: LUCENE-664
 URL: http://issues.apache.org/jira/browse/LUCENE-664
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.0.1
Reporter: Michael McCandless
 Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, 
 scoring-small-fixes.patch, scoring-small-fixes2.patch, 
 scoring-small-fixes3.patch


 This is an awesome initiative.  We need more docs that cleanly explain the 
 inner workings of Lucene in general... thanks Grant  Steve  others!
 I have a few small initial proposed fixes, largely just adding some more 
 description around the components of the formula.  But also a couple typos, 
 another link out to Wikipedia, a missing closing ), etc.  I've only made it 
 through the Understanding the Scoring Formula section so far.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-10-07 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12440648 ] 

Doron Cohen commented on LUCENE-664:


I played with including the formula from a separate file, Client Side Include. 

=== Summary === 
 I think the include is not going to work well enough and hence not worth to 
invest in it.
So, bottom line, I give up for now on include, and so I will make the changes 
in Similarity.java.

=== Details  === 
I know of 3 ways to do this: Javascript, Iframe, Object-Embed. 

Iframe can work for both the javadocs and the xdocs - I think that Embed also 
would work though I did not try it.

Both Iframe and Embed have a problem of appearance: you have to decide on the 
size of the frame showm, in pixels. If you set too large an area, blank space 
will remain. If you set a too small area, scroll bars would show up. If the 
user changes the text side, the required area size is changing, but not the 
allocated area, so scrollbars or blank space are showing up and disapearing. 
Very ugly. 

Iframe also has an issue with inner links navigation: once you navigate to an 
anchor text in the iframe part (this works), the back action from some reason 
does not work (both Firefox and IE).

The javascript approach should not have these issues, because the imported text 
becomes part of the embedding page (the imported text is dynamically 
generated). I saw that Javadocs themselves use javascript (at least in 1.5) so 
I feel better with using this. However to use Javascript you have to put some 
javascript code in the HTML Header, as well as an onload event in the BODY tag. 
I didn't find a way to do this with Javadocs. 

(Another tricky issue with Javascript is that outgoing links from the imported 
text have the base address of the embedding page. So references going out from 
the embedded text should be different in Similarity.html and Scoring.html 
(which are in separate directories). But I think this can be resolved with 
passing a 'base' param to the include() function.)

Bottom line, I give up for now on include, and so I will make the changes in 
Similarity.java.

 [PATCH] small fixes to the new scoring.html doc
 ---

 Key: LUCENE-664
 URL: http://issues.apache.org/jira/browse/LUCENE-664
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.0.1
Reporter: Michael McCandless
 Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, 
 scoring-small-fixes.patch, scoring-small-fixes2.patch, 
 scoring-small-fixes3.patch


 This is an awesome initiative.  We need more docs that cleanly explain the 
 inner workings of Lucene in general... thanks Grant  Steve  others!
 I have a few small initial proposed fixes, largely just adding some more 
 description around the components of the formula.  But also a couple typos, 
 another link out to Wikipedia, a missing closing ), etc.  I've only made it 
 through the Understanding the Scoring Formula section so far.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-10-07 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-664?page=all ]

Doron Cohen updated LUCENE-664:
---

Attachment: scoring_formula_2.patch

I am attaching scoring_formula_2.patch - modifed scoring formula as suggested. 

Additional changes here:
- order of the explanation parts: detailed norm part moved to end;  tf and idf 
moved to start, so most of the stuff is visble at first glance.
- links in the formula go to the appropriate explanation bullet.
- formula itself is framed (border=1)  for easier orientation within all the 
other text.


 [PATCH] small fixes to the new scoring.html doc
 ---

 Key: LUCENE-664
 URL: http://issues.apache.org/jira/browse/LUCENE-664
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.0.1
Reporter: Michael McCandless
 Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, 
 scoring-small-fixes.patch, scoring-small-fixes2.patch, 
 scoring-small-fixes3.patch, scoring_formula_2.patch


 This is an awesome initiative.  We need more docs that cleanly explain the 
 inner workings of Lucene in general... thanks Grant  Steve  others!
 I have a few small initial proposed fixes, largely just adding some more 
 description around the components of the formula.  But also a couple typos, 
 another link out to Wikipedia, a missing closing ), etc.  I've only made it 
 through the Understanding the Scoring Formula section so far.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-10-10 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12441194 ] 

Doron Cohen commented on LUCENE-664:


One comment for Scoring.html:

Tthe last sentence in the Score Boosting paragraph says:

  At scoring (search) time, this norm is brought into the score 
  of document as indexBoost, as shown by the formula in Similarity.

To fix this, we should replace  indexBoost  by   norm(t,d)


 [PATCH] small fixes to the new scoring.html doc
 ---

 Key: LUCENE-664
 URL: http://issues.apache.org/jira/browse/LUCENE-664
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.0.1
Reporter: Michael McCandless
 Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, 
 scoring-small-fixes.patch, scoring-small-fixes2.patch, 
 scoring-small-fixes3.patch, scoring_formula_2.patch


 This is an awesome initiative.  We need more docs that cleanly explain the 
 inner workings of Lucene in general... thanks Grant  Steve  others!
 I have a few small initial proposed fixes, largely just adding some more 
 description around the components of the formula.  But also a couple typos, 
 another link out to Wikipedia, a missing closing ), etc.  I've only made it 
 through the Understanding the Scoring Formula section so far.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-10-10 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12441280 ] 

Doron Cohen commented on LUCENE-664:


I just noticed that the link to TermScorer in Understanding the Scoring 
Formula is broken b/c TermScorer has package visibility. 

Can be fixed by saying instead ..., especially the scorer for TermQuery and 
linking to TermQuery.


 [PATCH] small fixes to the new scoring.html doc
 ---

 Key: LUCENE-664
 URL: http://issues.apache.org/jira/browse/LUCENE-664
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.0.1
Reporter: Michael McCandless
 Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, 
 scoring-small-fixes.patch, scoring-small-fixes2.patch, 
 scoring-small-fixes3.patch, scoring_formula_2.patch


 This is an awesome initiative.  We need more docs that cleanly explain the 
 inner workings of Lucene in general... thanks Grant  Steve  others!
 I have a few small initial proposed fixes, largely just adding some more 
 description around the components of the formula.  But also a couple typos, 
 another link out to Wikipedia, a missing closing ), etc.  I've only made it 
 through the Understanding the Scoring Formula section so far.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-678) [PATCH] LockFactory implementation based on OS native locks (java.nio.*)

2006-10-18 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-678?page=comments#action_12443304 ] 

Doron Cohen commented on LUCENE-678:


The patch added a call to writer.close() in TestLockFactory - 
testFSDirectoryTwoCreates().
This is just before the 2nd attempt to create an index writer with override.
This line should probably be removed, as it cancels the second part of that 
test case, right?

 [PATCH] LockFactory implementation based on OS native locks (java.nio.*)
 

 Key: LUCENE-678
 URL: http://issues.apache.org/jira/browse/LUCENE-678
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.1
Reporter: Michael McCandless
 Assigned To: Yonik Seeley
Priority: Minor
 Fix For: 2.0.1

 Attachments: LUCENE-678-patch.txt


 The current default locking for FSDirectory is SimpleFSLockFactory.
 It uses java.io.File.createNewFile for its locking, which has this
 spooky warning in Sun's javadocs:
 Note: this method should not be used for file-locking, as the
 resulting protocol cannot be made to work reliably. The FileLock
 facility should be used instead.
 So, this patch provides a LockFactory implementation based on FileLock
 (using java.nio.*).
 All unit tests pass with this patch, on OS X (10.4.8), Linux (Ubuntu
 6.06), and Windows XP SP2.
 Another benefit of native locks is the OS automatically frees them if
 the JVM exits before Lucene can free its locks.  Many people seem to
 hit this (old lock files still on disk) now.
 I've created this new class:
   org.apache.lucene.store.NativeFSLockFactory
 and added a couple test cases to the existing TestLockFactory.
 I've left SimpleFSLockFactory as the default locking for FSDirectory
 for now.  I think we should get some usage / experience with
 NativeFSLockFactory and then later on make it the default locking
 implementation?
 I also tested changing FSDirectory's default locking to
 NativeFSLockFactory and all unit tests still pass (on the above
 platforms).
 One important note about locking over NFS: some NFS servers and/or
 clients do not support it, or, it's a configuration option or mode
 that must be explicitly enabled.  When it's misconfigured it's able to
 take a long time (35 seconds in my case) before throwing an exception.
 To handle this, I acquire  release a random test lock on creating the
 NativeFSLockFactory to verify locking is configured properly.
 A few other small changes in the patch:
 - Added a failure reason to Lock.java so that in
   obtain(lockWaitTimeout), if there is a persistent IOException
   in trying to obtain the lock, this can be messaged  included in
   the Lock obtain timed out that's raised.
 - Corrected javadoc in SimpleFSLockFactory: it previously said the
   wrong system property for overriding lock class via system
   properties
 - Fixed unhandled IOException when opening an IndexWriter for
   create, if the locks dir does not exist (just added
   lockDir.exists() check in clearAllLocks method of
   SimpleFSLockFactory  NativeFSLockFactory.
 - Fixed a few small unrelated issues with TestLockFactory, and
   also fixed tests to accept NativeFSLockFactory as the default
   locking implementation for FSDirectory.
 - Fixed a typo in javadoc in FieldsReader.java
 - Added some more javadoc for the LockFactory.setLockPrefix

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-678) [PATCH] LockFactory implementation based on OS native locks (java.nio.*)

2006-10-18 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-678?page=comments#action_1244 ] 

Doron Cohen commented on LUCENE-678:


Michael, I must be misunderstanding something then...

 That test case is verifying that the 2nd index writer indeed removes 
 any leftover lockfiles created by the first one. 

Can there be any leftovers once the first writer was closed?

 It did not intend to test the case (but previously was)..

Could you explain why the change?

Thanks,
Doron

 [PATCH] LockFactory implementation based on OS native locks (java.nio.*)
 

 Key: LUCENE-678
 URL: http://issues.apache.org/jira/browse/LUCENE-678
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.1
Reporter: Michael McCandless
 Assigned To: Yonik Seeley
Priority: Minor
 Fix For: 2.0.1

 Attachments: LUCENE-678-patch.txt


 The current default locking for FSDirectory is SimpleFSLockFactory.
 It uses java.io.File.createNewFile for its locking, which has this
 spooky warning in Sun's javadocs:
 Note: this method should not be used for file-locking, as the
 resulting protocol cannot be made to work reliably. The FileLock
 facility should be used instead.
 So, this patch provides a LockFactory implementation based on FileLock
 (using java.nio.*).
 All unit tests pass with this patch, on OS X (10.4.8), Linux (Ubuntu
 6.06), and Windows XP SP2.
 Another benefit of native locks is the OS automatically frees them if
 the JVM exits before Lucene can free its locks.  Many people seem to
 hit this (old lock files still on disk) now.
 I've created this new class:
   org.apache.lucene.store.NativeFSLockFactory
 and added a couple test cases to the existing TestLockFactory.
 I've left SimpleFSLockFactory as the default locking for FSDirectory
 for now.  I think we should get some usage / experience with
 NativeFSLockFactory and then later on make it the default locking
 implementation?
 I also tested changing FSDirectory's default locking to
 NativeFSLockFactory and all unit tests still pass (on the above
 platforms).
 One important note about locking over NFS: some NFS servers and/or
 clients do not support it, or, it's a configuration option or mode
 that must be explicitly enabled.  When it's misconfigured it's able to
 take a long time (35 seconds in my case) before throwing an exception.
 To handle this, I acquire  release a random test lock on creating the
 NativeFSLockFactory to verify locking is configured properly.
 A few other small changes in the patch:
 - Added a failure reason to Lock.java so that in
   obtain(lockWaitTimeout), if there is a persistent IOException
   in trying to obtain the lock, this can be messaged  included in
   the Lock obtain timed out that's raised.
 - Corrected javadoc in SimpleFSLockFactory: it previously said the
   wrong system property for overriding lock class via system
   properties
 - Fixed unhandled IOException when opening an IndexWriter for
   create, if the locks dir does not exist (just added
   lockDir.exists() check in clearAllLocks method of
   SimpleFSLockFactory  NativeFSLockFactory.
 - Fixed a few small unrelated issues with TestLockFactory, and
   also fixed tests to accept NativeFSLockFactory as the default
   locking implementation for FSDirectory.
 - Fixed a typo in javadoc in FieldsReader.java
 - Added some more javadoc for the LockFactory.setLockPrefix

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-686) Resources not always reclaimed in scorers after each search

2006-10-25 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-686?page=comments#action_12444742 ] 

Doron Cohen commented on LUCENE-686:


An example of how current Lucene code relies on not having to close resoures, 
in PhraseQuery:
...
scorer(IndexReader reader) {
  ...
  for (int i = 0; i  terms.size(); i++) {
TermPositions p = reader.termPositions((Term)terms.elementAt(i));
if (p == null)
  return null;   - - - - change would be required here
tps[i] = p;
  }

If close() has to be respected this code would need to change to close all 
TermPositions that were obtained just before the one that was not found.

 Resources not always reclaimed in scorers after each search
 ---

 Key: LUCENE-686
 URL: http://issues.apache.org/jira/browse/LUCENE-686
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
 Environment: All
Reporter: Ning Li
 Attachments: ScorerResourceGC.patch


 Resources are not always reclaimed in scorers after each search.
 For example, close() is not always called for term docs in TermScorer.
 A test will be attached to show when resources are not reclaimed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring

2006-10-25 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-697?page=comments#action_12444744 ] 

Doron Cohen commented on LUCENE-697:


I can reproduce this by uncommenting this line. 

Interesting to notice that:

(1) the sequence to next() next() skip() skip() next() next() (!= instead of 
==) passes the tests.
This is exoected because the problem is in the initialization (at least in 
PhraseQuery). 

(2) the sequence next() skip() next() does not work (..0x01)!=0 instead of 
0x02)==0) does not pass.
This is surprising to me - becuase seemingly there are no initialization issues 
here.
But I think the cause is that, at least in PhraseQuery, it is not just an 
intialization issue.

Yonik, this is uassigned, are you working on a fix for this?

 Scorer.skipTo affects sloppyPhrase scoring
 --

 Key: LUCENE-697
 URL: http://issues.apache.org/jira/browse/LUCENE-697
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.0.0
Reporter: Yonik Seeley

 If you mix skipTo() and next(), you get different scores than what is 
 returned to a hit collector.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring

2006-10-25 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-697?page=all ]

Doron Cohen reassigned LUCENE-697:
--

Assignee: Doron Cohen

 Scorer.skipTo affects sloppyPhrase scoring
 --

 Key: LUCENE-697
 URL: http://issues.apache.org/jira/browse/LUCENE-697
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.0.0
Reporter: Yonik Seeley
 Assigned To: Doron Cohen

 If you mix skipTo() and next(), you get different scores than what is 
 returned to a hit collector.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-569) NearSpans skipTo bug

2006-10-27 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-569?page=comments#action_12445284 ] 

Doron Cohen commented on LUCENE-569:


It seems that having  assert()   in NearSpanOrdered.java now required Java 
1.5 in order to compile Lucene. This would require 1.5 for running Lucene. Do 
we want to include this now?

 NearSpans skipTo bug
 

 Key: LUCENE-569
 URL: http://issues.apache.org/jira/browse/LUCENE-569
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Hoss Man
 Assigned To: Hoss Man
 Attachments: common-build.assertions.patch, LUCENE-569.ubber.patch, 
 NearSpans20060903.patch, NearSpansOrdered.java, NearSpansUnordered.java, 
 SpanNearQuery20060622.patch, SpanScorer.explain.testcase.patch, 
 TestNearSpans.java


 NearSpans appears to have a bug in skipTo that causes it to skip over some 
 matching documents completely.  I discovered this bug while investigating 
 problems with SpanWeight.explain, but as far as I can tell the Bug is not 
 specific to Explanations ... it seems like it could potentially result in 
 incorrect matching in some situations where a SpanNearQuery is nested in 
 another query such thatskipTo will be used ... I tried to create a high level 
 test case to exploit the bug when searching, but i could not.  TestCase 
 exploiting the class using NearSpan and SpanScorer will follow...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring

2006-10-27 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-697?page=all ]

Doron Cohen updated LUCENE-697:
---

Attachment: sloppy_phrase_skipTo.patch

This was tricky, for me anyhow, but I think I found it.

The difference in scoring between using next() to using skipTo() (or a 
combination of these two) is caused by two (valid) orders of the sorted 
PhrasePositions. 

Currently PhrasePositions sorting is defined by doc and position, where 
position already considers the offset of the term within the (phrase) query. 

If however two TermPosition have the same doc and same position, the sort takes 
no decision, which falls down to one valid sort (by current sort definition). 
The difference between using next() and skipTo() in this regard is that 
skipTo() always calls sort(), sorting the entire set, while next() only calls 
sort() at initialization and then maintain the sorting as part of the scoring 
process. 

This would be clearer with the following example - taken from Yonik's test case 
that is failing now:
   - Doc1: w1 w3 w2 w3 zz
   - Query:   w3 w2~2
When starting scoring in this doc, both PhrasePositions pp(w3) and pp(w2) have 
doc(2)=doc(w3)=1.
Note, that, for the second w3 that matches we would have pos(w2)=2+1=3 and 
pos(w3)=3+0=3. 

So, after scoring doc1(w3 w2), if the sort result places pp(w2) at the top, 
we would also score for doc1(w3 w2). However, if pp(w3) is placed by the sort 
at the top (==smallest), we would not score also for doc1(w3 w2). 

Current behavior is inconsistent: skip() would take the two while next() won't, 
and I think it is possible to create a case where it would be the other way 
around. So definitely behavior should be made consistent. 

Next question to be asked is: Do we want to sum (or max) the frequency for both 
(or more cases)? I think yes, sum. 

To fix this I am changing PhrasePosition comparison, so that in case positions 
are equal, the actual doc position (ignoring offset in query phrase) is 
considered. 

In addition, I added missing calls to clear the priority queue before starting 
to sort and to mark that no more initialization is required when skipTo() is 
called. 

I tested with the sequence that Yonik added:
- skip skip next next skip skip 
And also with the sequences:
- skip skip skip skip skip skip
- next next next next next next 
- skip next skip next skip next 
- next skip next skip next skip
- next next skip skip next next
The latter 5 cases are now commented out, the first case is in effect.

This scoring code is still not feeling natural to me, so (actually as always) 
comments will be appreciated.

- Doron

 Scorer.skipTo affects sloppyPhrase scoring
 --

 Key: LUCENE-697
 URL: http://issues.apache.org/jira/browse/LUCENE-697
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.0.0
Reporter: Yonik Seeley
 Assigned To: Doron Cohen
 Attachments: sloppy_phrase_skipTo.patch


 If you mix skipTo() and next(), you get different scores than what is 
 returned to a hit collector.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-569) NearSpans skipTo bug

2006-10-27 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-569?page=comments#action_12445294 ] 

Doron Cohen commented on LUCENE-569:


Chris Hostetter wrote:

 Really? ... the build.xml currently sets the javac -source and -target to
 1.4 so if that were true i would except it to have failed, and the
 documentation for J2SE 1.4.2 indicates that assertion support exists in
 1.4.
 
 while writting this i attempted an ant clean test using Java 1.4 and
 everything seemed to work fine.

You are right Chris, my mistake, compilation passed for me with 1.5 but failed 
with 1.4 so I assumed this was the case, but apparently for 1.4 I had 1.3 for 
the source compatibility (in Eclipse). I changed to 1.4 and it passed with no 
problems.

Sorry for this noise,
Doron

 NearSpans skipTo bug
 

 Key: LUCENE-569
 URL: http://issues.apache.org/jira/browse/LUCENE-569
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Hoss Man
 Assigned To: Hoss Man
 Attachments: common-build.assertions.patch, LUCENE-569.ubber.patch, 
 NearSpans20060903.patch, NearSpansOrdered.java, NearSpansUnordered.java, 
 SpanNearQuery20060622.patch, SpanScorer.explain.testcase.patch, 
 TestNearSpans.java


 NearSpans appears to have a bug in skipTo that causes it to skip over some 
 matching documents completely.  I discovered this bug while investigating 
 problems with SpanWeight.explain, but as far as I can tell the Bug is not 
 specific to Explanations ... it seems like it could potentially result in 
 incorrect matching in some situations where a SpanNearQuery is nested in 
 another query such thatskipTo will be used ... I tried to create a high level 
 test case to exploit the bug when searching, but i could not.  TestCase 
 exploiting the class using NearSpan and SpanScorer will follow...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring

2006-10-27 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-697?page=all ]

Doron Cohen updated LUCENE-697:
---

Lucene Fields: [Patch Available]  (was: [New])

 Scorer.skipTo affects sloppyPhrase scoring
 --

 Key: LUCENE-697
 URL: http://issues.apache.org/jira/browse/LUCENE-697
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.0.0
Reporter: Yonik Seeley
 Assigned To: Doron Cohen
 Attachments: sloppy_phrase_skipTo.patch


 If you mix skipTo() and next(), you get different scores than what is 
 returned to a hit collector.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-665) temporary file access denied on Windows

2006-10-29 Thread Doron Cohen (JIRA)

[
http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12445507 ]

Doron Cohen commented on LUCENE-665:

Michael, I am not able to generate this with native locks. (did not try with
lockless commits).
Which brings me to think that native locks should be made default?

There is another thing that bothers me with locks, in NFS or other shared fs
situations:
Locks are maintained in a specified folder, but a lock file name is derived
from the full path of the index dir, actually the cannonical name of this dir.
So, if the same index is accessed by two machines, the drive / mount / fs
root of that index dir must be named the same in all the machines on which
Lucene is invoked to access/maintain that index.

The documentation for File.getCanonicalPath() says that it is system dependent.
So I am not even sure how it can be guaranteed that Lucene used on Linux and
Lucene used on Windows (say) that accesss the same index would be able to lock
on the same index. And for two Windows machines, admin would have to verify
that the index fs (samba/afs/nfs) mounts with the same drive letter.

This seems like a limitation on one hand, and also as a source for possible
problems, when users mis configure their mount names.

I may be missing someting trivial here, because it seems too wrong to be
true... I'll let the list comment on that...

temporary file access denied on Windows
---

Key: LUCENE-665
URL: http://issues.apache.org/jira/browse/LUCENE-665
Project: Lucene - Java
Issue Type: Bug
Components: Store
Affects Versions: 2.0.0
Environment: Windows
Reporter: Doron Cohen
Attachments: FSDirectory_Retry_Logic.patch,
FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch,
FSWinDirectory_26_Sep_06.patch, Test_Output.txt,
TestInterleavedAddAndRemoves.java

[jira] Commented: (LUCENE-665) temporary file access denied on Windows

2006-10-30 Thread Doron Cohen (JIRA)

[
http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12445724 ]

Doron Cohen commented on LUCENE-665:

Odd that just by using native locking, it stopped your issues.

Agree. I did not expect that to happen, since indeed I saw in the past
exceptions on renameFile, though most exceptions were in locks activity. So I
ran it many times, with an antivirus scan, etc. But it always passes. Therefore
I would not object to closing this issue - If I cannot test it I cannot fix it.
But for the same reason, I would like to see native locks becoming the default.

setLockPrefix()

I'll take this one to a seprate thread in dev list.

temporary file access denied on Windows
---

Key: LUCENE-665
URL: http://issues.apache.org/jira/browse/LUCENE-665
Project: Lucene - Java
Issue Type: Bug
Components: Store
Affects Versions: 2.0.0
Environment: Windows
Reporter: Doron Cohen
Attachments: FSDirectory_Retry_Logic.patch,
FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch,
FSWinDirectory_26_Sep_06.patch, Test_Output.txt,
TestInterleavedAddAndRemoves.java

[jira] Commented: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring

2006-10-30 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-697?page=comments#action_12445795 ] 

Doron Cohen commented on LUCENE-697:


An updated version of this patch - sloppy_phrase_skipTo.patch2.

I modified QueryUtils.java (test util) to test all the sequences, not just one. 
It is also now quite easy to add a new sequence to be tested, if needed.

Other changes in this patch remain:
- PhraseQueue: this is the fix.
- ExactPhraseScorer: added call to clear queue - not a must, but cleaner this 
way.
- PhraseScorer: added mark that init done at skip - again not a must, just 
cleaner this way.

All ant test tests pass.

- Doron

 Scorer.skipTo affects sloppyPhrase scoring
 --

 Key: LUCENE-697
 URL: http://issues.apache.org/jira/browse/LUCENE-697
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.0.0
Reporter: Yonik Seeley
 Assigned To: Doron Cohen
 Attachments: sloppy_phrase_skipTo.patch, sloppy_phrase_skipTo.patch2


 If you mix skipTo() and next(), you get different scores than what is 
 returned to a hit collector.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-706) Index File Format - Example for frequency file .frq is wrong

2006-11-02 Thread Doron Cohen (JIRA)

Index File Format - Example for frequency file .frq is wrong


 Key: LUCENE-706
 URL: http://issues.apache.org/jira/browse/LUCENE-706
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
 Environment: not applicable
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Trivial


Reported by Johan Stuyts - 
http://www.nabble.com/Possible-documentation-error--p7012445.html - 

Frequency file example says: 

 For example, the TermFreqs for a term which occurs once in document seven 
and three times in document eleven would be the following sequence of VInts: 
 15, 22, 3 

It should be: 

 For example, the TermFreqs for a term which occurs once in document seven 
and three times in document eleven would be the following sequence of VInts: 
 15, 8, 3 




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-706) Index File Format - Example for frequency file .frq is wrong

2006-11-02 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-706?page=all ]

Doron Cohen updated LUCENE-706:
---

Lucene Fields: [New, Patch Available]  (was: [New])

 Index File Format - Example for frequency file .frq is wrong
 

 Key: LUCENE-706
 URL: http://issues.apache.org/jira/browse/LUCENE-706
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
 Environment: not applicable
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Trivial
 Attachments: file-format-frq-example.patch


 Reported by Johan Stuyts - 
 http://www.nabble.com/Possible-documentation-error--p7012445.html - 
 Frequency file example says: 
  For example, the TermFreqs for a term which occurs once in document 
 seven and three times in document eleven would be the following sequence of 
 VInts: 
  15, 22, 3 
 It should be: 
  For example, the TermFreqs for a term which occurs once in document 
 seven and three times in document eleven would be the following sequence of 
 VInts: 
  15, 8, 3 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-706) Index File Format - Example for frequency file .frq is wrong

2006-11-03 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-706?page=comments#action_12447049 ] 

Doron Cohen commented on LUCENE-706:


Right, sorry, copied that hex data from an .frq of an index with a different 
example, where the frequencies were 1 in doc 6 and 3 in doc 10, so there you 
would get 2 * 6 + 1 = 13.

For the correct example of freq 1 in doc 7 and 3 in doc 11 the .frq content is  
0F 08 03  as it should be. 

(Meaning that the documentatin should still be fixed...;-)


 Index File Format - Example for frequency file .frq is wrong
 

 Key: LUCENE-706
 URL: http://issues.apache.org/jira/browse/LUCENE-706
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
 Environment: not applicable
Reporter: Doron Cohen
 Assigned To: Grant Ingersoll
Priority: Trivial
 Attachments: file-format-frq-example.patch


 Reported by Johan Stuyts - 
 http://www.nabble.com/Possible-documentation-error--p7012445.html - 
 Frequency file example says: 
  For example, the TermFreqs for a term which occurs once in document 
 seven and three times in document eleven would be the following sequence of 
 VInts: 
  15, 22, 3 
 It should be: 
  For example, the TermFreqs for a term which occurs once in document 
 seven and three times in document eleven would be the following sequence of 
 VInts: 
  15, 8, 3 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-11-07 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Doron Cohen updated LUCENE-675:
---

Attachment: timedata.zip

I tried it and it is working nice! - 
1st run downloaded the documents from the Web before starting to index. 
2nd run started right off - as input docs are already in place - great. 

Seems the only output is what is printed to stdout, right? 

I got something like this: 

 [echo] Working Directory: work
 [java] Testing 4 different permutations.
 [java] #-- ID: td-00_10_10, Sun Nov 05 22:40:49 PST 2006, heap=1065484288 
--
 [java] # source=work\reuters-out, [EMAIL 
PROTECTED]:\devoss\lucene\java\trunk\contrib\benchmark\work\index
 [java] # maxBufferedDocs=10, mergeFactor=10, compound=true, optimize=true
 [java] # Query data: R-reopen, W-warmup, T-retrieve, N-no
 [java] # qd-0110 R W NT [body:salomon]
 [java] # qd-0111 R W T [body:salomon]
 [java] # qd-0100 R NW NT [body:salomon]
...
 [java] # qd-14011 NR W T [body:fo*]
 [java] # qd-14000 NR NW NT [body:fo*]
 [java] # qd-14001 NR NW T [body:fo*]

 [java] Start Time: Sun Nov 05 22:41:38 PST 2006
 [java]  - processed 500, run id=0
 [java]  - processed 1000, run id=0
 [java]  - processed 1500, run id=0
 [java]  - processed 2000, run id=0
 [java] End Time: Sun Nov 05 22:41:48 PST 2006
 [java] warm = Warm Index Reader
 [java] srch = Search Index
 [java] trav = Traverse Hits list, optionally retrieving document

 [java] # testData id   operation   runCnt  recCnt  rec/s   
avgFreeMem  avgTotalMem
 [java] td-00_100_100   addDocument 1   2000472.0321
4493681 22611558
 [java] td-00_100_100   optimize1   1   2.857143
4229488 22716416
 [java] td-00_100_100   qd-0110-warm1   20004.0 4250992 
22716416
 [java] td-00_100_100   qd-0110-srch1   1   Infinity
4221288 22716416
...
 [java] td-00_100_100   qd-4110-srch1   1   Infinity
3993624 22716416
 [java] td-00_100_100   qd-4110-trav1   0   NaN 3993624 
22716416
 [java] td-00_100_100   qd-4111-warm1   20005.0 3853192 
22716416
...
BUILD SUCCESSFUL
Total time: 1 minute 0 seconds


I think the infinity and NAN are caused by op time too short for 
divide-by-sec.
This can be avoided by modifying getRate() in TimeData:
  public double getRate() {
double rps = (double) count * 1000.0 / (double) (elapsed0 ? elapsed : 1);
return rps;
  }

I like much the logic of loading test data from the Web, and the scaleUp and 
maximumDocumentsToIndex params are handy. 

It seems that all the test logic and some of its data (queries) are java coded. 
I initially thought of a setting where we define tasks/jobs that are 
parameterized, like:

- createIndex(params)
- writeToIndex(params):
  - addDocs()
  - optimize()
- readFromIndex(params):
  - searchIndex()
  - fetchData()

..and compose a test by an XML that says which of these simple jobs to run, 
with what params, in which order, serial/parallel, how long/often etc. 
Then creating a different test is as easy as creating a different XML that 
configures that test. 

On the other hand, chances are, I know, that most useful cases would be those 
already defined here - standard and micro-standard, so can ask why bothering 
changing to define these building blocks. I am not sure here, but thought I'll 
bring it up. 

About Using the driver - seems nice and clean to me. I don't know the Digester 
but it seems to read the config from the XML correctly.

Other comments:
1. I think there is a redundant call to params.showRunData(params.getId()) in 
runBenchmark(File,Options);
2. Seems that rec/sec would be a bit more accurately computed by aggregating 
elapsed times (instead of rate) in showRunData()
3. If TimeData not found (only memData) I think additional 0.0 should be printed
4. columns allignments with tabs and floats is imperfect.:-)
5. It would be nice I think to also get a summary of the results by task - 
e.g. srch, optimize, something like:
 [java] # testData id operation   runCnt recCnt  
rec/s   avgFreeMem  avgTotalMem
 [java]   warm60   2000   
42,628.88,235,758   23,048,192
 [java]   srch   120  1  
571.48,300,613   23,048,192
 [java]   optimize 1  1
2.99,375,732   23,048,192
 [java]   trav   120107   
30,517.88,326,046   23,048,192
 [java]   addDocument  1   2000  
441.87,310,929   22,206,872

Attached timedata.zip has modifies TimeData.java and

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-11-12 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449117 ] 

Doron Cohen commented on LUCENE-675:


I looked at extending the benchmark with:
- different test scenarios, i.e. other sequences of operations.
- multithreaded tests, e.g. several queries in parallel.
- rate of events, e.g. 2 queries arriving per second, or one query per 
second in parallel with 20 new documents in a minute.
- different data sources (input documents, queries).

For this I made lots of changes to the benchmark code, using parts of it and 
rewriting other parts. 
I would like to submit this code in a few days - it is running already but some 
functionality is missing.

I would like to describe how it works to hopefully get early feedback. 

There are several basic tasks defined - all extending an (abstract) class 
PerfTask:
- AddDocTask
- OptimizeTask
- CreateIndexTask
etc. 

To further extend the benchmark 'framework', new tasks can be added. Each task 
must implement the abstract method: doLogic(). For instance, in AddDocTask this 
method (doLogic) would call indexWriter.addDocument().
There are also setup() and tearDown() methods for performing work that should 
not be timed for that task. 

A special TaskSequence task contains other tasks. It is either parallel or 
sequential, which tells if it executes its child tasks serially or in parallel. 
TaskSequence also supports rate: the pace in which its child tasks are 
fired can be controlled.

With these tasks, it is possible to describe a performance test 'algorithm' in 
a simple syntax.
('algorithm' may be too big a word for this...?)

A test invocation takes two parameters: 
- test.properties - file with various config properties.
- test.alg   - file with the algorithm.

By convention, for each task class  OpNameTask,  the command  OpName  is 
valid in test.alg.

Adding a single document is done by:
AddDoc

Adding 3 documents:
   AddDoc
   AddDoc
   AddDoc

Or, alternatively:
   { AddDoc } : 3

So, '{' and '}' indicate a serial sequence of (child) tasks. 

To fire 100 queries in a row:
  { Search } : 100

To fire 100 queries in parallel:
  [ Search ] : 100

So, '[' and ']' indicate a parallel group of tasks. 

To fire 100 queries in a row, 2 queries per second (120 per minute):
  { Search } : 100 : 120

Similar, but in parallel:
  [ Search ] : 100 : 120

A sequence task can be named for identifying it in reports:
  { QueriesA Search } : 100 : 120

And there are tasks that create reports. 

There are more tasks, and more to tell on the alg syntax, but this post is 
already long..

I find this quite powerful for perf testing.
What do you (and you) think?

- Doron


 Lucene benchmark: objective performance test for Lucene
 ---

 Key: LUCENE-675
 URL: http://issues.apache.org/jira/browse/LUCENE-675
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Assigned To: Grant Ingersoll
 Attachments: benchmark.patch, BenchmarkingIndexer.pm, 
 extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip


 We need an objective way to measure the performance of Lucene, both indexing 
 and querying, on a known corpus. This issue is intended to collect comments 
 and patches implementing a suite of such benchmarking tests.
 Regarding the corpus: one of the widely used and freely available corpora is 
 the original Reuters collection, available from 
 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
 or 
 http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
  I propose to use this corpus as a base for benchmarks. The benchmarking 
 suite could automatically retrieve it from known locations, and cache it 
 locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-11-12 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Doron Cohen updated LUCENE-675:
---

Attachment: tiny.alg
tiny.properties

I am attaching a sample tiny.* - the .alg and .properties files I currently use 
- I think they may help to understand how this works.

 Lucene benchmark: objective performance test for Lucene
 ---

 Key: LUCENE-675
 URL: http://issues.apache.org/jira/browse/LUCENE-675
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Assigned To: Grant Ingersoll
 Attachments: benchmark.patch, BenchmarkingIndexer.pm, 
 extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, 
 tiny.alg, tiny.properties


 We need an objective way to measure the performance of Lucene, both indexing 
 and querying, on a known corpus. This issue is intended to collect comments 
 and patches implementing a suite of such benchmarking tests.
 Regarding the corpus: one of the widely used and freely available corpora is 
 the original Reuters collection, available from 
 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
 or 
 http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
  I propose to use this corpus as a base for benchmarks. The benchmarking 
 suite could automatically retrieve it from known locations, and cache it 
 locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-11-13 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449419 ] 

Doron Cohen commented on LUCENE-675:


Sounds good.

In this case I will add my stuff under a new package: 
org.apache.lucene.benchmark2. (this package would have no dependencies in 
org.apache.lucene.benchmark.). I will also add tarkets in buid.xml, and add 
.alg, and .alg files under conf.
Makes sense?

Do you already know when you are going to commit it?

 Lucene benchmark: objective performance test for Lucene
 ---

 Key: LUCENE-675
 URL: http://issues.apache.org/jira/browse/LUCENE-675
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Assigned To: Grant Ingersoll
 Attachments: benchmark.patch, BenchmarkingIndexer.pm, 
 extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, 
 tiny.alg, tiny.properties


 We need an objective way to measure the performance of Lucene, both indexing 
 and querying, on a known corpus. This issue is intended to collect comments 
 and patches implementing a suite of such benchmarking tests.
 Regarding the corpus: one of the widely used and freely available corpora is 
 the original Reuters collection, available from 
 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
 or 
 http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
  I propose to use this corpus as a base for benchmarks. The benchmarking 
 suite could automatically retrieve it from known locations, and cache it 
 locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-11-14 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449779 ] 

Doron Cohen commented on LUCENE-675:


Good point on names with numbers - I'm renaming the package to taskBenchmark, 
as I think of it as task sequence based, more than as propetries based. 


 Lucene benchmark: objective performance test for Lucene
 ---

 Key: LUCENE-675
 URL: http://issues.apache.org/jira/browse/LUCENE-675
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Assigned To: Grant Ingersoll
 Attachments: benchmark.patch, BenchmarkingIndexer.pm, 
 extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, 
 tiny.alg, tiny.properties


 We need an objective way to measure the performance of Lucene, both indexing 
 and querying, on a known corpus. This issue is intended to collect comments 
 and patches implementing a suite of such benchmarking tests.
 Regarding the corpus: one of the widely used and freely available corpora is 
 the original Reuters collection, available from 
 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
 or 
 http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
  I propose to use this corpus as a base for benchmarks. The benchmarking 
 suite could automatically retrieve it from known locations, and cache it 
 locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-11-15 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449947 ] 

Doron Cohen commented on LUCENE-675:


Would be nice to get some feedback on what I already have at this point for the 
task based benchmark framework for Lucene.  

So I am packing it as a zip file. I would probably resubmit as a patch when 
Grant commits the current benchmark code.
See attached taskBenchmark.zip.

To try out taskBenchmark, unzip under contrib/benchmark, on top of Grant's 
benchmark.patch.
This would do 3 changes:

1. replace build.xml - only change there is adding two targets: 
run-task-standard and run-task-micro-standard.

2. add 4 new files under conf:
 - task-standard.properties
 - task-standard.alg
 - task-micro-standard.properties
 - task-micro-standard.alg

3. add a src package 'taskBenchmark' side by side with current 'benchmark' 
package.

To try it out, go to contrib/benchmark and try 'ant run-task-standard' or 'ant 
run-task-micro-standard'. 

See inside the .alg files for how a test is specified.

The algorithm syntax and the entire package is documented in the package 
javadoc for taskBenchmark (package.html). 

Regards,
Doron

 Lucene benchmark: objective performance test for Lucene
 ---

 Key: LUCENE-675
 URL: http://issues.apache.org/jira/browse/LUCENE-675
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Assigned To: Grant Ingersoll
 Attachments: benchmark.patch, BenchmarkingIndexer.pm, 
 extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, 
 tiny.alg, tiny.properties


 We need an objective way to measure the performance of Lucene, both indexing 
 and querying, on a known corpus. This issue is intended to collect comments 
 and patches implementing a suite of such benchmarking tests.
 Regarding the corpus: one of the widely used and freely available corpora is 
 the original Reuters collection, available from 
 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
 or 
 http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
  I propose to use this corpus as a base for benchmarks. The benchmarking 
 suite could automatically retrieve it from known locations, and cache it 
 locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-11-16 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Doron Cohen updated LUCENE-675:
---

Attachment: benchmark.byTask.patch

I am attaching benchmark.byTask.patch - to be applied in the contrib/benchmark 
directory. 

Root package of byTask classes was modified to 
org.apache.lucene.benchmark.byTask, in the lines of Grant's suggestion - seems 
better cause it keeps all benchmark classes under 
lucene.benchmark.

I added one a sample .alg under conf and added some documentation. 

Entry point - documentation wise - is the package doc for 
org.apache.lucene.benchmark.byTask.

Thanks for any comments on this!

PS. Before submitting the patch file, I tried to apply it myself on a clean 
version of the code, just to make sure that it works. But I got errors like 
this -- Could not retrieve revision 0 of ...\byTask\.. -- for every file 
under a new folder. So I am not sure if it is just my (Windows) svn patch 
applying utility, or is it really impossible to apply a patch that creates 
files in (yet) nonexistent directories.  I searched Lucene mailing lists and 
SVN mailing lists and went again through the SVN book again but nowhere could I 
find what is the expected behavior for applying a patch containing new 
directories. In fact, svn diff would not even show you files that are new 
(again, this is the Windows svn 1.4.2 version). (I used Tortoise SVN to create 
the patch). This is rather annoying and I might be misunderstanding something 
basic about SVN, but I thought it'd be better to share this experience here - 
might save some time for others trying to apply this patch or other patches
 ...

 Lucene benchmark: objective performance test for Lucene
 ---

 Key: LUCENE-675
 URL: http://issues.apache.org/jira/browse/LUCENE-675
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Assigned To: Grant Ingersoll
 Attachments: benchmark.byTask.patch, benchmark.patch, 
 BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, 
 LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties


 We need an objective way to measure the performance of Lucene, both indexing 
 and querying, on a known corpus. This issue is intended to collect comments 
 and patches implementing a suite of such benchmarking tests.
 Regarding the corpus: one of the widely used and freely available corpora is 
 the original Reuters collection, available from 
 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
 or 
 http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
  I propose to use this corpus as a base for benchmarks. The benchmarking 
 suite could automatically retrieve it from known locations, and cache it 
 locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-717) src builds fail because of no lib directory

2006-11-27 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-717?page=comments#action_12453672 ] 

Doron Cohen commented on LUCENE-717:


That's because junit,jar is required for compiling and running the tests. 
(Guess we can't distribute junit.jar with Lucene.)

This is from the commn-build.xml:
  ##
  JUnit not found.
  Please make sure junit.jar is in ANT_HOME/lib, or made available
  to Ant using other mechanisms like -lib or CLASSPATH.
  ##


 src builds fail because of no lib directory
 -

 Key: LUCENE-717
 URL: http://issues.apache.org/jira/browse/LUCENE-717
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Affects Versions: 2.0.0
Reporter: Hoss Man

 I just downloaded 
 http://mirrors.ibiblio.org/pub/mirrors/apache/lucene/java/lucene-2.0.0-src.tar.gz
  and noticed that you can't compile and run the tests from that src build 
 because it doesn't inlcude the lib dir (and the build file won't attempt to 
 make it if it doesn't exist) ...
 [EMAIL PROTECTED]:~/tmp/l2$ tar -xzvf lucene-2.0.0-src.tar.gz
   ...
 [EMAIL PROTECTED]:~/tmp/l2$ cd lucene-2.0.0/
 [EMAIL PROTECTED]:~/tmp/l2/lucene-2.0.0$ ant test
   ...
 test:
 [mkdir] Created dir: /home/hossman/tmp/l2/lucene-2.0.0/build/test
 BUILD FAILED
 /home/hossman/tmp/l2/lucene-2.0.0/common-build.xml:169: 
 /home/hossman/tmp/l2/lucene-2.0.0/lib not found.
 (it's refrenced in junit.classpath, but i'm not relaly sure why)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-717) src builds fail because of no lib directory

2006-11-27 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-717?page=all ]

Doron Cohen updated LUCENE-717:
---

Lucene Fields: [New, Patch Available]  (was: [New])

 src builds fail because of no lib directory
 -

 Key: LUCENE-717
 URL: http://issues.apache.org/jira/browse/LUCENE-717
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Affects Versions: 2.0.0
Reporter: Hoss Man
 Attachments: common-build.xml.patch.txt


 I just downloaded 
 http://mirrors.ibiblio.org/pub/mirrors/apache/lucene/java/lucene-2.0.0-src.tar.gz
  and noticed that you can't compile and run the tests from that src build 
 because it doesn't inlcude the lib dir (and the build file won't attempt to 
 make it if it doesn't exist) ...
 [EMAIL PROTECTED]:~/tmp/l2$ tar -xzvf lucene-2.0.0-src.tar.gz
   ...
 [EMAIL PROTECTED]:~/tmp/l2$ cd lucene-2.0.0/
 [EMAIL PROTECTED]:~/tmp/l2/lucene-2.0.0$ ant test
   ...
 test:
 [mkdir] Created dir: /home/hossman/tmp/l2/lucene-2.0.0/build/test
 BUILD FAILED
 /home/hossman/tmp/l2/lucene-2.0.0/common-build.xml:169: 
 /home/hossman/tmp/l2/lucene-2.0.0/lib not found.
 (it's refrenced in junit.classpath, but i'm not relaly sure why)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-717) src builds fail because of no lib directory

2006-11-27 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-717?page=comments#action_12453738 ] 

Doron Cohen commented on LUCENE-717:


I'm ok with this...

 src builds fail because of no lib directory
 -

 Key: LUCENE-717
 URL: http://issues.apache.org/jira/browse/LUCENE-717
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Affects Versions: 2.0.0
Reporter: Hoss Man
 Attachments: common-build.xml.patch.txt


 I just downloaded 
 http://mirrors.ibiblio.org/pub/mirrors/apache/lucene/java/lucene-2.0.0-src.tar.gz
  and noticed that you can't compile and run the tests from that src build 
 because it doesn't inlcude the lib dir (and the build file won't attempt to 
 make it if it doesn't exist) ...
 [EMAIL PROTECTED]:~/tmp/l2$ tar -xzvf lucene-2.0.0-src.tar.gz
   ...
 [EMAIL PROTECTED]:~/tmp/l2$ cd lucene-2.0.0/
 [EMAIL PROTECTED]:~/tmp/l2/lucene-2.0.0$ ant test
   ...
 test:
 [mkdir] Created dir: /home/hossman/tmp/l2/lucene-2.0.0/build/test
 BUILD FAILED
 /home/hossman/tmp/l2/lucene-2.0.0/common-build.xml:169: 
 /home/hossman/tmp/l2/lucene-2.0.0/lib not found.
 (it's refrenced in junit.classpath, but i'm not relaly sure why)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-708) Setup nightly build website links and docs

2006-11-29 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-708?page=comments#action_12454375 ] 

Doron Cohen commented on LUCENE-708:


Could official be the most recent release (currently 2.0)? 

So there would be:
 Official (2.0)
 Nightly
 1.9.1
 1.9
 1.4.3

This way everyday users would not have to get into trunk_/_svn details for 
understanding what docs they are seeing. 


 Setup nightly build website links and docs
 --

 Key: LUCENE-708
 URL: http://issues.apache.org/jira/browse/LUCENE-708
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Grant Ingersoll
 Assigned To: Grant Ingersoll
Priority: Minor

 Per discussion on mailing list, we are going to setup a Nightly Build link on 
 the website linking to the docs (and javadocs) generated by the nightly build 
 process.  The build process may need to be modified to complete this task.
 Going forward, the main website will, for the most part, only be updated per 
 releases (I imagine exceptions will be made for News items and per 
 committer's discretion).  The Javadocs linked to from the main website will 
 always be for the latest release.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-736) Sloppy Phrase Scoring Misbehavior

2006-12-01 Thread Doron Cohen (JIRA)

Sloppy Phrase Scoring Misbehavior
-

 Key: LUCENE-736
 URL: http://issues.apache.org/jira/browse/LUCENE-736
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor


This is an extension of https://issues.apache.org/jira/browse/LUCENE-697

In addition to abnormalities Yonik pointed out in 697, there seem to be other 
issues with slopy phrase search and scoring.

1) A phrase with a repeated word would be detected in a document although it is 
not there.
I.e. document = A B D C E , query = B C B would not find this document (as 
expected), but query B C B~2 would find it. 
I think that no matter how large the slop is, this document should not be a 
match.

2) A document containing both orders of a query, symmetrically, would score 
differently for the queru and for its reveresed form.
I.e. document = A B C B A would score differently for queries B C~2 and C 
B~2, although it is symmetric to both.

I will attach test cases that show both these problems and the one reported by 
Yonik in 697. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-736) Sloppy Phrase Scoring Misbehavior

2006-12-01 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-736?page=all ]

Doron Cohen updated LUCENE-736:
---

Attachment: sloppy_phrase_tests.patch.txt

sloppy_phrase_tests.patch.txt  contains:

- two test cases added in TestPhraseQuery. 
These new tests currently fail. 

- skipTo() behavior tests that were originaly in issue 697. 
This too currently fails.

 Sloppy Phrase Scoring Misbehavior
 -

 Key: LUCENE-736
 URL: http://issues.apache.org/jira/browse/LUCENE-736
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: sloppy_phrase_tests.patch.txt


 This is an extension of https://issues.apache.org/jira/browse/LUCENE-697
 In addition to abnormalities Yonik pointed out in 697, there seem to be other 
 issues with slopy phrase search and scoring.
 1) A phrase with a repeated word would be detected in a document although it 
 is not there.
 I.e. document = A B D C E , query = B C B would not find this document (as 
 expected), but query B C B~2 would find it. 
 I think that no matter how large the slop is, this document should not be a 
 match.
 2) A document containing both orders of a query, symmetrically, would score 
 differently for the queru and for its reveresed form.
 I.e. document = A B C B A would score differently for queries B C~2 and C 
 B~2, although it is symmetric to both.
 I will attach test cases that show both these problems and the one reported 
 by Yonik in 697. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-736) Sloppy Phrase Scoring Misbehavior

2006-12-01 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-736?page=all ]

Doron Cohen updated LUCENE-736:
---

Attachment: sloppy_phrase_java.patch.txt
perf-search-new.log
perf-search-orig.log

Attached sloppy_phrase_java.patch.txt is fixing the failing new tests. 
This also includes the skipTo() bug from issue 697.

The fix does not guarantee that document A B C B A would score A B C~4 and C 
B A~4 the same. 
It does that for B C~2 and C B~2.
This is because a general fix for that (at least the one that I devised) would 
be too expensive.
Although this is an interesting case, I'd like to think it is not an important 
one.

This fix comes with a performance cost:  about 15% degradation in CPU activity 
of sloppy phrase scoring, as the attcahed perf logs show.
Here is the summary of these tests:

...Operation..runCnt...recsPerRun.rec/s..elapsedSec
Orig:..SearchSameRdr_3000..4.3000.216.1...55.52
New:...SearchSameRdr_3000..4.3000.187.8...63.91

I think that in a real life scenario - real index, real documents, real queries 
- this extra CPU will be shaded by IO, but I also belive we should refrain from 
slowing down search, so, unhappy with this degradation (anyone would:-), I 
would look for a other ways to fix this - ideas are welcome.

Perf test was done using the task benchmark framework (see issue 675), The logs 
show also the queries that were searched.

All tests pass with new code.

 Sloppy Phrase Scoring Misbehavior
 -

 Key: LUCENE-736
 URL: http://issues.apache.org/jira/browse/LUCENE-736
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: perf-search-new.log, perf-search-orig.log, 
 sloppy_phrase_java.patch.txt, sloppy_phrase_tests.patch.txt


 This is an extension of https://issues.apache.org/jira/browse/LUCENE-697
 In addition to abnormalities Yonik pointed out in 697, there seem to be other 
 issues with slopy phrase search and scoring.
 1) A phrase with a repeated word would be detected in a document although it 
 is not there.
 I.e. document = A B D C E , query = B C B would not find this document (as 
 expected), but query B C B~2 would find it. 
 I think that no matter how large the slop is, this document should not be a 
 match.
 2) A document containing both orders of a query, symmetrically, would score 
 differently for the queru and for its reveresed form.
 I.e. document = A B C B A would score differently for queries B C~2 and C 
 B~2, although it is symmetric to both.
 I will attach test cases that show both these problems and the one reported 
 by Yonik in 697. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-736) Sloppy Phrase Scoring Misbehavior

2006-12-02 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-736?page=all ]

Doron Cohen updated LUCENE-736:
---

Attachment: sloppy_phrase.patch2.txt
res-search-orig2.log
res-search-new2.log

The change to fix case 2 was not the main performance degradation cause.

I agree with Yonik that case 2 is much more important than case 1.
So I modified to fix case 2 but not case 1. 
Also extended the perf test to create also the reversed form of the sloppy 
phrases (slop increased for reversed cases so that queries would match docs).

Cost of this fix dropped from 15% more CPU time to about 3%. 
I feel ok with this.

.Operation..runCnt...recsPerRun...rec/s..elapsedSecavgUsedMemavgTotalMem
Orig.SearchSameRdr_6000..4.6000...194.2..123.59.8,032,732.11,333,632
New..SearchSameRdr_6000..4.6000...187.5..128.02.8,172,258.11,333,632

Attached sloppy_phrase.patch2.txt has the updated fix, including both java and 
test parts. Some of the asserts in the new tests were commented out cause the 
patch takes decision not to fix case 1 above.

Also attaching the updates perf test logs - res-search-orig2.log and 
res-search-new2.log.

I did not compare scoring of similar cases between sloppy phrase and near spans 
and Paul suggested - perhaps next week - not sure this should hold progress 
with this issue.

 Sloppy Phrase Scoring Misbehavior
 -

 Key: LUCENE-736
 URL: http://issues.apache.org/jira/browse/LUCENE-736
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: perf-search-new.log, perf-search-orig.log, 
 res-search-new2.log, res-search-orig2.log, sloppy_phrase.patch2.txt, 
 sloppy_phrase_java.patch.txt, sloppy_phrase_tests.patch.txt


 This is an extension of https://issues.apache.org/jira/browse/LUCENE-697
 In addition to abnormalities Yonik pointed out in 697, there seem to be other 
 issues with slopy phrase search and scoring.
 1) A phrase with a repeated word would be detected in a document although it 
 is not there.
 I.e. document = A B D C E , query = B C B would not find this document (as 
 expected), but query B C B~2 would find it. 
 I think that no matter how large the slop is, this document should not be a 
 match.
 2) A document containing both orders of a query, symmetrically, would score 
 differently for the queru and for its reveresed form.
 I.e. document = A B C B A would score differently for queries B C~2 and C 
 B~2, although it is symmetric to both.
 I will attach test cases that show both these problems and the one reported 
 by Yonik in 697. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-736) Sloppy Phrase Scoring Misbehavior

2006-12-02 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-736?page=all ]

Doron Cohen updated LUCENE-736:
---

Lucene Fields: [New, Patch Available]  (was: [New])

 Sloppy Phrase Scoring Misbehavior
 -

 Key: LUCENE-736
 URL: http://issues.apache.org/jira/browse/LUCENE-736
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: perf-search-new.log, perf-search-orig.log, 
 res-search-new2.log, res-search-orig2.log, sloppy_phrase.patch2.txt, 
 sloppy_phrase_java.patch.txt, sloppy_phrase_tests.patch.txt


 This is an extension of https://issues.apache.org/jira/browse/LUCENE-697
 In addition to abnormalities Yonik pointed out in 697, there seem to be other 
 issues with slopy phrase search and scoring.
 1) A phrase with a repeated word would be detected in a document although it 
 is not there.
 I.e. document = A B D C E , query = B C B would not find this document (as 
 expected), but query B C B~2 would find it. 
 I think that no matter how large the slop is, this document should not be a 
 match.
 2) A document containing both orders of a query, symmetrically, would score 
 differently for the queru and for its reveresed form.
 I.e. document = A B C B A would score differently for queries B C~2 and C 
 B~2, although it is symmetric to both.
 I will attach test cases that show both these problems and the one reported 
 by Yonik in 697. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-736) Sloppy Phrase Scoring Misbehavior

2006-12-04 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-736?page=comments#action_12455422 ] 

Doron Cohen commented on LUCENE-736:


There is a bug in my recent patch (sloppy_phrase.patch2.txt):
- for the case of phrase with repetitions, some additional computation is 
required before starting each doc. 
- this does not affect the regular/common case of phrase with no repetitions.
I extended the test to expose this and will commit an updated patch later today.

 Sloppy Phrase Scoring Misbehavior
 -

 Key: LUCENE-736
 URL: http://issues.apache.org/jira/browse/LUCENE-736
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: perf-search-new.log, perf-search-orig.log, 
 res-search-new2.log, res-search-orig2.log, sloppy_phrase.patch2.txt, 
 sloppy_phrase_java.patch.txt, sloppy_phrase_tests.patch.txt


 This is an extension of https://issues.apache.org/jira/browse/LUCENE-697
 In addition to abnormalities Yonik pointed out in 697, there seem to be other 
 issues with slopy phrase search and scoring.
 1) A phrase with a repeated word would be detected in a document although it 
 is not there.
 I.e. document = A B D C E , query = B C B would not find this document (as 
 expected), but query B C B~2 would find it. 
 I think that no matter how large the slop is, this document should not be a 
 match.
 2) A document containing both orders of a query, symmetrically, would score 
 differently for the queru and for its reveresed form.
 I.e. document = A B C B A would score differently for queries B C~2 and C 
 B~2, although it is symmetric to both.
 I will attach test cases that show both these problems and the one reported 
 by Yonik in 697. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse

2006-12-04 Thread Doron Cohen (JIRA)

read/write .del as d-gaps when the deleted bit vector is sufficiently sparse 
-

 Key: LUCENE-738
 URL: http://issues.apache.org/jira/browse/LUCENE-738
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.1
Reporter: Doron Cohen
 Assigned To: Doron Cohen


.del file of a segment maintains info on deleted documents in that segment. The 
file exists only for segments having deleted docs, so it does not exists for 
newly created segments (e.g. resulted from merge). Each time closing an index 
reader that deleted any document, the .del file is rewritten. In fact, since 
the lock-less commits change a new (generation of) .del file is created in each 
such occasion.

For small indexes there is no real problem with current situation. But for very 
large indexes, each time such an index reader is closed, creating such new 
bit-vector seems like unnecessary overhead in cases that the bit vector is 
sparse (just a few docs were deleted). For instance, for an index with a 
segment of 1M docs, the sequence: {open reader; delete 1 doc from that segment; 
close reader;} would write a file of ~128KB. Repeat this sequence 8 times: 8 
new files of total size of 1MB are written to disk.

Whether this is a bottleneck or not depends on the application deletes pattern, 
but for the case that deleted docs are sparse, writing just the d-gaps would 
save space and time. 

I have this (simple) change to BitVector running and currently trying some 
performance tests to, yet, convince myself on the worthiness of this.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse

2006-12-06 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-738?page=all ]

Doron Cohen updated LUCENE-738:
---

Attachment: del.dgap.patch.txt

Patch added: del.dgap.patch.txt for the above optn (1) writing d-gaps for 
ids of deleted docs.

Patch changes index format, but is backwards compatible.

I still need to update the FileFormats document - will add that part of the 
patch later.


 read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
 

 Key: LUCENE-738
 URL: http://issues.apache.org/jira/browse/LUCENE-738
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.1
Reporter: Doron Cohen
 Assigned To: Doron Cohen
 Attachments: del.dgap.patch.txt


 .del file of a segment maintains info on deleted documents in that segment. 
 The file exists only for segments having deleted docs, so it does not exists 
 for newly created segments (e.g. resulted from merge). Each time closing an 
 index reader that deleted any document, the .del file is rewritten. In fact, 
 since the lock-less commits change a new (generation of) .del file is created 
 in each such occasion.
 For small indexes there is no real problem with current situation. But for 
 very large indexes, each time such an index reader is closed, creating such 
 new bit-vector seems like unnecessary overhead in cases that the bit vector 
 is sparse (just a few docs were deleted). For instance, for an index with a 
 segment of 1M docs, the sequence: {open reader; delete 1 doc from that 
 segment; close reader;} would write a file of ~128KB. Repeat this sequence 8 
 times: 8 new files of total size of 1MB are written to disk.
 Whether this is a bottleneck or not depends on the application deletes 
 pattern, but for the case that deleted docs are sparse, writing just the 
 d-gaps would save space and time. 
 I have this (simple) change to BitVector running and currently trying some 
 performance tests to, yet, convince myself on the worthiness of this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse

2006-12-06 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-738?page=all ]

Doron Cohen updated LUCENE-738:
---

Lucene Fields: [Patch Available]  (was: [New])

 read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
 

 Key: LUCENE-738
 URL: http://issues.apache.org/jira/browse/LUCENE-738
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.1
Reporter: Doron Cohen
 Assigned To: Doron Cohen
 Attachments: del.dgap.patch.txt


 .del file of a segment maintains info on deleted documents in that segment. 
 The file exists only for segments having deleted docs, so it does not exists 
 for newly created segments (e.g. resulted from merge). Each time closing an 
 index reader that deleted any document, the .del file is rewritten. In fact, 
 since the lock-less commits change a new (generation of) .del file is created 
 in each such occasion.
 For small indexes there is no real problem with current situation. But for 
 very large indexes, each time such an index reader is closed, creating such 
 new bit-vector seems like unnecessary overhead in cases that the bit vector 
 is sparse (just a few docs were deleted). For instance, for an index with a 
 segment of 1M docs, the sequence: {open reader; delete 1 doc from that 
 segment; close reader;} would write a file of ~128KB. Repeat this sequence 8 
 times: 8 new files of total size of 1MB are written to disk.
 Whether this is a bottleneck or not depends on the application deletes 
 pattern, but for the case that deleted docs are sparse, writing just the 
 d-gaps would save space and time. 
 I have this (simple) change to BitVector running and currently trying some 
 performance tests to, yet, convince myself on the worthiness of this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse

2006-12-07 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-738?page=all ]

Doron Cohen updated LUCENE-738:
---

Attachment: FileFormatDoc.patch.txt

FileFormat document updated to reflect this format change.

 read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
 

 Key: LUCENE-738
 URL: http://issues.apache.org/jira/browse/LUCENE-738
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.1
Reporter: Doron Cohen
 Assigned To: Doron Cohen
 Attachments: del.dgap.patch.txt, FileFormatDoc.patch.txt


 .del file of a segment maintains info on deleted documents in that segment. 
 The file exists only for segments having deleted docs, so it does not exists 
 for newly created segments (e.g. resulted from merge). Each time closing an 
 index reader that deleted any document, the .del file is rewritten. In fact, 
 since the lock-less commits change a new (generation of) .del file is created 
 in each such occasion.
 For small indexes there is no real problem with current situation. But for 
 very large indexes, each time such an index reader is closed, creating such 
 new bit-vector seems like unnecessary overhead in cases that the bit vector 
 is sparse (just a few docs were deleted). For instance, for an index with a 
 segment of 1M docs, the sequence: {open reader; delete 1 doc from that 
 segment; close reader;} would write a file of ~128KB. Repeat this sequence 8 
 times: 8 new files of total size of 1MB are written to disk.
 Whether this is a bottleneck or not depends on the application deletes 
 pattern, but for the case that deleted docs are sparse, writing just the 
 d-gaps would save space and time. 
 I have this (simple) change to BitVector running and currently trying some 
 performance tests to, yet, convince myself on the worthiness of this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-740) Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception

2006-12-11 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-740?page=comments#action_12457462 ] 

Doron Cohen commented on LUCENE-740:


In addition to SnowballProgram bug fix there are few updates in 
snowball.tartarus.org comparing to snowball stemmers in Lucene, and Hungarian 
stemmer was added. Any reason not to update all the stemmers with this fix?

 Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives 
 Index-OOB Exception
 --

 Key: LUCENE-740
 URL: http://issues.apache.org/jira/browse/LUCENE-740
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 1.9
 Environment: linux amd64
Reporter: Andreas Kohn
Priority: Minor
 Attachments: lucene-1.9.1-SnowballProgram.java


 (copied from mail to java-user)
 while playing with the various stemmers of Lucene(-1.9.1), I got an
 index out of bounds exception:
 lucene-1.9.1java -cp
 build/contrib/snowball/lucene-snowball-1.9.2-dev.jar
 net.sf.snowball.TestApp Kp bla.txt
 Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:615)
at net.sf.snowball.TestApp.main(TestApp.java:56)
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out
 of range: 11
at java.lang.StringBuffer.charAt(StringBuffer.java:303)
at 
 net.sf.snowball.SnowballProgram.find_among_b(SnowballProgram.java:270)
at net.sf.snowball.ext.KpStemmer.r_Step_4(KpStemmer.java:1122)
at net.sf.snowball.ext.KpStemmer.stem(KpStemmer.java:1997)
 This happens when executing
 lucene-1.9.1java -cp
 build/contrib/snowball/lucene-snowball-1.9.2-dev.jar
 net.sf.snowball.TestApp Kp bla.txt
 bla.txt contains just this word: 'spijsvertering'.
 After some debugging, and some tests with the original snowball
 distribution from snowball.tartarus.org, it seems that the attached
 change is needed to avoid the exception.
 (The change comes from tartarus' SnowballProgram.java)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-740) Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception

2006-12-11 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-740?page=all ]

Doron Cohen updated LUCENE-740:
---

Attachment: snowball.patch.txt

Updated + new stemmers and SnowballProgram fix from http://snowball.tartarus.org

 Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives 
 Index-OOB Exception
 --

 Key: LUCENE-740
 URL: http://issues.apache.org/jira/browse/LUCENE-740
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 1.9
 Environment: linux amd64
Reporter: Andreas Kohn
Priority: Minor
 Attachments: lucene-1.9.1-SnowballProgram.java, snowball.patch.txt


 (copied from mail to java-user)
 while playing with the various stemmers of Lucene(-1.9.1), I got an
 index out of bounds exception:
 lucene-1.9.1java -cp
 build/contrib/snowball/lucene-snowball-1.9.2-dev.jar
 net.sf.snowball.TestApp Kp bla.txt
 Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:615)
at net.sf.snowball.TestApp.main(TestApp.java:56)
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out
 of range: 11
at java.lang.StringBuffer.charAt(StringBuffer.java:303)
at 
 net.sf.snowball.SnowballProgram.find_among_b(SnowballProgram.java:270)
at net.sf.snowball.ext.KpStemmer.r_Step_4(KpStemmer.java:1122)
at net.sf.snowball.ext.KpStemmer.stem(KpStemmer.java:1997)
 This happens when executing
 lucene-1.9.1java -cp
 build/contrib/snowball/lucene-snowball-1.9.2-dev.jar
 net.sf.snowball.TestApp Kp bla.txt
 bla.txt contains just this word: 'spijsvertering'.
 After some debugging, and some tests with the original snowball
 distribution from snowball.tartarus.org, it seems that the attached
 change is needed to avoid the exception.
 (The change comes from tartarus' SnowballProgram.java)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-740) Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception

2006-12-11 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-740?page=comments#action_12457605 ] 

Doron Cohen commented on LUCENE-740:


Attached snowball.patch.txt has latest and greatest plus new test case in 
TestSnowball that demostrates this Kp stemmer bug.

Lucene tests and contrib/snowball tests pass.


 Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives 
 Index-OOB Exception
 --

 Key: LUCENE-740
 URL: http://issues.apache.org/jira/browse/LUCENE-740
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 1.9
 Environment: linux amd64
Reporter: Andreas Kohn
Priority: Minor
 Attachments: lucene-1.9.1-SnowballProgram.java, snowball.patch.txt


 (copied from mail to java-user)
 while playing with the various stemmers of Lucene(-1.9.1), I got an
 index out of bounds exception:
 lucene-1.9.1java -cp
 build/contrib/snowball/lucene-snowball-1.9.2-dev.jar
 net.sf.snowball.TestApp Kp bla.txt
 Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:615)
at net.sf.snowball.TestApp.main(TestApp.java:56)
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out
 of range: 11
at java.lang.StringBuffer.charAt(StringBuffer.java:303)
at 
 net.sf.snowball.SnowballProgram.find_among_b(SnowballProgram.java:270)
at net.sf.snowball.ext.KpStemmer.r_Step_4(KpStemmer.java:1122)
at net.sf.snowball.ext.KpStemmer.stem(KpStemmer.java:1997)
 This happens when executing
 lucene-1.9.1java -cp
 build/contrib/snowball/lucene-snowball-1.9.2-dev.jar
 net.sf.snowball.TestApp Kp bla.txt
 bla.txt contains just this word: 'spijsvertering'.
 After some debugging, and some tests with the original snowball
 distribution from snowball.tartarus.org, it seems that the attached
 change is needed to avoid the exception.
 (The change comes from tartarus' SnowballProgram.java)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-740) Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception

2006-12-12 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-740?page=comments#action_12457619 ] 

Doron Cohen commented on LUCENE-740:


Two comments:

1. Testing: There's only limited testing in Lucene's contrib for these stemmers 
- we could probably add a simple test for each stemmer.

2. Licensing: when attaching the patch I granted it for ASF inclusion. But this 
only covers my (minimal) changes to this code. Stemmers themselves go under 
Snowball licensing - http://snowball.tartarus.org/license.php


 Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives 
 Index-OOB Exception
 --

 Key: LUCENE-740
 URL: http://issues.apache.org/jira/browse/LUCENE-740
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 1.9
 Environment: linux amd64
Reporter: Andreas Kohn
Priority: Minor
 Attachments: lucene-1.9.1-SnowballProgram.java, snowball.patch.txt


 (copied from mail to java-user)
 while playing with the various stemmers of Lucene(-1.9.1), I got an
 index out of bounds exception:
 lucene-1.9.1java -cp
 build/contrib/snowball/lucene-snowball-1.9.2-dev.jar
 net.sf.snowball.TestApp Kp bla.txt
 Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:615)
at net.sf.snowball.TestApp.main(TestApp.java:56)
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out
 of range: 11
at java.lang.StringBuffer.charAt(StringBuffer.java:303)
at 
 net.sf.snowball.SnowballProgram.find_among_b(SnowballProgram.java:270)
at net.sf.snowball.ext.KpStemmer.r_Step_4(KpStemmer.java:1122)
at net.sf.snowball.ext.KpStemmer.stem(KpStemmer.java:1997)
 This happens when executing
 lucene-1.9.1java -cp
 build/contrib/snowball/lucene-snowball-1.9.2-dev.jar
 net.sf.snowball.TestApp Kp bla.txt
 bla.txt contains just this word: 'spijsvertering'.
 After some debugging, and some tests with the original snowball
 distribution from snowball.tartarus.org, it seems that the attached
 change is needed to avoid the exception.
 (The change comes from tartarus' SnowballProgram.java)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)

Maintain norms in a single file .nrm


 Key: LUCENE-756
 URL: http://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Doron Cohen
Priority: Minor


Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
comparing to compound indexes. But their file descriptors foot print is much 
higher. 

By maintaining all field norms in a single .nrm file, we can bound the number 
of files used by non compound indexes, and possibly allow more applications to 
use this format.

More details on the motivation for this in: 
http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
 (in particular 
http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen reassigned LUCENE-756:
--

Assignee: Doron Cohen

 Maintain norms in a single file .nrm
 

 Key: LUCENE-756
 URL: http://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor

 Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
 comparing to compound indexes. But their file descriptors foot print is much 
 higher. 
 By maintaining all field norms in a single .nrm file, we can bound the number 
 of files used by non compound indexes, and possibly allow more applications 
 to use this format.
 More details on the motivation for this in: 
 http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
  (in particular 
 http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)

[ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen updated LUCENE-756:
---

Attachment: nrm.patch.txt

Attached patch - nrm.patch.txt - modifies field norms maintenance to a single
.nrm file.

Modification is backwards compatible - existing indexes with norms in a file
per norm are read. - the first merge would create a single .nrm file.

All tests pass.

No performance degtadations were observed as result of this change, but my
tests so far were not very extensive.

Maintain norms in a single file .nrm

Key: LUCENE-756
URL: http://issues.apache.org/jira/browse/LUCENE-756
Project: Lucene - Java
Issue Type: Improvement
Reporter: Doron Cohen
Assigned To: Doron Cohen
Priority: Minor
Attachments: nrm.patch.txt

Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity
comparing to compound indexes. But their file descriptors foot print is much
higher.
By maintaining all field norms in a single .nrm file, we can bound the number
of files used by non compound indexes, and possibly allow more applications
to use this format.
More details on the motivation for this in:
http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
(in particular
http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen updated LUCENE-756:
---

Lucene Fields: [Patch Available]  (was: [New])

 Maintain norms in a single file .nrm
 

 Key: LUCENE-756
 URL: http://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: nrm.patch.txt


 Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
 comparing to compound indexes. But their file descriptors foot print is much 
 higher. 
 By maintaining all field norms in a single .nrm file, we can bound the number 
 of files used by non compound indexes, and possibly allow more applications 
 to use this format.
 More details on the motivation for this in: 
 http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
  (in particular 
 http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen updated LUCENE-756:
---

Component/s: Index

 Maintain norms in a single file .nrm
 

 Key: LUCENE-756
 URL: http://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: nrm.patch.txt


 Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
 comparing to compound indexes. But their file descriptors foot print is much 
 higher. 
 By maintaining all field norms in a single .nrm file, we can bound the number 
 of files used by non compound indexes, and possibly allow more applications 
 to use this format.
 More details on the motivation for this in: 
 http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
  (in particular 
 http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen updated LUCENE-756:
---

Attachment: (was: nrm.patch.txt)

 Maintain norms in a single file .nrm
 

 Key: LUCENE-756
 URL: http://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor

 Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
 comparing to compound indexes. But their file descriptors foot print is much 
 higher. 
 By maintaining all field norms in a single .nrm file, we can bound the number 
 of files used by non compound indexes, and possibly allow more applications 
 to use this format.
 More details on the motivation for this in: 
 http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
  (in particular 
 http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)

[ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen updated LUCENE-756:
---

Attachment: nrm.patch.txt

Replacing the patch file (prev file was garbage - svn stat instead of svn
diff).

Few words on how this patch works:
- segment.nrm file was added.
- addDocument (DocumentWriter) still writes each norm to a separate file - but
that's in memory,
- at merge, all norms are written to a single file.
- CFS now also maintains all norms in a single file.
- IndexWriter merge-decision now considers hasSeparateNorms() not only for CFS
but also for non compound.
- SegmentReader.openNorms() still creates ready-to-use/load Norm objects (which
would read the norms only when needed). But the Norm object is now assigned a
normSeek value, which is nonzero if the norm file is segment.nrm.
- existing indexes, prior to this change, are managed the same way that
segments resulted of addDocument are managed.

Tests:
- I verified that also the (contrib) tests for FieldNormModifier and
LengthNormModofier are working.

Remaining:
- I might add a test.
- more benchmarking?
- update fileFormat document.

Maintain norms in a single file .nrm

Key: LUCENE-756
URL: http://issues.apache.org/jira/browse/LUCENE-756
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Doron Cohen
Assigned To: Doron Cohen
Priority: Minor
Attachments: nrm.patch.txt

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-21 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-756?page=comments#action_12460292 ] 

Doron Cohen commented on LUCENE-756:


 Does this mean a separate file outside the final .cfs files? 

Oh no - there's a single .nrm file in the .cfs file (instead of multiple .fN 
files in the .cfs file). 
As before, only .sN files (separated norm files) are outside of .cfs file.


 Maintain norms in a single file .nrm
 

 Key: LUCENE-756
 URL: http://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: nrm.patch.txt


 Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
 comparing to compound indexes. But their file descriptors foot print is much 
 higher. 
 By maintaining all field norms in a single .nrm file, we can bound the number 
 of files used by non compound indexes, and possibly allow more applications 
 to use this format.
 More details on the motivation for this in: 
 http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
  (in particular 
 http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-21 Thread Doron Cohen (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-756?page=comments#action_12460316 ] 

Doron Cohen commented on LUCENE-756:


Thanks for the comments, Doug. 
You're right of course, I will add both the header and the constant.
(that would be either today or only in a week from now.)

 Maintain norms in a single file .nrm
 

 Key: LUCENE-756
 URL: http://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: nrm.patch.txt


 Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
 comparing to compound indexes. But their file descriptors foot print is much 
 higher. 
 By maintaining all field norms in a single .nrm file, we can bound the number 
 of files used by non compound indexes, and possibly allow more applications 
 to use this format.
 More details on the motivation for this in: 
 http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
  (in particular 
 http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-22 Thread Doron Cohen (JIRA)

[ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen updated LUCENE-756:
---

Attachment: nrm.patch.2.txt

nrm.patch.2.txt:

Updated as Doug suggested:
- .nrm extension now maintained in a constant .
- .nrm file now has a 4 bytes header.

And, fileFormat document is updated.

Also, I checked again that the seeks for the various field norms are lazy -
performed only when bytes are actually read with refill().

Maintain norms in a single file .nrm

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm

2007-01-03 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462069
 ] 

Doron Cohen commented on LUCENE-756:


I am updating the patch (nrm.patch.3.txt): 

- using a single constant for the norms file extension:
  static final String NORMS_EXTENSION = nrm;
(This is more in line with existing extension constants in the code.)
(As a side comment, there are various extension names (e.g. .cfs) in the code 
that are also candidate for factoring as a constant, but this is a separate 
issue.)

- adding a test - TestNorms
This test verifies that norm values assigned with field.setBoost() are 
preserved during the life cycle of an index, including adding documents, 
updating norms values (separate norms), addIndexes(), and optimize.

All tests pass.
On my side this is ready to go in.


 Maintain norms in a single file .nrm
 

 Key: LUCENE-756
 URL: https://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: nrm.patch.2.txt, nrm.patch.txt


 Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
 comparing to compound indexes. But their file descriptors foot print is much 
 higher. 
 By maintaining all field norms in a single .nrm file, we can bound the number 
 of files used by non compound indexes, and possibly allow more applications 
 to use this format.
 More details on the motivation for this in: 
 http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
  (in particular 
 http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2007-01-03 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-756:
---

Attachment: nrm.patch.3.txt

 Maintain norms in a single file .nrm
 

 Key: LUCENE-756
 URL: https://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: nrm.patch.2.txt, nrm.patch.3.txt, nrm.patch.txt


 Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
 comparing to compound indexes. But their file descriptors foot print is much 
 higher. 
 By maintaining all field norms in a single .nrm file, we can bound the number 
 of files used by non compound indexes, and possibly allow more applications 
 to use this format.
 More details on the motivation for this in: 
 http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
  (in particular 
 http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2007-01-04 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462287
 ] 

Doron Cohen commented on LUCENE-675:


Grant, thanks for trying this out - I will update the patch shortly. 
I am using this for benchmarking - quite easy to add new stuff - and in fact I 
added some stuff lately but did not update here because wasn't sure if others 
are interested. 
I will verify what I have with svn head and pack it here as an updated patch.
Regards,
Doron

 Lucene benchmark: objective performance test for Lucene
 ---

 Key: LUCENE-675
 URL: https://issues.apache.org/jira/browse/LUCENE-675
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Assigned To: Grant Ingersoll
Priority: Minor
 Attachments: benchmark.byTask.patch, benchmark.patch, 
 BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, 
 LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties


 We need an objective way to measure the performance of Lucene, both indexing 
 and querying, on a known corpus. This issue is intended to collect comments 
 and patches implementing a suite of such benchmarking tests.
 Regarding the corpus: one of the widely used and freely available corpora is 
 the original Reuters collection, available from 
 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
 or 
 http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
  I propose to use this corpus as a base for benchmarks. The benchmarking 
 suite could automatically retrieve it from known locations, and cache it 
 locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2007-01-04 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-675:
---

Attachment: byTask.2.patch.txt

 Lucene benchmark: objective performance test for Lucene
 ---

 Key: LUCENE-675
 URL: https://issues.apache.org/jira/browse/LUCENE-675
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Assigned To: Grant Ingersoll
Priority: Minor
 Attachments: benchmark.byTask.patch, benchmark.patch, 
 BenchmarkingIndexer.pm, byTask.2.patch.txt, extract_reuters.plx, 
 LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, 
 tiny.alg, tiny.properties


 We need an objective way to measure the performance of Lucene, both indexing 
 and querying, on a known corpus. This issue is intended to collect comments 
 and patches implementing a suite of such benchmarking tests.
 Regarding the corpus: one of the widely used and freely available corpora is 
 the original Reuters collection, available from 
 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
 or 
 http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
  I propose to use this corpus as a base for benchmarks. The benchmarking 
 suite could automatically retrieve it from known locations, and cache it 
 locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm

2007-01-06 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462774
 ] 

Doron Cohen commented on LUCENE-756:


Thanks for commiting this Yonik!

Seems the added test TestNorms was not commited..?

 Maintain norms in a single file .nrm
 

 Key: LUCENE-756
 URL: https://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: nrm.patch.2.txt, nrm.patch.3.txt, nrm.patch.txt


 Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
 comparing to compound indexes. But their file descriptors foot print is much 
 higher. 
 By maintaining all field norms in a single .nrm file, we can bound the number 
 of files used by non compound indexes, and possibly allow more applications 
 to use this format.
 More details on the motivation for this in: 
 http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
  (in particular 
 http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-140) docs out of order

2007-01-08 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463176
 ] 

Doron Cohen commented on LUCENE-140:


Amazed by this long lasting bug report I was going similar routes to Mike, and 
I noticed 3 things - 

(1) the sequence of ops brought by Jason is wrong: 
 -a- Open an IndexReader (#1) over an existing index (this reader is used for 
searching while updating the index)
 -b- Using this reader (#1) do a search for the document(s) that you would like 
to update; obtain their document ID numbers
 -c- Create an IndexWriter and add several new documents to the index (for me, 
this writing is done in other threads) (*)
 -d- Close the IndexWriter (*)
 -e- Open another IndexReader (#2) over the index
 -f- Delete the previously found documents by their document ID numbers using 
reader #2
 -g- Close the #2 reader
 -h- Create another IndexWriter (#2) and re-add the updated documents
 -i- Close the IndexWriter #2
 -j- Close the original IndexReader (#1) and open a new reader for general 
searching

Problem here is that the docIDs found in (b) may be altered in step (d) and so 
step (f) would delete the wrong docs. In particular, it might attempt to delete 
ids that are out of the range. This might expose exactly the BitVector problem, 
and would explain the whole thing, but I too cannot see how it explains the 
delete-by-term case.

(2) BitVectort silent ignoring of attempts to delete slightly-out-of-bound docs 
that fall in the higher byte - this the problem that Mike fixed. I think the 
fix is okay - though some applications might now get exceptions they did not 
get in the past - but I believe this is for their own good. 
However when I first ran into this I didn't notice that BitVector.size() would 
become wrong as result of this - nice catch Mike!

I think however that the test Mike added does not expose the docs out of order 
bug - I tried this test without the fix and it only fail on the gotException 
assert - if you comment this assert the test pass. 

The following test would expose the out-of-order bug - it would fail with 
out-of-order before the fix, and would succeed without it. 

  public void testOutOfOrder () throws IOException {
String tempDir = System.getProperty(java.io.tmpdir);
if (tempDir == null) {
  throw new IOException(java.io.tmpdir undefined, cannot run test: 
+getName());
}

File indexDir = new File(tempDir, lucenetestindexTemp);
Directory dir = FSDirectory.getDirectory(indexDir, true);

boolean create = true;
int numDocs = 0;
int maxDoc = 0;
while (numDocs  100) {
  IndexWriter iw = new IndexWriter(dir,anlzr,create);
  create = false;
  iw.setUseCompoundFile(false);
  for (int i=0; i2; i++) {
Document d = new Document();
d.add(new Field(body,body+i,Store.NO,Index.UN_TOKENIZED));
iw.addDocument(d);
  }
  iw.optimize();
  iw.close();
  IndexReader ir = IndexReader.open(dir);
  numDocs = ir.numDocs();
  maxDoc = ir.maxDoc();
  assertEquals(numDocs,maxDoc);
  for (int i=7; i =-1; i--) {
try {
  ir.deleteDocument(maxDoc+i);
} catch (ArrayIndexOutOfBoundsException e) {  
}
  }
  ir.close();
}
  }

Mike, do you agree?

(3) maxDoc() computation in SegmentReader is based (on some paths) in 
RandomAccessFile.length(). IIRC I saw cases (in previous project) where 
File.length() or RAF.length() (not sure which of the two) did not always 
reflect real length, if the system was very busy IO wise, unless FD.sync() was 
called (with performance hit). 

This post seems relevant - RAF.length over 2GB in NFS - 
http://forum.java.sun.com/thread.jspa?threadID=708670messageID=4103657 

Not sure if this can be the case here but at least we can discuss whether it is 
better to always store the length.




 docs out of order
 -

 Key: LUCENE-140
 URL: https://issues.apache.org/jira/browse/LUCENE-140
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: unspecified
 Environment: Operating System: Linux
 Platform: PC
Reporter: legez
 Assigned To: Michael McCandless
 Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar


 Hello,
   I can not find out, why (and what) it is happening all the time. I got an
 exception:
 java.lang.IllegalStateException: docs out of order
 at
 org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
 at
 org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
 at
 org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
 at 
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
 at

[jira] Commented: (LUCENE-140) docs out of order

2007-01-09 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463483
 ] 

Doron Cohen commented on LUCENE-140:


Jed, is it possible that when re-creating the index, while IndexWriter is 
constructed with create=true, FSDirectory is opened with create=false?
I suspect so, because otherwise, old  .del files would have been deleted. 
If indeed so, newly created segments, which have same names as segments in 
previous (bad) runs, when opened, would read the (bad) old .del file. 
This would possibly expose the bug fixed by Michael. 
I may be over speculating here, but if this is the case, it can also explain 
why changing the merge factor from 4 to 10 exposed the problem. 

In fact, let me speculate even further - if indeed when creating the index from 
scratch, the FSDirectory is (mistakenly) opened with create=false, as long as 
you always repeated the same sequencing of adding and deleting docs, you were 
likely to almost not suffer from this mistake, because segments created with 
same names as (old) .del files simply see docs as deleted before the docs are 
actually deleted by the program. The search behaves wrongly, not finding these 
docs before they are actually deleted, but no exception is thrown when adding 
docs. However once the merge factor was changed from 4 to 10, the matching 
between old .del files and new segments (with same names) was broken, and the 
out-of-order exception appeared. 

...and if this is not the case, we would need to look for something else...

 docs out of order
 -

 Key: LUCENE-140
 URL: https://issues.apache.org/jira/browse/LUCENE-140
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: unspecified
 Environment: Operating System: Linux
 Platform: PC
Reporter: legez
 Assigned To: Michael McCandless
 Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar, 
 indexing-failure.log, LUCENE-140-2007-01-09-instrumentation.patch


 Hello,
   I can not find out, why (and what) it is happening all the time. I got an
 exception:
 java.lang.IllegalStateException: docs out of order
 at
 org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
 at
 org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
 at
 org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
 at 
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
 at 
 org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
 at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
 at Optimize.main(Optimize.java:29)
 It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
 find
 it neither in download nor in version list in this form). Everything seems 
 OK. I
 can search through index, but I can not optimize it. Even worse after this
 exception every time I add new documents and close IndexWriter new segments is
 created! I think it has all documents added before, because of its size.
 My index is quite big: 500.000 docs, about 5gb of index directory.
 It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
 docs, try to optimize and receive above exception.
 My documents' structure is:
   static Document indexIt(String id_strony, Reader reader, String 
 data_wydania,
 String id_wydania, String id_gazety, String data_wstawienia)
 {
 Document doc = new Document();
 doc.add(Field.Keyword(id, id_strony ));
 doc.add(Field.Keyword(data_wydania, data_wydania));
 doc.add(Field.Keyword(id_wydania, id_wydania));
 doc.add(Field.Text(id_gazety, id_gazety));
 doc.add(Field.Keyword(data_wstawienia, data_wstawienia));
 doc.add(Field.Text(tresc, reader));
 return doc;
 }
 Sincerely,
 legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2007-01-10 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-675:
---

Attachment: byTask.jre1.4.patch.txt

 Lucene benchmark: objective performance test for Lucene
 ---

 Key: LUCENE-675
 URL: https://issues.apache.org/jira/browse/LUCENE-675
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Assigned To: Grant Ingersoll
Priority: Minor
 Attachments: benchmark.byTask.patch, benchmark.patch, 
 BenchmarkingIndexer.pm, byTask.2.patch.txt, byTask.jre1.4.patch.txt, 
 extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, 
 taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties


 We need an objective way to measure the performance of Lucene, both indexing 
 and querying, on a known corpus. This issue is intended to collect comments 
 and patches implementing a suite of such benchmarking tests.
 Regarding the corpus: one of the widely used and freely available corpora is 
 the original Reuters collection, available from 
 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
 or 
 http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
  I propose to use this corpus as a base for benchmarks. The benchmarking 
 suite could automatically retrieve it from known locations, and cache it 
 locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2007-01-10 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463830
 ] 

Doron Cohen commented on LUCENE-675:


Oops... I had the impression that compiling with compliance level 1.4 is 
sufficient to prevent this, but guess I need to read again what that compliance 
level setting guarantees exactly. 

Anyhow there are a 3 things that require 1.5:
 - Boolean.parseBoolean() -- Boolean.valueOf().booleanValue()
 - String.contains() -- indexOf()
 - Class.getSimpleName() -- ?

Modifying Class.getSimpleName() to Class.getName() would not be very nice - 
queries prints and task names prints would be quite ugly. To fix that I added a 
method simpleName(Class) to byTask.util.Format. I am attaching an updated patch 
- byTask.jre1.4.patch.txt - that includes this method and removes the Java 1.5  
dependency.

Thanks for catching this!
Doron

 Lucene benchmark: objective performance test for Lucene
 ---

 Key: LUCENE-675
 URL: https://issues.apache.org/jira/browse/LUCENE-675
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Assigned To: Grant Ingersoll
Priority: Minor
 Attachments: benchmark.byTask.patch, benchmark.patch, 
 BenchmarkingIndexer.pm, byTask.2.patch.txt, byTask.jre1.4.patch.txt, 
 extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, 
 taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties


 We need an objective way to measure the performance of Lucene, both indexing 
 and querying, on a known corpus. This issue is intended to collect comments 
 and patches implementing a suite of such benchmarking tests.
 Regarding the corpus: one of the widely used and freely available corpora is 
 the original Reuters collection, available from 
 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
 or 
 http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
  I propose to use this corpus as a base for benchmarks. The benchmarking 
 suite could automatically retrieve it from known locations, and cache it 
 locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-741) Field norm modifier (CLI tool)

2007-01-11 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464105
 ] 

Doron Cohen commented on LUCENE-741:


I was looking at what it would take to make this work with .nrm file as well. 
I expected there will be a test that fails currently, but there is none.
So I looked into the tests and the implementation and have a few questions:

(1) under contrib, FieldNormModifier and LengthNormModifier seem quite similar, 
right? 
The first one sets with:
 - reader.setNorm(d, fieldName, 
 - sim.encodeNorm(sim.lengthNorm(fieldName, termCounts[d])));
The latter with:
 - byte norm = sim.encodeNorm(sim.lengthNorm(fieldName, termCounts[d]));
 - reader.setNorm(d, fieldName, norm);
Do we need to keep both?

(2) TestFieldNormModifier.testFieldWithNoNorm() calls resetNorms() for a field 
that does not exist. Some work is done by the modifier to collect the term 
frequencies, and then reader.setNorm is called but it does nothing, because 
there are no norms. And indeed the test verifies that there are still no norms 
for this field. Confusing I think. For some reason I assumed that calling 
resetNorms() for a field that has none, would implicitly set omitNorms to false 
for that field and compute it - the inverse of killNorms(). Since this is not 
the case, perhaps resetNorms should throw an exception in this case?

(3) I would feel safer about this feature if the test was more strict - 
something like TestNorms - have several fields, modify some, each in a unique 
way, remove some others, then at the end verify that all the values of each 
field norms are exactly as expected. 

(4) For killNorms to work, you can first revert the index to not use .nrm, and 
then kill as before. The code knows to read .fN files, for both backwards 
compatibility, and for reading segments created be DocumentWriter. The 
following steps will do this:
- read the norms using reader.norm(field)
- write into .fN files
- remove .nrm file
- modify segmentInfo to know that it has no .nrm file.

(5) It would have been more efficient to optimize (and remove the .nrm file) 
once in the application, so perhaps modify the public API to take an array of 
fields and operate on all?


 Field norm modifier (CLI tool)
 --

 Key: LUCENE-741
 URL: https://issues.apache.org/jira/browse/LUCENE-741
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Otis Gospodnetic
 Assigned To: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-741.patch, LUCENE-741.patch


 I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to 
 allow us to set fake norms on an existing fields, effectively making it 
 equivalent to Field.Index.NO_NORMS.
 This is related to LUCENE-448 (NO_NORMS patch) and LUCENE-496 
 (LengthNormModifier contrib from Chris).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-741) Field norm modifier (CLI tool)

2007-01-11 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-741:
---

Attachment: for.nrm.patch

 Field norm modifier (CLI tool)
 --

 Key: LUCENE-741
 URL: https://issues.apache.org/jira/browse/LUCENE-741
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Otis Gospodnetic
 Assigned To: Otis Gospodnetic
Priority: Minor
 Attachments: for.nrm.patch, LUCENE-741.patch, LUCENE-741.patch


 I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to 
 allow us to set fake norms on an existing fields, effectively making it 
 equivalent to Field.Index.NO_NORMS.
 This is related to LUCENE-448 (NO_NORMS patch) and LUCENE-496 
 (LengthNormModifier contrib from Chris).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-741) Field norm modifier (CLI tool)

2007-01-11 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-741:
---

Attachment: (was: for.nrm.patch)

 Field norm modifier (CLI tool)
 --

 Key: LUCENE-741
 URL: https://issues.apache.org/jira/browse/LUCENE-741
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Otis Gospodnetic
 Assigned To: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-741.patch, LUCENE-741.patch


 I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to 
 allow us to set fake norms on an existing fields, effectively making it 
 equivalent to Field.Index.NO_NORMS.
 This is related to LUCENE-448 (NO_NORMS patch) and LUCENE-496 
 (LengthNormModifier contrib from Chris).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-741) Field norm modifier (CLI tool)

2007-01-11 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-741:
---

Attachment: for.nrm.patch

 Field norm modifier (CLI tool)
 --

 Key: LUCENE-741
 URL: https://issues.apache.org/jira/browse/LUCENE-741
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Otis Gospodnetic
 Assigned To: Otis Gospodnetic
Priority: Minor
 Attachments: for.nrm.patch, LUCENE-741.patch, LUCENE-741.patch


 I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to 
 allow us to set fake norms on an existing fields, effectively making it 
 equivalent to Field.Index.NO_NORMS.
 This is related to LUCENE-448 (NO_NORMS patch) and LUCENE-496 
 (LengthNormModifier contrib from Chris).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-741) Field norm modifier (CLI tool)

2007-01-11 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464109
 ] 

Doron Cohen commented on LUCENE-741:


Attached for.nrm.patch was very noisy - so I replaced it with one created with 
   svn diff -x --ignore-eol-style  contrib/miscellaneous
It is relative to trunk.

A test is added to TestFieldNormModifier.java - 
testModifiedNormValuesCombinedWithKill - that verifies exactly what are the 
values of norms after modification.

FieldNormModifier modified to handle .nrm file as outlined above.


 Field norm modifier (CLI tool)
 --

 Key: LUCENE-741
 URL: https://issues.apache.org/jira/browse/LUCENE-741
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Otis Gospodnetic
 Assigned To: Otis Gospodnetic
Priority: Minor
 Attachments: for.nrm.patch, LUCENE-741.patch, LUCENE-741.patch


 I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to 
 allow us to set fake norms on an existing fields, effectively making it 
 equivalent to Field.Index.NO_NORMS.
 This is related to LUCENE-448 (NO_NORMS patch) and LUCENE-496 
 (LengthNormModifier contrib from Chris).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-665) temporary file access denied on Windows

2007-01-12 Thread Doron Cohen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464250
]

Doron Cohen commented on LUCENE-665:

Hi Michael,

Funny that I got this email with reply-to to you rather than the list.
Funnier part is that I really wanted to reply you directly rather than the
list. Is JIRA a mind reader?

Yes, I would like to close the issue - I already said that in my Oct 30
post.

I would like to do this myself - should I close or resolve the issue?
or perhaps first resolve and then close? I think I read somewhere the life
cycle of an issue but I cannot find it. I am also wondering if it should be
with won't fix or duplicate?

Thanks,
Doron

atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464174]

temporary file access denied on Windows
---

Key: LUCENE-665
URL: https://issues.apache.org/jira/browse/LUCENE-665
Project: Lucene - Java
Issue Type: Bug
Components: Store
Affects Versions: 2.0.0
Environment: Windows
Reporter: Doron Cohen
Attachments: FSDirectory_Retry_Logic.patch,
FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch,
FSWinDirectory_26_Sep_06.patch, Test_Output.txt,
TestInterleavedAddAndRemoves.java

[jira] Resolved: (LUCENE-665) temporary file access denied on Windows

2007-01-12 Thread Doron Cohen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doron Cohen resolved LUCENE-665.

Resolution: Won't Fix

With lockless commits this is no longer reproducable, and although
theoretically it seems that in some cases it should be able to reproduce this,
practice suggests otherwise, and there seems to be no sufficient justification
to introduce retry logic (which is not a 100% solution anyhow).

temporary file access denied on Windows
---

[jira] Commented: (LUCENE-771) Change default write lock file location to index directory (not java.io.tmpdir)

2007-01-12 Thread Doron Cohen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464395
]

Doron Cohen commented on LUCENE-771:

Is that true? I thought that for previous format changes, the combination of -
{ (1) point-in-time index reading by readers (2) backwards compatibility (3)
locks } - allowed not to require this.

Change default write lock file location to index directory (not
java.io.tmpdir)
---

Key: LUCENE-771
URL: https://issues.apache.org/jira/browse/LUCENE-771
Project: Lucene - Java
Issue Type: Improvement
Components: Store
Affects Versions: 2.1
Reporter: Michael McCandless
Assigned To: Michael McCandless
Priority: Minor
Fix For: 2.1

Now that readers are read-only, we no longer need to store lock files
in a different global lock directory than the index directory. This
has been a source of confusion and caused problems to users in the
past.
Furthermore, once the write lock is stored in the index directory, it
no longer needs the big digest prefix that was previously required
to make sure lock files in the global lock directory, from different
indexes, did not conflict.
This way, all files related to an index will appear in a single
directory. And you can easily list that directory to see if a
write.lock is present to check whether a writer is open on the
index.
Note that this change just affects how FSDirectory creates its default
lockFactory if no lockFactory was specified. It is still possible
(just no longer the default) to pick a different directory to store
your lock files by pre-instantiating your own LockFactory.
As part of this I would like to remove LOCK_DIR and the no-argument
constructor, in SimpleFSLockFactory and NativeFSLockFactory. I don't
think we should have the notion of a global default lock directory
anymore. This is actually an API change. However, neither
SimpleFSLockFactory nor NativeFSLockFactory haver been released yet,
so I think this API removal is allowed?
Finally I want to deprecate (but not yet remove, because this has been
in the API for many releases) the static LOCK_DIR that's in
FSDirectory. But it's now entirely unused.
See here for discussion leading to this:
http://www.gossamer-threads.com/lists/lucene/java-dev/43940

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm

2007-01-16 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465260
 ] 

Doron Cohen commented on LUCENE-756:


Michael, I like this improvement!

(At first I considered adding such FORMAT level but decided that it is not 
worth it, - aiming backwards compatibility with pre-lockless indexes. Then I 
had to add that file check - wrong trade-off indeed.)

Two minor comments:
- getHasMergedNorms() is private and now the method has no logic - I would 
remove that method and refer to hasMergedNorms instead.
- the term merged (in hasMergedNorms) is a little overloaded with other 
semantics (in Lucene), though I cannot think of other matching descriptive 
(short) term.

Thanks for improving this,
Doron

 Maintain norms in a single file .nrm
 

 Key: LUCENE-756
 URL: https://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: index.premergednorms.cfs.zip, 
 index.premergednorms.nocfs.zip, LUCENE-756-Jan16.patch, 
 LUCENE-756-Jan16.Take2.patch, nrm.patch.2.txt, nrm.patch.3.txt, nrm.patch.txt


 Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
 comparing to compound indexes. But their file descriptors foot print is much 
 higher. 
 By maintaining all field norms in a single .nrm file, we can bound the number 
 of files used by non compound indexes, and possibly allow more applications 
 to use this format.
 More details on the motivation for this in: 
 http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
  (in particular 
 http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-781) NPe in MultiReader.isCurrent() and getVersion()

2007-01-22 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466636
 ] 

Doron Cohen commented on LUCENE-781:


I checked - the fix is working and code seems right.

While we are looking at this, there are a few more IndexReader methods 
which are not implemented by MultiReader.

These 3 methods seems ok:
- document(int)
  would work because IndexReader would send to document(int,FieldSelector) 
  which is implemented in MultiReader.
- termDocs(Term),  
- termPositions(Term)
  would both work because IndexReader implementations goes to termDocs() or 
  to termPositions(), which both are implemented in MultiReader.

These 3 methods should probably be fixed:
- isOptimized() 
  would fail - similar to isCurrent()
- setNorm(int, String, float)
  would fail too, similar reason.
- directory()
  would not fail, but fall to return the directory of reader[0], 
  is this a correct behavior?
  this is because MultiReader() (constructor) calls super with reader[0] - 
  again, I am not sure, is this correct? (why allowing to create 
  a multi-reader with no readers at all?)



 NPe in MultiReader.isCurrent() and getVersion()
 ---

 Key: LUCENE-781
 URL: https://issues.apache.org/jira/browse/LUCENE-781
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Daniel Naber
 Attachments: multireader.diff, multireader_test.diff


 I'm attaching a fix for the NPE in MultiReader.isCurrent() plus a testcase. 
 For getVersion(), we should throw a better exception that NPE. I will commit 
 unless someone objects or has a better idea.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-781) NPE in MultiReader.isCurrent() and getVersion()

2007-01-23 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466852
 ] 

Doron Cohen commented on LUCENE-781:


I thought it would not break MultiReader, just do unnecessary work for that 
method...?

Same new test (using that (readers[]) constructor) would fail also in previous 
versions. 

I think main difference is that for the MultiReader created inside IndexReader, 
(1) all readers share the same directory, and (2) it maintains a SegmentsInfos 
read from that single directory. 

Now this is not the case for the other (but still valid (?)) usage of 
MultiReader - because there is no single directory (well, not necessarily) and 
hence no SegmentInfos for the MultiReader. 

So it seems a possible fix would be:
- define a boolean e.g. isWholeIndex predicate in MultiReader 
- would be true when constructed with a non null dir and a non null 
segmentInfos 
- base operation upon it: 
- if isWholeIndex call super.isCurrent() otherwise do the (multi) logic in 
current fix.

 NPE in MultiReader.isCurrent() and getVersion()
 ---

 Key: LUCENE-781
 URL: https://issues.apache.org/jira/browse/LUCENE-781
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Daniel Naber
 Attachments: multireader.diff, multireader_test.diff


 I'm attaching a fix for the NPE in MultiReader.isCurrent() plus a testcase. 
 For getVersion(), we should throw a better exception that NPE. I will commit 
 unless someone objects or has a better idea.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-781) NPE in MultiReader.isCurrent() and getVersion()

2007-01-26 Thread Doron Cohen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467953
]

Doron Cohen commented on LUCENE-781:

One could write an application that groups readers to multiReaders in more than
1 level, i.e. r1,r2,r3 grouped to rr1, r4,r5,r6 grouped to rr2, rr1,rr2
grouped to rrr.If rrr.isCurrent() throws unsupported, the application needs
to question recursively.

I am not aware of such an application, so you could argue this is only
theoretic, still it demonstrates a strength of Lucene. Also, here too, as
argued above, even if the answer is false (not current), the application would
need to apply the same recursive logic to reopen the non-current reader and
reconstruct the multi-reader.

So I agree it is valid to throw unsupported.

Just that it feels a bit uncomfortable to throw unsupported for existing API of
a method with well defined meaning that is quite easy to implement (relying on
that anyhow it was never implemented correctly).

NPE in MultiReader.isCurrent() and getVersion()
---

Key: LUCENE-781
URL: https://issues.apache.org/jira/browse/LUCENE-781
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Daniel Naber
Attachments: multireader.diff, multireader_test.diff

I'm attaching a fix for the NPE in MultiReader.isCurrent() plus a testcase.
For getVersion(), we should throw a better exception that NPE. I will commit
unless someone objects or has a better idea.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-788) contrib/benchmark assumes Locale.US for parsing dates in Reuters collection

2007-01-27 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-788:
---

Attachment: 788_benchmark_parseDate_locale_Jan_27.patch

Locale.US passed to SimpleDateFormat constructor.

 contrib/benchmark assumes Locale.US for parsing dates in Reuters collection
 ---

 Key: LUCENE-788
 URL: https://issues.apache.org/jira/browse/LUCENE-788
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Fix For: 2.1

 Attachments: 788_benchmark_parseDate_locale_Jan_27.patch


 SimpleDateFormat used for parsing dates in Reuters documents is instantiated 
 without specifying a locale. So it is using the default locale. If that 
 happens to be US, it will work. But for another locale a parse exception is 
 likely.
 Affects both StandardBenchmarker and ReutersDocMaker.
 Fix is trivial - specify Locale.US for SimpleDateFormat's constructor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-788) contrib/benchmark assumes Locale.US for parsing dates in Reuters collection

2007-01-27 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-788:
---

Lucene Fields: [New, Patch Available]  (was: [New])

 contrib/benchmark assumes Locale.US for parsing dates in Reuters collection
 ---

 Key: LUCENE-788
 URL: https://issues.apache.org/jira/browse/LUCENE-788
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Fix For: 2.1

 Attachments: 788_benchmark_parseDate_locale_Jan_27.patch


 SimpleDateFormat used for parsing dates in Reuters documents is instantiated 
 without specifying a locale. So it is using the default locale. If that 
 happens to be US, it will work. But for another locale a parse exception is 
 likely.
 Affects both StandardBenchmarker and ReutersDocMaker.
 Fix is trivial - specify Locale.US for SimpleDateFormat's constructor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-804) build.xml: result of dist-src should support build-contrib

2007-02-15 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-804:
---

Attachment: 804.build.xml.patch

804.build.xml.patch removes loadable jars from the src_dist
and adds back in jars that are (currently) not downloadable. This 
allows src_dist to compile contribs (or even to re-dist).

src-dist size effect of this - reduced from 8.9 MB to 6.8 MB.


 build.xml: result of dist-src should support build-contrib
 --

 Key: LUCENE-804
 URL: https://issues.apache.org/jira/browse/LUCENE-804
 Project: Lucene - Java
  Issue Type: Task
  Components: Other
Affects Versions: 2.1
Reporter: Doron Cohen
Priority: Minor
 Fix For: 2.1

 Attachments: 804.build.xml.patch


 Currently the packed src distribution would fail to run ant build-contrib.
 It would be much nicer if that work.
 In fact, would be nicer if you could even re-pack with it.
 For now I marked this for 2.1, although I am not yet sure if this is a 
 stopper.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-804) build.xml: result of dist-src should support build-contrib

2007-02-15 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-804:
---

Lucene Fields: [New, Patch Available]  (was: [New])

 build.xml: result of dist-src should support build-contrib
 --

 Key: LUCENE-804
 URL: https://issues.apache.org/jira/browse/LUCENE-804
 Project: Lucene - Java
  Issue Type: Task
  Components: Other
Affects Versions: 2.1
Reporter: Doron Cohen
Priority: Minor
 Fix For: 2.1

 Attachments: 804.build.xml.patch


 Currently the packed src distribution would fail to run ant build-contrib.
 It would be much nicer if that work.
 In fact, would be nicer if you could even re-pack with it.
 For now I marked this for 2.1, although I am not yet sure if this is a 
 stopper.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-804) build.xml: result of dist-src should support build-contrib

2007-02-16 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-804:
---

Lucene Fields: [Patch Available]  (was: [Patch Available, New])
Affects Version/s: (was: 2.1)
Fix Version/s: (was: 2.1)
 Assignee: Doron Cohen

[1]  Modifying 'fix version' to not be 2.1, thereby clarifying that,
since releases are to be more frequent, this should not be 
regarded as a stopper to release 2.1.

[2]  I would like to commit this in a day or so (unless anyone 
points a problem with this).


 build.xml: result of dist-src should support build-contrib
 --

 Key: LUCENE-804
 URL: https://issues.apache.org/jira/browse/LUCENE-804
 Project: Lucene - Java
  Issue Type: Task
  Components: Other
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: 804.build.xml.patch


 Currently the packed src distribution would fail to run ant build-contrib.
 It would be much nicer if that work.
 In fact, would be nicer if you could even re-pack with it.
 For now I marked this for 2.1, although I am not yet sure if this is a 
 stopper.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-808) bufferDeleteTerm in IndexWriter might flush prematurely

2007-02-21 Thread Doron Cohen (JIRA)

bufferDeleteTerm in IndexWriter might flush prematurely
---

 Key: LUCENE-808
 URL: https://issues.apache.org/jira/browse/LUCENE-808
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.1
Reporter: Doron Cohen


Successive calls to remove-by-the-same-term would increment 
numBufferedDeleteTerms
although all but the first are no op if no docs were added in between. Hence 
deletes would
be flushed too soon.

It is a minor problem, should be rare, but it seems cleaner to fix this. 

Attached patch also fixes TestIndexWriterDelete.testNonRAMDelete() which somehow
relied on this behavior.  All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

< 1 2 3 4 5 6 7 8 >

201 - 300 of 748 matches

Mail list logo