[jira] Commented: (LUCENE-1088) PriorityQueue 'wouldBeInserted' method

2007-12-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550843
 ] 

Shai Erera commented on LUCENE-1088:


If you're adding a wouldBeInserted method, I'd add a insertWithNoCheck that 
either calls put() (if there is room in the queue) or replaces the top() 
element, without re-evaluating the element (by calling lessThan). This will 
save unnecessary calls (the lessThan() method can be expensive). It should be 
documented in the method though that insertWithNoCheck assumes wouldBeInserted 
before.


> PriorityQueue 'wouldBeInserted' method
> --
>
> Key: LUCENE-1088
> URL: https://issues.apache.org/jira/browse/LUCENE-1088
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Reporter: Peter Keegan
>Assignee: Michael McCandless
> Attachments: LUCENE-1088.patch
>
>
> This is a request for a new method in PriorityQueue
> public boolean wouldBeInserted(Object element)
> // returns true if doc would be inserted, without inserting 
> This would allow an application to prevent duplicate entries from being added 
> to the queue.
> Here is a reference to the discussion behind  this request:
> http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-944) Remove deprecated methods in BooleanQuery

2007-12-11 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-944:
-

Fix Version/s: (was: 2.3)
   3.0

> Remove deprecated methods in BooleanQuery
> -
>
> Key: LUCENE-944
> URL: https://issues.apache.org/jira/browse/LUCENE-944
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: BooleanQuery20070626.patch
>
>
> Remove deprecated methods setUseScorer14 and getUseScorer14 in BooleanQuery, 
> and adapt javadocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-673) Exceptions when using Lucene over NFS

2007-12-11 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-673.
---

Resolution: Fixed

> Exceptions when using Lucene over NFS
> -
>
> Key: LUCENE-673
> URL: https://issues.apache.org/jira/browse/LUCENE-673
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.0.0
> Environment: NFS server/client
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.2
>
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
> exceptions!
> I was surprised by this because I had always thought (assumed?)
> the NFS problem was because the "simple" file-based locking was
> not correct over NFS, and that switching to native OS filesystem
> locking would resolve it, but it doesn't.
> I can reproduce the "FileNotFound" exceptions even when using NFS
> V4 (the latest NFS protocol), so this is not just a "your NFS
> server is too old" issue.
>   * Then, when running the same stress test with the lock-less
> changes, I don't hit any exceptions.  I've tested on NFS version
> 2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1088) PriorityQueue 'wouldBeInserted' method

2007-12-11 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1088:
---

Attachment: LUCENE-1088.patch

Attached patch.  All tests pass. I plan to commit sometime tomorrow...

> PriorityQueue 'wouldBeInserted' method
> --
>
> Key: LUCENE-1088
> URL: https://issues.apache.org/jira/browse/LUCENE-1088
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Reporter: Peter Keegan
>Assignee: Michael McCandless
> Attachments: LUCENE-1088.patch
>
>
> This is a request for a new method in PriorityQueue
> public boolean wouldBeInserted(Object element)
> // returns true if doc would be inserted, without inserting 
> This would allow an application to prevent duplicate entries from being added 
> to the queue.
> Here is a reference to the discussion behind  this request:
> http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550710
 ] 

Michael McCandless commented on LUCENE-753:
---

OK my results on Win XP now agree with Yonik's.

On UNIX & OS X, ChannelPread is a bit (2-14%) better, but on windows
it's quite a bit (31-34%) slower.

Win Server 2003 R2 Enterprise x64 (Sun Java 1.6):
{code}
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=68094, MB/sec=197.10654095808735

config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=72594, MB/sec=184.88818359644048

config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=98328, MB/sec=136.581360345

config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 
bufsize=1024 filelen=67108864
answer=110480725, ms=201563, MB/sec=66.58847506734867
{code}

Win XP Pro SP2, laptop (Sun Java 1.5):
{code}
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=47449, MB/sec=282.8673481000653

config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=54899, MB/sec=244.4811890926975

config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=71683, MB/sec=187.237877878995

config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 
bufsize=1024 filelen=67108864
answer=110480725, ms=149475, MB/sec=89.79275999330991
{code}

Linux 2.6.22.1 (Sun Java 1.5):
{code}
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=41162, MB/sec=326.0719304212623

config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=53114, MB/sec=252.69745829724744

config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=40226, MB/sec=333.65914582608264

config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 
bufsize=1024 filelen=67108864
answer=110480725, ms=59163, MB/sec=226.86092321214272
{code}

Mac OS X 10.4 (Sun Java 1.5):
{code}
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=85894, MB/sec=156.25972477705076

config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=109939, MB/sec=122.08381738964336

config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 
filelen=67108864
answer=110480725, ms=75517, MB/sec=177.73180608339845

config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 
bufsize=1024 filelen=67108864
answer=110480725, ms=130156, MB/sec=103.12066136021389
{code}


> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, FileReadTest.java, 
> FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1044) Behavior on hard power shutdown

2007-12-11 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1044:
---

Attachment: LUCENE-1044.take5.patch


Initial patch attached:

  * Created new commit() method; deprecated public flush() method

  * Changed IndexWriter to not write segments_N when flushing, only
when syncing (added new private sync() for this).  The current
"policy" is to sync only after merges are committed.  When
autoCommit=false we do not sync until close() or commit() is
called

  * Added MockRAMDirectory.crash() to simulate a machine crash.  It
keeps track of un-synced files, and then in crash() it goes and
corrupts any unsynced files rather aggressively.

  * Added a new unit test, TestCrash, to crash the MockRAMDirectory at
various interesting times & make sure we can still load the
resulting index.

  * Added new Directory.sync() method.  In FSDirectory.sync, if I hit
an IOException when opening or sync'ing, I retry (currently after
waiting 5 msec, and retrying up to 5 times).  If it still fails
after that, the original exception is thrown and the new
segments_N will not be written (and, the previous commit will also
not be deleted).

All tests now pass, but there is still alot to do, eg at least:

  * Javadocs

  * Refactor syncing code so DirectoryIndexReader.doCommit can use it
as well.

  * Change format of segments_N to include a hash of its contents, at
the end.  I think this is now necessary in case we crash after
writing segments_N but before we can sync it, to ensure that
whoever next opens the reader can detect corruption in this
segments_N file.



> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: FSyncPerfTest.java, LUCENE-1044.patch, 
> LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch, 
> LUCENE-1044.take5.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1087) MultiSearcher.explain returns incorrect score/explanation relating to docFreq

2007-12-11 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-1087:
-

Description: 
Creating 2 different indexes, searching  each individually and print score 
details and compare to searching both indexes with MulitSearcher and printing 
score details.  
 
The "docFreq" value printed isn't correct - the values it prints are as if each 
index was searched individually.

Code is like:
{code}
MultiSearcher multi = new MultiSearcher(searchables);
Hits hits = multi.search(query);
for(int i=0; imailto:[EMAIL PROTECTED] 
Sent: Friday, December 07, 2007 10:30 PM
To: [EMAIL PROTECTED]
Subject: Re: does the MultiSearcher class calculate IDF properly?


a quick glance at the code seems to indicate that MultiSearcher has code 
for calcuating the docFreq accross all of the Searchables when searching 
(or when the docFreq method is explicitly called) but that explain method 
just delegates to Searchable that the specific docid came from.

if you compare that Explanation score you got with the score returned by 
a HitCollector (or TopDocs) they probably won't match.

So i would say "yes MultiSearcher calculates IDF properly, but 
MultiSeracher.explain is broken.  Please file a bug about this, i can't 
think of an easy way to fix it, but it certianly seems broken to me.


: Subject: does the MultiSearcher class calculate IDF properly?
: 
: I tried the following.  Creating 2 different indexes, search each
: individually and print score details and compare to searching both
: indexes with MulitSearcher and printing score details.  
: 
: The "docFreq" value printed don't seem right - is this just a problem
: with using Explain together with the MultiSearcher?
: 
: 
: Code is like:
: MultiSearcher multi = new MultiSearcher(searchables);
: Hits hits = multi.search(query);
: for(int i=0; i MultiSearcher.explain returns incorrect score/explanation relating to docFreq
> -
>
> Key: LUCENE-1087
> URL: https://issues.apache.org/jira/browse/LUCENE-1087
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.2
> Environment: No special hardware required to reproduce the issue.
>Reporter: Yasoja Seneviratne
>Priority: Minor
>
> Creating 2 different indexes, searching  each individually and print score 
> details and compare to searching both indexes with MulitSearcher and printing 
> score details.  
>  
> The "docFreq" value printed isn't correct - the values it prints are as if 
> each index was searched individually.
> Code is like:
> {code}
> MultiSearcher multi = new MultiSearcher(searchables);
> Hits hits = multi.search(query);
> for(int i=0; i {
>   Explanation expl = multi.explain(query, hits.id(i));
>   System.out.println(expl.toString());
> }
> {code}
> I raised this in the Lucene user mailing list and was advised to log a bug, 
> email thread given below.
> {noformat} 
> -Original Message-
> From: Chris Hostetter  
> Sent: Friday, December 07, 2007 10:30 PM
> To: java-user
> Subject: Re: does the MultiSearcher class calculate IDF properly?
> a quick glance at the code seems to indicate that MultiSearcher has code 
> for calcuating the docFreq accross all of the Searchables when searching 
> (or when the docFreq method is explicitly called) but that explain method 
> just delegates to Searchable that the specific docid came from.
> if you compare that Explanation score you got with the score returned by 
> a HitCollector (or TopDocs) they probably won't match.
> So i would say "yes MultiSearcher calculates IDF properly, but 
> MultiSeracher.explain is broken.  Please file a bug about this, i can't 
> think of an easy way to fix it, but it certianly seems broken to me.
> : Subject: does the MultiSearcher class calculate IDF properly?
> : 
> : I tried the following.  Creating 2 different indexes, search each
> : individually and print score details and compare to searching both
> : indexes with MulitSearcher and printing score details.  
> : 
> : The "docFreq" value printed don't seem right - is this just a problem
> : with using Explain together with the MultiSearcher?
> : 
> : 
> : Code is like:
> : MultiSearcher multi = new MultiSearcher(searchables);
> : Hits hits = multi.search(query);
> : for(int i=0; i : {
> :   Explanation expl = multi.explain(query, hits.id(i));
> :   System.out.println(expl.toString());
> : }
> : 
> : 
> : Output:
> : id = 14 score = 0.071
> : 0.07073946 = (MATCH) fieldWeight(contents:climate in 2), product of:
> :   1.0 = tf(termFreq(contents:climate)=1)
> :   1.8109303 = idf(docFreq=1)
> :   0.0390625 = fieldNorm(field=contents, doc=2)
> {noformat} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue o

[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550701
 ] 

Michael McCandless commented on LUCENE-753:
---

Thanks!  I'll re-run.

{quote}
Well, at least we've learned that printing out all the input params for 
benchmarking programs is good practice :)
{quote}

Yes indeed :)

> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, FileReadTest.java, 
> FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-11 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-753:


Attachment: FileReadTest.java

OK, uploading latest version of the test that should fix ChannelTransfer (it's 
also slightly optimized to not create a new object per call).

Well, at least we've learned that printing out all the input params for 
benchmarking programs is  good practice :-)

> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, FileReadTest.java, 
> FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550687
 ] 

Michael McCandless commented on LUCENE-753:
---

Doh!!  Woops :)  I will rerun...

> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, 
> FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-11 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550685
 ] 

Yonik Seeley commented on LUCENE-753:
-

I'll try fixing the transferTo test before anyone re-runs any tests.

> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, 
> FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-11 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550682
 ] 

Yonik Seeley commented on LUCENE-753:
-

Mike, it looks like you are running with a bufsize of 6.5MB!
Apologies for my hard-to-use benchmark program :-(

> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, 
> FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-1088) PriorityQueue 'wouldBeInserted' method

2007-12-11 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1088:
--

Assignee: Michael McCandless

> PriorityQueue 'wouldBeInserted' method
> --
>
> Key: LUCENE-1088
> URL: https://issues.apache.org/jira/browse/LUCENE-1088
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Reporter: Peter Keegan
>Assignee: Michael McCandless
>
> This is a request for a new method in PriorityQueue
> public boolean wouldBeInserted(Object element)
> // returns true if doc would be inserted, without inserting 
> This would allow an application to prevent duplicate entries from being added 
> to the queue.
> Here is a reference to the discussion behind  this request:
> http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1088) PriorityQueue 'wouldBeInserted' method

2007-12-11 Thread Peter Keegan (JIRA)
PriorityQueue 'wouldBeInserted' method
--

 Key: LUCENE-1088
 URL: https://issues.apache.org/jira/browse/LUCENE-1088
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Peter Keegan


This is a request for a new method in PriorityQueue

public boolean wouldBeInserted(Object element)
// returns true if doc would be inserted, without inserting 

This would allow an application to prevent duplicate entries from being added 
to the queue.
Here is a reference to the discussion behind  this request:

http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550677
 ] 

Michael McCandless commented on LUCENE-753:
---

I also just ran a test with 4 threads, random access, on Linux 2.6.22.1:

  config: impl=ClassicFile serial=false nThreads=4 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-195110, ms=120856, MB/sec=444.22363142913883

  config: impl=ChannelFile serial=false nThreads=4 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-195110, ms=88272, MB/sec=608.2006887801341

  config: impl=ChannelPread serial=false nThreads=4 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-195110, ms=77672, MB/sec=691.2026367288084

  config: impl=ChannelTransfer serial=false nThreads=4 iterations=200 
bufsize=6518936 filelen=67108864
  answer=594875, ms=38390, MB/sec=1398.465517061735

ChannelTransfer got even faster (scales up with added threads better).


> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, 
> FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550675
 ] 

Michael McCandless commented on LUCENE-753:
---

I ran Yonik's most recent FileReadTest.java on the platforms below,
testing single-threaded random access for fully cached 64 MB file.

I tested two Windows XP Pro machines and got opposite results from
Yonik.  Yonik is your machine XP Home?

I'm showing ChannelTransfer to be much faster on all platforms except
Windows Server 2003 R2 Enterprise x64 where it's about the same as
ChannelPread and ChannelFile.

The ChannelTransfer test is giving the wrong checksum, but I think
just a bug in how checksum is computed (it's using "len" which with
ChannelTransfer is just the chunk size written on each call to
write).  So I think the MB/sec is still correct.

Mac OS X 10.4 (Sun java 1.5)
  config: impl=ClassicFile serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=32565, MB/sec=412.15331797942576

  config: impl=ChannelFile serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=19512, MB/sec=687.8727347273473

  config: impl=ChannelPread serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=19492, MB/sec=688.5785347835009

  config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=147783, ms=16009, MB/sec=838.3892060715847

Linux 2.6.22.1 (Sun java 1.5)
  config: impl=ClassicFile serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=37879, MB/sec=354.33281765622115

  config: impl=ChannelFile serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=21845, MB/sec=614.4093751430535

  config: impl=ChannelPread serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=21902, MB/sec=612.8103734818737

  config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=147783, ms=15978, MB/sec=840.015821754913

Windows Server 2003 R2 Enterprise x64 (Sun java 1.6)

  config: impl=ClassicFile serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=32703, MB/sec=410.4141149130049

  config: impl=ChannelFile serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=23344, MB/sec=574.9559972583961

  config: impl=ChannelPread serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=23329, MB/sec=575.3256804835183

  config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=147783, ms=23422, MB/sec=573.0412774314747

Windows XP Pro SP2, laptop (Sun Java 1.5)

  config: impl=ClassicFile serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=71253, MB/sec=188.36782731955148

  config: impl=ChannelFile serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=57463, MB/sec=233.57243443607192

  config: impl=ChannelPread serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=58043, MB/sec=231.23844046655068

  config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=147783, ms=20039, MB/sec=669.7825640001995

Windows XP Pro SP2, older desktop (Sun Java 1.6)

  config: impl=ClassicFile serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=53047, MB/sec=253.01662299470283

  config: impl=ChannelFile serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=34047, MB/sec=394.2130819161747

  config: impl=ChannelPread serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=-44611, ms=34078, MB/sec=393.8544750278772

  config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 
bufsize=6518936 filelen=67108864
  answer=147783, ms=18781, MB/sec=714.6463340610192


> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, 
> FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issu

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Yonik Seeley
On Dec 11, 2007 1:21 PM, Timo Nentwig <[EMAIL PROTECTED]> wrote:
> On Tuesday 11 December 2007 14:32:12 Shai Erera wrote:
> > For (1) - I can't explain it but I've run into documents with 0.0f scores.
> > For (2) - this is a simple logic - if the lowest score in the queue is 'x'
> > and you want to top docs only, then there's no point in attempting to
> > insert a document with score lower than 'x' (it will not be added).
>
> Sure. I didn't notice that score is passed as parameter and was surprised that
> subsequent calls to collect() are supposed to be guaranteed to have a lower
> score.

One is not guaranteed this... collect() generally goes in docid order,
and scores are unordered.

If you are only gathering the top 10 docs by score, you can compare
the current score to the lowest of the top 10 you currently have to
determine if you should bother inserting into the queue.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Timo Nentwig
On Tuesday 11 December 2007 14:32:12 Shai Erera wrote:
> For (1) - I can't explain it but I've run into documents with 0.0f scores.
> For (2) - this is a simple logic - if the lowest score in the queue is 'x'
> and you want to top docs only, then there's no point in attempting to
> insert a document with score lower than 'x' (it will not be added).

Sure. I didn't notice that score is passed as parameter and was surprised that 
subsequent calls to collect() are supposed to be guaranteed to have a lower 
score.

Ok, stupid question :)

> Maybe I didn't understand your question correctly though ...
>
> On Dec 11, 2007 2:25 PM, Timo Nentwig <[EMAIL PROTECTED]> wrote:
> > On Monday 10 December 2007 09:15:12 Paul Elschot wrote:
> > > The current TopDocCollector only allocates a ScoreDoc when the given
> > > score causes a new ScoreDoc to be added into the queue, but it does
> >
> > I actually wrote my own HitCollector and now wonder about
> > TopDocCollector:
> >
> >  public void collect(int doc, float score) {
> >if (score > 0.0f) {
> >  totalHits++;
> >  if (hq.size() < numHits || score >= minScore) {
> >hq.insert(new ScoreDoc(doc, score));
> >minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
> >  }
> >}
> >  }
> >
> > 1) How can there be hits with score=0.0?
> > 2) I don't understand minScore: inserts only document having a higher
> > score
> > than the lowest score already in queue?
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Michael McCandless


Shai Erera wrote:


Hi,
I will open an issue and create the patch. One thing I'm not sure  
of is the
wouldBeInserted method you mentioned - in what context should it be  
used?

And ... lessThan shouldn't be public, it can stay protected.


Sorry, this is a method Peter suggested (see below) in order to add  
de-duping logic on top of the PQ.


We should do it separately (it's unrelated).

Peter can you open an issue for that one?  Thanks!

Mike



On 12/11/07, Michael McCandless <[EMAIL PROTECTED]> wrote:



I think we can't make lessThan public since that would cause
subclasses to fail to compile (ie this breaks backwards  
compatibility)?


Adding "wouldBeInserted()" seems OK?

Mike

Peter Keegan wrote:


See my similar request from last March:
http://www.nabble.com/FieldSortedHitQueue-enhancement-
to9733550.html#a9733550

Peter

On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]>
wrote:


On Mon, Dec 10, 2007, Shai Erera wrote about "Performance
Improvement for
Search using PriorityQueue":

Hi

Lucene's PQ implements two methods: put (assumes the PQ has room
for the
object) and insert (checks whether the object can be inserted
etc.). The
implementation of insert() requires the application that uses  
it to

allocate

a new object every time it calls insert. Specifically, it cannot
reuse

the

objects that were removed from the PQ.


I've read this entire thread, and would like to add my comments  
about

three
independent issues, which I think that can and perhaps should be
considered
separately:

1. When Shai wanted to add the insertWithOverflow() method to
PriorityQueue
  he couldn't just subclass PriorityQueue in his application, but
rather
  was forced to modify PriorityQueue itself. Why? just because one
field
  of that classi - "heap" - was defined "private" instead of
"protected".

  Is there a special reason for that? If not, can we make the  
trivial

change
  to make PriorityQueue's fields protected, to allow Shai and
others (see
the
  next point) to add functionality on top of PriorityQueue?

2. PriorityQueue, in addition to being used in about a dozen
places inside
  Lucene, is also a public class that advanced users often find
useful
when
  implementing features like new collectors, new queries, and so  
on.

  Unfortunately, my experience exactly matches Shai's: In the two
occasions
  where I used a PriorityQueue, I found that I needed such a
  insertWithOverflow() method. If this feature is so useful (I  
can't

believe
  that Shai and me are the only ones who found it useful), I  
think it

would
  be nice to add it to Lucene's PriorityQueue, even if it isn't
(yet) used
  inside Lucene.

  Just to make it sound more interesting, let me give you an
example where
  I needed (and implemented) an "insertWithOverflow()": I was
implementing
a
  faceted search capability over Lucene. It calculated a count for
each
  facet value, and then I used a PriorityQueue to find the 10 best
values.
  The problem is that I also needed an "other" aggregator, which  
was

supposed
  to aggregate (in various ways) all the facets except the 10 best
ones.
For
  that, I needed to know which facets dropped off the  
priorityqueue.


3. Finally, Shai asked for this new
PriorityQueue.insertWithOverflow()
  to be used in TopDocCollector. I have to admit I don't know how
much
  of a benefit this will be in the "typical" case. But I do know  
that

  there's no such thing as a "typical" case...
  I can easily think of a quite typical "worst case" though:
Consider a
  collection indexed in order of document age (a pretty typical
scenario
  for a long-running index), and then you do a sorting query
  (TopFieldDocCollector), asking it to bring the 10 newest  
documents.

  In that case, each and every document will have a new DocScore
created -
  which is the worst-case that Shai feared.
  It would be nice if Shai or someone else could provide a
measurement in
  that case.

P.S. When looking now at PriorityQueue's code, I found two tiny
performance improvements that could be easily made to it - I
wonder if
there's any reason not to do them:

 1. Insert can use heap[1] directly instead of calling top().
After all,
   this is done in an if() that already ensures that size>0.

 2. Regardless, top() could return heap[1] always, without any if
(). After
   all, the heap array is initialized to all nulls, so when  
size==0,

heap[1]
   is null anyway.




PriorityQueue change (added insertWithOverflow method)

--- 
--

--

/**
 * insertWithOverflow() is similar to insert(), except its
return

value:

it
 * returns the object (if any) that was dropped off the heap
because

it

was
 * full. This can be the given parameter (in case it is
smaller than

the
 * full heap's minimum, and couldn't be added) or another  
object

that

was
 * previously the smallest value in the heap and now has been

replaced

by a
 * large

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Shai Erera
Hi,
I will open an issue and create the patch. One thing I'm not sure of is the
wouldBeInserted method you mentioned - in what context should it be used?
And ... lessThan shouldn't be public, it can stay protected.


On 12/11/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
>
>
> I think we can't make lessThan public since that would cause
> subclasses to fail to compile (ie this breaks backwards compatibility)?
>
> Adding "wouldBeInserted()" seems OK?
>
> Mike
>
> Peter Keegan wrote:
>
> > See my similar request from last March:
> > http://www.nabble.com/FieldSortedHitQueue-enhancement-
> > to9733550.html#a9733550
> >
> > Peter
> >
> > On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]>
> > wrote:
> >
> >> On Mon, Dec 10, 2007, Shai Erera wrote about "Performance
> >> Improvement for
> >> Search using PriorityQueue":
> >>> Hi
> >>>
> >>> Lucene's PQ implements two methods: put (assumes the PQ has room
> >>> for the
> >>> object) and insert (checks whether the object can be inserted
> >>> etc.). The
> >>> implementation of insert() requires the application that uses it to
> >> allocate
> >>> a new object every time it calls insert. Specifically, it cannot
> >>> reuse
> >> the
> >>> objects that were removed from the PQ.
> >>
> >> I've read this entire thread, and would like to add my comments about
> >> three
> >> independent issues, which I think that can and perhaps should be
> >> considered
> >> separately:
> >>
> >> 1. When Shai wanted to add the insertWithOverflow() method to
> >> PriorityQueue
> >>   he couldn't just subclass PriorityQueue in his application, but
> >> rather
> >>   was forced to modify PriorityQueue itself. Why? just because one
> >> field
> >>   of that classi - "heap" - was defined "private" instead of
> >> "protected".
> >>
> >>   Is there a special reason for that? If not, can we make the trivial
> >> change
> >>   to make PriorityQueue's fields protected, to allow Shai and
> >> others (see
> >> the
> >>   next point) to add functionality on top of PriorityQueue?
> >>
> >> 2. PriorityQueue, in addition to being used in about a dozen
> >> places inside
> >>   Lucene, is also a public class that advanced users often find
> >> useful
> >> when
> >>   implementing features like new collectors, new queries, and so on.
> >>   Unfortunately, my experience exactly matches Shai's: In the two
> >> occasions
> >>   where I used a PriorityQueue, I found that I needed such a
> >>   insertWithOverflow() method. If this feature is so useful (I can't
> >> believe
> >>   that Shai and me are the only ones who found it useful), I think it
> >> would
> >>   be nice to add it to Lucene's PriorityQueue, even if it isn't
> >> (yet) used
> >>   inside Lucene.
> >>
> >>   Just to make it sound more interesting, let me give you an
> >> example where
> >>   I needed (and implemented) an "insertWithOverflow()": I was
> >> implementing
> >> a
> >>   faceted search capability over Lucene. It calculated a count for
> >> each
> >>   facet value, and then I used a PriorityQueue to find the 10 best
> >> values.
> >>   The problem is that I also needed an "other" aggregator, which was
> >> supposed
> >>   to aggregate (in various ways) all the facets except the 10 best
> >> ones.
> >> For
> >>   that, I needed to know which facets dropped off the priorityqueue.
> >>
> >> 3. Finally, Shai asked for this new
> >> PriorityQueue.insertWithOverflow()
> >>   to be used in TopDocCollector. I have to admit I don't know how
> >> much
> >>   of a benefit this will be in the "typical" case. But I do know that
> >>   there's no such thing as a "typical" case...
> >>   I can easily think of a quite typical "worst case" though:
> >> Consider a
> >>   collection indexed in order of document age (a pretty typical
> >> scenario
> >>   for a long-running index), and then you do a sorting query
> >>   (TopFieldDocCollector), asking it to bring the 10 newest documents.
> >>   In that case, each and every document will have a new DocScore
> >> created -
> >>   which is the worst-case that Shai feared.
> >>   It would be nice if Shai or someone else could provide a
> >> measurement in
> >>   that case.
> >>
> >> P.S. When looking now at PriorityQueue's code, I found two tiny
> >> performance improvements that could be easily made to it - I
> >> wonder if
> >> there's any reason not to do them:
> >>
> >>  1. Insert can use heap[1] directly instead of calling top().
> >> After all,
> >>this is done in an if() that already ensures that size>0.
> >>
> >>  2. Regardless, top() could return heap[1] always, without any if
> >> (). After
> >>all, the heap array is initialized to all nulls, so when size==0,
> >> heap[1]
> >>is null anyway.
> >>
> >>
> >>
> >>> PriorityQueue change (added insertWithOverflow method)
> >>>
> >> -
> >> --
> >>> /**
> >>>  * insertWithOverflow() is similar to insert(), except its
> >>> return
> >> value:
> >>

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Michael McCandless


I think we can't make lessThan public since that would cause  
subclasses to fail to compile (ie this breaks backwards compatibility)?


Adding "wouldBeInserted()" seems OK?

Mike

Peter Keegan wrote:


See my similar request from last March:
http://www.nabble.com/FieldSortedHitQueue-enhancement- 
to9733550.html#a9733550


Peter

On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]>  
wrote:


On Mon, Dec 10, 2007, Shai Erera wrote about "Performance  
Improvement for

Search using PriorityQueue":

Hi

Lucene's PQ implements two methods: put (assumes the PQ has room  
for the
object) and insert (checks whether the object can be inserted  
etc.). The

implementation of insert() requires the application that uses it to

allocate
a new object every time it calls insert. Specifically, it cannot  
reuse

the

objects that were removed from the PQ.


I've read this entire thread, and would like to add my comments about
three
independent issues, which I think that can and perhaps should be
considered
separately:

1. When Shai wanted to add the insertWithOverflow() method to
PriorityQueue
  he couldn't just subclass PriorityQueue in his application, but  
rather
  was forced to modify PriorityQueue itself. Why? just because one  
field
  of that classi - "heap" - was defined "private" instead of  
"protected".


  Is there a special reason for that? If not, can we make the trivial
change
  to make PriorityQueue's fields protected, to allow Shai and  
others (see

the
  next point) to add functionality on top of PriorityQueue?

2. PriorityQueue, in addition to being used in about a dozen  
places inside
  Lucene, is also a public class that advanced users often find  
useful

when
  implementing features like new collectors, new queries, and so on.
  Unfortunately, my experience exactly matches Shai's: In the two
occasions
  where I used a PriorityQueue, I found that I needed such a
  insertWithOverflow() method. If this feature is so useful (I can't
believe
  that Shai and me are the only ones who found it useful), I think it
would
  be nice to add it to Lucene's PriorityQueue, even if it isn't  
(yet) used

  inside Lucene.

  Just to make it sound more interesting, let me give you an  
example where
  I needed (and implemented) an "insertWithOverflow()": I was  
implementing

a
  faceted search capability over Lucene. It calculated a count for  
each
  facet value, and then I used a PriorityQueue to find the 10 best  
values.

  The problem is that I also needed an "other" aggregator, which was
supposed
  to aggregate (in various ways) all the facets except the 10 best  
ones.

For
  that, I needed to know which facets dropped off the priorityqueue.

3. Finally, Shai asked for this new  
PriorityQueue.insertWithOverflow()
  to be used in TopDocCollector. I have to admit I don't know how  
much

  of a benefit this will be in the "typical" case. But I do know that
  there's no such thing as a "typical" case...
  I can easily think of a quite typical "worst case" though:  
Consider a
  collection indexed in order of document age (a pretty typical  
scenario

  for a long-running index), and then you do a sorting query
  (TopFieldDocCollector), asking it to bring the 10 newest documents.
  In that case, each and every document will have a new DocScore  
created -

  which is the worst-case that Shai feared.
  It would be nice if Shai or someone else could provide a  
measurement in

  that case.

P.S. When looking now at PriorityQueue's code, I found two tiny
performance improvements that could be easily made to it - I  
wonder if

there's any reason not to do them:

 1. Insert can use heap[1] directly instead of calling top().  
After all,

   this is done in an if() that already ensures that size>0.

 2. Regardless, top() could return heap[1] always, without any if 
(). After

   all, the heap array is initialized to all nulls, so when size==0,
heap[1]
   is null anyway.




PriorityQueue change (added insertWithOverflow method)

- 
--

/**
 * insertWithOverflow() is similar to insert(), except its  
return

value:

it
 * returns the object (if any) that was dropped off the heap  
because

it

was
 * full. This can be the given parameter (in case it is  
smaller than

the

 * full heap's minimum, and couldn't be added) or another object

that

was
 * previously the smallest value in the heap and now has been

replaced

by a
 * larger one.
 */
public Object insertWithOverflow(Object element) {
if (size < maxSize) {
put(element);
return null;
} else if (size > 0 && !lessThan(element, top())) {
Object ret = heap[1];
heap[1] = element;
adjustTop();
return ret;
} else {
return element;
}
}
[Very similar to insert(), only it returns the object that was  
kicked

out of

the Queue,

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Michael McCandless


I agree that even though we don't see gains on the queries tested,  
there are in theory cases where there could be a great many  
allocations that would be saved.


I think we should do Shai's suggested option 1 (add the method and  
change TDC to call it), change heap to be protected not private, plus  
the 2 tiny performance gains Nadav suggests below?  Shai can you open  
a Jira issue & attach a patch for these changes?  Thanks!


Mike

Nadav Har'El wrote:

On Mon, Dec 10, 2007, Shai Erera wrote about "Performance  
Improvement for Search using PriorityQueue":

Hi

Lucene's PQ implements two methods: put (assumes the PQ has room  
for the
object) and insert (checks whether the object can be inserted  
etc.). The
implementation of insert() requires the application that uses it  
to allocate
a new object every time it calls insert. Specifically, it cannot  
reuse the

objects that were removed from the PQ.


I've read this entire thread, and would like to add my comments  
about three
independent issues, which I think that can and perhaps should be  
considered

separately:

1. When Shai wanted to add the insertWithOverflow() method to  
PriorityQueue
   he couldn't just subclass PriorityQueue in his application, but  
rather
   was forced to modify PriorityQueue itself. Why? just because one  
field
   of that classi - "heap" - was defined "private" instead of  
"protected".


   Is there a special reason for that? If not, can we make the  
trivial change
   to make PriorityQueue's fields protected, to allow Shai and  
others (see the

   next point) to add functionality on top of PriorityQueue?

2. PriorityQueue, in addition to being used in about a dozen places  
inside
   Lucene, is also a public class that advanced users often find  
useful when

   implementing features like new collectors, new queries, and so on.
   Unfortunately, my experience exactly matches Shai's: In the two  
occasions

   where I used a PriorityQueue, I found that I needed such a
   insertWithOverflow() method. If this feature is so useful (I  
can't believe
   that Shai and me are the only ones who found it useful), I think  
it would
   be nice to add it to Lucene's PriorityQueue, even if it isn't  
(yet) used

   inside Lucene.

   Just to make it sound more interesting, let me give you an  
example where
   I needed (and implemented) an "insertWithOverflow()": I was  
implementing a
   faceted search capability over Lucene. It calculated a count for  
each
   facet value, and then I used a PriorityQueue to find the 10 best  
values.
   The problem is that I also needed an "other" aggregator, which  
was supposed
   to aggregate (in various ways) all the facets except the 10 best  
ones. For

   that, I needed to know which facets dropped off the priorityqueue.

3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow()
   to be used in TopDocCollector. I have to admit I don't know how  
much

   of a benefit this will be in the "typical" case. But I do know that
   there's no such thing as a "typical" case...
   I can easily think of a quite typical "worst case" though:  
Consider a
   collection indexed in order of document age (a pretty typical  
scenario

   for a long-running index), and then you do a sorting query
   (TopFieldDocCollector), asking it to bring the 10 newest documents.
   In that case, each and every document will have a new DocScore  
created -

   which is the worst-case that Shai feared.
   It would be nice if Shai or someone else could provide a  
measurement in

   that case.

P.S. When looking now at PriorityQueue's code, I found two tiny
performance improvements that could be easily made to it - I wonder if
there's any reason not to do them:

 1. Insert can use heap[1] directly instead of calling top(). After  
all,

this is done in an if() that already ensures that size>0.

 2. Regardless, top() could return heap[1] always, without any if 
(). After
all, the heap array is initialized to all nulls, so when  
size==0, heap[1]

is null anyway.




PriorityQueue change (added insertWithOverflow method)
- 
--

/**
 * insertWithOverflow() is similar to insert(), except its  
return value:

it
 * returns the object (if any) that was dropped off the heap  
because it

was
 * full. This can be the given parameter (in case it is  
smaller than the
 * full heap's minimum, and couldn't be added) or another  
object that

was
 * previously the smallest value in the heap and now has been  
replaced

by a
 * larger one.
 */
public Object insertWithOverflow(Object element) {
if (size < maxSize) {
put(element);
return null;
} else if (size > 0 && !lessThan(element, top())) {
Object ret = heap[1];
heap[1] = element;
adjustTop();
return ret;
} else {
return element;

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Peter Keegan
See my similar request from last March:
http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550

Peter

On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]> wrote:

> On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for
> Search using PriorityQueue":
> > Hi
> >
> > Lucene's PQ implements two methods: put (assumes the PQ has room for the
> > object) and insert (checks whether the object can be inserted etc.). The
> > implementation of insert() requires the application that uses it to
> allocate
> > a new object every time it calls insert. Specifically, it cannot reuse
> the
> > objects that were removed from the PQ.
>
> I've read this entire thread, and would like to add my comments about
> three
> independent issues, which I think that can and perhaps should be
> considered
> separately:
>
> 1. When Shai wanted to add the insertWithOverflow() method to
> PriorityQueue
>   he couldn't just subclass PriorityQueue in his application, but rather
>   was forced to modify PriorityQueue itself. Why? just because one field
>   of that classi - "heap" - was defined "private" instead of "protected".
>
>   Is there a special reason for that? If not, can we make the trivial
> change
>   to make PriorityQueue's fields protected, to allow Shai and others (see
> the
>   next point) to add functionality on top of PriorityQueue?
>
> 2. PriorityQueue, in addition to being used in about a dozen places inside
>   Lucene, is also a public class that advanced users often find useful
> when
>   implementing features like new collectors, new queries, and so on.
>   Unfortunately, my experience exactly matches Shai's: In the two
> occasions
>   where I used a PriorityQueue, I found that I needed such a
>   insertWithOverflow() method. If this feature is so useful (I can't
> believe
>   that Shai and me are the only ones who found it useful), I think it
> would
>   be nice to add it to Lucene's PriorityQueue, even if it isn't (yet) used
>   inside Lucene.
>
>   Just to make it sound more interesting, let me give you an example where
>   I needed (and implemented) an "insertWithOverflow()": I was implementing
> a
>   faceted search capability over Lucene. It calculated a count for each
>   facet value, and then I used a PriorityQueue to find the 10 best values.
>   The problem is that I also needed an "other" aggregator, which was
> supposed
>   to aggregate (in various ways) all the facets except the 10 best ones.
> For
>   that, I needed to know which facets dropped off the priorityqueue.
>
> 3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow()
>   to be used in TopDocCollector. I have to admit I don't know how much
>   of a benefit this will be in the "typical" case. But I do know that
>   there's no such thing as a "typical" case...
>   I can easily think of a quite typical "worst case" though: Consider a
>   collection indexed in order of document age (a pretty typical scenario
>   for a long-running index), and then you do a sorting query
>   (TopFieldDocCollector), asking it to bring the 10 newest documents.
>   In that case, each and every document will have a new DocScore created -
>   which is the worst-case that Shai feared.
>   It would be nice if Shai or someone else could provide a measurement in
>   that case.
>
> P.S. When looking now at PriorityQueue's code, I found two tiny
> performance improvements that could be easily made to it - I wonder if
> there's any reason not to do them:
>
>  1. Insert can use heap[1] directly instead of calling top(). After all,
>this is done in an if() that already ensures that size>0.
>
>  2. Regardless, top() could return heap[1] always, without any if(). After
>all, the heap array is initialized to all nulls, so when size==0,
> heap[1]
>is null anyway.
>
>
>
> > PriorityQueue change (added insertWithOverflow method)
> >
> ---
> > /**
> >  * insertWithOverflow() is similar to insert(), except its return
> value:
> > it
> >  * returns the object (if any) that was dropped off the heap because
> it
> > was
> >  * full. This can be the given parameter (in case it is smaller than
> the
> >  * full heap's minimum, and couldn't be added) or another object
> that
> > was
> >  * previously the smallest value in the heap and now has been
> replaced
> > by a
> >  * larger one.
> >  */
> > public Object insertWithOverflow(Object element) {
> > if (size < maxSize) {
> > put(element);
> > return null;
> > } else if (size > 0 && !lessThan(element, top())) {
> > Object ret = heap[1];
> > heap[1] = element;
> > adjustTop();
> > return ret;
> > } else {
> > return element;
> > }
> > }
> > [Very similar to insert(), only it returns the object that was kicked
> out of
> > the Queue, or null]
>

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Nadav Har'El
On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for 
Search using PriorityQueue":
> Hi
> 
> Lucene's PQ implements two methods: put (assumes the PQ has room for the
> object) and insert (checks whether the object can be inserted etc.). The
> implementation of insert() requires the application that uses it to allocate
> a new object every time it calls insert. Specifically, it cannot reuse the
> objects that were removed from the PQ.

I've read this entire thread, and would like to add my comments about three
independent issues, which I think that can and perhaps should be considered
separately:

1. When Shai wanted to add the insertWithOverflow() method to PriorityQueue
   he couldn't just subclass PriorityQueue in his application, but rather
   was forced to modify PriorityQueue itself. Why? just because one field
   of that classi - "heap" - was defined "private" instead of "protected".

   Is there a special reason for that? If not, can we make the trivial change
   to make PriorityQueue's fields protected, to allow Shai and others (see the
   next point) to add functionality on top of PriorityQueue?

2. PriorityQueue, in addition to being used in about a dozen places inside
   Lucene, is also a public class that advanced users often find useful when
   implementing features like new collectors, new queries, and so on.
   Unfortunately, my experience exactly matches Shai's: In the two occasions
   where I used a PriorityQueue, I found that I needed such a 
   insertWithOverflow() method. If this feature is so useful (I can't believe
   that Shai and me are the only ones who found it useful), I think it would
   be nice to add it to Lucene's PriorityQueue, even if it isn't (yet) used
   inside Lucene.

   Just to make it sound more interesting, let me give you an example where
   I needed (and implemented) an "insertWithOverflow()": I was implementing a
   faceted search capability over Lucene. It calculated a count for each
   facet value, and then I used a PriorityQueue to find the 10 best values.
   The problem is that I also needed an "other" aggregator, which was supposed
   to aggregate (in various ways) all the facets except the 10 best ones. For
   that, I needed to know which facets dropped off the priorityqueue.

3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow()
   to be used in TopDocCollector. I have to admit I don't know how much
   of a benefit this will be in the "typical" case. But I do know that
   there's no such thing as a "typical" case...
   I can easily think of a quite typical "worst case" though: Consider a
   collection indexed in order of document age (a pretty typical scenario
   for a long-running index), and then you do a sorting query
   (TopFieldDocCollector), asking it to bring the 10 newest documents.
   In that case, each and every document will have a new DocScore created -
   which is the worst-case that Shai feared.
   It would be nice if Shai or someone else could provide a measurement in
   that case.

P.S. When looking now at PriorityQueue's code, I found two tiny
performance improvements that could be easily made to it - I wonder if
there's any reason not to do them:

 1. Insert can use heap[1] directly instead of calling top(). After all,
this is done in an if() that already ensures that size>0.

 2. Regardless, top() could return heap[1] always, without any if(). After
all, the heap array is initialized to all nulls, so when size==0, heap[1]
is null anyway.



> PriorityQueue change (added insertWithOverflow method)
> ---
> /**
>  * insertWithOverflow() is similar to insert(), except its return value:
> it
>  * returns the object (if any) that was dropped off the heap because it
> was
>  * full. This can be the given parameter (in case it is smaller than the
>  * full heap's minimum, and couldn't be added) or another object that
> was
>  * previously the smallest value in the heap and now has been replaced
> by a
>  * larger one.
>  */
> public Object insertWithOverflow(Object element) {
> if (size < maxSize) {
> put(element);
> return null;
> } else if (size > 0 && !lessThan(element, top())) {
> Object ret = heap[1];
> heap[1] = element;
> adjustTop();
> return ret;
> } else {
> return element;
> }
> }
> [Very similar to insert(), only it returns the object that was kicked out of
> the Queue, or null]

-- 
Nadav Har'El|   Tuesday, Dec 11 2007, 3 Tevet 5768
IBM Haifa Research Lab  |-
|A professor is one who talks in someone
http://nadav.harel.org.il   |else's sleep.

-
To unsubscribe, e-m

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Shai Erera
For (1) - I can't explain it but I've run into documents with 0.0f scores.
For (2) - this is a simple logic - if the lowest score in the queue is 'x'
and you want to top docs only, then there's no point in attempting to insert
a document with score lower than 'x' (it will not be added).
Maybe I didn't understand your question correctly though ...

On Dec 11, 2007 2:25 PM, Timo Nentwig <[EMAIL PROTECTED]> wrote:

> On Monday 10 December 2007 09:15:12 Paul Elschot wrote:
> > The current TopDocCollector only allocates a ScoreDoc when the given
> > score causes a new ScoreDoc to be added into the queue, but it does
>
> I actually wrote my own HitCollector and now wonder about TopDocCollector:
>
>  public void collect(int doc, float score) {
>if (score > 0.0f) {
>  totalHits++;
>  if (hq.size() < numHits || score >= minScore) {
>hq.insert(new ScoreDoc(doc, score));
>minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
>  }
>}
>  }
>
> 1) How can there be hits with score=0.0?
> 2) I don't understand minScore: inserts only document having a higher
> score
> than the lowest score already in queue?
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,

Shai Erera


Re: [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

2007-12-11 Thread Shai Erera
Hi

I attached two patch files (for "java" and "test"). Due to a problem in my
checkout project in Eclipse, I don't have them under "src".
I also added a test and modified two tests in TestStandardAnalyzer.

On Dec 10, 2007 11:44 PM, Grant Ingersoll (JIRA) <[EMAIL PROTECTED]> wrote:

>
>[
> https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550202]
>
> Grant Ingersoll commented on LUCENE-1068:
> -
>
> Hmmm, maybe there is a way in Eclipse to make the path relative to the
> working directory?  Otherwise, from the command line in the Lucene
> directory:  svn diff > StandardTokenizer-4.patch
>
> -Grant
>
>
>
> --
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
> > Invalid behavior of StandardTokenizerImpl
> > -
> >
> > Key: LUCENE-1068
> > URL: https://issues.apache.org/jira/browse/LUCENE-1068
> > Project: Lucene - Java
> >  Issue Type: Bug
> >  Components: Analysis
> >Reporter: Shai Erera
> >Assignee: Grant Ingersoll
> > Attachments: StandardTokenizerImpl-2.patch,
> StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch,
> standardTokenizerImpl.patch
> >
> >
> > The following code prints the output of StandardAnalyzer:
> > Analyzer analyzer = new StandardAnalyzer();
> > TokenStream ts = analyzer.tokenStream("content", new
> StringReader(""));
> > Token t;
> > while ((t = ts.next()) != null) {
> > System.out.println(t);
> > }
> > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=)
> (which is correct in my opinion).
> > However, if you pass "www.abc.com." (notice the extra '.' at the end),
> the output is (wwwabccom,0,12,type=).
> > I think the behavior in the second case is incorrect for several
> reasons:
> > 1. It recognizes the string incorrectly (no argue on that).
> > 2. It kind of prevents you from putting URLs at the end of a sentence,
> which is perfectly legal.
> > 3. An ACRONYM, at least to the best of my understanding, is of the form
> A.B.C. and not ABC.DEF.
> > I looked at StandardTokenizerImpl.jflex and I think the problem comes
> from this definition:
> > // acronyms: U.S.A., I.B.M., etc.
> > // use a post-filter to remove dots
> > ACRONYM=  {ALPHA} "." ({ALPHA} ".")+
> > Notice how the comment relates to acronym as U.S.A., I.B.M. and not
> something else. I changed the definition to
> > ACRONYM=  {LETTER} "." ({LETTER} ".")+
> > and it solved the problem.
> > This was also reported here:
> >
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> >
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,

Shai Erera


[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

2007-12-11 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1068:
---

Attachment: StandardTokenizer-test-4.patch
StandardTokenizer-java-4.patch

Code fies under java and test packages. This should be applied under "src"

> Invalid behavior of StandardTokenizerImpl
> -
>
> Key: LUCENE-1068
> URL: https://issues.apache.org/jira/browse/LUCENE-1068
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Shai Erera
>Assignee: Grant Ingersoll
> Attachments: StandardTokenizer-java-4.patch, 
> StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, 
> StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, 
> standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
> Analyzer analyzer = new StandardAnalyzer();
> TokenStream ts = analyzer.tokenStream("content", new 
> StringReader(""));
> Token t;
> while ((t = ts.next()) != null) {
> System.out.println(t);
> }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=) 
> (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the 
> output is (wwwabccom,0,12,type=).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which 
> is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form 
> A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from 
> this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM=  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something 
> else. I changed the definition to
> ACRONYM=  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Caching FuzzyQuery

2007-12-11 Thread Timo Nentwig
Hi!

Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a 
caching decorator? A WeakHashMap with key==IndexReader and value==LRU of 
BooleanQueries.

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Timo Nentwig
On Monday 10 December 2007 09:15:12 Paul Elschot wrote:
> The current TopDocCollector only allocates a ScoreDoc when the given
> score causes a new ScoreDoc to be added into the queue, but it does

I actually wrote my own HitCollector and now wonder about TopDocCollector:

  public void collect(int doc, float score) {
if (score > 0.0f) {
  totalHits++;
  if (hq.size() < numHits || score >= minScore) {
hq.insert(new ScoreDoc(doc, score));
minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
  }
}
  }

1) How can there be hits with score=0.0?
2) I don't understand minScore: inserts only document having a higher score 
than the lowest score already in queue?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1081) Remove the "Experimental" warnings from search.function package

2007-12-11 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550419
 ] 

Doron Cohen commented on LUCENE-1081:
-

{quote}
I think we should resolve LUCENE-1085 first and move this to 2.4?
{quote}
Done.

> Remove the "Experimental" warnings from search.function package
> ---
>
> Key: LUCENE-1081
> URL: https://issues.apache.org/jira/browse/LUCENE-1081
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 2.4
>
>
> I am using this package for a while, seems that others in this list use it 
> too, so let's remove those warnings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1081) Remove the "Experimental" warnings from search.function package

2007-12-11 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-1081:


Fix Version/s: (was: 2.3)
   2.4
Affects Version/s: (was: 2.3)
   2.4

> Remove the "Experimental" warnings from search.function package
> ---
>
> Key: LUCENE-1081
> URL: https://issues.apache.org/jira/browse/LUCENE-1081
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 2.4
>
>
> I am using this package for a while, seems that others in this list use it 
> too, so let's remove those warnings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Reopened: (LUCENE-944) Remove deprecated methods in BooleanQuery

2007-12-11 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reopened LUCENE-944:
--

Lucene Fields: [Patch Available]  (was: [Patch Available, New])

You are right, Grant.

I will revert this for 2.3. Thanks for catching this!!

> Remove deprecated methods in BooleanQuery
> -
>
> Key: LUCENE-944
> URL: https://issues.apache.org/jira/browse/LUCENE-944
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: BooleanQuery20070626.patch
>
>
> Remove deprecated methods setUseScorer14 and getUseScorer14 in BooleanQuery, 
> and adapt javadocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1081) Remove the "Experimental" warnings from search.function package

2007-12-11 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550411
 ] 

Michael Busch commented on LUCENE-1081:
---

I think we should resolve LUCENE-1085 first and move this to 2.4?

> Remove the "Experimental" warnings from search.function package
> ---
>
> Key: LUCENE-1081
> URL: https://issues.apache.org/jira/browse/LUCENE-1081
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.3
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 2.3
>
>
> I am using this package for a while, seems that others in this list use it 
> too, so let's remove those warnings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Shai Erera
Hi

Back from the experiments lab with more results. I've used two indexes (1
and 10 million documents) and ran over the two 2000 queries. Each run was
executed 4 times and I paste here the average of the latest 3 (to eliminate
any caching that is done by the OS and to mimic systems that are already
working and therefore have some data in the OS cache). Following are the
results:

Current TopDocCollector + PQ

Index Size 1M  10M
Avg. Time   8.519ms 289.232ms
Avg. Allocations  77.38  97.35
Avg. # results  51,113461,019

Modified TopDocCollector + PQ
--
Index Size 1M  10M
Avg. Time   9.619ms 298.197ms
Avg. Allocations  9.92   10.12
Avg. # results  51,113461,019

Basically the results haven't changed from yesterday. There isn't any
significant difference in the execution time of both versions. The only
difference is the number of allocations.
Although the number of allocations is very small (100 for 461,000 results),
I think it should not be neglected. On systems that rely solely on memory
(such as powerful systems that are able to keep entire indexes in-memory),
the number of object allocations may be significant.

The way I see it we can do either of the following:
1. Add the method to PQ and change TDC implementation to reuse ScoreDocs. We
gain only in the number of allocations. Basically, we don't lose anything by
doing that, we only gain.
2. Add the method to PQ for applications that require it and not change
TDC's implementation. For example, applications that want to show the 10
most recent documents from a very large collection need to run a
MatchAllDocsQuery with some sorting. They may create a lot more instances of
ScoreDoc.
3. Do nothing.

If you think I should run more tests, please let me know - I already have
the two indexes and any further tests can be performed quite immediately.

Thanks,

Shai

On Dec 10, 2007 11:46 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:

> On 10-Dec-07, at 1:20 PM, Shai Erera wrote:
>
> > Thanks for the info. Too bad I use Windows ...
>
> Just allocate a bunch of memory and free it.  This linux, but
> something similar should work on windows:
>
> $ vmstat -S M
> procs ---memory--
> r  b   swpd   free   buff  cache
> 0  0  0 45372786
>
> $ python -c '"a"*20'
>
> $ vmstat -S M
> procs ---memory--
> r  b   swpd   free   buff  cache
> 0  0463   1761  0  6
>
> -Mike
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,

Shai Erera


[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-11 Thread Brian Pinkerton (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550376
 ] 

Brian Pinkerton commented on LUCENE-753:


BTW, I think the performance win with Yonik's patch for some workloads could be 
far greater than what the simple benchmark illustrates.  Sure, pread might be 
marginally faster.   But the real win is avoiding synchronized access to the 
file.

I did some IO tracing a while back on one particular workload that is 
characterized by:
* a small number of large compound indexes
* short average execution time, particularly compared to disk response time
* a 99+% FS cache hit rate
* cache misses that tend to cluster on rare queries

In this workload where each query hits each compound index, the locking in 
FSIndexInput means that a single rare query clobbers the response time for all 
queries.  The requests to read cached data are serialized (fairly, even) with 
those that hit the disk.  While we can't get rid of the rare queries, we can 
allow the common ones to proceed against cached data right away.



> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, 
> FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-11 Thread Brian Pinkerton (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550351
 ] 

Brian Pinkerton commented on LUCENE-753:


Yeah, the file was full of zeroes.  But I created the files w/o holes and was 
using filesystems that don't compress file contents.  Just to be sure, though, 
I repeated the tests with a file with random contents; the results above still 
hold.


> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, 
> FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]