[jira] Commented: (LUCENE-1088) PriorityQueue 'wouldBeInserted' method
[ https://issues.apache.org/jira/browse/LUCENE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550843 ] Shai Erera commented on LUCENE-1088: If you're adding a wouldBeInserted method, I'd add a insertWithNoCheck that either calls put() (if there is room in the queue) or replaces the top() element, without re-evaluating the element (by calling lessThan). This will save unnecessary calls (the lessThan() method can be expensive). It should be documented in the method though that insertWithNoCheck assumes wouldBeInserted before. > PriorityQueue 'wouldBeInserted' method > -- > > Key: LUCENE-1088 > URL: https://issues.apache.org/jira/browse/LUCENE-1088 > Project: Lucene - Java > Issue Type: New Feature > Components: Other >Reporter: Peter Keegan >Assignee: Michael McCandless > Attachments: LUCENE-1088.patch > > > This is a request for a new method in PriorityQueue > public boolean wouldBeInserted(Object element) > // returns true if doc would be inserted, without inserting > This would allow an application to prevent duplicate entries from being added > to the queue. > Here is a reference to the discussion behind this request: > http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-944) Remove deprecated methods in BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-944: - Fix Version/s: (was: 2.3) 3.0 > Remove deprecated methods in BooleanQuery > - > > Key: LUCENE-944 > URL: https://issues.apache.org/jira/browse/LUCENE-944 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Assignee: Michael Busch >Priority: Minor > Fix For: 3.0 > > Attachments: BooleanQuery20070626.patch > > > Remove deprecated methods setUseScorer14 and getUseScorer14 in BooleanQuery, > and adapt javadocs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-673) Exceptions when using Lucene over NFS
[ https://issues.apache.org/jira/browse/LUCENE-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-673. --- Resolution: Fixed > Exceptions when using Lucene over NFS > - > > Key: LUCENE-673 > URL: https://issues.apache.org/jira/browse/LUCENE-673 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.0.0 > Environment: NFS server/client >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.2 > > > I'm opening this issue to track details on the known problems with > Lucene over NFS. > The summary is: if you have one machine writing to an index stored on > an NFS mount, and other machine(s) reading (and periodically > re-opening the index) then sometimes on re-opening the index the > reader will hit a FileNotFound exception. > This has hit many users because this is a natural way to "scale up" > your searching (single writer, multiple readers) across machines. The > best current workaround (I think?) is to take the approach Solr takes > (either by actually using Solr or copying/modifying its approach) to > take snapshots of the index and then have the readers open the > snapshots instead of the "live" index being written to. > I've been working on two patches for Lucene: > * A locking (LockFactory) implementation using native OS locks > * Lock-less commits > (I'll open separate issues with the details for those). > I have a simple stress test where one machine is constantly adding > docs to an index over NFS, and another machine is constantly > re-opening the index searcher over NFS. > These tests have revealed new details (at least for me!) about the > root cause of our NFS problems: > * Even when using native locks over NFS, Lucene still hits these > exceptions! > I was surprised by this because I had always thought (assumed?) > the NFS problem was because the "simple" file-based locking was > not correct over NFS, and that switching to native OS filesystem > locking would resolve it, but it doesn't. > I can reproduce the "FileNotFound" exceptions even when using NFS > V4 (the latest NFS protocol), so this is not just a "your NFS > server is too old" issue. > * Then, when running the same stress test with the lock-less > changes, I don't hit any exceptions. I've tested on NFS version > 2, 3 and 4 (using the "nfsvers=N" mount option). > I think this means that in fact (as Hoss at one point suggested I > believe), the NFS problems are likely due to the cache coherence of > the NFS file system (I think the "segments" file in particular) > against the existence of the actual segment data files. > In other words, even if you lock correctly, on the reader side it will > sometimes see stale contents of the "segments" file which lead it to > try to open a now deleted segment data file. > So I think this is good news / bad news: the bad news is, native > locking doesn't fix our problems with NFS (as at least I had expected > it to). But the good news is, it looks like (still need to do more > thorough testing of this) the changes for lock-less commits do enable > Lucene to work fine over NFS. > [One quick side note in case it helps others: to get native locks > working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get > install nfs-common" on the NFS client machines. Before I did this I > would hit "No locks available" IOExceptions on calling the "tryLock" > method. The default nfs server install on the server machine just > worked because it runs in kernel mode and it start a lockd process.] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1088) PriorityQueue 'wouldBeInserted' method
[ https://issues.apache.org/jira/browse/LUCENE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1088: --- Attachment: LUCENE-1088.patch Attached patch. All tests pass. I plan to commit sometime tomorrow... > PriorityQueue 'wouldBeInserted' method > -- > > Key: LUCENE-1088 > URL: https://issues.apache.org/jira/browse/LUCENE-1088 > Project: Lucene - Java > Issue Type: New Feature > Components: Other >Reporter: Peter Keegan >Assignee: Michael McCandless > Attachments: LUCENE-1088.patch > > > This is a request for a new method in PriorityQueue > public boolean wouldBeInserted(Object element) > // returns true if doc would be inserted, without inserting > This would allow an application to prevent duplicate entries from being added > to the queue. > Here is a reference to the discussion behind this request: > http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550710 ] Michael McCandless commented on LUCENE-753: --- OK my results on Win XP now agree with Yonik's. On UNIX & OS X, ChannelPread is a bit (2-14%) better, but on windows it's quite a bit (31-34%) slower. Win Server 2003 R2 Enterprise x64 (Sun Java 1.6): {code} config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=68094, MB/sec=197.10654095808735 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=72594, MB/sec=184.88818359644048 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=98328, MB/sec=136.581360345 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=201563, MB/sec=66.58847506734867 {code} Win XP Pro SP2, laptop (Sun Java 1.5): {code} config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=47449, MB/sec=282.8673481000653 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=54899, MB/sec=244.4811890926975 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=71683, MB/sec=187.237877878995 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=149475, MB/sec=89.79275999330991 {code} Linux 2.6.22.1 (Sun Java 1.5): {code} config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=41162, MB/sec=326.0719304212623 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=53114, MB/sec=252.69745829724744 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=40226, MB/sec=333.65914582608264 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=59163, MB/sec=226.86092321214272 {code} Mac OS X 10.4 (Sun Java 1.5): {code} config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=85894, MB/sec=156.25972477705076 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=109939, MB/sec=122.08381738964336 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=75517, MB/sec=177.73180608339845 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=130156, MB/sec=103.12066136021389 {code} > Use NIO positional read to avoid synchronization in FSIndexInput > > > Key: LUCENE-753 > URL: https://issues.apache.org/jira/browse/LUCENE-753 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Yonik Seeley > Attachments: FileReadTest.java, FileReadTest.java, FileReadTest.java, > FSIndexInput.patch, FSIndexInput.patch > > > As suggested by Doug, we could use NIO pread to avoid synchronization on the > underlying file. > This could mitigate any MT performance drop caused by reducing the number of > files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1044: --- Attachment: LUCENE-1044.take5.patch Initial patch attached: * Created new commit() method; deprecated public flush() method * Changed IndexWriter to not write segments_N when flushing, only when syncing (added new private sync() for this). The current "policy" is to sync only after merges are committed. When autoCommit=false we do not sync until close() or commit() is called * Added MockRAMDirectory.crash() to simulate a machine crash. It keeps track of un-synced files, and then in crash() it goes and corrupts any unsynced files rather aggressively. * Added a new unit test, TestCrash, to crash the MockRAMDirectory at various interesting times & make sure we can still load the resulting index. * Added new Directory.sync() method. In FSDirectory.sync, if I hit an IOException when opening or sync'ing, I retry (currently after waiting 5 msec, and retrying up to 5 times). If it still fails after that, the original exception is thrown and the new segments_N will not be written (and, the previous commit will also not be deleted). All tests now pass, but there is still alot to do, eg at least: * Javadocs * Refactor syncing code so DirectoryIndexReader.doCommit can use it as well. * Change format of segments_N to include a hash of its contents, at the end. I think this is now necessary in case we crash after writing segments_N but before we can sync it, to ensure that whoever next opens the reader can detect corruption in this segments_N file. > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.4 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch, > LUCENE-1044.take5.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1087) MultiSearcher.explain returns incorrect score/explanation relating to docFreq
[ https://issues.apache.org/jira/browse/LUCENE-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated LUCENE-1087: - Description: Creating 2 different indexes, searching each individually and print score details and compare to searching both indexes with MulitSearcher and printing score details. The "docFreq" value printed isn't correct - the values it prints are as if each index was searched individually. Code is like: {code} MultiSearcher multi = new MultiSearcher(searchables); Hits hits = multi.search(query); for(int i=0; imailto:[EMAIL PROTECTED] Sent: Friday, December 07, 2007 10:30 PM To: [EMAIL PROTECTED] Subject: Re: does the MultiSearcher class calculate IDF properly? a quick glance at the code seems to indicate that MultiSearcher has code for calcuating the docFreq accross all of the Searchables when searching (or when the docFreq method is explicitly called) but that explain method just delegates to Searchable that the specific docid came from. if you compare that Explanation score you got with the score returned by a HitCollector (or TopDocs) they probably won't match. So i would say "yes MultiSearcher calculates IDF properly, but MultiSeracher.explain is broken. Please file a bug about this, i can't think of an easy way to fix it, but it certianly seems broken to me. : Subject: does the MultiSearcher class calculate IDF properly? : : I tried the following. Creating 2 different indexes, search each : individually and print score details and compare to searching both : indexes with MulitSearcher and printing score details. : : The "docFreq" value printed don't seem right - is this just a problem : with using Explain together with the MultiSearcher? : : : Code is like: : MultiSearcher multi = new MultiSearcher(searchables); : Hits hits = multi.search(query); : for(int i=0; i MultiSearcher.explain returns incorrect score/explanation relating to docFreq > - > > Key: LUCENE-1087 > URL: https://issues.apache.org/jira/browse/LUCENE-1087 > Project: Lucene - Java > Issue Type: Bug > Components: Query/Scoring >Affects Versions: 2.2 > Environment: No special hardware required to reproduce the issue. >Reporter: Yasoja Seneviratne >Priority: Minor > > Creating 2 different indexes, searching each individually and print score > details and compare to searching both indexes with MulitSearcher and printing > score details. > > The "docFreq" value printed isn't correct - the values it prints are as if > each index was searched individually. > Code is like: > {code} > MultiSearcher multi = new MultiSearcher(searchables); > Hits hits = multi.search(query); > for(int i=0; i { > Explanation expl = multi.explain(query, hits.id(i)); > System.out.println(expl.toString()); > } > {code} > I raised this in the Lucene user mailing list and was advised to log a bug, > email thread given below. > {noformat} > -Original Message- > From: Chris Hostetter > Sent: Friday, December 07, 2007 10:30 PM > To: java-user > Subject: Re: does the MultiSearcher class calculate IDF properly? > a quick glance at the code seems to indicate that MultiSearcher has code > for calcuating the docFreq accross all of the Searchables when searching > (or when the docFreq method is explicitly called) but that explain method > just delegates to Searchable that the specific docid came from. > if you compare that Explanation score you got with the score returned by > a HitCollector (or TopDocs) they probably won't match. > So i would say "yes MultiSearcher calculates IDF properly, but > MultiSeracher.explain is broken. Please file a bug about this, i can't > think of an easy way to fix it, but it certianly seems broken to me. > : Subject: does the MultiSearcher class calculate IDF properly? > : > : I tried the following. Creating 2 different indexes, search each > : individually and print score details and compare to searching both > : indexes with MulitSearcher and printing score details. > : > : The "docFreq" value printed don't seem right - is this just a problem > : with using Explain together with the MultiSearcher? > : > : > : Code is like: > : MultiSearcher multi = new MultiSearcher(searchables); > : Hits hits = multi.search(query); > : for(int i=0; i : { > : Explanation expl = multi.explain(query, hits.id(i)); > : System.out.println(expl.toString()); > : } > : > : > : Output: > : id = 14 score = 0.071 > : 0.07073946 = (MATCH) fieldWeight(contents:climate in 2), product of: > : 1.0 = tf(termFreq(contents:climate)=1) > : 1.8109303 = idf(docFreq=1) > : 0.0390625 = fieldNorm(field=contents, doc=2) > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue o
[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550701 ] Michael McCandless commented on LUCENE-753: --- Thanks! I'll re-run. {quote} Well, at least we've learned that printing out all the input params for benchmarking programs is good practice :) {quote} Yes indeed :) > Use NIO positional read to avoid synchronization in FSIndexInput > > > Key: LUCENE-753 > URL: https://issues.apache.org/jira/browse/LUCENE-753 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Yonik Seeley > Attachments: FileReadTest.java, FileReadTest.java, FileReadTest.java, > FSIndexInput.patch, FSIndexInput.patch > > > As suggested by Doug, we could use NIO pread to avoid synchronization on the > underlying file. > This could mitigate any MT performance drop caused by reducing the number of > files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated LUCENE-753: Attachment: FileReadTest.java OK, uploading latest version of the test that should fix ChannelTransfer (it's also slightly optimized to not create a new object per call). Well, at least we've learned that printing out all the input params for benchmarking programs is good practice :-) > Use NIO positional read to avoid synchronization in FSIndexInput > > > Key: LUCENE-753 > URL: https://issues.apache.org/jira/browse/LUCENE-753 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Yonik Seeley > Attachments: FileReadTest.java, FileReadTest.java, FileReadTest.java, > FSIndexInput.patch, FSIndexInput.patch > > > As suggested by Doug, we could use NIO pread to avoid synchronization on the > underlying file. > This could mitigate any MT performance drop caused by reducing the number of > files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550687 ] Michael McCandless commented on LUCENE-753: --- Doh!! Woops :) I will rerun... > Use NIO positional read to avoid synchronization in FSIndexInput > > > Key: LUCENE-753 > URL: https://issues.apache.org/jira/browse/LUCENE-753 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Yonik Seeley > Attachments: FileReadTest.java, FileReadTest.java, > FSIndexInput.patch, FSIndexInput.patch > > > As suggested by Doug, we could use NIO pread to avoid synchronization on the > underlying file. > This could mitigate any MT performance drop caused by reducing the number of > files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550685 ] Yonik Seeley commented on LUCENE-753: - I'll try fixing the transferTo test before anyone re-runs any tests. > Use NIO positional read to avoid synchronization in FSIndexInput > > > Key: LUCENE-753 > URL: https://issues.apache.org/jira/browse/LUCENE-753 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Yonik Seeley > Attachments: FileReadTest.java, FileReadTest.java, > FSIndexInput.patch, FSIndexInput.patch > > > As suggested by Doug, we could use NIO pread to avoid synchronization on the > underlying file. > This could mitigate any MT performance drop caused by reducing the number of > files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550682 ] Yonik Seeley commented on LUCENE-753: - Mike, it looks like you are running with a bufsize of 6.5MB! Apologies for my hard-to-use benchmark program :-( > Use NIO positional read to avoid synchronization in FSIndexInput > > > Key: LUCENE-753 > URL: https://issues.apache.org/jira/browse/LUCENE-753 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Yonik Seeley > Attachments: FileReadTest.java, FileReadTest.java, > FSIndexInput.patch, FSIndexInput.patch > > > As suggested by Doug, we could use NIO pread to avoid synchronization on the > underlying file. > This could mitigate any MT performance drop caused by reducing the number of > files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-1088) PriorityQueue 'wouldBeInserted' method
[ https://issues.apache.org/jira/browse/LUCENE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1088: -- Assignee: Michael McCandless > PriorityQueue 'wouldBeInserted' method > -- > > Key: LUCENE-1088 > URL: https://issues.apache.org/jira/browse/LUCENE-1088 > Project: Lucene - Java > Issue Type: New Feature > Components: Other >Reporter: Peter Keegan >Assignee: Michael McCandless > > This is a request for a new method in PriorityQueue > public boolean wouldBeInserted(Object element) > // returns true if doc would be inserted, without inserting > This would allow an application to prevent duplicate entries from being added > to the queue. > Here is a reference to the discussion behind this request: > http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1088) PriorityQueue 'wouldBeInserted' method
PriorityQueue 'wouldBeInserted' method -- Key: LUCENE-1088 URL: https://issues.apache.org/jira/browse/LUCENE-1088 Project: Lucene - Java Issue Type: New Feature Components: Other Reporter: Peter Keegan This is a request for a new method in PriorityQueue public boolean wouldBeInserted(Object element) // returns true if doc would be inserted, without inserting This would allow an application to prevent duplicate entries from being added to the queue. Here is a reference to the discussion behind this request: http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550677 ] Michael McCandless commented on LUCENE-753: --- I also just ran a test with 4 threads, random access, on Linux 2.6.22.1: config: impl=ClassicFile serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=-195110, ms=120856, MB/sec=444.22363142913883 config: impl=ChannelFile serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=-195110, ms=88272, MB/sec=608.2006887801341 config: impl=ChannelPread serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=-195110, ms=77672, MB/sec=691.2026367288084 config: impl=ChannelTransfer serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=594875, ms=38390, MB/sec=1398.465517061735 ChannelTransfer got even faster (scales up with added threads better). > Use NIO positional read to avoid synchronization in FSIndexInput > > > Key: LUCENE-753 > URL: https://issues.apache.org/jira/browse/LUCENE-753 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Yonik Seeley > Attachments: FileReadTest.java, FileReadTest.java, > FSIndexInput.patch, FSIndexInput.patch > > > As suggested by Doug, we could use NIO pread to avoid synchronization on the > underlying file. > This could mitigate any MT performance drop caused by reducing the number of > files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550675 ] Michael McCandless commented on LUCENE-753: --- I ran Yonik's most recent FileReadTest.java on the platforms below, testing single-threaded random access for fully cached 64 MB file. I tested two Windows XP Pro machines and got opposite results from Yonik. Yonik is your machine XP Home? I'm showing ChannelTransfer to be much faster on all platforms except Windows Server 2003 R2 Enterprise x64 where it's about the same as ChannelPread and ChannelFile. The ChannelTransfer test is giving the wrong checksum, but I think just a bug in how checksum is computed (it's using "len" which with ChannelTransfer is just the chunk size written on each call to write). So I think the MB/sec is still correct. Mac OS X 10.4 (Sun java 1.5) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=32565, MB/sec=412.15331797942576 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=19512, MB/sec=687.8727347273473 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=19492, MB/sec=688.5785347835009 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=16009, MB/sec=838.3892060715847 Linux 2.6.22.1 (Sun java 1.5) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=37879, MB/sec=354.33281765622115 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=21845, MB/sec=614.4093751430535 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=21902, MB/sec=612.8103734818737 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=15978, MB/sec=840.015821754913 Windows Server 2003 R2 Enterprise x64 (Sun java 1.6) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=32703, MB/sec=410.4141149130049 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=23344, MB/sec=574.9559972583961 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=23329, MB/sec=575.3256804835183 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=23422, MB/sec=573.0412774314747 Windows XP Pro SP2, laptop (Sun Java 1.5) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=71253, MB/sec=188.36782731955148 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=57463, MB/sec=233.57243443607192 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=58043, MB/sec=231.23844046655068 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=20039, MB/sec=669.7825640001995 Windows XP Pro SP2, older desktop (Sun Java 1.6) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=53047, MB/sec=253.01662299470283 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=34047, MB/sec=394.2130819161747 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=34078, MB/sec=393.8544750278772 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=18781, MB/sec=714.6463340610192 > Use NIO positional read to avoid synchronization in FSIndexInput > > > Key: LUCENE-753 > URL: https://issues.apache.org/jira/browse/LUCENE-753 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Yonik Seeley > Attachments: FileReadTest.java, FileReadTest.java, > FSIndexInput.patch, FSIndexInput.patch > > > As suggested by Doug, we could use NIO pread to avoid synchronization on the > underlying file. > This could mitigate any MT performance drop caused by reducing the number of > files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issu
Re: Performance Improvement for Search using PriorityQueue
On Dec 11, 2007 1:21 PM, Timo Nentwig <[EMAIL PROTECTED]> wrote: > On Tuesday 11 December 2007 14:32:12 Shai Erera wrote: > > For (1) - I can't explain it but I've run into documents with 0.0f scores. > > For (2) - this is a simple logic - if the lowest score in the queue is 'x' > > and you want to top docs only, then there's no point in attempting to > > insert a document with score lower than 'x' (it will not be added). > > Sure. I didn't notice that score is passed as parameter and was surprised that > subsequent calls to collect() are supposed to be guaranteed to have a lower > score. One is not guaranteed this... collect() generally goes in docid order, and scores are unordered. If you are only gathering the top 10 docs by score, you can compare the current score to the lowest of the top 10 you currently have to determine if you should bother inserting into the queue. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance Improvement for Search using PriorityQueue
On Tuesday 11 December 2007 14:32:12 Shai Erera wrote: > For (1) - I can't explain it but I've run into documents with 0.0f scores. > For (2) - this is a simple logic - if the lowest score in the queue is 'x' > and you want to top docs only, then there's no point in attempting to > insert a document with score lower than 'x' (it will not be added). Sure. I didn't notice that score is passed as parameter and was surprised that subsequent calls to collect() are supposed to be guaranteed to have a lower score. Ok, stupid question :) > Maybe I didn't understand your question correctly though ... > > On Dec 11, 2007 2:25 PM, Timo Nentwig <[EMAIL PROTECTED]> wrote: > > On Monday 10 December 2007 09:15:12 Paul Elschot wrote: > > > The current TopDocCollector only allocates a ScoreDoc when the given > > > score causes a new ScoreDoc to be added into the queue, but it does > > > > I actually wrote my own HitCollector and now wonder about > > TopDocCollector: > > > > public void collect(int doc, float score) { > >if (score > 0.0f) { > > totalHits++; > > if (hq.size() < numHits || score >= minScore) { > >hq.insert(new ScoreDoc(doc, score)); > >minScore = ((ScoreDoc)hq.top()).score; // maintain minScore > > } > >} > > } > > > > 1) How can there be hits with score=0.0? > > 2) I don't understand minScore: inserts only document having a higher > > score > > than the lowest score already in queue? > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance Improvement for Search using PriorityQueue
Shai Erera wrote: Hi, I will open an issue and create the patch. One thing I'm not sure of is the wouldBeInserted method you mentioned - in what context should it be used? And ... lessThan shouldn't be public, it can stay protected. Sorry, this is a method Peter suggested (see below) in order to add de-duping logic on top of the PQ. We should do it separately (it's unrelated). Peter can you open an issue for that one? Thanks! Mike On 12/11/07, Michael McCandless <[EMAIL PROTECTED]> wrote: I think we can't make lessThan public since that would cause subclasses to fail to compile (ie this breaks backwards compatibility)? Adding "wouldBeInserted()" seems OK? Mike Peter Keegan wrote: See my similar request from last March: http://www.nabble.com/FieldSortedHitQueue-enhancement- to9733550.html#a9733550 Peter On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]> wrote: On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for Search using PriorityQueue": Hi Lucene's PQ implements two methods: put (assumes the PQ has room for the object) and insert (checks whether the object can be inserted etc.). The implementation of insert() requires the application that uses it to allocate a new object every time it calls insert. Specifically, it cannot reuse the objects that were removed from the PQ. I've read this entire thread, and would like to add my comments about three independent issues, which I think that can and perhaps should be considered separately: 1. When Shai wanted to add the insertWithOverflow() method to PriorityQueue he couldn't just subclass PriorityQueue in his application, but rather was forced to modify PriorityQueue itself. Why? just because one field of that classi - "heap" - was defined "private" instead of "protected". Is there a special reason for that? If not, can we make the trivial change to make PriorityQueue's fields protected, to allow Shai and others (see the next point) to add functionality on top of PriorityQueue? 2. PriorityQueue, in addition to being used in about a dozen places inside Lucene, is also a public class that advanced users often find useful when implementing features like new collectors, new queries, and so on. Unfortunately, my experience exactly matches Shai's: In the two occasions where I used a PriorityQueue, I found that I needed such a insertWithOverflow() method. If this feature is so useful (I can't believe that Shai and me are the only ones who found it useful), I think it would be nice to add it to Lucene's PriorityQueue, even if it isn't (yet) used inside Lucene. Just to make it sound more interesting, let me give you an example where I needed (and implemented) an "insertWithOverflow()": I was implementing a faceted search capability over Lucene. It calculated a count for each facet value, and then I used a PriorityQueue to find the 10 best values. The problem is that I also needed an "other" aggregator, which was supposed to aggregate (in various ways) all the facets except the 10 best ones. For that, I needed to know which facets dropped off the priorityqueue. 3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow() to be used in TopDocCollector. I have to admit I don't know how much of a benefit this will be in the "typical" case. But I do know that there's no such thing as a "typical" case... I can easily think of a quite typical "worst case" though: Consider a collection indexed in order of document age (a pretty typical scenario for a long-running index), and then you do a sorting query (TopFieldDocCollector), asking it to bring the 10 newest documents. In that case, each and every document will have a new DocScore created - which is the worst-case that Shai feared. It would be nice if Shai or someone else could provide a measurement in that case. P.S. When looking now at PriorityQueue's code, I found two tiny performance improvements that could be easily made to it - I wonder if there's any reason not to do them: 1. Insert can use heap[1] directly instead of calling top(). After all, this is done in an if() that already ensures that size>0. 2. Regardless, top() could return heap[1] always, without any if (). After all, the heap array is initialized to all nulls, so when size==0, heap[1] is null anyway. PriorityQueue change (added insertWithOverflow method) --- -- -- /** * insertWithOverflow() is similar to insert(), except its return value: it * returns the object (if any) that was dropped off the heap because it was * full. This can be the given parameter (in case it is smaller than the * full heap's minimum, and couldn't be added) or another object that was * previously the smallest value in the heap and now has been replaced by a * large
Re: Performance Improvement for Search using PriorityQueue
Hi, I will open an issue and create the patch. One thing I'm not sure of is the wouldBeInserted method you mentioned - in what context should it be used? And ... lessThan shouldn't be public, it can stay protected. On 12/11/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > I think we can't make lessThan public since that would cause > subclasses to fail to compile (ie this breaks backwards compatibility)? > > Adding "wouldBeInserted()" seems OK? > > Mike > > Peter Keegan wrote: > > > See my similar request from last March: > > http://www.nabble.com/FieldSortedHitQueue-enhancement- > > to9733550.html#a9733550 > > > > Peter > > > > On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]> > > wrote: > > > >> On Mon, Dec 10, 2007, Shai Erera wrote about "Performance > >> Improvement for > >> Search using PriorityQueue": > >>> Hi > >>> > >>> Lucene's PQ implements two methods: put (assumes the PQ has room > >>> for the > >>> object) and insert (checks whether the object can be inserted > >>> etc.). The > >>> implementation of insert() requires the application that uses it to > >> allocate > >>> a new object every time it calls insert. Specifically, it cannot > >>> reuse > >> the > >>> objects that were removed from the PQ. > >> > >> I've read this entire thread, and would like to add my comments about > >> three > >> independent issues, which I think that can and perhaps should be > >> considered > >> separately: > >> > >> 1. When Shai wanted to add the insertWithOverflow() method to > >> PriorityQueue > >> he couldn't just subclass PriorityQueue in his application, but > >> rather > >> was forced to modify PriorityQueue itself. Why? just because one > >> field > >> of that classi - "heap" - was defined "private" instead of > >> "protected". > >> > >> Is there a special reason for that? If not, can we make the trivial > >> change > >> to make PriorityQueue's fields protected, to allow Shai and > >> others (see > >> the > >> next point) to add functionality on top of PriorityQueue? > >> > >> 2. PriorityQueue, in addition to being used in about a dozen > >> places inside > >> Lucene, is also a public class that advanced users often find > >> useful > >> when > >> implementing features like new collectors, new queries, and so on. > >> Unfortunately, my experience exactly matches Shai's: In the two > >> occasions > >> where I used a PriorityQueue, I found that I needed such a > >> insertWithOverflow() method. If this feature is so useful (I can't > >> believe > >> that Shai and me are the only ones who found it useful), I think it > >> would > >> be nice to add it to Lucene's PriorityQueue, even if it isn't > >> (yet) used > >> inside Lucene. > >> > >> Just to make it sound more interesting, let me give you an > >> example where > >> I needed (and implemented) an "insertWithOverflow()": I was > >> implementing > >> a > >> faceted search capability over Lucene. It calculated a count for > >> each > >> facet value, and then I used a PriorityQueue to find the 10 best > >> values. > >> The problem is that I also needed an "other" aggregator, which was > >> supposed > >> to aggregate (in various ways) all the facets except the 10 best > >> ones. > >> For > >> that, I needed to know which facets dropped off the priorityqueue. > >> > >> 3. Finally, Shai asked for this new > >> PriorityQueue.insertWithOverflow() > >> to be used in TopDocCollector. I have to admit I don't know how > >> much > >> of a benefit this will be in the "typical" case. But I do know that > >> there's no such thing as a "typical" case... > >> I can easily think of a quite typical "worst case" though: > >> Consider a > >> collection indexed in order of document age (a pretty typical > >> scenario > >> for a long-running index), and then you do a sorting query > >> (TopFieldDocCollector), asking it to bring the 10 newest documents. > >> In that case, each and every document will have a new DocScore > >> created - > >> which is the worst-case that Shai feared. > >> It would be nice if Shai or someone else could provide a > >> measurement in > >> that case. > >> > >> P.S. When looking now at PriorityQueue's code, I found two tiny > >> performance improvements that could be easily made to it - I > >> wonder if > >> there's any reason not to do them: > >> > >> 1. Insert can use heap[1] directly instead of calling top(). > >> After all, > >>this is done in an if() that already ensures that size>0. > >> > >> 2. Regardless, top() could return heap[1] always, without any if > >> (). After > >>all, the heap array is initialized to all nulls, so when size==0, > >> heap[1] > >>is null anyway. > >> > >> > >> > >>> PriorityQueue change (added insertWithOverflow method) > >>> > >> - > >> -- > >>> /** > >>> * insertWithOverflow() is similar to insert(), except its > >>> return > >> value: > >>
Re: Performance Improvement for Search using PriorityQueue
I think we can't make lessThan public since that would cause subclasses to fail to compile (ie this breaks backwards compatibility)? Adding "wouldBeInserted()" seems OK? Mike Peter Keegan wrote: See my similar request from last March: http://www.nabble.com/FieldSortedHitQueue-enhancement- to9733550.html#a9733550 Peter On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]> wrote: On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for Search using PriorityQueue": Hi Lucene's PQ implements two methods: put (assumes the PQ has room for the object) and insert (checks whether the object can be inserted etc.). The implementation of insert() requires the application that uses it to allocate a new object every time it calls insert. Specifically, it cannot reuse the objects that were removed from the PQ. I've read this entire thread, and would like to add my comments about three independent issues, which I think that can and perhaps should be considered separately: 1. When Shai wanted to add the insertWithOverflow() method to PriorityQueue he couldn't just subclass PriorityQueue in his application, but rather was forced to modify PriorityQueue itself. Why? just because one field of that classi - "heap" - was defined "private" instead of "protected". Is there a special reason for that? If not, can we make the trivial change to make PriorityQueue's fields protected, to allow Shai and others (see the next point) to add functionality on top of PriorityQueue? 2. PriorityQueue, in addition to being used in about a dozen places inside Lucene, is also a public class that advanced users often find useful when implementing features like new collectors, new queries, and so on. Unfortunately, my experience exactly matches Shai's: In the two occasions where I used a PriorityQueue, I found that I needed such a insertWithOverflow() method. If this feature is so useful (I can't believe that Shai and me are the only ones who found it useful), I think it would be nice to add it to Lucene's PriorityQueue, even if it isn't (yet) used inside Lucene. Just to make it sound more interesting, let me give you an example where I needed (and implemented) an "insertWithOverflow()": I was implementing a faceted search capability over Lucene. It calculated a count for each facet value, and then I used a PriorityQueue to find the 10 best values. The problem is that I also needed an "other" aggregator, which was supposed to aggregate (in various ways) all the facets except the 10 best ones. For that, I needed to know which facets dropped off the priorityqueue. 3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow() to be used in TopDocCollector. I have to admit I don't know how much of a benefit this will be in the "typical" case. But I do know that there's no such thing as a "typical" case... I can easily think of a quite typical "worst case" though: Consider a collection indexed in order of document age (a pretty typical scenario for a long-running index), and then you do a sorting query (TopFieldDocCollector), asking it to bring the 10 newest documents. In that case, each and every document will have a new DocScore created - which is the worst-case that Shai feared. It would be nice if Shai or someone else could provide a measurement in that case. P.S. When looking now at PriorityQueue's code, I found two tiny performance improvements that could be easily made to it - I wonder if there's any reason not to do them: 1. Insert can use heap[1] directly instead of calling top(). After all, this is done in an if() that already ensures that size>0. 2. Regardless, top() could return heap[1] always, without any if (). After all, the heap array is initialized to all nulls, so when size==0, heap[1] is null anyway. PriorityQueue change (added insertWithOverflow method) - -- /** * insertWithOverflow() is similar to insert(), except its return value: it * returns the object (if any) that was dropped off the heap because it was * full. This can be the given parameter (in case it is smaller than the * full heap's minimum, and couldn't be added) or another object that was * previously the smallest value in the heap and now has been replaced by a * larger one. */ public Object insertWithOverflow(Object element) { if (size < maxSize) { put(element); return null; } else if (size > 0 && !lessThan(element, top())) { Object ret = heap[1]; heap[1] = element; adjustTop(); return ret; } else { return element; } } [Very similar to insert(), only it returns the object that was kicked out of the Queue,
Re: Performance Improvement for Search using PriorityQueue
I agree that even though we don't see gains on the queries tested, there are in theory cases where there could be a great many allocations that would be saved. I think we should do Shai's suggested option 1 (add the method and change TDC to call it), change heap to be protected not private, plus the 2 tiny performance gains Nadav suggests below? Shai can you open a Jira issue & attach a patch for these changes? Thanks! Mike Nadav Har'El wrote: On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for Search using PriorityQueue": Hi Lucene's PQ implements two methods: put (assumes the PQ has room for the object) and insert (checks whether the object can be inserted etc.). The implementation of insert() requires the application that uses it to allocate a new object every time it calls insert. Specifically, it cannot reuse the objects that were removed from the PQ. I've read this entire thread, and would like to add my comments about three independent issues, which I think that can and perhaps should be considered separately: 1. When Shai wanted to add the insertWithOverflow() method to PriorityQueue he couldn't just subclass PriorityQueue in his application, but rather was forced to modify PriorityQueue itself. Why? just because one field of that classi - "heap" - was defined "private" instead of "protected". Is there a special reason for that? If not, can we make the trivial change to make PriorityQueue's fields protected, to allow Shai and others (see the next point) to add functionality on top of PriorityQueue? 2. PriorityQueue, in addition to being used in about a dozen places inside Lucene, is also a public class that advanced users often find useful when implementing features like new collectors, new queries, and so on. Unfortunately, my experience exactly matches Shai's: In the two occasions where I used a PriorityQueue, I found that I needed such a insertWithOverflow() method. If this feature is so useful (I can't believe that Shai and me are the only ones who found it useful), I think it would be nice to add it to Lucene's PriorityQueue, even if it isn't (yet) used inside Lucene. Just to make it sound more interesting, let me give you an example where I needed (and implemented) an "insertWithOverflow()": I was implementing a faceted search capability over Lucene. It calculated a count for each facet value, and then I used a PriorityQueue to find the 10 best values. The problem is that I also needed an "other" aggregator, which was supposed to aggregate (in various ways) all the facets except the 10 best ones. For that, I needed to know which facets dropped off the priorityqueue. 3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow() to be used in TopDocCollector. I have to admit I don't know how much of a benefit this will be in the "typical" case. But I do know that there's no such thing as a "typical" case... I can easily think of a quite typical "worst case" though: Consider a collection indexed in order of document age (a pretty typical scenario for a long-running index), and then you do a sorting query (TopFieldDocCollector), asking it to bring the 10 newest documents. In that case, each and every document will have a new DocScore created - which is the worst-case that Shai feared. It would be nice if Shai or someone else could provide a measurement in that case. P.S. When looking now at PriorityQueue's code, I found two tiny performance improvements that could be easily made to it - I wonder if there's any reason not to do them: 1. Insert can use heap[1] directly instead of calling top(). After all, this is done in an if() that already ensures that size>0. 2. Regardless, top() could return heap[1] always, without any if (). After all, the heap array is initialized to all nulls, so when size==0, heap[1] is null anyway. PriorityQueue change (added insertWithOverflow method) - -- /** * insertWithOverflow() is similar to insert(), except its return value: it * returns the object (if any) that was dropped off the heap because it was * full. This can be the given parameter (in case it is smaller than the * full heap's minimum, and couldn't be added) or another object that was * previously the smallest value in the heap and now has been replaced by a * larger one. */ public Object insertWithOverflow(Object element) { if (size < maxSize) { put(element); return null; } else if (size > 0 && !lessThan(element, top())) { Object ret = heap[1]; heap[1] = element; adjustTop(); return ret; } else { return element;
Re: Performance Improvement for Search using PriorityQueue
See my similar request from last March: http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550 Peter On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]> wrote: > On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for > Search using PriorityQueue": > > Hi > > > > Lucene's PQ implements two methods: put (assumes the PQ has room for the > > object) and insert (checks whether the object can be inserted etc.). The > > implementation of insert() requires the application that uses it to > allocate > > a new object every time it calls insert. Specifically, it cannot reuse > the > > objects that were removed from the PQ. > > I've read this entire thread, and would like to add my comments about > three > independent issues, which I think that can and perhaps should be > considered > separately: > > 1. When Shai wanted to add the insertWithOverflow() method to > PriorityQueue > he couldn't just subclass PriorityQueue in his application, but rather > was forced to modify PriorityQueue itself. Why? just because one field > of that classi - "heap" - was defined "private" instead of "protected". > > Is there a special reason for that? If not, can we make the trivial > change > to make PriorityQueue's fields protected, to allow Shai and others (see > the > next point) to add functionality on top of PriorityQueue? > > 2. PriorityQueue, in addition to being used in about a dozen places inside > Lucene, is also a public class that advanced users often find useful > when > implementing features like new collectors, new queries, and so on. > Unfortunately, my experience exactly matches Shai's: In the two > occasions > where I used a PriorityQueue, I found that I needed such a > insertWithOverflow() method. If this feature is so useful (I can't > believe > that Shai and me are the only ones who found it useful), I think it > would > be nice to add it to Lucene's PriorityQueue, even if it isn't (yet) used > inside Lucene. > > Just to make it sound more interesting, let me give you an example where > I needed (and implemented) an "insertWithOverflow()": I was implementing > a > faceted search capability over Lucene. It calculated a count for each > facet value, and then I used a PriorityQueue to find the 10 best values. > The problem is that I also needed an "other" aggregator, which was > supposed > to aggregate (in various ways) all the facets except the 10 best ones. > For > that, I needed to know which facets dropped off the priorityqueue. > > 3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow() > to be used in TopDocCollector. I have to admit I don't know how much > of a benefit this will be in the "typical" case. But I do know that > there's no such thing as a "typical" case... > I can easily think of a quite typical "worst case" though: Consider a > collection indexed in order of document age (a pretty typical scenario > for a long-running index), and then you do a sorting query > (TopFieldDocCollector), asking it to bring the 10 newest documents. > In that case, each and every document will have a new DocScore created - > which is the worst-case that Shai feared. > It would be nice if Shai or someone else could provide a measurement in > that case. > > P.S. When looking now at PriorityQueue's code, I found two tiny > performance improvements that could be easily made to it - I wonder if > there's any reason not to do them: > > 1. Insert can use heap[1] directly instead of calling top(). After all, >this is done in an if() that already ensures that size>0. > > 2. Regardless, top() could return heap[1] always, without any if(). After >all, the heap array is initialized to all nulls, so when size==0, > heap[1] >is null anyway. > > > > > PriorityQueue change (added insertWithOverflow method) > > > --- > > /** > > * insertWithOverflow() is similar to insert(), except its return > value: > > it > > * returns the object (if any) that was dropped off the heap because > it > > was > > * full. This can be the given parameter (in case it is smaller than > the > > * full heap's minimum, and couldn't be added) or another object > that > > was > > * previously the smallest value in the heap and now has been > replaced > > by a > > * larger one. > > */ > > public Object insertWithOverflow(Object element) { > > if (size < maxSize) { > > put(element); > > return null; > > } else if (size > 0 && !lessThan(element, top())) { > > Object ret = heap[1]; > > heap[1] = element; > > adjustTop(); > > return ret; > > } else { > > return element; > > } > > } > > [Very similar to insert(), only it returns the object that was kicked > out of > > the Queue, or null] >
Re: Performance Improvement for Search using PriorityQueue
On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for Search using PriorityQueue": > Hi > > Lucene's PQ implements two methods: put (assumes the PQ has room for the > object) and insert (checks whether the object can be inserted etc.). The > implementation of insert() requires the application that uses it to allocate > a new object every time it calls insert. Specifically, it cannot reuse the > objects that were removed from the PQ. I've read this entire thread, and would like to add my comments about three independent issues, which I think that can and perhaps should be considered separately: 1. When Shai wanted to add the insertWithOverflow() method to PriorityQueue he couldn't just subclass PriorityQueue in his application, but rather was forced to modify PriorityQueue itself. Why? just because one field of that classi - "heap" - was defined "private" instead of "protected". Is there a special reason for that? If not, can we make the trivial change to make PriorityQueue's fields protected, to allow Shai and others (see the next point) to add functionality on top of PriorityQueue? 2. PriorityQueue, in addition to being used in about a dozen places inside Lucene, is also a public class that advanced users often find useful when implementing features like new collectors, new queries, and so on. Unfortunately, my experience exactly matches Shai's: In the two occasions where I used a PriorityQueue, I found that I needed such a insertWithOverflow() method. If this feature is so useful (I can't believe that Shai and me are the only ones who found it useful), I think it would be nice to add it to Lucene's PriorityQueue, even if it isn't (yet) used inside Lucene. Just to make it sound more interesting, let me give you an example where I needed (and implemented) an "insertWithOverflow()": I was implementing a faceted search capability over Lucene. It calculated a count for each facet value, and then I used a PriorityQueue to find the 10 best values. The problem is that I also needed an "other" aggregator, which was supposed to aggregate (in various ways) all the facets except the 10 best ones. For that, I needed to know which facets dropped off the priorityqueue. 3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow() to be used in TopDocCollector. I have to admit I don't know how much of a benefit this will be in the "typical" case. But I do know that there's no such thing as a "typical" case... I can easily think of a quite typical "worst case" though: Consider a collection indexed in order of document age (a pretty typical scenario for a long-running index), and then you do a sorting query (TopFieldDocCollector), asking it to bring the 10 newest documents. In that case, each and every document will have a new DocScore created - which is the worst-case that Shai feared. It would be nice if Shai or someone else could provide a measurement in that case. P.S. When looking now at PriorityQueue's code, I found two tiny performance improvements that could be easily made to it - I wonder if there's any reason not to do them: 1. Insert can use heap[1] directly instead of calling top(). After all, this is done in an if() that already ensures that size>0. 2. Regardless, top() could return heap[1] always, without any if(). After all, the heap array is initialized to all nulls, so when size==0, heap[1] is null anyway. > PriorityQueue change (added insertWithOverflow method) > --- > /** > * insertWithOverflow() is similar to insert(), except its return value: > it > * returns the object (if any) that was dropped off the heap because it > was > * full. This can be the given parameter (in case it is smaller than the > * full heap's minimum, and couldn't be added) or another object that > was > * previously the smallest value in the heap and now has been replaced > by a > * larger one. > */ > public Object insertWithOverflow(Object element) { > if (size < maxSize) { > put(element); > return null; > } else if (size > 0 && !lessThan(element, top())) { > Object ret = heap[1]; > heap[1] = element; > adjustTop(); > return ret; > } else { > return element; > } > } > [Very similar to insert(), only it returns the object that was kicked out of > the Queue, or null] -- Nadav Har'El| Tuesday, Dec 11 2007, 3 Tevet 5768 IBM Haifa Research Lab |- |A professor is one who talks in someone http://nadav.harel.org.il |else's sleep. - To unsubscribe, e-m
Re: Performance Improvement for Search using PriorityQueue
For (1) - I can't explain it but I've run into documents with 0.0f scores. For (2) - this is a simple logic - if the lowest score in the queue is 'x' and you want to top docs only, then there's no point in attempting to insert a document with score lower than 'x' (it will not be added). Maybe I didn't understand your question correctly though ... On Dec 11, 2007 2:25 PM, Timo Nentwig <[EMAIL PROTECTED]> wrote: > On Monday 10 December 2007 09:15:12 Paul Elschot wrote: > > The current TopDocCollector only allocates a ScoreDoc when the given > > score causes a new ScoreDoc to be added into the queue, but it does > > I actually wrote my own HitCollector and now wonder about TopDocCollector: > > public void collect(int doc, float score) { >if (score > 0.0f) { > totalHits++; > if (hq.size() < numHits || score >= minScore) { >hq.insert(new ScoreDoc(doc, score)); >minScore = ((ScoreDoc)hq.top()).score; // maintain minScore > } >} > } > > 1) How can there be hits with score=0.0? > 2) I don't understand minScore: inserts only document having a higher > score > than the lowest score already in queue? > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera
Re: [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl
Hi I attached two patch files (for "java" and "test"). Due to a problem in my checkout project in Eclipse, I don't have them under "src". I also added a test and modified two tests in TestStandardAnalyzer. On Dec 10, 2007 11:44 PM, Grant Ingersoll (JIRA) <[EMAIL PROTECTED]> wrote: > >[ > https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550202] > > Grant Ingersoll commented on LUCENE-1068: > - > > Hmmm, maybe there is a way in Eclipse to make the path relative to the > working directory? Otherwise, from the command line in the Lucene > directory: svn diff > StandardTokenizer-4.patch > > -Grant > > > > -- > Grant Ingersoll > http://lucene.grantingersoll.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > Invalid behavior of StandardTokenizerImpl > > - > > > > Key: LUCENE-1068 > > URL: https://issues.apache.org/jira/browse/LUCENE-1068 > > Project: Lucene - Java > > Issue Type: Bug > > Components: Analysis > >Reporter: Shai Erera > >Assignee: Grant Ingersoll > > Attachments: StandardTokenizerImpl-2.patch, > StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, > standardTokenizerImpl.patch > > > > > > The following code prints the output of StandardAnalyzer: > > Analyzer analyzer = new StandardAnalyzer(); > > TokenStream ts = analyzer.tokenStream("content", new > StringReader("")); > > Token t; > > while ((t = ts.next()) != null) { > > System.out.println(t); > > } > > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=) > (which is correct in my opinion). > > However, if you pass "www.abc.com." (notice the extra '.' at the end), > the output is (wwwabccom,0,12,type=). > > I think the behavior in the second case is incorrect for several > reasons: > > 1. It recognizes the string incorrectly (no argue on that). > > 2. It kind of prevents you from putting URLs at the end of a sentence, > which is perfectly legal. > > 3. An ACRONYM, at least to the best of my understanding, is of the form > A.B.C. and not ABC.DEF. > > I looked at StandardTokenizerImpl.jflex and I think the problem comes > from this definition: > > // acronyms: U.S.A., I.B.M., etc. > > // use a post-filter to remove dots > > ACRONYM= {ALPHA} "." ({ALPHA} ".")+ > > Notice how the comment relates to acronym as U.S.A., I.B.M. and not > something else. I changed the definition to > > ACRONYM= {LETTER} "." ({LETTER} ".")+ > > and it solved the problem. > > This was also reported here: > > > http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 > > > http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera
[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl
[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1068: --- Attachment: StandardTokenizer-test-4.patch StandardTokenizer-java-4.patch Code fies under java and test packages. This should be applied under "src" > Invalid behavior of StandardTokenizerImpl > - > > Key: LUCENE-1068 > URL: https://issues.apache.org/jira/browse/LUCENE-1068 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Reporter: Shai Erera >Assignee: Grant Ingersoll > Attachments: StandardTokenizer-java-4.patch, > StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, > StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, > standardTokenizerImpl.patch > > > The following code prints the output of StandardAnalyzer: > Analyzer analyzer = new StandardAnalyzer(); > TokenStream ts = analyzer.tokenStream("content", new > StringReader("")); > Token t; > while ((t = ts.next()) != null) { > System.out.println(t); > } > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=) > (which is correct in my opinion). > However, if you pass "www.abc.com." (notice the extra '.' at the end), the > output is (wwwabccom,0,12,type=). > I think the behavior in the second case is incorrect for several reasons: > 1. It recognizes the string incorrectly (no argue on that). > 2. It kind of prevents you from putting URLs at the end of a sentence, which > is perfectly legal. > 3. An ACRONYM, at least to the best of my understanding, is of the form > A.B.C. and not ABC.DEF. > I looked at StandardTokenizerImpl.jflex and I think the problem comes from > this definition: > // acronyms: U.S.A., I.B.M., etc. > // use a post-filter to remove dots > ACRONYM= {ALPHA} "." ({ALPHA} ".")+ > Notice how the comment relates to acronym as U.S.A., I.B.M. and not something > else. I changed the definition to > ACRONYM= {LETTER} "." ({LETTER} ".")+ > and it solved the problem. > This was also reported here: > http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 > http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Caching FuzzyQuery
Hi! Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a caching decorator? A WeakHashMap with key==IndexReader and value==LRU of BooleanQueries. Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance Improvement for Search using PriorityQueue
On Monday 10 December 2007 09:15:12 Paul Elschot wrote: > The current TopDocCollector only allocates a ScoreDoc when the given > score causes a new ScoreDoc to be added into the queue, but it does I actually wrote my own HitCollector and now wonder about TopDocCollector: public void collect(int doc, float score) { if (score > 0.0f) { totalHits++; if (hq.size() < numHits || score >= minScore) { hq.insert(new ScoreDoc(doc, score)); minScore = ((ScoreDoc)hq.top()).score; // maintain minScore } } } 1) How can there be hits with score=0.0? 2) I don't understand minScore: inserts only document having a higher score than the lowest score already in queue? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1081) Remove the "Experimental" warnings from search.function package
[ https://issues.apache.org/jira/browse/LUCENE-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550419 ] Doron Cohen commented on LUCENE-1081: - {quote} I think we should resolve LUCENE-1085 first and move this to 2.4? {quote} Done. > Remove the "Experimental" warnings from search.function package > --- > > Key: LUCENE-1081 > URL: https://issues.apache.org/jira/browse/LUCENE-1081 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4 >Reporter: Doron Cohen >Assignee: Doron Cohen >Priority: Minor > Fix For: 2.4 > > > I am using this package for a while, seems that others in this list use it > too, so let's remove those warnings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1081) Remove the "Experimental" warnings from search.function package
[ https://issues.apache.org/jira/browse/LUCENE-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-1081: Fix Version/s: (was: 2.3) 2.4 Affects Version/s: (was: 2.3) 2.4 > Remove the "Experimental" warnings from search.function package > --- > > Key: LUCENE-1081 > URL: https://issues.apache.org/jira/browse/LUCENE-1081 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4 >Reporter: Doron Cohen >Assignee: Doron Cohen >Priority: Minor > Fix For: 2.4 > > > I am using this package for a while, seems that others in this list use it > too, so let's remove those warnings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Reopened: (LUCENE-944) Remove deprecated methods in BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch reopened LUCENE-944: -- Lucene Fields: [Patch Available] (was: [Patch Available, New]) You are right, Grant. I will revert this for 2.3. Thanks for catching this!! > Remove deprecated methods in BooleanQuery > - > > Key: LUCENE-944 > URL: https://issues.apache.org/jira/browse/LUCENE-944 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: BooleanQuery20070626.patch > > > Remove deprecated methods setUseScorer14 and getUseScorer14 in BooleanQuery, > and adapt javadocs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1081) Remove the "Experimental" warnings from search.function package
[ https://issues.apache.org/jira/browse/LUCENE-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550411 ] Michael Busch commented on LUCENE-1081: --- I think we should resolve LUCENE-1085 first and move this to 2.4? > Remove the "Experimental" warnings from search.function package > --- > > Key: LUCENE-1081 > URL: https://issues.apache.org/jira/browse/LUCENE-1081 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.3 >Reporter: Doron Cohen >Assignee: Doron Cohen >Priority: Minor > Fix For: 2.3 > > > I am using this package for a while, seems that others in this list use it > too, so let's remove those warnings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance Improvement for Search using PriorityQueue
Hi Back from the experiments lab with more results. I've used two indexes (1 and 10 million documents) and ran over the two 2000 queries. Each run was executed 4 times and I paste here the average of the latest 3 (to eliminate any caching that is done by the OS and to mimic systems that are already working and therefore have some data in the OS cache). Following are the results: Current TopDocCollector + PQ Index Size 1M 10M Avg. Time 8.519ms 289.232ms Avg. Allocations 77.38 97.35 Avg. # results 51,113461,019 Modified TopDocCollector + PQ -- Index Size 1M 10M Avg. Time 9.619ms 298.197ms Avg. Allocations 9.92 10.12 Avg. # results 51,113461,019 Basically the results haven't changed from yesterday. There isn't any significant difference in the execution time of both versions. The only difference is the number of allocations. Although the number of allocations is very small (100 for 461,000 results), I think it should not be neglected. On systems that rely solely on memory (such as powerful systems that are able to keep entire indexes in-memory), the number of object allocations may be significant. The way I see it we can do either of the following: 1. Add the method to PQ and change TDC implementation to reuse ScoreDocs. We gain only in the number of allocations. Basically, we don't lose anything by doing that, we only gain. 2. Add the method to PQ for applications that require it and not change TDC's implementation. For example, applications that want to show the 10 most recent documents from a very large collection need to run a MatchAllDocsQuery with some sorting. They may create a lot more instances of ScoreDoc. 3. Do nothing. If you think I should run more tests, please let me know - I already have the two indexes and any further tests can be performed quite immediately. Thanks, Shai On Dec 10, 2007 11:46 PM, Mike Klaas <[EMAIL PROTECTED]> wrote: > On 10-Dec-07, at 1:20 PM, Shai Erera wrote: > > > Thanks for the info. Too bad I use Windows ... > > Just allocate a bunch of memory and free it. This linux, but > something similar should work on windows: > > $ vmstat -S M > procs ---memory-- > r b swpd free buff cache > 0 0 0 45372786 > > $ python -c '"a"*20' > > $ vmstat -S M > procs ---memory-- > r b swpd free buff cache > 0 0463 1761 0 6 > > -Mike > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera
[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550376 ] Brian Pinkerton commented on LUCENE-753: BTW, I think the performance win with Yonik's patch for some workloads could be far greater than what the simple benchmark illustrates. Sure, pread might be marginally faster. But the real win is avoiding synchronized access to the file. I did some IO tracing a while back on one particular workload that is characterized by: * a small number of large compound indexes * short average execution time, particularly compared to disk response time * a 99+% FS cache hit rate * cache misses that tend to cluster on rare queries In this workload where each query hits each compound index, the locking in FSIndexInput means that a single rare query clobbers the response time for all queries. The requests to read cached data are serialized (fairly, even) with those that hit the disk. While we can't get rid of the rare queries, we can allow the common ones to proceed against cached data right away. > Use NIO positional read to avoid synchronization in FSIndexInput > > > Key: LUCENE-753 > URL: https://issues.apache.org/jira/browse/LUCENE-753 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Yonik Seeley > Attachments: FileReadTest.java, FileReadTest.java, > FSIndexInput.patch, FSIndexInput.patch > > > As suggested by Doug, we could use NIO pread to avoid synchronization on the > underlying file. > This could mitigate any MT performance drop caused by reducing the number of > files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550351 ] Brian Pinkerton commented on LUCENE-753: Yeah, the file was full of zeroes. But I created the files w/o holes and was using filesystems that don't compress file contents. Just to be sure, though, I repeated the tests with a file with random contents; the results above still hold. > Use NIO positional read to avoid synchronization in FSIndexInput > > > Key: LUCENE-753 > URL: https://issues.apache.org/jira/browse/LUCENE-753 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Yonik Seeley > Attachments: FileReadTest.java, FileReadTest.java, > FSIndexInput.patch, FSIndexInput.patch > > > As suggested by Doug, we could use NIO pread to avoid synchronization on the > underlying file. > This could mitigate any MT performance drop caused by reducing the number of > files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]