[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699545#action_12699545 ] Uwe Schindler commented on LUCENE-831: -- Hi, looks good: I am only not sure, what would be the right caching ValueSource. If you use a caching value source externally from IndexReader, what should I use? The original trie patch used the CachingValueSource (as when the patch was done, there only existed CacingValueSource): {code} + public static final ValueSource TRIE_VALUE_SOURCE = new CachingValueSource(new TrieValueSource()); {code} But correct would be CacheByReaderValueSource as a per-JVM singleton? For the tests is its not a problem, because there is only one index with one segment. If I use CachingValueSurce as a singleton, it would cache all values from all index readers mixed together? Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Assignee: Mark Miller Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699558#action_12699558 ] Uwe Schindler commented on LUCENE-1536: --- How about DocIdSet adds a {code} boolean isRandomAccess() { return false; } {code} That is implemented to return false in the default abstract class for backwards compatibility. If a DocIdSet is random access (backed by OpenBitSet or is the empty iterator), isRandomAccess() is overridden to return true and an additional method in DocIdSet is implemented, the default would be: {code} boolean acceptDoc(int docid) { throw new UnsupportedOperationException(); } {code} Both changes are backwards compatible, but filters using OpenBitSet would automatically be random access and support acceptDoc(). if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699571#action_12699571 ] Uwe Schindler commented on LUCENE-1536: --- The empty docidset instance should *not* be random access :), so the only change would affect OpenBitSet to overwrite these two new methods from the default abstract class: {code} boolean isRandomAccess() { return true; } boolean acceptDoc(int docid) { return get(docid); /* possibly inlined */ } {code} if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699573#action_12699573 ] Uwe Schindler commented on LUCENE-1536: --- And the switch for different densities: OpenBitSet could calculate its density in isRandomAccess() and return true or false depending on the density factors above. The search code then would only check initially isRandomAccess() (before starting filtering) and then switch between iterator or random acess api. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1606: Attachment: automaton.patch patch Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: automaton.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1606) Automaton Query/Filter (scalable regex)
Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: automaton.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699571#action_12699571 ] Uwe Schindler edited comment on LUCENE-1536 at 4/16/09 2:27 AM: The empty docidset instance should *not* be random access :), so the only change would affect OpenBitSet to overwrite these two new methods from the default abstract class: {code} boolean isRandomAccess() { return true; } boolean acceptDoc(int docid) { return fastGet(docid); /* possibly inlined */ } {code} was (Author: thetaphi): The empty docidset instance should *not* be random access :), so the only change would affect OpenBitSet to overwrite these two new methods from the default abstract class: {code} boolean isRandomAccess() { return true; } boolean acceptDoc(int docid) { return get(docid); /* possibly inlined */ } {code} if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Filtering documents out of IndexReader
On Tue, Apr 14, 2009 at 9:25 PM, Jeremy Volkman jvolk...@gmail.com wrote: Implementing this way allows me to write RAM indexes out to disk without blocking readers, and only block readers when I need to remap any filtered docs that may have been updated or deleted during the flushing process. I think this may beat using a straight IW for my requirements, but I'm not positive yet. I think testing out-of-the-box NRT's performance should be your next step: if it's sufficient, why bring all the complexity of tracking these RAM indices? So I've currently got a SuppressedIndexReader extends FilterIndexReader, but due to 1483 and 1573 I had to implement IndexReader.getFieldCacheKey() to get any sort of decent search performance, which I'd rather not do since I'm aware its only temporary. It's temporary because it's needed for the current field cache API, which we hope to replace with LUCENE-831. Still, it will likely be shipped w/ 2.9 and then removed in 3.0. LUCENE-1313 aims to support the RAM buffering for real, for cases where performance of the current NRT is in fact limiting, but we still have some iterating to do with that one Is it possible to perform a bunch of adds and deletes from an IW in an atomic action? Should I use addIndexesNoOptimize? IW doesn't support this, so you'll have externally sychronize to achieve this. Earlier patches on LUCENE-1313 did have a Transaction class for an atomic set of updates. If I go the filtered searcher direction, my filter will have to be aware of the portion of the MultiReader that corresponds to the disk index. Can I assume that my disk index will populate the lower portion of doc id space if it comes first in the list passed to the MultiReader constructor? The code says yes but the docs don't say anything. This is true today, but is an implementation detail that's free to change from release to release. Also, I'd worry about search performance of the filtered searcher approach, if that's an issue in your app. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699604#action_12699604 ] Michael McCandless commented on LUCENE-1591: All tests pass! And patch looks good. I'll commit shortly. Thanks Shai! Enable bzip compression in benchmark Key: LUCENE-1591 URL: https://issues.apache.org/jira/browse/LUCENE-1591 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Fix For: 2.9 Attachments: commons-compress-dev20090413.jar, commons-compress-dev20090413.jar, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch bzip compression can aid the benchmark package by not requiring extracting bzip files (such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false and in the relevant tasks either decompress the input file or compress the output file using the bzip streams. It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream and GZIPInputStream which compress/decompress files using the bzip algorithm. bzip is known to be superior in its compression performance to the gzip algorithm (~20% better compression), although it does the compression/decompression a bit slower. I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes, so it can be inherited by all sub-classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699620#action_12699620 ] Shai Erera commented on LUCENE-1591: Mike, did you commit the commons-compress jar too? Enable bzip compression in benchmark Key: LUCENE-1591 URL: https://issues.apache.org/jira/browse/LUCENE-1591 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Fix For: 2.9 Attachments: commons-compress-dev20090413.jar, commons-compress-dev20090413.jar, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch bzip compression can aid the benchmark package by not requiring extracting bzip files (such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false and in the relevant tasks either decompress the input file or compress the output file using the bzip streams. It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream and GZIPInputStream which compress/decompress files using the bzip algorithm. bzip is known to be superior in its compression performance to the gzip algorithm (~20% better compression), although it does the compression/decompression a bit slower. I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes, so it can be inherited by all sub-classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699607#action_12699607 ] Michael McCandless commented on LUCENE-1604: OK, patch looks good. All tests pass, even if I temporarily default disableFakeNorms to true (but back-compat tests fail, which is expected and is why we won't flip the default until 3.0). Thanks Shon! I still need to test perf cost of this change... Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1591) Enable bzip compression in benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1591. Resolution: Fixed Enable bzip compression in benchmark Key: LUCENE-1591 URL: https://issues.apache.org/jira/browse/LUCENE-1591 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Fix For: 2.9 Attachments: commons-compress-dev20090413.jar, commons-compress-dev20090413.jar, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch bzip compression can aid the benchmark package by not requiring extracting bzip files (such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false and in the relevant tasks either decompress the input file or compress the output file using the bzip streams. It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream and GZIPInputStream which compress/decompress files using the bzip algorithm. bzip is known to be superior in its compression performance to the gzip algorithm (~20% better compression), although it does the compression/decompression a bit slower. I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes, so it can be inherited by all sub-classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1606: Attachment: automatonWithWildCard.patch Here is an updated patch with AutomatonWildCardQuery. This implements standard Lucene Wildcard query with AutomatonFilter. This accelerates quite a few wildcard situations, such as ??(a|b)?cd*ef Sorry, provides no help for leading *, but definitely for leading ?. All wildcard tests pass. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: automaton.patch, automatonWithWildCard.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699643#action_12699643 ] Michael McCandless commented on LUCENE-1591: bq. Mike, did you commit the commons-compress jar too? Woops, forgot, and now fixed -- thanks for catching that! Enable bzip compression in benchmark Key: LUCENE-1591 URL: https://issues.apache.org/jira/browse/LUCENE-1591 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Fix For: 2.9 Attachments: commons-compress-dev20090413.jar, commons-compress-dev20090413.jar, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch bzip compression can aid the benchmark package by not requiring extracting bzip files (such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false and in the relevant tasks either decompress the input file or compress the output file using the bzip streams. It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream and GZIPInputStream which compress/decompress files using the bzip algorithm. bzip is known to be superior in its compression performance to the gzip algorithm (~20% better compression), although it does the compression/decompression a bit slower. I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes, so it can be inherited by all sub-classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699644#action_12699644 ] Mark Miller commented on LUCENE-831: Right, you really want to use CacheByReaderValueSource. Better would probably be to get that cache on the segment reader as well. But I think that would mean bringing back some sort of general cache to IndexReader. You would have to be able to attach arbitrary ValueSources to the reader. We will see what ends up materializing. I am agonizingly slow at understanding anything, but quick to move anyway ;) Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Assignee: Mark Miller Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699656#action_12699656 ] Michael McCandless commented on LUCENE-1593: bq. if so, can we agree on the new names (add, updateTop)? I think it makes sense to add these, returning the min value (and deprecate the old ones). Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699657#action_12699657 ] Robert Muir commented on LUCENE-1606: - mark yeah, the enumeration helps a lot, it means a lot less comparisons, plus brics is *FAST*. inside the AutomatonFilter i describe how it could possibly be done better, but I was afraid I would mess it up. its affected somewhat by the size of the alphabet so if you were using it against lots of CJK text, it might be worth it to instead use the State/Transition objects in the package. Transitions are described by min and max character intervals and you can access intervals in sorted order... its all so nice but I figure this is a start. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699659#action_12699659 ] Michael McCandless commented on LUCENE-1606: Can this do everything that RegexQuery currently does? (Ie we'd deprecate RegexQuery)? Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1606: --- Fix Version/s: 2.9 Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1603) Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement
[ https://issues.apache.org/jira/browse/LUCENE-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699660#action_12699660 ] Michael McCandless commented on LUCENE-1603: I think the name is good, so it's clear you have to provide a MultiTermQuery yourself (via subclass) to use it. Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement Key: LUCENE-1603 URL: https://issues.apache.org/jira/browse/LUCENE-1603 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4, 2.9 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1603.patch, LUCENE-1603.patch, LUCENE-1603.patch This is a patch, that is needed for the MultiTermQuery-rewrite of TrieRange (LUCENE-1602): - Make the private members protected, to have access to them from the very special TrieRangeTermEnum - Fix a small inconsistency (docFreq() now only returns a value, if a valid term is existing) - Improvement of MultiTermFilter.getDocIdSet to return DocIdSet.EMPTY_DOCIDSET, if the TermEnum is empty (less memory usage) and faster. - Add the getLastNumberOfTerms() to MultiTermQuery for statistics on different multi term queries and how may terms they affect, using this new functionality, the improvement of TrieRange can be shown (extract from test case there, 1 docs index, long values): {code} [junit] Average number of terms during random search on 'field8': [junit] Trie query: 244.2 [junit] Classical query: 3136.94 [junit] Average number of terms during random search on 'field4': [junit] Trie query: 38.3 [junit] Classical query: 3018.68 [junit] Average number of terms during random search on 'field2': [junit] Trie query: 18.04 [junit] Classical query: 3539.42 {code} All core tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699649#action_12699649 ] Uwe Schindler commented on LUCENE-831: -- This was the idea behin the FieldType: You register at the top-level IndexReader/MultiReader/whatever the parsers/valuesources (e.g. in a map coded by field), all subreaders would also get this map (passed through) and if one asks for cache values for a specific field, he would get the correctly decoded fields (from CSF, Univerter, TrieUniverter, Stored Fields [not really, but would be possible]). This was the original approach of this issue: attach caching to the single index/segmentreaders (with possibility to register valuesources for specific fields). In this case the SortField ctors taking ValueSource or Parser can be cancelled (and we can do this for 2.9, as the Parser ctor of SortField was not yet released!). Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Assignee: Mark Miller Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1603) Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement
[ https://issues.apache.org/jira/browse/LUCENE-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699647#action_12699647 ] Michael McCandless commented on LUCENE-1603: Patch looks good -- I'll commit shortly. Thanks Uwe! Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement Key: LUCENE-1603 URL: https://issues.apache.org/jira/browse/LUCENE-1603 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4, 2.9 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1603.patch, LUCENE-1603.patch, LUCENE-1603.patch This is a patch, that is needed for the MultiTermQuery-rewrite of TrieRange (LUCENE-1602): - Make the private members protected, to have access to them from the very special TrieRangeTermEnum - Fix a small inconsistency (docFreq() now only returns a value, if a valid term is existing) - Improvement of MultiTermFilter.getDocIdSet to return DocIdSet.EMPTY_DOCIDSET, if the TermEnum is empty (less memory usage) and faster. - Add the getLastNumberOfTerms() to MultiTermQuery for statistics on different multi term queries and how may terms they affect, using this new functionality, the improvement of TrieRange can be shown (extract from test case there, 1 docs index, long values): {code} [junit] Average number of terms during random search on 'field8': [junit] Trie query: 244.2 [junit] Classical query: 3136.94 [junit] Average number of terms during random search on 'field4': [junit] Trie query: 38.3 [junit] Classical query: 3018.68 [junit] Average number of terms during random search on 'field2': [junit] Trie query: 18.04 [junit] Classical query: 3539.42 {code} All core tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1603) Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement
[ https://issues.apache.org/jira/browse/LUCENE-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699654#action_12699654 ] Uwe Schindler commented on LUCENE-1603: --- Do you think the name is good? MultiTermQueryWrapperFilter or simplier MultiTermFilter? Its not really one of both, its a mix between wrapper and the real filter: It wraps the query, but does the getDocIdSet and TermEnums himself. Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement Key: LUCENE-1603 URL: https://issues.apache.org/jira/browse/LUCENE-1603 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4, 2.9 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1603.patch, LUCENE-1603.patch, LUCENE-1603.patch This is a patch, that is needed for the MultiTermQuery-rewrite of TrieRange (LUCENE-1602): - Make the private members protected, to have access to them from the very special TrieRangeTermEnum - Fix a small inconsistency (docFreq() now only returns a value, if a valid term is existing) - Improvement of MultiTermFilter.getDocIdSet to return DocIdSet.EMPTY_DOCIDSET, if the TermEnum is empty (less memory usage) and faster. - Add the getLastNumberOfTerms() to MultiTermQuery for statistics on different multi term queries and how may terms they affect, using this new functionality, the improvement of TrieRange can be shown (extract from test case there, 1 docs index, long values): {code} [junit] Average number of terms during random search on 'field8': [junit] Trie query: 244.2 [junit] Classical query: 3136.94 [junit] Average number of terms during random search on 'field4': [junit] Trie query: 38.3 [junit] Classical query: 3018.68 [junit] Average number of terms during random search on 'field2': [junit] Trie query: 18.04 [junit] Classical query: 3539.42 {code} All core tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699663#action_12699663 ] Mark Miller commented on LUCENE-831: Thats somewhat possible now (with the exception that you can't yet set the value source for the segment reader yet - it would likely become an argument to the static open methods): ValueSource gets a field as an argument, so it is also easy enough to set a ValueSource that does trie encoding for arbitrary fields on the SegmentReader, eg FieldTypeValueSource could take arguments to configure it per field and then you set it on the IndexReader when you open it. Thats all still in the patch - its just a bit more of a pain than being able to set it at any time on the SortField as an override. I guess I almost see things going just to the segment reader valuesource option though - once FieldCache goes back to standard, it might make sense to drop the SortField valuesource support too, and just do the segment ValueSource. Being able to init the SegmentReader with a ValueSource really allows for anything needed - I just wasn't sure if it was too much of a pain in comparison to also having a dynamic SortField override. Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Assignee: Mark Miller Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1602) Rewrite TrieRange to use MultiTermQuery
[ https://issues.apache.org/jira/browse/LUCENE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1602: -- Attachment: LUCENE-1602.patch This is the final patch, with the changes for LUCENE-1603. I also added svn:eol-style to all files in trie and test-trie. Because this is not yet committed, the patch may still fail to apply, but I will commit in the next few hours. Rewrite TrieRange to use MultiTermQuery --- Key: LUCENE-1602 URL: https://issues.apache.org/jira/browse/LUCENE-1602 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1602.patch, LUCENE-1602.patch, LUCENE-1602.patch, LUCENE-1602.patch, LUCENE-1602.patch, queries.zip, queries.zip Issue for discussion here: http://www.lucidimagination.com/search/document/46a548a79ae9c809/move_trierange_to_core_module_and_integration_issues This patch is a rewrite of TrieRange using MultiTermQuery like all other core queries. This should make TrieRange identical in functionality to core range queries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699669#action_12699669 ] Michael McCandless commented on LUCENE-1536: I like this approach! But should we somehow decouple the density check vs the is random access check? Ie, isRandomAccess should return true or false based on the underlying datastructure. Then, somehow, I think the search code should determine whether a given docIdSet should be randomly accessed vs iterated? (I'm not sure how yet!) Also, we somehow need the mechanism to denormalize the application of the filter from top to bottom, ie, each leaf TermQuery involved in the full query needs to know to apply the random access filter just like it applies deletes. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1606: Attachment: automatonWithWildCard2.patch oops I did say in javadocs score is constant / boost only so when Wildcard has no wildcards and rewrites to termquery, wrap it with ConstantScoreQuery(QueryWrapperFilter)) to ensure this. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699662#action_12699662 ] Robert Muir commented on LUCENE-1606: - Mike the thing it cant do is stuff that cannot be determinized. However I think you only need an NFA for capturing group related things: http://oreilly.com/catalog/regex/chapter/ch04.html One thing is that the brics syntax is a bit different. i.e. ^ and $ are implied and I think some things need to be escaped. So I think it can do everything RegexQuery does, but maybe different syntax is required. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699650#action_12699650 ] Mark Miller commented on LUCENE-1606: - Very nice Robert. This looks like it would make a very nice addition to our regex support. Found the benchmarks here quite interesting: http://tusker.org/regex/regex_benchmark.html (though it sounds like your special enumeration technique makes this regex imp even faster for our uses?) Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: automaton.patch, automatonWithWildCard.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1603) Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement
[ https://issues.apache.org/jira/browse/LUCENE-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1603. Resolution: Fixed Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement Key: LUCENE-1603 URL: https://issues.apache.org/jira/browse/LUCENE-1603 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4, 2.9 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1603.patch, LUCENE-1603.patch, LUCENE-1603.patch This is a patch, that is needed for the MultiTermQuery-rewrite of TrieRange (LUCENE-1602): - Make the private members protected, to have access to them from the very special TrieRangeTermEnum - Fix a small inconsistency (docFreq() now only returns a value, if a valid term is existing) - Improvement of MultiTermFilter.getDocIdSet to return DocIdSet.EMPTY_DOCIDSET, if the TermEnum is empty (less memory usage) and faster. - Add the getLastNumberOfTerms() to MultiTermQuery for statistics on different multi term queries and how may terms they affect, using this new functionality, the improvement of TrieRange can be shown (extract from test case there, 1 docs index, long values): {code} [junit] Average number of terms during random search on 'field8': [junit] Trie query: 244.2 [junit] Classical query: 3136.94 [junit] Average number of terms during random search on 'field4': [junit] Trie query: 38.3 [junit] Classical query: 3018.68 [junit] Average number of terms during random search on 'field2': [junit] Trie query: 18.04 [junit] Classical query: 3539.42 {code} All core tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699672#action_12699672 ] Uwe Schindler commented on LUCENE-1606: --- I looked into the patch, looks good. Maybe it would be good to make the new AutomatonRegExQuey als a subclass of MultiTermQuery. As you also seek/exchange the TermEnum, the needed FilteredTermEnum may be a little bit complicated. But you may do it in the same way like I commit soon for TrieRange (LUCENE-1602). The latest changes from LUCENE-1603 make it possible to write a FilteredTermEnum, that handles over to different positioned TermEnums like you do. With MultiTermQuery you get all for free: ConstantScore, Boolean rewrite and optionally the Filter (which is not needed here, I think). And: You could also overwrite difference in FilteredTermEnum to rank the hits. A note: The FilteredTermEnum created by TrieRange is not for sure really ordered correctly according Term.compareTo(), but this is not really needed for MultiTermQuery. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
TermEnum.skipTo()
while I was mucking with term enumeration i found that TermEnum.skipTo() has a very simple implementation and has in javadocs that 'some implementations are considerably more efficent', yet SegmentTermEnum definitely doesn't reimplement it in a more efficient way. For my purposes to skip around i simply close the term enum and get a new one from the indexReader at a different starting point. Not that I want to touch it, just mentioning i thought it was a little non-obvious that skipTo() is so inefficient, it keeps enumerating until compareTo() returns what it wants... -- Robert Muir rcm...@gmail.com
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699673#action_12699673 ] Robert Muir commented on LUCENE-1606: - Uwe, I agree with you, with one caveat: for this functionality to work the Enum must be ordered correctly according to Term.compareTo(). Otherwise it will not work correctly... Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699674#action_12699674 ] Shon Vella commented on LUCENE-1604: Working on an update to the patch - MultiSegmentReader needs to set disableFakeNorms transitively to it's subReaders as well as to new subReaders on reopen. Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: TermEnum.skipTo()
Robert Muir wrote: while I was mucking with term enumeration i found that TermEnum.skipTo() has a very simple implementation and has in javadocs that 'some implementations are considerably more efficent', yet SegmentTermEnum definitely doesn't reimplement it in a more efficient way. For my purposes to skip around i simply close the term enum and get a new one from the indexReader at a different starting point. Not that I want to touch it, just mentioning i thought it was a little non-obvious that skipTo() is so inefficient, it keeps enumerating until compareTo() returns what it wants... -- Robert Muir rcm...@gmail.com mailto:rcm...@gmail.com Indeed - somewhat related: https://issues.apache.org/jira/browse/LUCENE-1592 -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699675#action_12699675 ] Uwe Schindler commented on LUCENE-1536: --- I coupled the density check inside the OpenBitSet, because the internals of OpenBitset are responsible for determining how fast a sequential vs. random approach is. Maybe someone invents an new hyper-bitset that can faster do sequential accesses even in sparingly filled bitsets (e.g. fragmented bitset, bitset with RDBMS-like index). In this case, it has the responsibility to say: if density is between this and this i would use sequential. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 status (to port to Lucene.Net)
Hi George, There's been a sudden burst of activity lately on 2.9 development... I know there are some biggish remaining features we may want to get into 2.9: * The new field cache (LUCENE-831; still being iterated/mulled), * Possible major rework of Field / Document index-time vs search-time Document * Applying filters via random-access API when possible performant (LUCENE-1536) * Possible further optimizations to how collection works (LUCENE-1593) * Maybe breaking core + contrib into a more uniform set of modules (and figuring out how Trie(Numeric)RangeQuery/Filter fits in here) -- the Modularization uber-thread. * Further improvements to near-realtime search (using RAMDir for small recently flushed segments) * Many other small things and probably some big ones that I'm forgetting now :) So things are still in flux, and I'm really not sure on a release date at this point. Late last year, I was hoping for early this year, but it's no longer early this year ;) Mike On Wed, Apr 15, 2009 at 9:17 PM, George Aroush geo...@aroush.net wrote: Hi Folks, This is George Aroush, I'm one of the committers on Lucene.Net - a port of Java Lucene to C# Lucene. I'm looking at the current trunk code of yet to be released Lucene 2.9 and I would like to port it to Lucene.Net. If I do this now, we get the benefit of keeping our code base and release dates much closer to Java Lucene. However, this comes with a cost of carrying over unfinished work, known defects, and I have to keep an eye on new code that get committed into Java Lucene which must be ported over in a timely fashion. To help me determine when is a good time to start the port -- keep in mind, I will be taking the latest code off SVN -- I like to hear from the Java Lucene committers (and users who are playing or using Lucene 2.9 off SVN) about those questions: 1) how stable the current code in the trunk is, 2) do you still have feature work to deliver or just bug fixes, and 3) what's your target date to release Java Lucene 2.9 #1 is important, such that is anyone using it in production? Yes, I did look at the current open issues in JIRA, but that doesn't help me answer the above questions. Regards, -- George - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699676#action_12699676 ] Uwe Schindler commented on LUCENE-1606: --- It will work, that was what I said. For MultiTermQuery, it must *not* be ordered, the ordering is irrelevant for it, MultTermQuery only enumerates the terms. TrieRange is an example of that, the order of terms is not for sure ordered correctly (it is at the moment because of the internal implementation of splitLongRange(), but I tested it with the inverse order and it still worked). If you want to use the enum for something other, it will fail. The filters inside MultiTermQuery and the BooleanQuery do not need to have the terms ordered. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699678#action_12699678 ] Mark Miller commented on LUCENE-831: So I'm flopping around on this, but I guess my latest take is that: I want to drop the SortField ValueSource override option. Everything would need to be handled by overriding the segment reader ValueSource. Drop the current back compat code for FieldCache - its mostly unnecessary I think. Instead, perhaps go back to orig FieldCache impl, except if the Reader is a segment reader, use the new ValueSource API ? Grrr - except if someone has mucked with the ValueSource or used a custom FieldCache Parser, it won't match correctly...thats it - you just can't straddle the two APIs. So I'll revert FieldCache to its former self and just deprecate. Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Assignee: Mark Miller Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699680#action_12699680 ] Michael McCandless commented on LUCENE-1536: OK, if we do choose to couple, maybe we should name it useRandomAccess()? Another filter optimization that'd be nice to get in is to somehow know that a filter has pre-incorporated deleted documents. This way, once we have a solution for the push filter down to all TermScorers, we could have it only check the filter and not also deleted docs. (This is one of the optimizations in LUCENE-1594). We might eventually want/need some sort of external FilterManager that would handle this (ie, convert a filter to sparse vs random-access as appropriate, multiply in deleted docs, handle caching, etc). if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699685#action_12699685 ] Robert Muir commented on LUCENE-1606: - Uwe, i'll look and see how you do it for TrieRange. if it can make the code for this simpler that will be fantastic. maybe by then I will have also figured out some way to cleanly and non-recursively use min/max character intervals in the state machine to decrease the amount of seeks and optimize a little bit. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1602) Rewrite TrieRange to use MultiTermQuery
[ https://issues.apache.org/jira/browse/LUCENE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-1602. --- Resolution: Fixed Committed revision 765618. Rewrite TrieRange to use MultiTermQuery --- Key: LUCENE-1602 URL: https://issues.apache.org/jira/browse/LUCENE-1602 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1602.patch, LUCENE-1602.patch, LUCENE-1602.patch, LUCENE-1602.patch, LUCENE-1602.patch, queries.zip, queries.zip Issue for discussion here: http://www.lucidimagination.com/search/document/46a548a79ae9c809/move_trierange_to_core_module_and_integration_issues This patch is a rewrite of TrieRange using MultiTermQuery like all other core queries. This should make TrieRange identical in functionality to core range queries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699690#action_12699690 ] Uwe Schindler commented on LUCENE-1606: --- I committed TrieRange revision 765618. You can see the impl here: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/trie/TrieRangeTermEnum.java?view=markup Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: TermEnum.skipTo()
Mark Miller wrote: Robert Muir wrote: while I was mucking with term enumeration i found that TermEnum.skipTo() has a very simple implementation and has in javadocs that 'some implementations are considerably more efficent', yet SegmentTermEnum definitely doesn't reimplement it in a more efficient way. For my purposes to skip around i simply close the term enum and get a new one from the indexReader at a different starting point. Not that I want to touch it, just mentioning i thought it was a little non-obvious that skipTo() is so inefficient, it keeps enumerating until compareTo() returns what it wants... -- Robert Muir rcm...@gmail.com mailto:rcm...@gmail.com Indeed - somewhat related: https://issues.apache.org/jira/browse/LUCENE-1592 I've changed Some implementations are considerably more efficient than that. to Some implementations *could* be considerably more efficient than a linear scan. Check the implementation to be sure. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1592) fix or deprecate TermsEnum.skipTo
[ https://issues.apache.org/jira/browse/LUCENE-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1592: -- Summary: fix or deprecate TermsEnum.skipTo (was: fix or deprecate TermsEnum.seek) fix or deprecate TermsEnum.skipTo - Key: LUCENE-1592 URL: https://issues.apache.org/jira/browse/LUCENE-1592 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor Fix For: 2.9 This method is a trap: it looks legitimate but it has hideously poor performance (simple linear scan implemented in the TermsEnum base class since none of the concrete impls override it with a more efficient implementation). The least we should do for 2.9 is deprecate the method with a strong warning about its performance. See here for background: http://www.lucidimagination.com/search/document/77dc4f8e893d3cf3/possible_terminfosreader_speedup And, here for historical context: http://www.lucidimagination.com/search/document/88f1b95b404ebf16/remove_termenum_skipto_term_target -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699693#action_12699693 ] Robert Muir commented on LUCENE-1606: - Uwe, thanks. I'll think on this and on other improvements. I'm not really confident in my ability to make the code much cleaner at the end of the day, but more efficient and get some things for free as you say. For now it is working much better than a linear scan, and the improvements wont change the order, but might help a bit. Think i should try to correct this issue or create a separate issue? Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1592) fix or deprecate TermsEnum.skipTo
[ https://issues.apache.org/jira/browse/LUCENE-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699696#action_12699696 ] Mark Miller commented on LUCENE-1592: - I made a quick update to the javadoc so its a bit less misleading, but still needs to be resolved in a stronger manner, al la this issue. fix or deprecate TermsEnum.skipTo - Key: LUCENE-1592 URL: https://issues.apache.org/jira/browse/LUCENE-1592 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor Fix For: 2.9 This method is a trap: it looks legitimate but it has hideously poor performance (simple linear scan implemented in the TermsEnum base class since none of the concrete impls override it with a more efficient implementation). The least we should do for 2.9 is deprecate the method with a strong warning about its performance. See here for background: http://www.lucidimagination.com/search/document/77dc4f8e893d3cf3/possible_terminfosreader_speedup And, here for historical context: http://www.lucidimagination.com/search/document/88f1b95b404ebf16/remove_termenum_skipto_term_target -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699697#action_12699697 ] Uwe Schindler commented on LUCENE-1606: --- Let's stay with this issue! Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: TermEnum.skipTo()
Maybe we should deprecate it? Mike On Thu, Apr 16, 2009 at 9:04 AM, Mark Miller markrmil...@gmail.com wrote: Mark Miller wrote: Robert Muir wrote: while I was mucking with term enumeration i found that TermEnum.skipTo() has a very simple implementation and has in javadocs that 'some implementations are considerably more efficent', yet SegmentTermEnum definitely doesn't reimplement it in a more efficient way. For my purposes to skip around i simply close the term enum and get a new one from the indexReader at a different starting point. Not that I want to touch it, just mentioning i thought it was a little non-obvious that skipTo() is so inefficient, it keeps enumerating until compareTo() returns what it wants... -- Robert Muir rcm...@gmail.com mailto:rcm...@gmail.com Indeed - somewhat related: https://issues.apache.org/jira/browse/LUCENE-1592 I've changed Some implementations are considerably more efficient than that. to Some implementations *could* be considerably more efficient than a linear scan. Check the implementation to be sure. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: TermEnum.skipTo()
I think it's a convenient method. Even if not performing, it's still more convenient than forcing everyone who wants to use it to implement it by himself. Perhaps a better implementation will exist in the future, and thus everyone who'll use this method will be silently upgraded. Maybe such a better implementation should be considered? On Thu, Apr 16, 2009 at 4:46 PM, Michael McCandless luc...@mikemccandless.com wrote: Maybe we should deprecate it? Mike On Thu, Apr 16, 2009 at 9:04 AM, Mark Miller markrmil...@gmail.com wrote: Mark Miller wrote: Robert Muir wrote: while I was mucking with term enumeration i found that TermEnum.skipTo() has a very simple implementation and has in javadocs that 'some implementations are considerably more efficent', yet SegmentTermEnum definitely doesn't reimplement it in a more efficient way. For my purposes to skip around i simply close the term enum and get a new one from the indexReader at a different starting point. Not that I want to touch it, just mentioning i thought it was a little non-obvious that skipTo() is so inefficient, it keeps enumerating until compareTo() returns what it wants... -- Robert Muir rcm...@gmail.com mailto:rcm...@gmail.com Indeed - somewhat related: https://issues.apache.org/jira/browse/LUCENE-1592 I've changed Some implementations are considerably more efficient than that. to Some implementations *could* be considerably more efficient than a linear scan. Check the implementation to be sure. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: TermEnum.skipTo()
That would be great... we need someone to pull a patch together (for SegmentReader Multi*Reader to implement it efficiently). Mike On Thu, Apr 16, 2009 at 9:50 AM, Shai Erera ser...@gmail.com wrote: I think it's a convenient method. Even if not performing, it's still more convenient than forcing everyone who wants to use it to implement it by himself. Perhaps a better implementation will exist in the future, and thus everyone who'll use this method will be silently upgraded. Maybe such a better implementation should be considered? On Thu, Apr 16, 2009 at 4:46 PM, Michael McCandless luc...@mikemccandless.com wrote: Maybe we should deprecate it? Mike On Thu, Apr 16, 2009 at 9:04 AM, Mark Miller markrmil...@gmail.com wrote: Mark Miller wrote: Robert Muir wrote: while I was mucking with term enumeration i found that TermEnum.skipTo() has a very simple implementation and has in javadocs that 'some implementations are considerably more efficent', yet SegmentTermEnum definitely doesn't reimplement it in a more efficient way. For my purposes to skip around i simply close the term enum and get a new one from the indexReader at a different starting point. Not that I want to touch it, just mentioning i thought it was a little non-obvious that skipTo() is so inefficient, it keeps enumerating until compareTo() returns what it wants... -- Robert Muir rcm...@gmail.com mailto:rcm...@gmail.com Indeed - somewhat related: https://issues.apache.org/jira/browse/LUCENE-1592 I've changed Some implementations are considerably more efficient than that. to Some implementations *could* be considerably more efficient than a linear scan. Check the implementation to be sure. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
I wanna contribute a Chinese analyzer to lucene
Hi All! I wrote a Analyzer for apache lucene for analyzing sentences in *Chinese*language, it's called *imdict-chinese-analyzer* as it is a subproject of *imdict*http://www.imdict.net/, which is an intelligent online dictionary. The project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), *not* 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzerhttp://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/and CJKAnalyzerhttp://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hit the performance baddly. The algorithm of* imdict-chinese-analyzer* is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLALhttp://www.nlp.org.cn/project/project.php?proj_id=6 . As *imdict-chinese-analyzer* is a really fast intelligent Chinese Analyzer for lucene written in Java. I want to share this project with every one using Lucene. This Analyzer contains two packages, *the source code* and the *lexical dictionary*. I want to publish the source code using Apache license, but the dictionary which is under an ambigus license was not create by me. So, can I only submit the source code to lucene contribution repository, and let the users download the dictionary from the google code site? please help me about this contribution.
[jira] Issue Comment Edited: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699714#action_12699714 ] Shon Vella edited comment on LUCENE-1604 at 4/16/09 7:16 AM: - Setting disableFakeNorms transitively isn't really needed because MultiSegmentReader doesn't make any calls to the subreaders that would cause it to create it's own fake norms. We probably ought to preserve the flag on clone() and reopen() though, which is going to be a little messy because IndexReader doesn't really implement either so it would have to be handled at the root of each concrete class hierarchy that does implement those. Any thoughts on whether we need this or not? was (Author: svella): Setting disableFakeNorms transitively isn't really needed because MultiSegmentReader doesn't make any calls to the subreaders that would cause it to create it's own fake norms. We probably ought to preserve the flag on clone() and reopen() though, which is going to be a little messy because IndexReader doesn't really implement either so it would have to be handled at the root of each concrete class hierarchy that does implement those. Any thoughts? Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699714#action_12699714 ] Shon Vella commented on LUCENE-1604: Setting disableFakeNorms transitively isn't really needed because MultiSegmentReader doesn't make any calls to the subreaders that would cause it to create it's own fake norms. We probably ought to preserve the flag on clone() and reopen() though, which is going to be a little messy because IndexReader doesn't really implement either so it would have to be handled at the root of each concrete class hierarchy that does implement those. Any thoughts? Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: I wanna contribute a Chinese analyzer to lucene
I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language, it's called imdict-chinese-analyzer as it is a subproject of http://www.imdict.net/imdict, which is an intelligent online dictionary. The project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/http://code.google.com/p/imdict-chinese-analyzer/ I took a quick look, but didn't see any code posted there yet. [snip] This Analyzer contains two packages, the source code and the lexical dictionary. I want to publish the source code using Apache license, but the dictionary which is under an ambigus license was not create by me. So, can I only submit the source code to lucene contribution repository, and let the users download the dictionary from the google code site? I believe your code can be a contrib, with a reference to the dictionary. So a first step would be to open an issue in Lucene's Jira (http://issues.apache.org/jira/browse/LUCENE), and post your source as a patch. The best way to get the right answer to the legal issue is to post it to the legal-disc...@apache.org list (join it first), as Apache's lawyers can then respond to your specific question. -- Ken -- Ken Krugler +1 530-210-6378
Re: TermEnum.skipTo()
+1 on further handling (LUCENE-1592). I just wanted to get a doc change in now rather than wait for that to complete. The statment that some implementations provide more efficient impls is very misleading (its almost an assertion that one exists) when no impls that ship with Lucene in fact do. On Thu, Apr 16, 2009 at 9:57 AM, Michael McCandless luc...@mikemccandless.com wrote: That would be great... we need someone to pull a patch together (for SegmentReader Multi*Reader to implement it efficiently). Mike On Thu, Apr 16, 2009 at 9:50 AM, Shai Erera ser...@gmail.com wrote: I think it's a convenient method. Even if not performing, it's still more convenient than forcing everyone who wants to use it to implement it by himself. Perhaps a better implementation will exist in the future, and thus everyone who'll use this method will be silently upgraded. Maybe such a better implementation should be considered? On Thu, Apr 16, 2009 at 4:46 PM, Michael McCandless luc...@mikemccandless.com wrote: Maybe we should deprecate it? Mike On Thu, Apr 16, 2009 at 9:04 AM, Mark Miller markrmil...@gmail.com wrote: Mark Miller wrote: Robert Muir wrote: while I was mucking with term enumeration i found that TermEnum.skipTo() has a very simple implementation and has in javadocs that 'some implementations are considerably more efficent', yet SegmentTermEnum definitely doesn't reimplement it in a more efficient way. For my purposes to skip around i simply close the term enum and get a new one from the indexReader at a different starting point. Not that I want to touch it, just mentioning i thought it was a little non-obvious that skipTo() is so inefficient, it keeps enumerating until compareTo() returns what it wants... -- Robert Muir rcm...@gmail.com mailto:rcm...@gmail.com Indeed - somewhat related: https://issues.apache.org/jira/browse/LUCENE-1592 I've changed Some implementations are considerably more efficient than that. to Some implementations *could* be considerably more efficient than a linear scan. Check the implementation to be sure. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1605) Add subset method to BitVector
[ https://issues.apache.org/jira/browse/LUCENE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1605: -- Assignee: Michael McCandless Add subset method to BitVector -- Key: LUCENE-1605 URL: https://issues.apache.org/jira/browse/LUCENE-1605 Project: Lucene - Java Issue Type: New Feature Components: Other Affects Versions: 2.9 Reporter: Jeremy Volkman Assignee: Michael McCandless Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1605.txt Recently I needed the ability to efficiently compute subsets of a BitVector. The method is: public BitVector subset(int start, int end) where start is the starting index, inclusive and end is the ending index, exclusive. Attached is a patch including the subset method as well as relevant unit tests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1605) Add subset method to BitVector
[ https://issues.apache.org/jira/browse/LUCENE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1605. Resolution: Fixed Add subset method to BitVector -- Key: LUCENE-1605 URL: https://issues.apache.org/jira/browse/LUCENE-1605 Project: Lucene - Java Issue Type: New Feature Components: Other Affects Versions: 2.9 Reporter: Jeremy Volkman Assignee: Michael McCandless Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1605.txt Recently I needed the ability to efficiently compute subsets of a BitVector. The method is: public BitVector subset(int start, int end) where start is the starting index, inclusive and end is the ending index, exclusive. Attached is a patch including the subset method as well as relevant unit tests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1605) Add subset method to BitVector
[ https://issues.apache.org/jira/browse/LUCENE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699718#action_12699718 ] Michael McCandless commented on LUCENE-1605: Patch looks good; I'll commit shortly. Thanks Jeremy! Add subset method to BitVector -- Key: LUCENE-1605 URL: https://issues.apache.org/jira/browse/LUCENE-1605 Project: Lucene - Java Issue Type: New Feature Components: Other Affects Versions: 2.9 Reporter: Jeremy Volkman Assignee: Michael McCandless Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1605.txt Recently I needed the ability to efficiently compute subsets of a BitVector. The method is: public BitVector subset(int start, int end) where start is the starting index, inclusive and end is the ending index, exclusive. Attached is a patch including the subset method as well as relevant unit tests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699720#action_12699720 ] Michael McCandless commented on LUCENE-1604: bq. Setting disableFakeNorms transitively isn't really needed because MultiSegmentReader doesn't make any calls to the subreaders that would cause it to create it's own fake norms But since we score per-segment, TermScorer would ask each SegmentReader (in the MultiSegmentReader) for its norms? So I think the sub readers need to know the setting. bq. Any thoughts on whether we need this or not? I think we do need each class implementing clone() and reopen() to properly carryover this setting. Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
vacation
Just as a heads up, since we have so many neat Lucene improvements in flight: tomorrow I leave for a week long vacation, in a nice warm place that may or may not have internet access. So if suddenly I stop answering things, now you know why! Keep hacking away ;) Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: vacation
If it's nice and warm I hope for you that it doesn't have internet access, so you won't be tempted to be dragged away from it ;) On Thu, Apr 16, 2009 at 5:45 PM, Michael McCandless luc...@mikemccandless.com wrote: Just as a heads up, since we have so many neat Lucene improvements in flight: tomorrow I leave for a week long vacation, in a nice warm place that may or may not have internet access. So if suddenly I stop answering things, now you know why! Keep hacking away ;) Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: I wanna contribute a Chinese analyzer to lucene
On Thu, Apr 16, 2009 at 18:16, Ken Krugler kkrugler_li...@transpac.com wrote: I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language, it's called imdict-chinese-analyzer as it is a subproject of imdict, which is an intelligent online dictionary. The project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ I took a quick look, but didn't see any code posted there yet. http://code.google.com/p/imdict-chinese-analyzer/downloads/list ? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: vacation
Yes I suppose that would be best ;) Mike On Thu, Apr 16, 2009 at 10:48 AM, Shai Erera ser...@gmail.com wrote: If it's nice and warm I hope for you that it doesn't have internet access, so you won't be tempted to be dragged away from it ;) On Thu, Apr 16, 2009 at 5:45 PM, Michael McCandless luc...@mikemccandless.com wrote: Just as a heads up, since we have so many neat Lucene improvements in flight: tomorrow I leave for a week long vacation, in a nice warm place that may or may not have internet access. So if suddenly I stop answering things, now you know why! Keep hacking away ;) Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Lucene 2.9 status (to port to Lucene.Net)
Thanks Mike. A quick follow up question. What's the status of http://issues.apache.org/jira/browse/LUCENE-1313? Can this work be applied to Lucene 2.4.1 and still get it's benefit or are there other dependency / issues with it that prevents us from doing so? If anyone else knows, I welcome your input. -- George -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, April 16, 2009 8:36 AM To: java-dev@lucene.apache.org Subject: Re: Lucene 2.9 status (to port to Lucene.Net) Hi George, There's been a sudden burst of activity lately on 2.9 development... I know there are some biggish remaining features we may want to get into 2.9: * The new field cache (LUCENE-831; still being iterated/mulled), * Possible major rework of Field / Document index-time vs search-time Document * Applying filters via random-access API when possible performant (LUCENE-1536) * Possible further optimizations to how collection works (LUCENE-1593) * Maybe breaking core + contrib into a more uniform set of modules (and figuring out how Trie(Numeric)RangeQuery/Filter fits in here) -- the Modularization uber-thread. * Further improvements to near-realtime search (using RAMDir for small recently flushed segments) * Many other small things and probably some big ones that I'm forgetting now :) So things are still in flux, and I'm really not sure on a release date at this point. Late last year, I was hoping for early this year, but it's no longer early this year ;) Mike On Wed, Apr 15, 2009 at 9:17 PM, George Aroush geo...@aroush.net wrote: Hi Folks, This is George Aroush, I'm one of the committers on Lucene.Net - a port of Java Lucene to C# Lucene. I'm looking at the current trunk code of yet to be released Lucene 2.9 and I would like to port it to Lucene.Net. If I do this now, we get the benefit of keeping our code base and release dates much closer to Java Lucene. However, this comes with a cost of carrying over unfinished work, known defects, and I have to keep an eye on new code that get committed into Java Lucene which must be ported over in a timely fashion. To help me determine when is a good time to start the port -- keep in mind, I will be taking the latest code off SVN -- I like to hear from the Java Lucene committers (and users who are playing or using Lucene 2.9 off SVN) about those questions: 1) how stable the current code in the trunk is, 2) do you still have feature work to deliver or just bug fixes, and 3) what's your target date to release Java Lucene 2.9 #1 is important, such that is anyone using it in production? Yes, I did look at the current open issues in JIRA, but that doesn't help me answer the above questions. Regards, -- George - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 status (to port to Lucene.Net)
I wouldn't be surprised if it didnt depend on a couple other little issues - Jason or Mike would probably have to tell you that. It does count a bit on LUCENE-1483 if you want to use it with FieldCaches or cached Filters though. It would still work with 1483, but would be much slower in those cases. - Mark George Aroush wrote: Thanks Mike. A quick follow up question. What's the status of http://issues.apache.org/jira/browse/LUCENE-1313? Can this work be applied to Lucene 2.4.1 and still get it's benefit or are there other dependency / issues with it that prevents us from doing so? If anyone else knows, I welcome your input. -- George -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, April 16, 2009 8:36 AM To: java-dev@lucene.apache.org Subject: Re: Lucene 2.9 status (to port to Lucene.Net) Hi George, There's been a sudden burst of activity lately on 2.9 development... I know there are some biggish remaining features we may want to get into 2.9: * The new field cache (LUCENE-831; still being iterated/mulled), * Possible major rework of Field / Document index-time vs search-time Document * Applying filters via random-access API when possible performant (LUCENE-1536) * Possible further optimizations to how collection works (LUCENE-1593) * Maybe breaking core + contrib into a more uniform set of modules (and figuring out how Trie(Numeric)RangeQuery/Filter fits in here) -- the Modularization uber-thread. * Further improvements to near-realtime search (using RAMDir for small recently flushed segments) * Many other small things and probably some big ones that I'm forgetting now :) So things are still in flux, and I'm really not sure on a release date at this point. Late last year, I was hoping for early this year, but it's no longer early this year ;) Mike On Wed, Apr 15, 2009 at 9:17 PM, George Aroush geo...@aroush.net wrote: Hi Folks, This is George Aroush, I'm one of the committers on Lucene.Net - a port of Java Lucene to C# Lucene. I'm looking at the current trunk code of yet to be released Lucene 2.9 and I would like to port it to Lucene.Net. If I do this now, we get the benefit of keeping our code base and release dates much closer to Java Lucene. However, this comes with a cost of carrying over unfinished work, known defects, and I have to keep an eye on new code that get committed into Java Lucene which must be ported over in a timely fashion. To help me determine when is a good time to start the port -- keep in mind, I will be taking the latest code off SVN -- I like to hear from the Java Lucene committers (and users who are playing or using Lucene 2.9 off SVN) about those questions: 1) how stable the current code in the trunk is, 2) do you still have feature work to deliver or just bug fixes, and 3) what's your target date to release Java Lucene 2.9 #1 is important, such that is anyone using it in production? Yes, I did look at the current open issues in JIRA, but that doesn't help me answer the above questions. Regards, -- George - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 status (to port to Lucene.Net)
Whoops - should read: It should still work *without* 1483 but would be much slower in those cases (reloading the filter/fieldcache per reader rather than per segment). Mark Miller wrote: I wouldn't be surprised if it didnt depend on a couple other little issues - Jason or Mike would probably have to tell you that. It does count a bit on LUCENE-1483 if you want to use it with FieldCaches or cached Filters though. It would still work with 1483, but would be much slower in those cases. - Mark George Aroush wrote: Thanks Mike. A quick follow up question. What's the status of http://issues.apache.org/jira/browse/LUCENE-1313? Can this work be applied to Lucene 2.4.1 and still get it's benefit or are there other dependency / issues with it that prevents us from doing so? If anyone else knows, I welcome your input. -- George -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Lucene 2.9 status (to port to Lucene.Net)
These issues all depend so much on each other, i would suggest to simply try Lucene-2.9-dev trunk (e.g. from downloaded from Hudson). We have this running here without any problems. The problem with unreleased Lucene is more, that if you try new features, there may be non-compatible changes until the release, so you must keep track on changes on the components you try out. In general: If everything works for you, and you have backups of your indexes, you can simply try out. If it works correctly, just use it! Patching the relased version may make it more unstable than using the development tree, that is more tested by all our committers :) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: George Aroush [mailto:geo...@aroush.net] Sent: Thursday, April 16, 2009 5:05 PM To: java-dev@lucene.apache.org Subject: RE: Lucene 2.9 status (to port to Lucene.Net) Thanks Mike. A quick follow up question. What's the status of http://issues.apache.org/jira/browse/LUCENE-1313? Can this work be applied to Lucene 2.4.1 and still get it's benefit or are there other dependency / issues with it that prevents us from doing so? If anyone else knows, I welcome your input. -- George -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, April 16, 2009 8:36 AM To: java-dev@lucene.apache.org Subject: Re: Lucene 2.9 status (to port to Lucene.Net) Hi George, There's been a sudden burst of activity lately on 2.9 development... I know there are some biggish remaining features we may want to get into 2.9: * The new field cache (LUCENE-831; still being iterated/mulled), * Possible major rework of Field / Document index-time vs search-time Document * Applying filters via random-access API when possible performant (LUCENE-1536) * Possible further optimizations to how collection works (LUCENE-1593) * Maybe breaking core + contrib into a more uniform set of modules (and figuring out how Trie(Numeric)RangeQuery/Filter fits in here) -- the Modularization uber-thread. * Further improvements to near-realtime search (using RAMDir for small recently flushed segments) * Many other small things and probably some big ones that I'm forgetting now :) So things are still in flux, and I'm really not sure on a release date at this point. Late last year, I was hoping for early this year, but it's no longer early this year ;) Mike On Wed, Apr 15, 2009 at 9:17 PM, George Aroush geo...@aroush.net wrote: Hi Folks, This is George Aroush, I'm one of the committers on Lucene.Net - a port of Java Lucene to C# Lucene. I'm looking at the current trunk code of yet to be released Lucene 2.9 and I would like to port it to Lucene.Net. If I do this now, we get the benefit of keeping our code base and release dates much closer to Java Lucene. However, this comes with a cost of carrying over unfinished work, known defects, and I have to keep an eye on new code that get committed into Java Lucene which must be ported over in a timely fashion. To help me determine when is a good time to start the port -- keep in mind, I will be taking the latest code off SVN -- I like to hear from the Java Lucene committers (and users who are playing or using Lucene 2.9 off SVN) about those questions: 1) how stable the current code in the trunk is, 2) do you still have feature work to deliver or just bug fixes, and 3) what's your target date to release Java Lucene 2.9 #1 is important, such that is anyone using it in production? Yes, I did look at the current open issues in JIRA, but that doesn't help me answer the above questions. Regards, -- George - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: I wanna contribute a Chinese analyzer to lucene
In addition to Ken's suggestions, check out http://wiki.apache.org/lucene-java/HowToContribute for some help on getting set up. - Steve From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Thursday, April 16, 2009 10:16 AM To: java-dev@lucene.apache.org Subject: Re: I wanna contribute a Chinese analyzer to lucene I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language, it's called imdict-chinese-analyzer as it is a subproject of imdicthttp://www.imdict.net/, which is an intelligent online dictionary. The project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ I took a quick look, but didn't see any code posted there yet. [snip] This Analyzer contains two packages, the source code and the lexical dictionary. I want to publish the source code using Apache license, but the dictionary which is under an ambigus license was not create by me. So, can I only submit the source code to lucene contribution repository, and let the users download the dictionary from the google code site? I believe your code can be a contrib, with a reference to the dictionary. So a first step would be to open an issue in Lucene's Jira (http://issues.apache.org/jira/browse/LUCENE), and post your source as a patch. The best way to get the right answer to the legal issue is to post it to the legal-disc...@apache.org list (join it first), as Apache's lawyers can then respond to your specific question. -- Ken -- Ken Krugler +1 530-210-6378
[jira] Commented: (LUCENE-1600) Reduce usage of String.intern(), performance is terrible
[ https://issues.apache.org/jira/browse/LUCENE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699857#action_12699857 ] Jason Rutherglen commented on LUCENE-1600: -- contrib/MemoryIndex has a bunch of notes about how interning is slow, and using (I believe) hashmaps of strings is better. Comments on this approach? Reduce usage of String.intern(), performance is terrible Key: LUCENE-1600 URL: https://issues.apache.org/jira/browse/LUCENE-1600 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4, 2.4.1 Environment: Windows Server 2003 x64 Hotspot JDK 1.6.0_12 64-bit Reporter: Patrick Eger Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: intern.png, intern_perf.patch I profiled a simple MatchAllDocsQuery() against ~1.5 million documents (8 fields of short text, Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS), then retrieved all documents via searcher.doc(i, fs). String.intern() showed up as a top hotspot (see attached screenshot), so i implemented a small optimization to not intern() for every new Field(), instead forcing the intern in the FieldInfos class and adding a optional internName constructor to Field. This reduced execution time for searching and iterating through all documents by 35%. Results were similar for -server and -client. TRUNK (2.9) w/out patch: matched 1435563 in 8884 ms/search TRUNK (2.9) w/patch: matched 1435563 in 5786 ms/search -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1600) Reduce usage of String.intern(), performance is terrible
[ https://issues.apache.org/jira/browse/LUCENE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699864#action_12699864 ] Patrick Eger commented on LUCENE-1600: -- Hashmaps would work also, but then they either need to be synchronized or kept per-thread, the former would probably kill all your performance gains and the latter would be annoying i think. A moderate usage of String.intern() is fine i think, my patch just takes it out of the hot-path (for my use-case at least). Other uses of String.intern() in the codebase may have different solutions/tradeoffs certainly. Reduce usage of String.intern(), performance is terrible Key: LUCENE-1600 URL: https://issues.apache.org/jira/browse/LUCENE-1600 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4, 2.4.1 Environment: Windows Server 2003 x64 Hotspot JDK 1.6.0_12 64-bit Reporter: Patrick Eger Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: intern.png, intern_perf.patch I profiled a simple MatchAllDocsQuery() against ~1.5 million documents (8 fields of short text, Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS), then retrieved all documents via searcher.doc(i, fs). String.intern() showed up as a top hotspot (see attached screenshot), so i implemented a small optimization to not intern() for every new Field(), instead forcing the intern in the FieldInfos class and adding a optional internName constructor to Field. This reduced execution time for searching and iterating through all documents by 35%. Results were similar for -server and -client. TRUNK (2.9) w/out patch: matched 1435563 in 8884 ms/search TRUNK (2.9) w/patch: matched 1435563 in 5786 ms/search -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1600) Reduce usage of String.intern(), performance is terrible
[ https://issues.apache.org/jira/browse/LUCENE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699865#action_12699865 ] Uwe Schindler commented on LUCENE-1600: --- In addition to Mikes fixes, there are more places in FieldsReader, where intern() is used. The best would be to add the sme ctor to AbstractField, too and use it for LayzyField and so on, too. If I have time, I attach a patch similar to Mikes (as he is on holidays). Reduce usage of String.intern(), performance is terrible Key: LUCENE-1600 URL: https://issues.apache.org/jira/browse/LUCENE-1600 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4, 2.4.1 Environment: Windows Server 2003 x64 Hotspot JDK 1.6.0_12 64-bit Reporter: Patrick Eger Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: intern.png, intern_perf.patch I profiled a simple MatchAllDocsQuery() against ~1.5 million documents (8 fields of short text, Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS), then retrieved all documents via searcher.doc(i, fs). String.intern() showed up as a top hotspot (see attached screenshot), so i implemented a small optimization to not intern() for every new Field(), instead forcing the intern in the FieldInfos class and adding a optional internName constructor to Field. This reduced execution time for searching and iterating through all documents by 35%. Results were similar for -server and -client. TRUNK (2.9) w/out patch: matched 1435563 in 8884 ms/search TRUNK (2.9) w/patch: matched 1435563 in 5786 ms/search -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 status (to port to Lucene.Net)
LUCENE-1313 relies on LUCENE-1516 which is in trunk. If you have other questions George, feel free to ask. On Thu, Apr 16, 2009 at 8:04 AM, George Aroush geo...@aroush.net wrote: Thanks Mike. A quick follow up question. What's the status of http://issues.apache.org/jira/browse/LUCENE-1313? Can this work be applied to Lucene 2.4.1 and still get it's benefit or are there other dependency / issues with it that prevents us from doing so? If anyone else knows, I welcome your input. -- George -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, April 16, 2009 8:36 AM To: java-dev@lucene.apache.org Subject: Re: Lucene 2.9 status (to port to Lucene.Net) Hi George, There's been a sudden burst of activity lately on 2.9 development... I know there are some biggish remaining features we may want to get into 2.9: * The new field cache (LUCENE-831; still being iterated/mulled), * Possible major rework of Field / Document index-time vs search-time Document * Applying filters via random-access API when possible performant (LUCENE-1536) * Possible further optimizations to how collection works (LUCENE-1593) * Maybe breaking core + contrib into a more uniform set of modules (and figuring out how Trie(Numeric)RangeQuery/Filter fits in here) -- the Modularization uber-thread. * Further improvements to near-realtime search (using RAMDir for small recently flushed segments) * Many other small things and probably some big ones that I'm forgetting now :) So things are still in flux, and I'm really not sure on a release date at this point. Late last year, I was hoping for early this year, but it's no longer early this year ;) Mike On Wed, Apr 15, 2009 at 9:17 PM, George Aroush geo...@aroush.net wrote: Hi Folks, This is George Aroush, I'm one of the committers on Lucene.Net - a port of Java Lucene to C# Lucene. I'm looking at the current trunk code of yet to be released Lucene 2.9 and I would like to port it to Lucene.Net. If I do this now, we get the benefit of keeping our code base and release dates much closer to Java Lucene. However, this comes with a cost of carrying over unfinished work, known defects, and I have to keep an eye on new code that get committed into Java Lucene which must be ported over in a timely fashion. To help me determine when is a good time to start the port -- keep in mind, I will be taking the latest code off SVN -- I like to hear from the Java Lucene committers (and users who are playing or using Lucene 2.9 off SVN) about those questions: 1) how stable the current code in the trunk is, 2) do you still have feature work to deliver or just bug fixes, and 3) what's your target date to release Java Lucene 2.9 #1 is important, such that is anyone using it in production? Yes, I did look at the current open issues in JIRA, but that doesn't help me answer the above questions. Regards, -- George - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: vacation
Enjoy, I just got back from mine, tropical Minneapolis. On Thu, Apr 16, 2009 at 7:45 AM, Michael McCandless luc...@mikemccandless.com wrote: Just as a heads up, since we have so many neat Lucene improvements in flight: tomorrow I leave for a week long vacation, in a nice warm place that may or may not have internet access. So if suddenly I stop answering things, now you know why! Keep hacking away ;) Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1600) Reduce usage of String.intern(), performance is terrible
[ https://issues.apache.org/jira/browse/LUCENE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699865#action_12699865 ] Uwe Schindler edited comment on LUCENE-1600 at 4/16/09 2:13 PM: In addition to Mikes fixes, there are more places in FieldsReader, where intern() is used. The best would be to add the sme ctor to AbstractField, too and use it for LayzyField and so on, too. If I have time, I attach a patch similar to Patrick's. was (Author: thetaphi): In addition to Mikes fixes, there are more places in FieldsReader, where intern() is used. The best would be to add the sme ctor to AbstractField, too and use it for LayzyField and so on, too. If I have time, I attach a patch similar to Mikes (as he is on holidays). Reduce usage of String.intern(), performance is terrible Key: LUCENE-1600 URL: https://issues.apache.org/jira/browse/LUCENE-1600 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4, 2.4.1 Environment: Windows Server 2003 x64 Hotspot JDK 1.6.0_12 64-bit Reporter: Patrick Eger Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: intern.png, intern_perf.patch I profiled a simple MatchAllDocsQuery() against ~1.5 million documents (8 fields of short text, Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS), then retrieved all documents via searcher.doc(i, fs). String.intern() showed up as a top hotspot (see attached screenshot), so i implemented a small optimization to not intern() for every new Field(), instead forcing the intern in the FieldInfos class and adding a optional internName constructor to Field. This reduced execution time for searching and iterating through all documents by 35%. Results were similar for -server and -client. TRUNK (2.9) w/out patch: matched 1435563 in 8884 ms/search TRUNK (2.9) w/patch: matched 1435563 in 5786 ms/search -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: I wanna contribute a Chinese analyzer to lucene
-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Gao Pinker xiaoping...@gmail.com To: java-dev@lucene.apache.org Sent: Thursday, April 16, 2009 9:58:51 AM Subject: I wanna contribute a Chinese analyzer to lucene Hi All! I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language, it's called imdict-chinese-analyzer as it is a subproject of imdict, which is an intelligent online dictionary. The project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hit the performance baddly. The algorithm ofimdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL. As imdict-chinese-analyzer is a really fast intelligent Chinese Analyzer for lucene written in Java. I want to share this project with every one using Lucene. This Analyzer contains two packages, the source code and the lexical dictionary. I want to publish the source code using Apache license, but the dictionary which is under an ambigus license was not create by me. So, can I only submit the source code to lucene contribution repository, and let the users download the dictionary from the google code site? please help me about this contribution.
RE: vacation
Have fun and relax! My next holiday will be after a meeting in Japan, I will visit Kyoto (end of May). It will be hot there, too...! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, April 16, 2009 4:46 PM To: java-dev@lucene.apache.org Subject: vacation Just as a heads up, since we have so many neat Lucene improvements in flight: tomorrow I leave for a week long vacation, in a nice warm place that may or may not have internet access. So if suddenly I stop answering things, now you know why! Keep hacking away ;) Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: I wanna contribute a Chinese analyzer to lucene
This would be a great contribution. I took a quick look at the ZIP file and noticed it depends on, say, net.imdict.wordsegment.WordSegmenter, but I didn't see that class anywhere. I assume you will patch and polish things, but I thought I'd point this out. Thanks! Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Gao Pinker xiaoping...@gmail.com To: java-dev@lucene.apache.org Sent: Thursday, April 16, 2009 9:58:51 AM Subject: I wanna contribute a Chinese analyzer to lucene Hi All! I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language, it's called imdict-chinese-analyzer as it is a subproject of imdict, which is an intelligent online dictionary. The project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hit the performance baddly. The algorithm ofimdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL. As imdict-chinese-analyzer is a really fast intelligent Chinese Analyzer for lucene written in Java. I want to share this project with every one using Lucene. This Analyzer contains two packages, the source code and the lexical dictionary. I want to publish the source code using Apache license, but the dictionary which is under an ambigus license was not create by me. So, can I only submit the source code to lucene contribution repository, and let the users download the dictionary from the google code site? please help me about this contribution.
[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699872#action_12699872 ] Shon Vella commented on LUCENE-1604: What should the transitive behavior of MultiReader, FilterReader, and ParallelReader be? I'm inclined to say they shouldn't pass through to their subordinate readers because they don't really own them. Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699880#action_12699880 ] Mark Miller commented on LUCENE-831: Okay, now that I half way understand this issue, I think I have to go back to the basic motivations. The original big win was taken away by 1483, so lets see if we really need a new API for the wins we have left. h3. Advantage of new API (kind of as it is in the patch) FieldCache is interface and it would be nice to move to abstract class, ExtendedFieldCache is ugly Avoid global sync by IndexReader to access cache its easier/cleaner to block caching by multireaders (though I am almost thinking I would prefer warnings/advice about performance and encouragement to move to per segment) It becomes easier to share a ValueSource instance across readers. h3. Disadvantages of new API If we want only SegmentReaders to have a ValueSource, you can't efficiently back the old API with the new, causing RAM reqs jumps if you straddle the two APIs and ask for the same array data from each. Its probably a higher barrier to a custom Parser to implement and init a Reader with a ValueSource (presumably that works per field) than to simply pass the Parser on a SortField. However, Parser stops making sense if we end up being able to back ValueSource with column stride fields. We could allow ValueSource to be passed on the SortField (the current incarnation of this patch), but then you have to go back to a global cache by reader the ValueSources passed that way (you would also still have the per segment reader, settable ValueSource). h3. Advantages of staying with old API Avoid forcing large migration for users, with possible RAM req penalties if they don't switch from deprecated code (we are doing something similar with 1483 even without deprecated code though - if you were using an external multireader FieldCache that matched a sort FieldCache key, youd double your RAM reqs). h3. Thoughts If we stayed with the old API, we could still allow a custom FieldCache to be supplied. We could still back FieldCacheImpl with Uninverter to reduce code. We could still have CachingFieldCache. Though CachingValueSource is much better :) FieldCache implies caching, and so the name would be confusing. We could also avoid CachingFieldCache though, as just making a pluggable FieldCache would allow alternate caching implementations (with a bit more effort). We could deprecate the Parser methods and force supplying a new FieldCache impl for custom uninversion to get to an API suitable to be backed by CSF. Or: We could also move to ValueSource, but allow a ValueSource on multi-readers. That would probably make straddling the API's much more possible (and efficient) in the default case. We could advise that its best to work per segment, but leave the option to the user. h3. Conclusion I am not sure. I thought I was convinced we might as well not even move from FieldCache at all, but now that I've written a bit out, I'm thinking it would be worth going to ValueSource. I'm just not positive on what we should support. SortField ValueSource override keyed by reader? ValueSources on MultiReaders? Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Assignee: Mark Miller Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards
Re: vacation
On Thu, Apr 16, 2009 at 10:45:49AM -0400, Michael McCandless wrote: Just as a heads up, since we have so many neat Lucene improvements in flight: tomorrow I leave for a week long vacation, in a nice warm place that may or may not have internet access. So if suddenly I stop answering things, now you know why! I've got plenty to keep myself busy while you're gone. :) We'll manage on autopilot for a little while. Enjoy your break. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699893#action_12699893 ] Uwe Schindler commented on LUCENE-831: -- We have the problem with the ValueSource-override not only with SortField. Also Functions Queries need the additional ValueSource-override and other places too. So a central place to register a ValueSource per field for a IndexReader (MultiReader,... passing down to segments) would really be nice. For the caching problem: Possibly the ValueSource given to SortField etc. behaves like the current parser. The cache in IndexReader should also be keyed by the ValueSource. So the SortField/FunctionQuery ValueSource override is passed down to IndexReader's cache. If the IndexReader has an entry in its cache for same (field, ValueSource, ...) key, it could use the arrays from there, if not fill cache with an array from the overridden ValueSource. I would really make the ValueSource per-field. Univerter inner class should be made public and the Univerter should accept a starting term to iterate (overwrite ) and the newTerm() method should be able to return false to stop iterating (see my ValueSource example for trie). With that one could easily create a subclass of univerter with a own parser logic (like trie). Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Assignee: Mark Miller Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699892#action_12699892 ] Jason Rutherglen commented on LUCENE-1536: -- I thought we are going to get LUCENE-1518 working to compare the performance against passing the filter into TermDocs? if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1518: --- Fix Version/s: 2.9 Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699939#action_12699939 ] Michael McCandless commented on LUCENE-1536: Ahh right, we should re-test performance of this after LUCENE-1518 is done. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699941#action_12699941 ] Michael McCandless commented on LUCENE-1604: bq. I'm inclined to say they shouldn't pass through to their subordinate readers because they don't really own them. I agree. Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1263#action_1263 ] Mark Miller commented on LUCENE-831: I think we don't want to expose Uninverter though? The API should be neutral enough to naturally support loading from CSF, in which case Uninverter doesnt make sense...so we were going to go with having to override the value source to handle uninverter type stuff. Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Assignee: Mark Miller Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org