RE: Build failed in Hudson: Lucene-trunk #1091
Really cool: This time we hit the failure during the clover run: [junit] Testsuite: org.apache.lucene.index.TestIndexWriterMergePolicy [junit] Tests run: 6, Failures: 1, Errors: 0, Time elapsed: 28.519 sec [junit] [junit] Testcase: testMaxBufferedDocsChange(org.apache.lucene.index.TestIndexWriterMergePolicy): FAILED [junit] maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10-_32 _5u:c10-_32 _5v:c10-_32 _5w:c10-_32 _5x:c10-_32 _5y:c10-_32 _5z:c10-_32 _60:c10-_32 _61:c10-_32 _62:c9-_32 _64:c1-_62 [junit] junit.framework.AssertionFailedError: maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10-_32 _5u:c10-_32 _5v:c10-_32 _5w:c10-_32 _5x:c10-_32 _5y:c10-_32 _5z:c10-_32 _60:c10-_32 _61:c10-_32 _62:c9-_32 _64:c1-_62 [junit] at org.apache.lucene.index.TestIndexWriterMergePolicy.checkInvariants(TestIndexWriterMergePolicy.java:234) [junit] at org.apache.lucene.index.TestIndexWriterMergePolicy.__CLR2_6_3zf7i0317qu(TestIndexWriterMergePolicy.java:164) [junit] at org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange(TestIndexWriterMergePolicy.java:125) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:214) [junit] [junit] [junit] Test org.apache.lucene.index.TestIndexWriterMergePolicy FAILED We also get the exact location of the failure: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1091/clover-report/org/apache/lucene/index/TestIndexWriterMergePolicy.html?line=125#src-125 And you can see which lines are called how often until the failure occurred! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org] Sent: Thursday, February 11, 2010 5:16 AM To: java-dev@lucene.apache.org Subject: Build failed in Hudson: Lucene-trunk #1091 See http://hudson.zones.apache.org/hudson/job/Lucene- trunk/1091/changes Changes: [uschindler] LUCENE-2248: Change core tests to use a global Version constant [uschindler] LUCENE-2258: Remove unneeded synchronization in FuzzyTermEnum -- [...truncated 23095 lines...] [javadoc] Loading source files for package org.apache.lucene.queryParser.standard... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.builders... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.config... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.nodes... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.parser... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.processors... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_22 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/docs/api/contrib-queryparser/stylesheet.css... [javadoc] Note: Custom tags that were not seen: @lucene.experimental, @lucene.internal [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/contrib/queryparser/lucene-queryparser-2010-02- 11_02-03-57-javadoc.jar [echo] Building regex... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/docs/api/contrib-regex [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.search.regex... [javadoc] Loading source files for package org.apache.regexp... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_22 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/docs/api/contrib-regex/stylesheet.css... [javadoc] Note: Custom tags that were not seen: @lucene.experimental, @lucene.internal [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/contrib/regex/lucene-regex-2010-02-11_02-03-57- javadoc.jar [echo] Building remote... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/docs/api/contrib-remote [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.search... [javadoc] Constructing Javadoc
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832424#action_12832424 ] Eks Dev commented on LUCENE-2089: - {quote} What about this, http://www.catalysoft.com/articles/StrikeAMatch.html it seems logically more appropriate to (human-entered) text objects than Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance faster? {quote} Is that only me who sees plain, vanilla bigram distance here? What is new or better in StrikeAMatch compared to the first phase of the current SpellCehcker (feeding PriorityQueue with candidates)? If you need too use this, nothing simpler, you do not even need pair comparison (aka traversal), just Index terms split into bigrams and search with standard Query. Autmaton trick is a neat one. Imo, the only thing that would work better is to make term dictionary real trie (ternary, n-ary, dfa, makes no big diff). Making TerrmDict some sort of trie/dfa would permit smart beam-search, even without compiling query DFA. Beam search also makes implementation of better distances possible (Weighted Edit distance without metric constraint ). I guess this is going to be possible with Flex, Mike was allready talking about DFA Dictionary :) It took a while to figure out the trick Robert pooled here, treating term dictionary as another DFA due to the sortedness, nice. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1720: --- Attachment: LUCENE-1720.patch Collected all the files in the issue into a single .patch file. I also included the following changes/additions: # Formatting. # Renamed TestTimeLimitedIndexReader to TimeLimitedIndexReaderBenchmark - I don't want it to be run by ant test. We should review it anyway and put in in benchmark. # Created TestActivityTimeMonitor which tests the ATM itself. This revealed a bug in the ATM, where it removed from the map while iterating on it (TimeoutThread.run()). #* BTW, the test testMultipleThreadsFailing() passes when the number of threads is 10, and fails when it's set higher. It is definitely a timing issue, b/c if I change each thread to sleep more than 200ms, the test passes. I suspect that when there are so many threads in the system, TimeoutThread will not be as accurate. However I'm not sure if we should do something about it ... if there are 50 threads running concurrently, things will be slow, and executing search requests will take longer, therefore the TimeoutThread will have enough time to pick their timeouts. We need to create a test for the IndexReader case, I hope to get to it, but if you want, please take a stab at it. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832427#action_12832427 ] Shai Erera commented on LUCENE-1720: BTW Mark, if we assume the number of threads that will be monitored is relatively low, I agree w/ your comment on using the TreeMap. I think we can remove it from the code (the comment I mean). What do you think? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2154) Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers
[ https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2154: -- Attachment: LUCENE-2154.patch Here is a first patch about cglib-generated proxy attributes. In IRC we found out yesterday, that the proposed idea to share the attributes accross all Multi*Enums would result in problems as the call to next() on any sub-enum would overwrite the contents of the attributes of the previous sub-enum which would make TermsEnum not working (because e.g. TermsEnum looks forward by calling next() an all sub-enums and choosing the lowest term to return - after calling each enums next() the attributes of the first enums cannot be restored without captureState co, as overwritten by the next() call to the last enum). This patch needs cglib-nodep-2.2.jar put into the lib-folder of the checkout [http://sourceforge.net/projects/cglib/files/cglib2/2.2/cglib-nodep-2.2.jar/download]. It contains a test and that shows how the usage is. The central part is cglib's Enhancer that creates a dynamic class extending ProxyAttributeImpl (which defines the general AttributeImpl methods delegating to the delegate) and implementing the requested Attribute interface using a MethodInterceptor. Please note: This uses no reflection (only during in-memory class file creation, which is only run one time on loading the proxy class). The proxy implements MethodInterceptor and uses the fast MethodProxy class (which is also generated by cglib for each proxied method, too) and can invoke the delegated method directly (without reflection) on the delegate. The test verifies everything works and also compares speed by using a TermAttribute natively and proxied. The speed is lower (which is not caused by reflection, but by the MethodInterceptor creating an array of parameters and boxing/unboxing native parameters into the Object[]), but for the testcase I have seen about only 50% more time needed. The generated classes are cached and reused (like DEFAULT_ATTRIBUTE_FACTORY does). To get maximum speed and no external libraries, the code implemented by Enhancer can be rewritten natively using the Apache Harmony java.lang.reflect.Proxy implementation source code as basis. The hardest part in generating bytecode is the ConstantPool in class files. But as the proxy methods are simply delegating and no magic like boxing/unboxing is needed, the generated bytecode is rather simple. One other use-case for these proxies is AppendingTokenStream, which is not possible since 3.0 without captureState (in old TS API it was possible, because you could reuse the same TokenInstance even over the appended streams). In the new TS api, the appending stream must have a view on the attributes of the current consuming sub-stream. Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers --- Key: LUCENE-2154 URL: https://issues.apache.org/jira/browse/LUCENE-2154 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Fix For: Flex Branch Attachments: LUCENE-2154.patch The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum levels, for a codec to set custom attrs. But, it's currently broken for Dir/MultiReader, which must somehow share attrs across all the sub-readers. Somehow we must make a single attr source, and tell each sub-reader's enum to use that instead of creating its own. Hopefully Uwe can work some magic here :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-trunk #1091
This is LUCENE-2118 -- it strikes every so often. I think it's harmless (but really annoying). It looks like a corner case in the merge policy where somehow a level is able to have 11 segments after merging thinks it's done. Mike On Thu, Feb 11, 2010 at 3:03 AM, Uwe Schindler u...@thetaphi.de wrote: Really cool: This time we hit the failure during the clover run: [junit] Testsuite: org.apache.lucene.index.TestIndexWriterMergePolicy [junit] Tests run: 6, Failures: 1, Errors: 0, Time elapsed: 28.519 sec [junit] [junit] Testcase: testMaxBufferedDocsChange(org.apache.lucene.index.TestIndexWriterMergePolicy): FAILED [junit] maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10-_32 _5u:c10-_32 _5v:c10-_32 _5w:c10-_32 _5x:c10-_32 _5y:c10-_32 _5z:c10-_32 _60:c10-_32 _61:c10-_32 _62:c9-_32 _64:c1-_62 [junit] junit.framework.AssertionFailedError: maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10-_32 _5u:c10-_32 _5v:c10-_32 _5w:c10-_32 _5x:c10-_32 _5y:c10-_32 _5z:c10-_32 _60:c10-_32 _61:c10-_32 _62:c9-_32 _64:c1-_62 [junit] at org.apache.lucene.index.TestIndexWriterMergePolicy.checkInvariants(TestIndexWriterMergePolicy.java:234) [junit] at org.apache.lucene.index.TestIndexWriterMergePolicy.__CLR2_6_3zf7i0317qu(TestIndexWriterMergePolicy.java:164) [junit] at org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange(TestIndexWriterMergePolicy.java:125) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:214) [junit] [junit] [junit] Test org.apache.lucene.index.TestIndexWriterMergePolicy FAILED We also get the exact location of the failure: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1091/clover-report/org/apache/lucene/index/TestIndexWriterMergePolicy.html?line=125#src-125 And you can see which lines are called how often until the failure occurred! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org] Sent: Thursday, February 11, 2010 5:16 AM To: java-dev@lucene.apache.org Subject: Build failed in Hudson: Lucene-trunk #1091 See http://hudson.zones.apache.org/hudson/job/Lucene- trunk/1091/changes Changes: [uschindler] LUCENE-2248: Change core tests to use a global Version constant [uschindler] LUCENE-2258: Remove unneeded synchronization in FuzzyTermEnum -- [...truncated 23095 lines...] [javadoc] Loading source files for package org.apache.lucene.queryParser.standard... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.builders... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.config... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.nodes... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.parser... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.processors... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_22 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/docs/api/contrib-queryparser/stylesheet.css... [javadoc] Note: Custom tags that were not seen: @lucene.experimental, @lucene.internal [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/contrib/queryparser/lucene-queryparser-2010-02- 11_02-03-57-javadoc.jar [echo] Building regex... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/docs/api/contrib-regex [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.search.regex... [javadoc] Loading source files for package org.apache.regexp... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_22 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/docs/api/contrib-regex/stylesheet.css... [javadoc] Note: Custom tags that were not seen: @lucene.experimental, @lucene.internal [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/contrib/regex/lucene-regex-2010-02-11_02-03-57- javadoc.jar [echo] Building remote...
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832444#action_12832444 ] Mark Harwood commented on LUCENE-1720: -- Thanks for the updates, Shai. Agreed on removing the treemap comment.. As you suggest, their may be a low-level accuracy timing issue under heavy load but for the typically longer timeout settings we may set this is less likely to be an issue. Related: I did think of another feature for ATM - timeouts will typically be set to the maximum bearable value that can be sustained by the hardware without upsetting lots of users/customers who need answers. This setting is therefore a tough business decision to make and is likely to be on the high side to avoid annoying customers (10 seconds? 30?). The current monitoring solution only aborts at the latest possible stage when the uppermost acceptable limit has been reached and expensive resource has already been burned. Maybe we could add a progress-testing method to ATM which can throw an exception earlier e.g. public void checkForProjectedActivityTimeout(float percentActivityCompletedSoFar) Clients would need to estimate how far through a task they were and call this method periodically. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2154) Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers
[ https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2154: -- Attachment: LUCENE-2154.patch I had some more fun. Made ProxyAttributeSource non-final and added class name policy to also contain the corresponding interface (to make stack traces on errors nicer). Here the example output: {noformat} [junit] DEBUG: Created class org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$TermAttribute$$EnhancerByCGLIB$$6100bdf9 for attribute org.apache.lucene.analysis.tokenattributes.TermAttribute [junit] DEBUG: Created class org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$TypeAttribute$$EnhancerByCGLIB$$6f89c3ff for attribute org.apache.lucene.analysis.tokenattributes.TypeAttribute [junit] DEBUG: Created class org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$FlagsAttribute$$EnhancerByCGLIB$$4668733c for attribute org.apache.lucene.analysis.tokenattributes.FlagsAttribute [junit] Time taken using org.apache.lucene.analysis.tokenattributes.TermAttributeImpl: [junit] 1476.090658 ms for 1000 iterations [junit] Time taken using org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$TermAttribute$$EnhancerByCGLIB$$6100bdf 9: [junit] 1881.295734 ms for 1000 iterations {noformat} Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers --- Key: LUCENE-2154 URL: https://issues.apache.org/jira/browse/LUCENE-2154 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Fix For: Flex Branch Attachments: LUCENE-2154.patch, LUCENE-2154.patch The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum levels, for a codec to set custom attrs. But, it's currently broken for Dir/MultiReader, which must somehow share attrs across all the sub-readers. Somehow we must make a single attr source, and tell each sub-reader's enum to use that instead of creating its own. Hopefully Uwe can work some magic here :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832461#action_12832461 ] Shai Erera commented on LUCENE-1720: I like the idea of adding the projected activity timeout in general, but I'd like to question its usefulness in reality (or at least for search applications). The way I think of it (and it might be because I'm thinking of my use case) there are two problems with such API: # It might not be very easy (if at all) or performing to project how much of the work has been done. For TermQuery it might be easy to tell this (e.g. numSeenSoFar / df(term)), but that will add an 'if' to every document that is traversed, and possible more operations. But for more complicated queries, I'm not sure you'll be able to tell how much of the query has been processed. # If I am willing to sustain a 10s query, then I guess I'd want to extract as much information as I can in those 10s. If after 1s I realize I haven't processed even 10% of the data that doesn't mean I'd like to stop, right? Maybe the query/activity will speed up shortly? I think that if I put a cap on the query time, it means I don't mind spending that amount of time ... but I also recognize this may depend on the application, and therefore that is not a too strong argument. I think this approach is interesting, as it is able to detect 'hanging' threads (such as those stuck in infinite loops). I realize however that ActivityTimeMonitor is not search specific (which makes me think it should be moved to o.a.l.util or something) and therefore the projected activity timeout can have its usage in other places. How about if we do it in a separate issue? We still need to write enough tests for what exists so far, and turn the Benchmark class into a benchmark task/alg. I think that if we can avoid extra functionality (which is likely to add more bugs to cover) it will be easier to finish that issue, no? BTW, in order to support this we'll need to store the startTime as well, not just the timeoutTime, which means that we either add another startTimesThreads map, or change the map to be from Thread to a Times object which encapsulates both times ... Minor thing though. Also, is this targeted to be added to 'core' or contrib? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832470#action_12832470 ] Mark Harwood commented on LUCENE-1720: -- The change to ATM isn't that big - as you say just adding start to the data on each thread. Here's an (untested) example {code:title=Bar.java|borderStyle=solid} /** * Checks to see if this thread is likely to exceed it's pre-determined timeout. * This is a heavier-weight call than checkForTimeout and should not be called quite as frequently * * Throws {...@link ActivityTimedOutException}RuntimeException in the event of any anticipated timeout. * @param progress */ public static final void checkProjectedTimeoutOnThisThread(float progress) { Thread currentThread=Thread.currentThread(); synchronized(timeLimitedThreads) { ActivityTime thisTimeOut = timeLimitedThreads.get(currentThread); if(thisTimeOut!=null ) { long now=System.currentTimeMillis(); long maxDuration=thisTimeOut.scheduledTimeout-thisTimeOut.startTime; long durationSoFar=now-thisTimeOut.startTime; float expectedMinimumProgress=(float)durationSoFar/maxDuration; if(progressexpectedMinimumProgress) { long expectedOverrun=(long) (((durationSoFar*(1f-progress))+now)-thisTimeOut.scheduledTimeout); throw new ActivityTimedOutException(Thread +currentThread+ is expected to time out, estimated overrun = +expectedOverrun+ ms,expectedOverrun); } } } } static class ActivityTime { public ActivityTime(long startTime, long timeOutTime) { this.startTime=startTime; this.scheduledTimeout=timeOutTime; } long startTime; long scheduledTimeout; } {code} I agree it will be challenging to work out when to call this from readers etc and how to estimate completeness but as a general utility class (as you suggest, in o.a.l.util ) it seems like a useful addition. My suspicion is that this is currently contrib - but then TimeLimitingCollector is currently in core. Maybe TimeLimitingCollector could be rewritten to use ATM and then we maintain a common generally reusable implementation? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are
AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are used (e.g. in the Solr plugins classloader) - Key: LUCENE-2260 URL: https://issues.apache.org/jira/browse/LUCENE-2260 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.0, 2.9.1 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 When working on the dynmaic proxy classes using cglib/javaassist i recognized a problem in the caching code inside AttributeSource: - AttributeSource has a static (!) cache map that holds implementation classes for attributes to be faster on creating new attributes (reflection cost) - AttributeSource has a static (!) cache map that holds a list of all interfaces implemented by a specific AttributeImpl Also: - VirtualMethod in 3.1 hold a map of implementation distances keyed by subclasses of the deprecated API Both have the problem that this strong reference is inside Lucene's classloader and so persists as long as lucene lives. The classes referenced can never be unloaded therefore, which would be fine if all live in the same classloader. As soon as the Attribute or implementation class or the subclass of the deprecated API are loaded by a different classloder (e.g. Lucene lives in bootclasspath of tomcat, but lucene-consumer with custom attributes lives in a webapp), they can never be unloaded, because a reference exists. Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for generated class files. They also manage this by a WeakHashMap. The cache will always work perfect and no class will be evicted without reason, as classes are only unloaded when the classloader goes and this will only happen on request (e.g. by Tomcat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are
[ https://issues.apache.org/jira/browse/LUCENE-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2260: -- Attachment: LUCENE-2260.patch Attached patch. I will commit this in a day and also merge to 2.9 and 3.0 (without VirtualMethod) as this is a resource leak. This problem is similar to LUCENE-2182. AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are used (e.g. in the Solr plugins classloader) - Key: LUCENE-2260 URL: https://issues.apache.org/jira/browse/LUCENE-2260 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9.1, 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2260.patch When working on the dynmaic proxy classes using cglib/javaassist i recognized a problem in the caching code inside AttributeSource: - AttributeSource has a static (!) cache map that holds implementation classes for attributes to be faster on creating new attributes (reflection cost) - AttributeSource has a static (!) cache map that holds a list of all interfaces implemented by a specific AttributeImpl Also: - VirtualMethod in 3.1 hold a map of implementation distances keyed by subclasses of the deprecated API Both have the problem that this strong reference is inside Lucene's classloader and so persists as long as lucene lives. The classes referenced can never be unloaded therefore, which would be fine if all live in the same classloader. As soon as the Attribute or implementation class or the subclass of the deprecated API are loaded by a different classloder (e.g. Lucene lives in bootclasspath of tomcat, but lucene-consumer with custom attributes lives in a webapp), they can never be unloaded, because a reference exists. Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for generated class files. They also manage this by a WeakHashMap. The cache will always work perfect and no class will be evicted without reason, as classes are only unloaded when the classloader goes and this will only happen on request (e.g. by Tomcat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832483#action_12832483 ] Mark Harwood commented on LUCENE-1720: -- Agreed, might be useful to provide boolean response to the progress method - a kind of how am I doing? check. We can always provide a convenience wrapper method which throws an exception : ATM.blowUpIfNotGoingFastEnough(float progress) Re TimeLimitingCollector - agreed, you really do need to protect ATM/start/stop calls in the same try...finally block. Maybe ATM could have a start method variant that takes an additional alreadyRunningSince argument as opposed to the existing assumption that the activity is starting right now. The first collect could then call this with a timestamp initialised in the constructor. Even then, there is the issue of where to put the stop call - collector has no close call to signal the end of the activity. Doesn't seem like TimeLimitingCollector can be based on the same ATM code. Shame. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832491#action_12832491 ] Shai Erera commented on LUCENE-1720: bq. We can always provide a convenience wrapper method which throws an exception : ATM.blowUpIfNotGoingFastEnough(float progress) Right. Although I'm thinking, in the sake of keeping the API simple, and back-compat sane, it's easy enough for the caller to do: {code} if (!ATM.checkProjectedTimeoutOnThisThread(progress)) { throw new ActivityTimedOutException(terminating due to projected timeout. progress = + progress); } {code} bq. The first collect Note: that means that every collect will need to check if it's the first. I think we should leave TLC out of this for now. Maybe w/ TimeLimitingIndexReader, which has finer granularity, TLC won't be used at all. At least, I definitely don't see why would one use both... About the packages, so we agree that ATM and exception move to o.a.l.util, and the reader stay in index (like IndexReader)? And I don't think this warrants a contrib package, unless we move TLC there as well. But this seems like a useful Lucene core utility. So that we don't override each other - are you going to add the projected method to the patch? If so, can you also order the packages and remove the TreeMap comment? If not I can do it. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are
[ https://issues.apache.org/jira/browse/LUCENE-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2260: -- Attachment: LUCENE-2260.patch Improved patch, now all class references are weak. The assumption on the WeakReference inside addAttributeImpl is always != null is true because the code has a strong reference on the implementing class. AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are used (e.g. in the Solr plugins classloader) - Key: LUCENE-2260 URL: https://issues.apache.org/jira/browse/LUCENE-2260 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9.1, 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2260.patch, LUCENE-2260.patch When working on the dynmaic proxy classes using cglib/javaassist i recognized a problem in the caching code inside AttributeSource: - AttributeSource has a static (!) cache map that holds implementation classes for attributes to be faster on creating new attributes (reflection cost) - AttributeSource has a static (!) cache map that holds a list of all interfaces implemented by a specific AttributeImpl Also: - VirtualMethod in 3.1 hold a map of implementation distances keyed by subclasses of the deprecated API Both have the problem that this strong reference is inside Lucene's classloader and so persists as long as lucene lives. The classes referenced can never be unloaded therefore, which would be fine if all live in the same classloader. As soon as the Attribute or implementation class or the subclass of the deprecated API are loaded by a different classloder (e.g. Lucene lives in bootclasspath of tomcat, but lucene-consumer with custom attributes lives in a webapp), they can never be unloaded, because a reference exists. Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for generated class files. They also manage this by a WeakHashMap. The cache will always work perfect and no class will be evicted without reason, as classes are only unloaded when the classloader goes and this will only happen on request (e.g. by Tomcat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832500#action_12832500 ] Mark Harwood commented on LUCENE-1720: -- I'll pick this up TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832501#action_12832501 ] Shai Erera commented on LUCENE-1720: BTW, perhaps to allow even finer granularity and precision, we should move to use nanoTime()? the given timeout can be in ms, but the TimeoutThread can work w/ nanoTime() ... what do you think? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2080) Improve the documentation of Version
[ https://issues.apache.org/jira/browse/LUCENE-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-2080. - Resolution: Fixed Committed revision 908975. Improve the documentation of Version Key: LUCENE-2080 URL: https://issues.apache.org/jira/browse/LUCENE-2080 Project: Lucene - Java Issue Type: Task Components: Javadocs Reporter: Robert Muir Assignee: Robert Muir Priority: Trivial Fix For: 2.9.2, 3.1, 3.0 Attachments: LUCENE-2080.patch, LUCENE-2080.patch, LUCENE-2080.patch, LUCENE-2080.patch In my opinion, we should elaborate more on the effects of changing the Version parameter. Particularly, changing this value, even if you recompile your code, likely involves reindexing your data. I do not think this is adequately clear from the current javadocs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
Hi Eks Dev, thanks for the comment! I like this: If you need too use this, nothing simpler, you do not even need pair comparison (aka traversal), just Index terms split into bigrams and search with standard Query. And, I need to study all the tricks of Automaton... -Fuad -Original Message- From: Eks Dev (JIRA) [mailto:j...@apache.org] Sent: February-11-10 5:04 AM To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery [ https://issues.apache.org/jira/browse/LUCENE- 2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanelfocusedCommentId=12832424#action_12832424 ] Eks Dev commented on LUCENE-2089: - {quote} What about this, http://www.catalysoft.com/articles/StrikeAMatch.html it seems logically more appropriate to (human-entered) text objects than Levenshtein distance, and it is (in theory) extremely fast; is DFA- distance faster? {quote} Is that only me who sees plain, vanilla bigram distance here? What is new or better in StrikeAMatch compared to the first phase of the current SpellCehcker (feeding PriorityQueue with candidates)? If you need too use this, nothing simpler, you do not even need pair comparison (aka traversal), just Index terms split into bigrams and search with standard Query. Autmaton trick is a neat one. Imo, the only thing that would work better is to make term dictionary real trie (ternary, n-ary, dfa, makes no big diff). Making TerrmDict some sort of trie/dfa would permit smart beam-search, even without compiling query DFA. Beam search also makes implementation of better distances possible (Weighted Edit distance without metric constraint ). I guess this is going to be possible with Flex, Mike was allready talking about DFA Dictionary :) It took a while to figure out the trick Robert pooled here, treating term dictionary as another DFA due to the sortedness, nice. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. --
FieldCacheImpl concurrency
Hi, I would like to try and improve concurrency in Lucene in several places, and thought I would start with FieldCacheImpl. The implementation is heavily synchronized on both a global map and on creation values for a pretty heavily used path (get). I think the weak hash map is a good solution (but also think that the ability to choose to use soft hash map would add even greater flexibility), the problem is that there is no built in concurrent weak/soft hash map. I have built several of these over the course of my Java life, but now, with google collections, I really don't think someone should build anything like this anymore. I would like to take the task of doing this, I wanted to first check if the it is possible for lucene to have a dependency on google collections? Its apache licensed, and developed in cooperation with Doug Lea and basically considered as java.util valid extension with possible inclusion of some of the work into future JDK versions (and how know when those will come :) ). -shay.banon
[jira] Updated: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are
[ https://issues.apache.org/jira/browse/LUCENE-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2260: -- Attachment: LUCENE-2260-lucene29.patch Patch for 2.9 branch (without Java 5 generics) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are used (e.g. in the Solr plugins classloader) - Key: LUCENE-2260 URL: https://issues.apache.org/jira/browse/LUCENE-2260 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9.1, 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2260-lucene29.patch, LUCENE-2260.patch, LUCENE-2260.patch When working on the dynmaic proxy classes using cglib/javaassist i recognized a problem in the caching code inside AttributeSource: - AttributeSource has a static (!) cache map that holds implementation classes for attributes to be faster on creating new attributes (reflection cost) - AttributeSource has a static (!) cache map that holds a list of all interfaces implemented by a specific AttributeImpl Also: - VirtualMethod in 3.1 hold a map of implementation distances keyed by subclasses of the deprecated API Both have the problem that this strong reference is inside Lucene's classloader and so persists as long as lucene lives. The classes referenced can never be unloaded therefore, which would be fine if all live in the same classloader. As soon as the Attribute or implementation class or the subclass of the deprecated API are loaded by a different classloder (e.g. Lucene lives in bootclasspath of tomcat, but lucene-consumer with custom attributes lives in a webapp), they can never be unloaded, because a reference exists. Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for generated class files. They also manage this by a WeakHashMap. The cache will always work perfect and no class will be evicted without reason, as classes are only unloaded when the classloader goes and this will only happen on request (e.g. by Tomcat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: FieldCacheImpl concurrency
On Thu, Feb 11, 2010 at 9:54 AM, Shay Banon kim...@gmail.com wrote: I would like to try and improve concurrency in Lucene in several places, and thought I would start with FieldCacheImpl. The implementation is heavily synchronized on both a global map and on creation values for a pretty heavily used path (get). It really shouldn't be heavily used. For a sorted search, get() is called once per segment in the index. There is no synchronization to retrieve per-document values. We have just barely moved to Java5 though, and so it would probably be pretty easy to improve the concurrency of the read path if it did become problematic. I think the weak hash map is a good solution (but also think that the ability to choose to use soft hash map would add even greater flexibility), the problem is that there is no built in concurrent weak/soft hash map. It might be a better idea to have interfaces that allow one to implement their own policies rather than to try and push all the different options into Lucene: https://issues.apache.org/jira/browse/LUCENE-831 -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size - Key: LUCENE-2261 URL: https://issues.apache.org/jira/browse/LUCENE-2261 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Priority: Minor Fix For: Flex Branch, 3.1 MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority queue to expand the query to the top-N terms. currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy query to at most only the 50 closest terms. at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite (it is private right now) and add a ctor to it, so a MultiTermQuery can instantiate one with its own limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1720: - Attachment: Lucene-1720.patch Moved ATM to o.a.l.util package Added isProjectedToTimeout method to ATM and corresponding Junit test Removed treemap comments TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
[ https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2261: Attachment: LUCENE-2261.patch attached is a patch. i changed FuzzyQueries pq test to use this so it does not have to do the try-finally thing/BooleanQuery.maxClauseCount configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size - Key: LUCENE-2261 URL: https://issues.apache.org/jira/browse/LUCENE-2261 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Priority: Minor Fix For: Flex Branch, 3.1 Attachments: LUCENE-2261.patch MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority queue to expand the query to the top-N terms. currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy query to at most only the 50 closest terms. at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite (it is private right now) and add a ctor to it, so a MultiTermQuery can instantiate one with its own limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
[ https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832607#action_12832607 ] Uwe Schindler commented on LUCENE-2261: --- Patch looks good, some things because of serializable: - The readResove method must go to the singleton constant, which should also throw UOE when modified - euquals / hashcode is neaded for the rewritemode, else FuzzyQuery Co would no longer compare It could be solved by doing like for AutoRewrite and its unmodifiable constant. I know: Queries are a pain because of Serializable. +1 on adding a param to FuzzyQuery ctor configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size - Key: LUCENE-2261 URL: https://issues.apache.org/jira/browse/LUCENE-2261 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Priority: Minor Fix For: Flex Branch, 3.1 Attachments: LUCENE-2261.patch MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority queue to expand the query to the top-N terms. currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy query to at most only the 50 closest terms. at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite (it is private right now) and add a ctor to it, so a MultiTermQuery can instantiate one with its own limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
[ https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2261: Attachment: LUCENE-2261.patch the previous patch was wrong for readResolve, here is a fix. configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size - Key: LUCENE-2261 URL: https://issues.apache.org/jira/browse/LUCENE-2261 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Priority: Minor Fix For: Flex Branch, 3.1 Attachments: LUCENE-2261.patch, LUCENE-2261.patch MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority queue to expand the query to the top-N terms. currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy query to at most only the 50 closest terms. at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite (it is private right now) and add a ctor to it, so a MultiTermQuery can instantiate one with its own limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832638#action_12832638 ] Jason Rutherglen commented on LUCENE-1720: -- When's this ready to test with Solr? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
[ https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2261: Attachment: LUCENE-2261.patch here is a patch with uwe's suggestions. now that fuzzyquery supports the param via ctor, the singleton TopTerms is no longer used so i removed it. configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size - Key: LUCENE-2261 URL: https://issues.apache.org/jira/browse/LUCENE-2261 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Priority: Minor Fix For: Flex Branch, 3.1 Attachments: LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority queue to expand the query to the top-N terms. currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy query to at most only the 50 closest terms. at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite (it is private right now) and add a ctor to it, so a MultiTermQuery can instantiate one with its own limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
[ https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2261: Attachment: LUCENE-2261.patch sorry, i created a javadoc warning, (TOP_TERMS singleton was referenced in the top of mtq javadocs). here is the fix configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size - Key: LUCENE-2261 URL: https://issues.apache.org/jira/browse/LUCENE-2261 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Priority: Minor Fix For: Flex Branch, 3.1 Attachments: LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority queue to expand the query to the top-N terms. currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy query to at most only the 50 closest terms. at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite (it is private right now) and add a ctor to it, so a MultiTermQuery can instantiate one with its own limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
[ https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832683#action_12832683 ] Uwe Schindler commented on LUCENE-2261: --- Looks good when gaining a first insight. I have not tried the patch, will do soon. configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size - Key: LUCENE-2261 URL: https://issues.apache.org/jira/browse/LUCENE-2261 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Priority: Minor Fix For: Flex Branch, 3.1 Attachments: LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority queue to expand the query to the top-N terms. currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy query to at most only the 50 closest terms. at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite (it is private right now) and add a ctor to it, so a MultiTermQuery can instantiate one with its own limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832687#action_12832687 ] Robert Muir commented on LUCENE-2089: - bq. Imo, the only thing that would work better is to make term dictionary real trie (ternary, n-ary, dfa, makes no big diff). yeah I think a real DFA would be more efficient, its too much to think about. I guess this is why we leave this kind of thing to Mike :) I haven't grasped how this beam search would work with a DFA dictionary, although it definitely sounds cool. I assume you mean by weighted edit distance that the transitions in the state machine would have costs? If this is the case couldn't we even define standard levenshtein very easily (instead of nasty math), and would the beam search technique enumerate efficiently for us? explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2257) relax the per-segment max unique term limit
[ https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832689#action_12832689 ] Tom Burton-West commented on LUCENE-2257: - Hi Michael, Thanks for your help. We mounted one of the indexes with 2.4 billion terms on our dev server and tested with and without the patch. (I discovered that queries containing Korean characters would consistently trigger the bug). With the patch, we don't see any ArrayIndexOutOfBounds exceptions. We are going to do a bit more testing before we put this into production. (We rolled back our production indexes temporarily to indexes that split the terms over 2 segments and therefore didn't trigger the bug). Other than walking though the code in the debugger, is there some systematic way of looking for any other places where an int is used that might also have problems when we have over 2.1x billion terms? Tom relax the per-segment max unique term limit --- Key: LUCENE-2257 URL: https://issues.apache.org/jira/browse/LUCENE-2257 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2257.patch Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms in a single segment. But I think we can improve this to termIndexInterval (default 128) * 2.1B. There is one place (internal API only) where Lucene uses an int but should use a long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2257) relax the per-segment max unique term limit
[ https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832698#action_12832698 ] Robert Muir commented on LUCENE-2257: - bq. (I discovered that queries containing Korean characters would consistently trigger the bug). this makes sense because Hangul is sorted towards the end of the term dictionary you can see this visually here: http://unicode.org/roadmaps/bmp/ relax the per-segment max unique term limit --- Key: LUCENE-2257 URL: https://issues.apache.org/jira/browse/LUCENE-2257 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2257.patch Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms in a single segment. But I think we can improve this to termIndexInterval (default 128) * 2.1B. There is one place (internal API only) where Lucene uses an int but should use a long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832721#action_12832721 ] Mark Harwood commented on LUCENE-1720: -- bq. When's this ready to test with Solr? I think the API is pretty stable - call try..start..finally...stop around time-critical stuff and use a TimeLimitedIndexReader to wrap your IndexReader. Internally the implementation feels reasonably stable too. In my tests it doesn't seem to add too much overhead to calls - I was getting response times of 3400 milliseconds on a heavy wikipedia query with TimeLimitedIndexReader versus 3300 for the same query on a raw IndexReader without timeout protection. I'm tempted to try put the timeout check calls directly into a version of IndexReader rather than in a delegating reader wrapper just to try see if the wrapper code is where the bulk of the extra overhead comes in. I'd hate to add any overhead to core IndexReader but I'm keen to see just how low-cost this check can get. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2257) relax the per-segment max unique term limit
[ https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2257: --- Attachment: LUCENE-2257.patch relax the per-segment max unique term limit --- Key: LUCENE-2257 URL: https://issues.apache.org/jira/browse/LUCENE-2257 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2257.patch, LUCENE-2257.patch Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms in a single segment. But I think we can improve this to termIndexInterval (default 128) * 2.1B. There is one place (internal API only) where Lucene uses an int but should use a long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2257) relax the per-segment max unique term limit
[ https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832729#action_12832729 ] Michael McCandless commented on LUCENE-2257: bq. With the patch, we don't see any ArrayIndexOutOfBounds exceptions. Great! And the results look correct? bq. Other than walking though the code in the debugger, is there some systematic way of looking for any other places where an int is used that might also have problems when we have over 2.1x billion terms? Not that I know of! The code that handles the term dict lookup is fairly contained, in TermInfosReader and SegmentTermEnum. I think scrutinizing the code and testing (as you're doing) is the only way. I just looked again -- there are a few places where int is still being used. First is two methods (get(int position) and scanEnum), in TermInfosReader, that are actually dead code (package private unused). Second is int SegmentTermEnum.scanTo, but this is fine because it's never asked to can more than termIndexInterval terms. I'll attach patch that additionally just removes that dead code. relax the per-segment max unique term limit --- Key: LUCENE-2257 URL: https://issues.apache.org/jira/browse/LUCENE-2257 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2257.patch, LUCENE-2257.patch Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms in a single segment. But I think we can improve this to termIndexInterval (default 128) * 2.1B. There is one place (internal API only) where Lucene uses an int but should use a long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832741#action_12832741 ] Eks Dev commented on LUCENE-2089: - {quote} I assume you mean by weighted edit distance that the transitions in the state machine would have costs? {quote} Yes, kind of, not embedded in the trie, just defined externally. What I am talking about is a part of the noisy channel approach, modeling only channel distribution. Have a look at the http://norvig.com/spell-correct.html for basic theory. I am suggesting almost the same, just applied at character level and without language model part. It is rather easy once you have your dictionary in some sort of tree structure. You guide your trie traversal over the trie by iterating on each char in your search term accumulating log probabilities of single transformations (recycling prefix part). When you hit a leaf insert into PriorityQueue of appropriate depth. What I mean by probabilities of single transformations are defined as: insertion(character a)//map char-log probability (think of it as kind of cost of inserting this particular character) deletion(character)//map char-log probability... transposition(char a, char b) replacement(char a, char b)//2D matrix char,char-probability (cost) if you wish , you could even add some positional information, boosting match on start/end of the string I avoided tricky mechanicson traversal, insertion, deletion, but on trie you can do it by following different paths... the only good implementation (in memory) around there I know of is in LingPipe spell checker (they implement full Noisy Channel, with Language model driving traversal)... has huge educational value, Bob is really great at explaining things. The code itself is proprietary. I would suggest you to peek into this code to see this 2-Minute rumbling I wrote here properly explained :) Just ignore the language model part and assume you have NULL language model (all chars in language are equally probable) , doing full traversal over the trie. {quote} If this is the case couldn't we even define standard levenshtein very easily (instead of nasty math), and would the beam search technique enumerate efficiently for us? {quote} Standard Lev. is trivially configured once you have this, it is just setting all these costs to 1 (delete, insert... in log domain)... But who would use standard distance with such a beast, reducing impact of inserting/deleting silent h as in Thomas Tomas... Enumeration is trie traversal, practically calculating distance against all terms at the same time and collectiong N best along the way. The place where you save your time is recycling prefix part in this calculation. Enumeration is optimal as this trie there contains only the terms from termDict, you are not trying all possible alphabet characters and you can implement early path abandoning easily ether by cost (log probability) or/and by limiting the number of successive insertions If interested in really in depth things, look at http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 Great book, (another great tip from b...@lingpipe). A bit strange with terminology (at least to me), but once you get used to it, is really worth the time you spend trying to grasp it. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor)
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832773#action_12832773 ] Aaron McCurry commented on LUCENE-2089: --- I have written a levenstein generator today that seems to operate similarly to what is being discussed here. It generates all the possible matches to levenstein algorithm given a term and a character set, it then creates a booleanquery from it. For a given term with edit distance of 1 or 2 it is extremely fast. I tested it on my dev data that has about 8 billion documents with 20 shards, each shard has about 170,000,000 terms in the field that I'm testing. The normal fuzzy query with a term length of 8 and and edit distance of 2 took about 110 seconds to complete, and the auto generated query took around a 1.5 seconds complete. However this method will probably only work with edits distances in the 1 and 2 range, because once I hit 3 it spiked the memory usage and slowed way down (to be expected). Not sure if you all want to take a look at my code or not, but if you want me to post it I will. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832785#action_12832785 ] Robert Muir commented on LUCENE-2089: - Aaron i think generation may pose a problem for a full unicode alphabet. e.g. edit distance of 1 on otter is 720,891 strings... this is a large boolean query! so this is where the ping-ponging of automatonquery really helps, we dont have to do 720,891 term lookups because the term dictionary is in sorted order, so its treated like dfa intersection. btw, you can test this patch with your 170M terms by applying this patch to flex branch, and using the following code. {code} LevenshteinAutomata a = new LevenshteinAutomata(otter); // searching on otter Automaton lev1 = a.toAutomaton(1); // edit distance of 1 Query query = new AutomatonQuery(new Term(field, bogus), lev1); ... {code} explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832789#action_12832789 ] Robert Muir commented on LUCENE-2089: - bq. But who would use standard distance with such a beast, reducing impact of inserting/deleting silent h as in Thomas Tomas... hehe, well my only goal with this issue (and really automaton in general) is to speed up in a backwards compatible way, so I am stuck with standard distance for that purpose. but the more intelligent stuff you speak of could be really cool esp. for spellchecking, sure you dont want to rewrite our spellchecker? btw its not clear to me yet, could you implement that stuff on top of ghetto DFA (the sorted terms dict we have now) or is something more sophisticated needed? its a lot easier to write this stuff now with the flex MTQ apis :) explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832794#action_12832794 ] Aaron McCurry commented on LUCENE-2089: --- Excuse my ignorance, but what is a DFA? explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832798#action_12832798 ] Aaron McCurry commented on LUCENE-2089: --- {quote} Aaron i think generation may pose a problem for a full unicode alphabet. e.g. edit distance of 1 on otter is 720,891 strings... this is a large boolean query! {quote} So I have to ask, who has the entire unicode alphabet indexed in a single field? I don't, and I am willing to make tradeoffs in my system for performance. So for instance, I am willing to code a mapping file or scan my index on startup to know the character set (a-z 0-9 or whatever) to reduce the possibilities. This is a simple know your data problem, at least in IMHO. Aaaron explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832798#action_12832798 ] Aaron McCurry edited comment on LUCENE-2089 at 2/12/10 2:22 AM: {quote} Aaron i think generation may pose a problem for a full unicode alphabet. e.g. edit distance of 1 on otter is 720,891 strings... this is a large boolean query! {quote} So I have to ask, who has the entire unicode alphabet indexed in a single field? I don't, and I am willing to make tradeoffs in my system for performance. So for instance, I am willing to code a mapping file or scan my index on startup to know the character set (a-z 0-9 or whatever) to reduce the possibilities. This is a simple know your data problem, at least in IMHO. Aaron was (Author: amccurry): {quote} Aaron i think generation may pose a problem for a full unicode alphabet. e.g. edit distance of 1 on otter is 720,891 strings... this is a large boolean query! {quote} So I have to ask, who has the entire unicode alphabet indexed in a single field? I don't, and I am willing to make tradeoffs in my system for performance. So for instance, I am willing to code a mapping file or scan my index on startup to know the character set (a-z 0-9 or whatever) to reduce the possibilities. This is a simple know your data problem, at least in IMHO. Aaaron explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832814#action_12832814 ] Robert Muir commented on LUCENE-2089: - bq. So I have to ask, who has the entire unicode alphabet indexed in a single field? I don't, and I am willing to make tradeoffs in my system for performance. Aaron, would you mind testing this patch with the flex branch (with the sample code i listed before)? There is no tradeoff to support unicode. If you don't have chinese chars in your index, the enum will not seek to them, but skip past them, as they do not exist, and Term(s)Enum always returns things in sorted order. For more details, see AutomatonTermsEnum in flex (this is the code that actually does DFA - DFA intersection of any Automaton with lucene's term dictionary). explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832817#action_12832817 ] Robert Muir commented on LUCENE-2089: - bq. Excuse my ignorance, but what is a DFA? here is wikipedia page with a description: http://en.wikipedia.org/wiki/Deterministic_finite-state_machine explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1990) Add unsigned packed int impls in oal.util
[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toke Eskildsen updated LUCENE-1990: --- Attachment: LUCENE-1990-te20100212.patch I've read through the comments on LUCENE-1990 and implemented most of what has been suggested. The attached patch contains implementations for all the variants we've talked about, including aligned. There's a known bug in persistence for aligned64 (and probably also for aligned32) that I haven't stomped yet. There's also a clear need for a more elaborate unit-test with regard to persistence. Other outstanding issues, as I see them, are whether or not mutable packed arrays should be requestable (as general purpose data structures) and how the factory for creating a writer should work. I have added a getMutable-method to the factory and not touched the return type Reader for the getReader-method. That way read-only users will not be tempted to try and update the received structure. As for the arguments to the factory, Michael McCandless suggested that the preferences should be expressed with (packed | aligned32 | aligned64 | auto). As fas as I can see, this should work. However, I've only just reached this conclusion and haven't had the time to implement it. A speed-test has been added and the results from my machine can be seen below. In order for it to be really usable, it should be tried on other machines too. I won't touch the code before sometime next week, but I'll keep an eye on LUCENE-1990 comments until then. {code} bitsPerValue valueCountgetCount PackedDirectByte PackedDirectShortPacked32 PackedAligned32 PackedDirectIntPacked64 PackedAligned64PackedDirectLong 110001000 167 141 258 242 172 264 242 183 1 1001000 224 232 266 233 246 262 238 338 110001000 359 469 280 278 508 278 272 551 310001000 168 166 265 241 163 262 243 166 3 1001000 227 226 261 251 239 274 249 330 310001000 406 476 301 304 522 300 308 547 410001000 167 168 266 239 164 285 239 169 4 1001000 228 231 294 274 262 291 269 314 410001000 385 480 308 333 514 331 315 557 710001000 172 174 278 248 162 271 238 177 7 1001000 224 236 289 281 272 278 277 345 710001000 405 473 389 447 516 399 402 553 810001000 192 171 268 242 174 291 240 163 8 1001000 226 232 291 284 286 274 265 314 81000
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832819#action_12832819 ] Aaron McCurry commented on LUCENE-2089: --- Sure I will give it a try, still wrapping my head around the flex branch. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832822#action_12832822 ] Robert Muir commented on LUCENE-2089: - thanks, this will give you something to play with wrt DFA queries. in flex, Regex and wildcard are already implemented this way. one additional complexity here that this code will not do (until some other issues are resolved and i update the patch), is the pq trick. This is the fact that FuzzyQuery (or really anything else that uses TopTermsRewrite) is really an n-best list, and we shouldn't seek to terms that will simply be overflowed off the priority queue anyway, as its just a waste of time. this is the part i am working on now, most of it already one under LUCENE-2140 and some setup now under LUCENE-2261 explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Hudson build is back to normal : Lucene-trunk #1092
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1092/changes - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832858#action_12832858 ] Shai Erera commented on LUCENE-1720: Thanks Mark. I will take a look at the patch a bit later. TestActivityMonitor should move to o.a.l.util as well. I can do that. I also want to experiment w/ using ConcurrentHashMap, as most of the times the map is just checked and not modified. I'll see if it removes some of the synchronization. The synchronization is needed because of TimeoutThread, which only checks the map ... if it's possible, I believe it will improve the performance of ATM. About adding it to IndexReader directly - it'd be interesting to see if it affects performance in any way, but I wouldn't want to see it directly in IR, since if I don't want to limit anything, I don't want to add an 'if' everywhere in IR, or a call to ATM ... it'd be interesting to see if it improves anything though. I also want to add a TestTimeLimitedIndexReader. BTW about the name - I think TimeLimitingIndexReader is more accurate, since it's not the IR which is time limited, but the operations it performs. What do you think? (also we'll follow TimeLimitingCollector) ... TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: FieldCacheImpl concurrency
On Thu, Feb 11, 2010 at 5:41 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Feb 11, 2010 at 9:54 AM, Shay Banon kim...@gmail.com wrote: I would like to try and improve concurrency in Lucene in several places, and thought I would start with FieldCacheImpl. The implementation is heavily synchronized on both a global map and on creation values for a pretty heavily used path (get). It really shouldn't be heavily used. For a sorted search, get() is called once per segment in the index. There is no synchronization to retrieve per-document values. Sorting is not the only reason why FieldCache would be used... We have just barely moved to Java5 though, and so it would probably be pretty easy to improve the concurrency of the read path if it did become problematic. I know you just moved to Java 5, and my first email talked about the concurrency aspect. I think the weak hash map is a good solution (but also think that the ability to choose to use soft hash map would add even greater flexibility), the problem is that there is no built in concurrent weak/soft hash map. It might be a better idea to have interfaces that allow one to implement their own policies rather than to try and push all the different options into Lucene: https://issues.apache.org/jira/browse/LUCENE-831 I think that trying to provide an extension that either syncs or not is a pain in terms of extension point (it will be very convoluted) and many times hinders the optimizations that can be done in concurrent systems... . Just see what is done in the FieldCacheImpl and the creation marker which is not really needed with google collections (they allow for single entry be created atomically). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org