[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www
[ https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702192#action_12702192 ] Otis Gospodnetic commented on LUCENE-1284: -- Hm, I feel that because of these command-line non-Java and GPLed tools it may not be possible (or will be very clunky) to integrate this with Lucene. What do others think? Felipe, although Java equivalents of those command-line tools don't exist currently, do you think one could implement them in Java (and release them under ASL)? I don't know what exactly is in those tools and what it would take to port them to Java. Thanks. > Set of Java classes that allow the Lucene search engine to use morphological > information developed for the Apertium open-source machine translation > platform (http://www.apertium.org) > -- > > Key: LUCENE-1284 > URL: https://issues.apache.org/jira/browse/LUCENE-1284 > Project: Lucene - Java > Issue Type: New Feature > Environment: New feature developed under GNU/Linux, but it should > work in any other Java-compliance platform >Reporter: Felipe Sánchez Martínez >Assignee: Otis Gospodnetic > Attachments: apertium-morph.0.9.0.tgz > > > Set of Java classes that allow the Lucene search engine to use morphological > information developed for the Apertium open-source machine translation > platform (http://www.apertium.org). Morphological information is used to > index new documents and to process smarter queries in which morphological > attributes can be used to specify query terms. > The tool makes use of morphological analyzers and dictionaries developed for > the open-source machine translation platform Apertium (http://apertium.org) > and, optionally, the part-of-speech taggers developed for it. Currently there > are morphological dictionaries available for Spanish, Catalan, Galician, > Portuguese, > Aranese, Romanian, French and English. In addition new dictionaries are being > developed for Esperanto, Occitan, Basque, Swedish, Danish, > Welsh, Polish and Italian, among others; we hope more language pairs to be > added to the Apertium machine translation platform in the near future. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Lucene 1483 and Auto resolution
Just got off the train and ny to ct has a brilliant bar car, so lest I forget: 1483 moved auto resolution from fshq to indexsearcher - which is a back compat break if you were using a fshq without indexsearcher (Solr does it - anyone could). Annoying. If I remember right, I did it to resolve auto on the multireader rather than each individual segment reader. So the change is needed and not allowed. Perhaps it could just re-resolve like before though - if indexsearcher has already resolved, fine, otherwise it will be done again at the fshq level. Ill issue it up later. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1341) BoostingNearQuery class (prototype)
[ https://issues.apache.org/jira/browse/LUCENE-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Keegan updated LUCENE-1341: - Attachment: lucene-1341-new-1.patch As I was debugging a unit test for BoostingNearQuery, I discovered that not all the payloads were getting read. The 'needToLoadPayload' flag on the termpos was getting reset on the last term in the span by NearSpansOrdered. Then I noticed that the term positions aren't even needed in BNQ because they were already collected by the Spans in 'matchPayload'. So, here is a newer, simpler implementation of BNQ along with some unit tests. Peter > BoostingNearQuery class (prototype) > --- > > Key: LUCENE-1341 > URL: https://issues.apache.org/jira/browse/LUCENE-1341 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Affects Versions: 2.3.1 >Reporter: Peter Keegan >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 3.0 > > Attachments: bnq.patch, bnq.patch, BoostingNearQuery.java, > BoostingNearQuery.java, lucene-1341-new-1.patch, LUCENE-1341-new.patch, > LUCENE-1341.patch > > > This patch implements term boosting for SpanNearQuery. Refer to: > http://www.gossamer-threads.com/lists/lucene/java-user/62779 > This patch works but probably needs more work. I don't like the use of > 'instanceof', but I didn't want to touch Spans or TermSpans. Also, the > payload code is mostly a copy of what's in BoostingTermQuery and could be > common-sourced somewhere. Feel free to throw darts at it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1602) Rewrite TrieRange to use MultiTermQuery
[ https://issues.apache.org/jira/browse/LUCENE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702013#action_12702013 ] Uwe Schindler commented on LUCENE-1602: --- Fixed the incomplete hashcode(), equals() and toString() of TrieRangeQueries in revision 767982. > Rewrite TrieRange to use MultiTermQuery > --- > > Key: LUCENE-1602 > URL: https://issues.apache.org/jira/browse/LUCENE-1602 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.9 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 2.9 > > Attachments: LUCENE-1602.patch, LUCENE-1602.patch, LUCENE-1602.patch, > LUCENE-1602.patch, LUCENE-1602.patch, queries.zip, queries.zip > > > Issue for discussion here: > http://www.lucidimagination.com/search/document/46a548a79ae9c809/move_trierange_to_core_module_and_integration_issues > This patch is a rewrite of TrieRange using MultiTermQuery like all other core > queries. This should make TrieRange identical in functionality to core range > queries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead
[ https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702011#action_12702011 ] Earwin Burrfoot commented on LUCENE-1609: - You cannot put all these fields into state object, because you introduce state to it and it can no longer be unsafely published. > one thread may exchange the state object to a IndexRead, but another one > still sees the reference to the IndexNotRead object Nothing terrible here, a thread hitting stale IndexNotRead synchronizes and short-circuits in the beginning of the method. The problem is that seeing proper state object doesn't guarantee seeing fields it is supposed to guard :) Yes, it's not fixable here without volatile or proper synchronization. But I still have a feeling that lazy loading (and consequent synchronization) is not needed here at all. > Eliminate synchronization contention on initial index reading in > TermInfosReader ensureIndexIsRead > --- > > Key: LUCENE-1609 > URL: https://issues.apache.org/jira/browse/LUCENE-1609 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.9 > Environment: Solr > Tomcat 5.5 > Ubuntu 2.6.20-17-generic > Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM >Reporter: Dan Rosher > Attachments: LUCENE-1609.patch > > > synchronized method ensureIndexIsRead in TermInfosReader causes contention > under heavy load > Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple > range search e.g. id:[0 TO 99] on even a small index (in my case 28K > docs) and under a load/stress test application, and later, examining the > Thread dump (kill -3) , many threads are blocked on 'waiting for monitor > entry' to this method. > Rather than using Double-Checked Locking which is known to have issues, this > implementation uses a state pattern, where only one thread can move the > object from IndexNotRead state to IndexRead, and in doing so alters the > objects behavior, i.e. once the index is loaded, the index nolonger needs a > synchronized method. > In my particular test, this uncreased throughput at least 30 times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead
[ https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701994#action_12701994 ] Uwe Schindler commented on LUCENE-1609: --- You could fix this, if you put all these field into the state object, too (an abstract class instead of interface containing these variables) and cloning those on creating the new state. But then you have the mentioned problem, that one thread may exchange the state object to a IndexRead, but another one still sees the reference to the IndexNotRead object, not used any longer. As log as you not also sychronize the state object change or make it volatile in Java 1.5 it will still not work. That was, what I meant. In my opinion, this is not fixable in any case with these type of state objects, yes? > Eliminate synchronization contention on initial index reading in > TermInfosReader ensureIndexIsRead > --- > > Key: LUCENE-1609 > URL: https://issues.apache.org/jira/browse/LUCENE-1609 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.9 > Environment: Solr > Tomcat 5.5 > Ubuntu 2.6.20-17-generic > Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM >Reporter: Dan Rosher > Attachments: LUCENE-1609.patch > > > synchronized method ensureIndexIsRead in TermInfosReader causes contention > under heavy load > Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple > range search e.g. id:[0 TO 99] on even a small index (in my case 28K > docs) and under a load/stress test application, and later, examining the > Thread dump (kill -3) , many threads are blocked on 'waiting for monitor > entry' to this method. > Rather than using Double-Checked Locking which is known to have issues, this > implementation uses a state pattern, where only one thread can move the > object from IndexNotRead state to IndexRead, and in doing so alters the > objects behavior, i.e. once the index is loaded, the index nolonger needs a > synchronized method. > In my particular test, this uncreased throughput at least 30 times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead
[ https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701986#action_12701986 ] Earwin Burrfoot edited comment on LUCENE-1609 at 4/23/09 9:41 AM: -- The problem is not with indexState not being volatile. You can unsafely publish objects that have no internal state, or their state is consistent enough for you under any memory visibility/reordering effects. See working example of it in LUCENE-1607, Yonik's hash for interning strings. The problem is that indexState guards indexTerms, indexInfos, indexPointers, which aren't volatile too and aren't guarded by the lock. It is possible that one thread does load these fields and then sets indexState = new IndexRead(), but another thread sees only the last write and dies with NPE. The thing I don't get, is why do we want lazy loading here at all? Is there any usage for TermInfosReader that prevents it from initializing in constructor? was (Author: earwin): The problem is not with indexState not being volatile. You can unsafely publish objects that have no internal state, or their state is consistent enough for you under any memory visibility/reordering effects. See working example of it in LUCENE-1607, Yonik's hash for interning strings. The problem is that indexState guards indexTerms, indexInfos, indexPointers, which aren't volatile too and aren't guarded by the lock. It is possible that one thread does load these fields and then sets indexState = new IndexRead(), but another thread sees only the last write and dies with NPE. > Eliminate synchronization contention on initial index reading in > TermInfosReader ensureIndexIsRead > --- > > Key: LUCENE-1609 > URL: https://issues.apache.org/jira/browse/LUCENE-1609 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.9 > Environment: Solr > Tomcat 5.5 > Ubuntu 2.6.20-17-generic > Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM >Reporter: Dan Rosher > Attachments: LUCENE-1609.patch > > > synchronized method ensureIndexIsRead in TermInfosReader causes contention > under heavy load > Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple > range search e.g. id:[0 TO 99] on even a small index (in my case 28K > docs) and under a load/stress test application, and later, examining the > Thread dump (kill -3) , many threads are blocked on 'waiting for monitor > entry' to this method. > Rather than using Double-Checked Locking which is known to have issues, this > implementation uses a state pattern, where only one thread can move the > object from IndexNotRead state to IndexRead, and in doing so alters the > objects behavior, i.e. once the index is loaded, the index nolonger needs a > synchronized method. > In my particular test, this uncreased throughput at least 30 times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead
[ https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701986#action_12701986 ] Earwin Burrfoot commented on LUCENE-1609: - The problem is not with indexState not being volatile. You can unsafely publish objects that have no internal state, or their state is consistent enough for you under any memory visibility/reordering effects. See working example of it in LUCENE-1607, Yonik's hash for interning strings. The problem is that indexState guards indexTerms, indexInfos, indexPointers, which aren't volatile too and aren't guarded by the lock. It is possible that one thread does load these fields and then sets indexState = new IndexRead(), but another thread sees only the last write and dies with NPE. > Eliminate synchronization contention on initial index reading in > TermInfosReader ensureIndexIsRead > --- > > Key: LUCENE-1609 > URL: https://issues.apache.org/jira/browse/LUCENE-1609 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.9 > Environment: Solr > Tomcat 5.5 > Ubuntu 2.6.20-17-generic > Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM >Reporter: Dan Rosher > Attachments: LUCENE-1609.patch > > > synchronized method ensureIndexIsRead in TermInfosReader causes contention > under heavy load > Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple > range search e.g. id:[0 TO 99] on even a small index (in my case 28K > docs) and under a load/stress test application, and later, examining the > Thread dump (kill -3) , many threads are blocked on 'waiting for monitor > entry' to this method. > Rather than using Double-Checked Locking which is known to have issues, this > implementation uses a state pattern, where only one thread can move the > object from IndexNotRead state to IndexRead, and in doing so alters the > objects behavior, i.e. once the index is loaded, the index nolonger needs a > synchronized method. > In my particular test, this uncreased throughput at least 30 times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead
[ https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701973#action_12701973 ] Uwe Schindler commented on LUCENE-1609: --- Are you sure, this works correct? If the indexState is changed in the synchronized block, another thread not synchronizing on the lock may still see the old indexState. At least, the indexState must be volatile, but this only works correct with Java 1.5 (and Lucene only needs Java 1.4 as requirement). > Eliminate synchronization contention on initial index reading in > TermInfosReader ensureIndexIsRead > --- > > Key: LUCENE-1609 > URL: https://issues.apache.org/jira/browse/LUCENE-1609 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.9 > Environment: Solr > Tomcat 5.5 > Ubuntu 2.6.20-17-generic > Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM >Reporter: Dan Rosher > Attachments: LUCENE-1609.patch > > > synchronized method ensureIndexIsRead in TermInfosReader causes contention > under heavy load > Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple > range search e.g. id:[0 TO 99] on even a small index (in my case 28K > docs) and under a load/stress test application, and later, examining the > Thread dump (kill -3) , many threads are blocked on 'waiting for monitor > entry' to this method. > Rather than using Double-Checked Locking which is known to have issues, this > implementation uses a state pattern, where only one thread can move the > object from IndexNotRead state to IndexRead, and in doing so alters the > objects behavior, i.e. once the index is loaded, the index nolonger needs a > synchronized method. > In my particular test, this uncreased throughput at least 30 times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead
[ https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Rosher updated LUCENE-1609: --- Attachment: LUCENE-1609.patch > Eliminate synchronization contention on initial index reading in > TermInfosReader ensureIndexIsRead > --- > > Key: LUCENE-1609 > URL: https://issues.apache.org/jira/browse/LUCENE-1609 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.9 > Environment: Solr > Tomcat 5.5 > Ubuntu 2.6.20-17-generic > Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM >Reporter: Dan Rosher > Attachments: LUCENE-1609.patch > > > synchronized method ensureIndexIsRead in TermInfosReader causes contention > under heavy load > Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple > range search e.g. id:[0 TO 99] on even a small index (in my case 28K > docs) and under a load/stress test application, and later, examining the > Thread dump (kill -3) , many threads are blocked on 'waiting for monitor > entry' to this method. > Rather than using Double-Checked Locking which is known to have issues, this > implementation uses a state pattern, where only one thread can move the > object from IndexNotRead state to IndexRead, and in doing so alters the > objects behavior, i.e. once the index is loaded, the index nolonger needs a > synchronized method. > In my particular test, this uncreased throughput at least 30 times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead
Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead --- Key: LUCENE-1609 URL: https://issues.apache.org/jira/browse/LUCENE-1609 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Environment: Solr Tomcat 5.5 Ubuntu 2.6.20-17-generic Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM Reporter: Dan Rosher synchronized method ensureIndexIsRead in TermInfosReader causes contention under heavy load Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple range search e.g. id:[0 TO 99] on even a small index (in my case 28K docs) and under a load/stress test application, and later, examining the Thread dump (kill -3) , many threads are blocked on 'waiting for monitor entry' to this method. Rather than using Double-Checked Locking which is known to have issues, this implementation uses a state pattern, where only one thread can move the object from IndexNotRead state to IndexRead, and in doing so alters the objects behavior, i.e. once the index is loaded, the index nolonger needs a synchronized method. In my particular test, this uncreased throughput at least 30 times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701957#action_12701957 ] Paul Elschot commented on LUCENE-1252: -- There is no patch for now. HitCollectors should not be affected by this, as they would only be involved when a real match is found, and that, when position info is needed, necessarily involves the positions. Extending this with a cheap score brings another issue: should a cheap score be given for a document that might match, but in the end does not really match when positions are used? At the moment, I don't think so: score values are normally cheap to compute, but accessing positions is not cheap. > Avoid using positions when not all required terms are present > - > > Key: LUCENE-1252 > URL: https://issues.apache.org/jira/browse/LUCENE-1252 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Paul Elschot >Priority: Minor > > In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, > currently next() and skipTo() will use position information even when other > parts of the query cannot match because some required terms are not present. > This could be avoided by adding some methods to Scorer that relax the > postcondition of next() and skipTo() to something like "all required terms > are present, but no position info was checked yet", and implementing these > methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and > SpanScorer/NearSpans. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Greetings and questions about patches
Thanks all. Despite my aesthetic preference for removing unused code, I'm *really* not in favor of causing extra work (for myself or others) to satisfy it .. Especially when there's reasonable expectations that the code in question *will* be used in the foreseeable future. Ok, I'll leave the code in place as-is and provide a patch with unit tests sometime real soon now. Erick On Thu, Apr 23, 2009 at 6:15 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > Welcome Erick! > > Because nextHighestPowerOfTwo methods are public, I think we cannot > change what they return, nor remove them. At most we could deprecate > them now (and remove in 3.0), though I think it's fine to simply keep > them around even though nothing inside Lucene uses them today: since > we are heavy users of BitSet/Vector/Array/etc., it seems possible > we'll need them at some point. > > EG we are looking to make a better data structure to share > nearly-identical deleted doc bitsets between near realtime readers, > which could conceivably use advanced BitUtil methods. > > Keep hacking ;) > > Mike > > On Wed, Apr 22, 2009 at 9:33 PM, Erick Erickson > wrote: > > Hi all: > > > > I've been participating in the user list for some time, and I'd like > > to start helping maintain/enhance the code. So I thought I'd start > > with something small, mostly to get the process down. Unit tests > > sure fit the bill it seems to me, less chance of introducing errors > > through ignorance but a fine way to extend *my* understanding > > of Lucene. > > > > I managed to check out the code and run the unit tests, which > > was amazingly easy. I even managed to get the project into > > IntelliJ and connect the codestyle.xml file. Kudos for whoever > > set up the checkout/build process, I was dreading spending > > days setting this up, fortunately I didn't have to. > > > > So I, with Chris's help, found the code coverage report and > > chose something pretty straightforward to test, BitUtil since it > > was nice and self-contained. As I said, I'm looking at understanding > > the process rather than adding much value the first time. > > > > Alas, even something as simple as BitUtil generates questions > > that I'm asking mostly to understand what approach the veterans > > prefer. I'll argue with y'all next year sometime . > > > > So, according to the coverage report, there are two methods that > > are never executed by the unit tests (actually 4, 2 that operate on > > ints and 2 that operate on longs), isPowerOfTwo and > > nextHighestPowerOfTwo. nextHighestPowerOfTwo is especially > > clever, had to get out a paper and pencil to really understand it. > > > > Issues: > > 1> none of these methods is ever called. I commented them out > > and ran all the unit tests and all is well. Additionally, commenting > > out one of the other methods produces compile-time errors so I'm > > fairly sure I didn't do something completely stupid that just > *looked* > > like it was OK. I grepped recursively and they're nowhere in the > > *.java files. > > 1a> What's the consensus about unused code? Take it out (my > > preference) along with leaving a comment on where it can > > be found (since it *is* clever code)? Leave it in because > someone > > found some pretty neat algorithms that we may need sometime? > > 1b> I'm not entirely sure about the contrib area, but the contrib jars > > are all new so I assume "ant clean test" compiles them as well. > > > > 2> I don't agree with the behavior of nextHighestPowerOfTwo. Should > > I make changes if we decide to keep it? > > 2a> Why should it return the parameter passed in when it happens to be > > a perfect power of two? e.g. this passes: > >assertEquals(BitUtil.nextHighestPowerOfTwo(128L), 128); > >I'd expect this to actually return 256, given the name. > > 2b> Why should it ever return 0? There's no power of two that is > >zero. e.g. this passes: > >assertEquals(BitUtil.nextHighestPowerOfTwo(-1), 0); > >as does this: assertEquals(BitUtil.nextHighestPowerOfTwo(0), 0). > >*Assuming* that someone wants to use this sometime to, say, size > > an array they'd have to test against a return of 0. > > > > > > I'm fully aware that these are trivial issues in the grand scheme of > things, > > and I *really* don't want to waste much time hashing them over. I'll > provide > > a patch either way and go on to something slightly more complicated for > > my next trick. > > > > Best > > Erick > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >
Re: Synonym filter with support for phrases?
>> engine. So guys looking for "MSU CMC" really want to get "Московский >> Государственный Университет, факультет ВМиК" and his friends. > And? How often do they extend this particular phrase with further terms? They don't need to. Variations of this phrase alone killed my first several approaches to synonyms :) -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785
Re: Synonym filter with support for phrases?
It'd be great to get multi-word synonyms fully working... I agree -- this is something that seems to be useful for a wider bunch of people. How would you change how Lucene indexes token positions to do this "correctly"? Kirill has some interesting points to this. I have a busy day today, but I'll try to clean up and post the code that I put together for another project. It'll be a start for refining into better directions. Dawid - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Fuzzy search optimization
Hi, I was going through the Levenshtein distance code in org.apache.lucene.search.FuzzyTermEnum.java of the 2.4.1 build. I noticed that there can be a small, but effective optimization to the distance calculation code (initialization). I have the code ready with me. I can post it if anyone is interested. Thanks and regards Varun Dhussa Product Architect CE InfoSystems (P) Ltd. http://maps.mapmyindia.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Synonym filter with support for phrases?
> On Wed, Apr 22, 2009 at 5:12 AM, Earwin Burrfoot wrote: > >> Your synonyms will break if you try searching for phrases. >> Building on your example, "food place in new york" will find nothing, >> because 'place' and 'in' share the same position. > > It'd be great to get multi-word synonyms fully working... > > How would you change how Lucene indexes token positions to do this > "correctly"? You need an ability to put two tokens in the same position, with different posIncrements. One variant from the top of my head is to introduce a notion of span, so token becomes (text, span, incr). (restaurant, 1, 0), (food, 0, 1), (place, 0, 1), (in, 0, 1), (new, 0, 1), (york, 0, 1) The span affects distance calculation between this term, and some that follows. E.g. dist(food, in) = 2, because both food and place have incr=1, but despite restaurant and food having same start position, dist(restaurant, in) = 1, because restaurant spans an additional position. With something like that I think it is possible to formulate an algorithm for indexing and query rewriting that does "correct" multiword synonyms. Right now I cheat when rewriting a query. If my syngroup is a part of the phrase, and I know that this syngroup has longer phrases than the one currently detected, I do a span or sloppy phrase query. That works, but theoretically could match a wrong document. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Greetings and questions about patches
Welcome Erick! Because nextHighestPowerOfTwo methods are public, I think we cannot change what they return, nor remove them. At most we could deprecate them now (and remove in 3.0), though I think it's fine to simply keep them around even though nothing inside Lucene uses them today: since we are heavy users of BitSet/Vector/Array/etc., it seems possible we'll need them at some point. EG we are looking to make a better data structure to share nearly-identical deleted doc bitsets between near realtime readers, which could conceivably use advanced BitUtil methods. Keep hacking ;) Mike On Wed, Apr 22, 2009 at 9:33 PM, Erick Erickson wrote: > Hi all: > > I've been participating in the user list for some time, and I'd like > to start helping maintain/enhance the code. So I thought I'd start > with something small, mostly to get the process down. Unit tests > sure fit the bill it seems to me, less chance of introducing errors > through ignorance but a fine way to extend *my* understanding > of Lucene. > > I managed to check out the code and run the unit tests, which > was amazingly easy. I even managed to get the project into > IntelliJ and connect the codestyle.xml file. Kudos for whoever > set up the checkout/build process, I was dreading spending > days setting this up, fortunately I didn't have to. > > So I, with Chris's help, found the code coverage report and > chose something pretty straightforward to test, BitUtil since it > was nice and self-contained. As I said, I'm looking at understanding > the process rather than adding much value the first time. > > Alas, even something as simple as BitUtil generates questions > that I'm asking mostly to understand what approach the veterans > prefer. I'll argue with y'all next year sometime . > > So, according to the coverage report, there are two methods that > are never executed by the unit tests (actually 4, 2 that operate on > ints and 2 that operate on longs), isPowerOfTwo and > nextHighestPowerOfTwo. nextHighestPowerOfTwo is especially > clever, had to get out a paper and pencil to really understand it. > > Issues: > 1> none of these methods is ever called. I commented them out > and ran all the unit tests and all is well. Additionally, commenting > out one of the other methods produces compile-time errors so I'm > fairly sure I didn't do something completely stupid that just *looked* > like it was OK. I grepped recursively and they're nowhere in the > *.java files. > 1a> What's the consensus about unused code? Take it out (my > preference) along with leaving a comment on where it can > be found (since it *is* clever code)? Leave it in because someone > found some pretty neat algorithms that we may need sometime? > 1b> I'm not entirely sure about the contrib area, but the contrib jars > are all new so I assume "ant clean test" compiles them as well. > > 2> I don't agree with the behavior of nextHighestPowerOfTwo. Should > I make changes if we decide to keep it? > 2a> Why should it return the parameter passed in when it happens to be > a perfect power of two? e.g. this passes: > assertEquals(BitUtil.nextHighestPowerOfTwo(128L), 128); > I'd expect this to actually return 256, given the name. > 2b> Why should it ever return 0? There's no power of two that is > zero. e.g. this passes: > assertEquals(BitUtil.nextHighestPowerOfTwo(-1), 0); > as does this: assertEquals(BitUtil.nextHighestPowerOfTwo(0), 0). > *Assuming* that someone wants to use this sometime to, say, size > an array they'd have to test against a return of 0. > > > I'm fully aware that these are trivial issues in the grand scheme of things, > and I *really* don't want to waste much time hashing them over. I'll provide > a patch either way and go on to something slightly more complicated for > my next trick. > > Best > Erick > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Synonym filter with support for phrases?
On Wed, Apr 22, 2009 at 5:12 AM, Earwin Burrfoot wrote: > Your synonyms will break if you try searching for phrases. > Building on your example, "food place in new york" will find nothing, > because 'place' and 'in' share the same position. It'd be great to get multi-word synonyms fully working... How would you change how Lucene indexes token positions to do this "correctly"? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org