[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-04-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702192#action_12702192
 ] 

Otis Gospodnetic commented on LUCENE-1284:
--

Hm, I feel that because of these command-line non-Java and GPLed tools it may 
not be possible (or will be very clunky) to integrate this with Lucene.

What do others think?

Felipe, although Java equivalents of those command-line tools don't exist 
currently, do you think one could implement them in Java (and release them 
under ASL)?  I don't know what exactly is in those tools and what it would take 
to port them to Java.
Thanks.

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.0.9.0.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Lucene 1483 and Auto resolution

2009-04-23 Thread Mark Miller
Just got off the train and ny to ct has a brilliant bar car, so lest I 
forget:


1483 moved auto resolution from fshq to indexsearcher - which is a back 
compat break if you were using a fshq without indexsearcher (Solr does 
it - anyone could). Annoying. If I remember right, I did it to resolve 
auto on the multireader rather than each individual segment reader. So 
the change is needed and not allowed. Perhaps it could just re-resolve 
like before though - if indexsearcher has already resolved, fine, 
otherwise it will be done again at the fshq level. Ill issue it up later.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1341) BoostingNearQuery class (prototype)

2009-04-23 Thread Peter Keegan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Keegan updated LUCENE-1341:
-

Attachment: lucene-1341-new-1.patch

As I was debugging a unit test for BoostingNearQuery, I discovered that not all 
the payloads were getting read. The 'needToLoadPayload' flag on the termpos was 
getting reset on the last term in the span by NearSpansOrdered. Then I noticed 
that the term positions aren't even needed in BNQ because they were already 
collected by the Spans in 'matchPayload'. So, here is a newer, simpler 
implementation of BNQ along with some unit tests.

Peter



> BoostingNearQuery class (prototype)
> ---
>
> Key: LUCENE-1341
> URL: https://issues.apache.org/jira/browse/LUCENE-1341
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Affects Versions: 2.3.1
>Reporter: Peter Keegan
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 3.0
>
> Attachments: bnq.patch, bnq.patch, BoostingNearQuery.java, 
> BoostingNearQuery.java, lucene-1341-new-1.patch, LUCENE-1341-new.patch, 
> LUCENE-1341.patch
>
>
> This patch implements term boosting for SpanNearQuery. Refer to: 
> http://www.gossamer-threads.com/lists/lucene/java-user/62779
> This patch works but probably needs more work. I don't like the use of 
> 'instanceof', but I didn't want to touch Spans or TermSpans. Also, the 
> payload code is mostly a copy of what's in BoostingTermQuery and could be 
> common-sourced somewhere. Feel free to throw darts at it :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1602) Rewrite TrieRange to use MultiTermQuery

2009-04-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702013#action_12702013
 ] 

Uwe Schindler commented on LUCENE-1602:
---

Fixed the incomplete hashcode(), equals() and toString() of TrieRangeQueries in 
revision 767982.

> Rewrite TrieRange to use MultiTermQuery
> ---
>
> Key: LUCENE-1602
> URL: https://issues.apache.org/jira/browse/LUCENE-1602
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1602.patch, LUCENE-1602.patch, LUCENE-1602.patch, 
> LUCENE-1602.patch, LUCENE-1602.patch, queries.zip, queries.zip
>
>
> Issue for discussion here: 
> http://www.lucidimagination.com/search/document/46a548a79ae9c809/move_trierange_to_core_module_and_integration_issues
> This patch is a rewrite of TrieRange using MultiTermQuery like all other core 
> queries. This should make TrieRange identical in functionality to core range 
> queries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-04-23 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702011#action_12702011
 ] 

Earwin Burrfoot commented on LUCENE-1609:
-

You cannot put all these fields into state object, because you introduce state 
to it and it can no longer be unsafely published.

> one thread may exchange the state object to a IndexRead, but another one 
> still sees the reference to the IndexNotRead object
Nothing terrible here, a thread hitting stale IndexNotRead synchronizes and 
short-circuits in the beginning of the method. The problem is that seeing 
proper state object doesn't guarantee seeing fields it is supposed to guard :)

Yes, it's not fixable here without volatile or proper synchronization. But I 
still have a feeling that lazy loading (and consequent synchronization) is not 
needed here at all.

> Eliminate synchronization contention on initial index reading in 
> TermInfosReader ensureIndexIsRead 
> ---
>
> Key: LUCENE-1609
> URL: https://issues.apache.org/jira/browse/LUCENE-1609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
> Environment: Solr 
> Tomcat 5.5
> Ubuntu 2.6.20-17-generic
> Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
>Reporter: Dan Rosher
> Attachments: LUCENE-1609.patch
>
>
> synchronized method ensureIndexIsRead in TermInfosReader causes contention 
> under heavy load
> Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
> range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
> docs) and under a load/stress test application, and later, examining the 
> Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
> entry' to this method.
> Rather than using Double-Checked Locking which is known to have issues, this 
> implementation uses a state pattern, where only one thread can move the 
> object from IndexNotRead state to IndexRead, and in doing so alters the 
> objects behavior, i.e. once the index is loaded, the index nolonger needs a 
> synchronized method. 
> In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-04-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701994#action_12701994
 ] 

Uwe Schindler commented on LUCENE-1609:
---

You could fix this, if you put all these field into the state object, too (an 
abstract class instead of interface containing these variables) and cloning 
those on creating the new state. But then you have the mentioned problem, that 
one thread may exchange the state object to a IndexRead, but another one still 
sees the reference to the IndexNotRead object, not used any longer. As log as 
you not also sychronize the state object change or make it volatile in Java 1.5 
it will still not work. That was, what I meant.
In my opinion, this is not fixable in any case with these type of state 
objects, yes?

> Eliminate synchronization contention on initial index reading in 
> TermInfosReader ensureIndexIsRead 
> ---
>
> Key: LUCENE-1609
> URL: https://issues.apache.org/jira/browse/LUCENE-1609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
> Environment: Solr 
> Tomcat 5.5
> Ubuntu 2.6.20-17-generic
> Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
>Reporter: Dan Rosher
> Attachments: LUCENE-1609.patch
>
>
> synchronized method ensureIndexIsRead in TermInfosReader causes contention 
> under heavy load
> Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
> range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
> docs) and under a load/stress test application, and later, examining the 
> Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
> entry' to this method.
> Rather than using Double-Checked Locking which is known to have issues, this 
> implementation uses a state pattern, where only one thread can move the 
> object from IndexNotRead state to IndexRead, and in doing so alters the 
> objects behavior, i.e. once the index is loaded, the index nolonger needs a 
> synchronized method. 
> In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-04-23 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701986#action_12701986
 ] 

Earwin Burrfoot edited comment on LUCENE-1609 at 4/23/09 9:41 AM:
--

The problem is not with indexState not being volatile. You can unsafely publish 
objects that have no internal state, or their state is consistent enough for 
you under any memory visibility/reordering effects. See working example of it 
in LUCENE-1607, Yonik's hash for interning strings.

The problem is that indexState guards indexTerms, indexInfos, indexPointers, 
which aren't volatile too and aren't guarded by the lock. It is possible that 
one thread does load these fields and then sets indexState = new IndexRead(), 
but another thread sees only the last write and dies with NPE.

The thing I don't get, is why do we want lazy loading here at all? Is there any 
usage for TermInfosReader that prevents it from initializing in constructor?

  was (Author: earwin):
The problem is not with indexState not being volatile. You can unsafely 
publish objects that have no internal state, or their state is consistent 
enough for you under any memory visibility/reordering effects. See working 
example of it in LUCENE-1607, Yonik's hash for interning strings.

The problem is that indexState guards indexTerms, indexInfos, indexPointers, 
which aren't volatile too and aren't guarded by the lock. It is possible that 
one thread does load these fields and then sets indexState = new IndexRead(), 
but another thread sees only the last write and dies with NPE.
  
> Eliminate synchronization contention on initial index reading in 
> TermInfosReader ensureIndexIsRead 
> ---
>
> Key: LUCENE-1609
> URL: https://issues.apache.org/jira/browse/LUCENE-1609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
> Environment: Solr 
> Tomcat 5.5
> Ubuntu 2.6.20-17-generic
> Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
>Reporter: Dan Rosher
> Attachments: LUCENE-1609.patch
>
>
> synchronized method ensureIndexIsRead in TermInfosReader causes contention 
> under heavy load
> Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
> range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
> docs) and under a load/stress test application, and later, examining the 
> Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
> entry' to this method.
> Rather than using Double-Checked Locking which is known to have issues, this 
> implementation uses a state pattern, where only one thread can move the 
> object from IndexNotRead state to IndexRead, and in doing so alters the 
> objects behavior, i.e. once the index is loaded, the index nolonger needs a 
> synchronized method. 
> In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-04-23 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701986#action_12701986
 ] 

Earwin Burrfoot commented on LUCENE-1609:
-

The problem is not with indexState not being volatile. You can unsafely publish 
objects that have no internal state, or their state is consistent enough for 
you under any memory visibility/reordering effects. See working example of it 
in LUCENE-1607, Yonik's hash for interning strings.

The problem is that indexState guards indexTerms, indexInfos, indexPointers, 
which aren't volatile too and aren't guarded by the lock. It is possible that 
one thread does load these fields and then sets indexState = new IndexRead(), 
but another thread sees only the last write and dies with NPE.

> Eliminate synchronization contention on initial index reading in 
> TermInfosReader ensureIndexIsRead 
> ---
>
> Key: LUCENE-1609
> URL: https://issues.apache.org/jira/browse/LUCENE-1609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
> Environment: Solr 
> Tomcat 5.5
> Ubuntu 2.6.20-17-generic
> Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
>Reporter: Dan Rosher
> Attachments: LUCENE-1609.patch
>
>
> synchronized method ensureIndexIsRead in TermInfosReader causes contention 
> under heavy load
> Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
> range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
> docs) and under a load/stress test application, and later, examining the 
> Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
> entry' to this method.
> Rather than using Double-Checked Locking which is known to have issues, this 
> implementation uses a state pattern, where only one thread can move the 
> object from IndexNotRead state to IndexRead, and in doing so alters the 
> objects behavior, i.e. once the index is loaded, the index nolonger needs a 
> synchronized method. 
> In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-04-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701973#action_12701973
 ] 

Uwe Schindler commented on LUCENE-1609:
---

Are you sure, this works correct? If the indexState is changed in the 
synchronized block, another thread not synchronizing on the lock may still see 
the old indexState. At least, the indexState must be volatile, but this only 
works correct with Java 1.5 (and Lucene only needs Java 1.4 as requirement).

> Eliminate synchronization contention on initial index reading in 
> TermInfosReader ensureIndexIsRead 
> ---
>
> Key: LUCENE-1609
> URL: https://issues.apache.org/jira/browse/LUCENE-1609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
> Environment: Solr 
> Tomcat 5.5
> Ubuntu 2.6.20-17-generic
> Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
>Reporter: Dan Rosher
> Attachments: LUCENE-1609.patch
>
>
> synchronized method ensureIndexIsRead in TermInfosReader causes contention 
> under heavy load
> Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
> range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
> docs) and under a load/stress test application, and later, examining the 
> Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
> entry' to this method.
> Rather than using Double-Checked Locking which is known to have issues, this 
> implementation uses a state pattern, where only one thread can move the 
> object from IndexNotRead state to IndexRead, and in doing so alters the 
> objects behavior, i.e. once the index is loaded, the index nolonger needs a 
> synchronized method. 
> In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-04-23 Thread Dan Rosher (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Rosher updated LUCENE-1609:
---

Attachment: LUCENE-1609.patch

> Eliminate synchronization contention on initial index reading in 
> TermInfosReader ensureIndexIsRead 
> ---
>
> Key: LUCENE-1609
> URL: https://issues.apache.org/jira/browse/LUCENE-1609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
> Environment: Solr 
> Tomcat 5.5
> Ubuntu 2.6.20-17-generic
> Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
>Reporter: Dan Rosher
> Attachments: LUCENE-1609.patch
>
>
> synchronized method ensureIndexIsRead in TermInfosReader causes contention 
> under heavy load
> Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
> range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
> docs) and under a load/stress test application, and later, examining the 
> Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
> entry' to this method.
> Rather than using Double-Checked Locking which is known to have issues, this 
> implementation uses a state pattern, where only one thread can move the 
> object from IndexNotRead state to IndexRead, and in doing so alters the 
> objects behavior, i.e. once the index is loaded, the index nolonger needs a 
> synchronized method. 
> In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-04-23 Thread Dan Rosher (JIRA)
Eliminate synchronization contention on initial index reading in 
TermInfosReader ensureIndexIsRead 
---

 Key: LUCENE-1609
 URL: https://issues.apache.org/jira/browse/LUCENE-1609
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
 Environment: Solr 
Tomcat 5.5
Ubuntu 2.6.20-17-generic
Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
Reporter: Dan Rosher


synchronized method ensureIndexIsRead in TermInfosReader causes contention 
under heavy load

Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
range search e.g. id:[0 TO 99] on even a small index (in my case 28K docs) 
and under a load/stress test application, and later, examining the Thread dump 
(kill -3) , many threads are blocked on 'waiting for monitor entry' to this 
method.

Rather than using Double-Checked Locking which is known to have issues, this 
implementation uses a state pattern, where only one thread can move the object 
from IndexNotRead state to IndexRead, and in doing so alters the objects 
behavior, i.e. once the index is loaded, the index nolonger needs a 
synchronized method. 

In my particular test, this uncreased throughput at least 30 times.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present

2009-04-23 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701957#action_12701957
 ] 

Paul Elschot commented on LUCENE-1252:
--

There is no patch for now.

HitCollectors should not be affected by this, as they would only be involved 
when a real match is found, and that, when position info is needed, necessarily 
involves the positions.

Extending this with a cheap score brings another issue: should a cheap score be 
given for a document that might match, but in the end does not really match 
when positions are used? At the moment, I don't think so: score values are 
normally cheap to compute, but accessing positions is not cheap.





> Avoid using positions when not all required terms are present
> -
>
> Key: LUCENE-1252
> URL: https://issues.apache.org/jira/browse/LUCENE-1252
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Paul Elschot
>Priority: Minor
>
> In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, 
> currently next() and skipTo() will use position information even when other 
> parts of the query cannot match because some required terms are not present.
> This could be avoided by adding some methods to Scorer that relax the 
> postcondition of next() and skipTo() to something like "all required terms 
> are present, but no position info was checked yet", and implementing these 
> methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and 
> SpanScorer/NearSpans.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Greetings and questions about patches

2009-04-23 Thread Erick Erickson
Thanks all. Despite my aesthetic preference for removing unused code,
I'm *really* not in favor of causing extra work (for myself or others) to
satisfy it .. Especially when there's reasonable expectations that the
code in question *will* be used in the foreseeable future.

Ok, I'll leave the code in place as-is and provide a patch with unit tests
sometime real soon now.

Erick

On Thu, Apr 23, 2009 at 6:15 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Welcome Erick!
>
> Because nextHighestPowerOfTwo methods are public, I think we cannot
> change what they return, nor remove them.  At most we could deprecate
> them now (and remove in 3.0), though I think it's fine to simply keep
> them around even though nothing inside Lucene uses them today: since
> we are heavy users of BitSet/Vector/Array/etc., it seems possible
> we'll need them at some point.
>
> EG we are looking to make a better data structure to share
> nearly-identical deleted doc bitsets between near realtime readers,
> which could conceivably use advanced BitUtil methods.
>
> Keep hacking ;)
>
> Mike
>
> On Wed, Apr 22, 2009 at 9:33 PM, Erick Erickson 
> wrote:
> > Hi all:
> >
> > I've been participating in the user list for some time, and I'd like
> > to start helping maintain/enhance the code. So I thought I'd start
> > with something small, mostly to get the process down. Unit tests
> > sure fit the bill it seems to me, less chance of introducing errors
> > through ignorance but a fine way to extend *my* understanding
> > of Lucene.
> >
> > I managed to check out the code and run the unit tests, which
> > was amazingly easy. I even managed to get the project into
> > IntelliJ and connect the codestyle.xml file. Kudos for whoever
> > set up the checkout/build process, I was dreading spending
> > days setting this up, fortunately I didn't have to.
> >
> > So I, with Chris's help, found the code coverage report and
> > chose something pretty straightforward to test, BitUtil since it
> > was nice and self-contained. As I said, I'm looking at understanding
> > the process rather than adding much value the first time.
> >
> > Alas, even something as simple as BitUtil generates questions
> > that I'm asking mostly to understand what approach the veterans
> > prefer. I'll argue with y'all next year sometime .
> >
> > So, according to the coverage report, there are two methods that
> > are never executed by the unit tests (actually 4, 2 that operate on
> > ints and 2 that operate on longs), isPowerOfTwo and
> > nextHighestPowerOfTwo. nextHighestPowerOfTwo is especially
> > clever, had to get out a paper and pencil to really understand it.
> >
> > Issues:
> > 1> none of these methods is ever called. I commented them out
> >  and ran all the unit tests and all is well. Additionally, commenting
> >  out one of the other methods produces compile-time errors so I'm
> > fairly sure I didn't do something completely stupid that just
> *looked*
> > like it was OK. I grepped recursively and they're nowhere in the
> > *.java files.
> >   1a> What's the consensus about unused code? Take it out (my
> >  preference) along with leaving a comment on where it can
> >  be found (since it *is* clever code)? Leave it in because
> someone
> >  found some pretty neat algorithms that we may need sometime?
> >   1b> I'm not entirely sure about the contrib area, but the contrib jars
> >  are all new so I assume "ant clean test" compiles them as well.
> >
> > 2> I don't agree with the behavior of nextHighestPowerOfTwo. Should
> >  I make changes if we decide to keep it?
> >   2a> Why should it return the parameter passed in when it happens to be
> > a perfect power of two? e.g. this passes:
> >assertEquals(BitUtil.nextHighestPowerOfTwo(128L), 128);
> >I'd expect this to actually return 256, given the name.
> > 2b> Why should it ever return 0? There's no power of two that is
> >zero. e.g. this passes:
> >assertEquals(BitUtil.nextHighestPowerOfTwo(-1), 0);
> >as does this: assertEquals(BitUtil.nextHighestPowerOfTwo(0), 0).
> >*Assuming* that someone wants to use this sometime to, say, size
> > an array they'd have to test against a return of 0.
> >
> >
> > I'm fully aware that these are trivial issues in the grand scheme of
> things,
> > and I *really* don't want to waste much time hashing them over. I'll
> provide
> > a patch either way and go on to something slightly more complicated for
> > my next trick.
> >
> > Best
> > Erick
> >
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: Synonym filter with support for phrases?

2009-04-23 Thread Earwin Burrfoot
>> engine. So guys looking for "MSU CMC" really want to get "Московский
>> Государственный Университет, факультет ВМиК" and his friends.
> And? How often do they extend this particular phrase with further terms?
They don't need to. Variations of this phrase alone killed my first
several approaches to synonyms :)

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785


Re: Synonym filter with support for phrases?

2009-04-23 Thread Dawid Weiss



It'd be great to get multi-word synonyms fully working...


I agree -- this is something that seems to be useful for a wider bunch of 
people.


How would you change how Lucene indexes token positions to do this "correctly"?


Kirill has some interesting points to this. I have a busy day today, but I'll 
try to clean up and post the code that I put together for another project. It'll 
be a start for refining into better directions.


Dawid

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Fuzzy search optimization

2009-04-23 Thread Varun Dhussa
Hi,

I was going through the Levenshtein distance code in
org.apache.lucene.search.FuzzyTermEnum.java of the 2.4.1 build. I
noticed that there can be a small, but effective optimization to the
distance calculation code (initialization). I have the code ready with
me. I can post it if anyone is interested.

Thanks and regards
Varun Dhussa
Product Architect
CE InfoSystems (P) Ltd.
http://maps.mapmyindia.com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Synonym filter with support for phrases?

2009-04-23 Thread Earwin Burrfoot
> On Wed, Apr 22, 2009 at 5:12 AM, Earwin Burrfoot  wrote:
>
>> Your synonyms will break if you try searching for phrases.
>> Building on your example, "food place in new york" will find nothing,
>> because 'place' and 'in' share the same position.
>
> It'd be great to get multi-word synonyms fully working...
>
> How would you change how Lucene indexes token positions to do this 
> "correctly"?
You need an ability to put two tokens in the same position, with
different posIncrements.

One variant from the top of my head is to introduce a notion of span,
so token becomes (text, span, incr).
(restaurant, 1, 0), (food, 0, 1), (place, 0, 1), (in, 0, 1), (new, 0,
1), (york, 0, 1)

The span affects distance calculation between this term, and some that follows.
E.g. dist(food, in) = 2, because both food and place have incr=1, but
despite restaurant and food having same start position,
dist(restaurant, in) = 1, because restaurant spans an additional
position.

With something like that I think it is possible to formulate an
algorithm for indexing and query rewriting that does "correct"
multiword synonyms.

Right now I cheat when rewriting a query. If my syngroup is a part of
the phrase, and I know that this syngroup has longer phrases than the
one currently detected, I do a span or sloppy phrase query. That
works, but theoretically could match a wrong document.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Greetings and questions about patches

2009-04-23 Thread Michael McCandless
Welcome Erick!

Because nextHighestPowerOfTwo methods are public, I think we cannot
change what they return, nor remove them.  At most we could deprecate
them now (and remove in 3.0), though I think it's fine to simply keep
them around even though nothing inside Lucene uses them today: since
we are heavy users of BitSet/Vector/Array/etc., it seems possible
we'll need them at some point.

EG we are looking to make a better data structure to share
nearly-identical deleted doc bitsets between near realtime readers,
which could conceivably use advanced BitUtil methods.

Keep hacking ;)

Mike

On Wed, Apr 22, 2009 at 9:33 PM, Erick Erickson  wrote:
> Hi all:
>
> I've been participating in the user list for some time, and I'd like
> to start helping maintain/enhance the code. So I thought I'd start
> with something small, mostly to get the process down. Unit tests
> sure fit the bill it seems to me, less chance of introducing errors
> through ignorance but a fine way to extend *my* understanding
> of Lucene.
>
> I managed to check out the code and run the unit tests, which
> was amazingly easy. I even managed to get the project into
> IntelliJ and connect the codestyle.xml file. Kudos for whoever
> set up the checkout/build process, I was dreading spending
> days setting this up, fortunately I didn't have to.
>
> So I, with Chris's help, found the code coverage report and
> chose something pretty straightforward to test, BitUtil since it
> was nice and self-contained. As I said, I'm looking at understanding
> the process rather than adding much value the first time.
>
> Alas, even something as simple as BitUtil generates questions
> that I'm asking mostly to understand what approach the veterans
> prefer. I'll argue with y'all next year sometime .
>
> So, according to the coverage report, there are two methods that
> are never executed by the unit tests (actually 4, 2 that operate on
> ints and 2 that operate on longs), isPowerOfTwo and
> nextHighestPowerOfTwo. nextHighestPowerOfTwo is especially
> clever, had to get out a paper and pencil to really understand it.
>
> Issues:
> 1> none of these methods is ever called. I commented them out
>      and ran all the unit tests and all is well. Additionally, commenting
>      out one of the other methods produces compile-time errors so I'm
>     fairly sure I didn't do something completely stupid that just *looked*
>     like it was OK. I grepped recursively and they're nowhere in the
>     *.java files.
>   1a> What's the consensus about unused code? Take it out (my
>  preference) along with leaving a comment on where it can
>  be found (since it *is* clever code)? Leave it in because someone
>      found some pretty neat algorithms that we may need sometime?
>   1b> I'm not entirely sure about the contrib area, but the contrib jars
>      are all new so I assume "ant clean test" compiles them as well.
>
> 2> I don't agree with the behavior of nextHighestPowerOfTwo. Should
>      I make changes if we decide to keep it?
>   2a> Why should it return the parameter passed in when it happens to be
>     a perfect power of two? e.g. this passes:
>    assertEquals(BitUtil.nextHighestPowerOfTwo(128L), 128);
>    I'd expect this to actually return 256, given the name.
> 2b> Why should it ever return 0? There's no power of two that is
>        zero. e.g. this passes:
>    assertEquals(BitUtil.nextHighestPowerOfTwo(-1), 0);
>    as does this: assertEquals(BitUtil.nextHighestPowerOfTwo(0), 0).
>    *Assuming* that someone wants to use this sometime to, say, size
>     an array they'd have to test against a return of 0.
>
>
> I'm fully aware that these are trivial issues in the grand scheme of things,
> and I *really* don't want to waste much time hashing them over. I'll provide
> a patch either way and go on to something slightly more complicated for
> my next trick.
>
> Best
> Erick
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Synonym filter with support for phrases?

2009-04-23 Thread Michael McCandless
On Wed, Apr 22, 2009 at 5:12 AM, Earwin Burrfoot  wrote:

> Your synonyms will break if you try searching for phrases.
> Building on your example, "food place in new york" will find nothing,
> because 'place' and 'in' share the same position.

It'd be great to get multi-word synonyms fully working...

How would you change how Lucene indexes token positions to do this "correctly"?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org