Problematic documentation in o.a.l.analysis.package.html

2009-11-22 Thread Shai Erera
Hi

I've read the analysis package.html and I found two issues:

1) The code sample under Invoking the Analyzer is broken. It calls
incrementToken() but inside the while it prints 'ts' (which is TokenStream)
and then do "t = ts.next()", which no longer works. That's an easy fix, so I
don't think a JIRA issue is needed.

2) The documentation specifies that "Even consumers of TokenStreams should
normally call addAttribute() instead of getAttribute(), because it would not
fail if the TokenStream does not have this Attribute". IMO this is wrong and
will give the wrong impression about how this API should be used. What if
the TokenStream does not care about this attribute? It will not fill it with
any information. The example with LengthFilter which calls addAttribute
instead of has/getAttribute is a good one regarding why you shouldn't just
call addAttribute. LegthFilter relies on the given TokenStream to fill
TermAttribute with some information, so that it can later filter out terms
of length < threshold. But what if I create a LengthFilter and give it a
TokenStream which creates just PartOfSpeechAttribute? Or output terms that
are not TermAttribute? Obviously it would be silly for me to do it, but no
one restricts me from doing so. LengthFilter should either document that it
expects TermAttribute to be returned from the input TokenStream, or better
yet, enforce it in the constructor --> if you pass a TokenStream that does
not return TermAttribute, throw an IllegalArgumentException.

But anyway, the current documentation is, IMO, wrong and may lead to wrong
impression. I don't know if this warrants a larger issue to investigate all
the current TokenFilters and validate the input TokenStream. In my filters,
I enforce the existence of a certain attribute. If I've misunderstood
something, please correct me.

3) I think it would help if there will be some documentation/example about
how TokenFilters are expected to process an Attribute before they return it.
For example, if I have a TokenFilter which processes a certain TermAttribute
by returning two other TermAttributes, then according to my understanding,
upon calling incrementToken() it should:
3.1) If first call, clone the TokenStream's TermAttribute in an instance
variable. Then process it and store in the TokenStream's TermAttribute the
first TA it should return.
3.2) If second call, process it again and store the second TA in the
TokenStream's TermAttribute.
That's because the consumer will call incrementToken and then getAttribute.
That getAttribute will return the TokenStream's attribute and not the
filter's. I think I've read it somewhere, but it doesn't appear in this
package.html.

Shai


[jira] Commented: (LUCENE-2068) fix reverseStringFilter for unicode 4.0

2009-11-22 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781102#action_12781102
 ] 

Simon Willnauer commented on LUCENE-2068:
-

Robert, we can take the MatchV. Version of the patch while I still think this 
one should not maintain backcompatibility here as it maintains extremely broken 
code.

I plan to commit this soon if nobody objects.

simon

> fix reverseStringFilter for unicode 4.0
> ---
>
> Key: LUCENE-2068
> URL: https://issues.apache.org/jira/browse/LUCENE-2068
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2068.patch, LUCENE-2068.patch, LUCENE_2068.patch, 
> LUCENE_2068.patch
>
>
> ReverseStringFilter is not aware of supplementary characters: when it 
> reverses it will create unpaired surrogates, which will be replaced by U+FFFD 
> by the indexer (but not at query time).
> The wrong words will conflate to each other, and the right words won't match, 
> basically the whole thing falls apart.
> This patch implements in-place reverse with the algorithm from apache harmony 
> AbstractStringBuilder.reverse0()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



RE: Problematic documentation in o.a.l.analysis.package.html

2009-11-22 Thread Uwe Schindler
Hi Shai,

 

Thanks for the suggestions!

 

About your points: 

1)   This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is
already in the vote phase and 2.9x is also already out.

2)   Maybe the explanation is not so good. This text comes especially
from the 2.9 old to new TS migration. The current indexer normally uses all
attributes and the default Token implementation implements all six (in 2.9).
We had a lot of TokenStreams in core that used getAttribute without
checking, if the attribute is really there. With Token/TokenWrapper as
implementation, this was always ok, but as soon as you switched to new API
only, the attributes were implemented by single instances, so after adding a
TermAttribute you were not be sure, to also have a PosIncr attribute. Even
the indexer of Lucene had that problem. So for the basic six attributes you
can always use addAttribute(), because if you not add the attribute in the
instantiation of the stream, all attributes are added by the indexer. This
applies to TermAttribute, PosIncrementAttribute and OffsetAttribute (if TV
are enabled). So if the chain does not add them, they are added by the
indexer (consumer). Because all attributes have a default value, this is no
problem and the indexer code is even faster (e.g. it does not make sense to
every time check for a PositionIncrementAttribute, as reading the default
int value of 1 is read in the same speed, even faster without the extra null
check. And with current index format it is always tracked and indexed). If
you have own attribute, it may correctly make sense to check with
hasAttribute() and throw IAE. But after this check, you can also use
addAttribute() to get the reference to the attribute (there were a
discussion about removing getAttribute at all). getAttribute and
addAttribute have the same speed when fetching an already existing
attribute. So it is up to yours, how you would implement your tokenstreams,
that are just hints for beginners. The thing with hasAttribute() and
addAttribute() is also mentioned in the docs a few lines down at the
"optimization" part.

3)   Attribute interfaces cannot be cloned (the Attribute interface has
no clone()). You can only clone implementations (AttributeImpl), but it may
happen that you not only clone the TermAttribute, but some other attributes
also (e.g. if you use Token as impl, like 2.9 does by default for backwards
compatinility). Because of that, attribute instances cannot be simply
cloned, only their implementing classes. What you generally should do (see
caching TokenFilter): use AttributeSource.captureState() to capture all
current attributes in the filter chain. In incrementToken(), you can restore
these captured states again into the existing attributes. Such states can be
handles in local instance variables or Lists etc.
The second point I do not understand: Consumers always call addAttribute
(maybe hasAttribute/getAttribute) exactly one time before consuming! So they
only have the current attribute instance, which can never change during the
lifetime of a TokenStream chain. Also all
TokenFilters/TokenStreams/Tokenizers use the same instances! Because of this
the TS chain works. If every TokenStream would have its own attributes, they
could not communicate! The important thing is, that you cannot relay on the
fact that a chain may contain attributes or not, because even if you created
a the Tokenizer and one TokenFilter in the chain and checked for
hasAttribute() in the ctor of the filter, it may happen, that the third
TokenFilter adds a new Attribute, which is then suddenly available (but the
second filter does not see it - ok, it may not need to know about it,
because the Tokenizer will never set it). Because of that its always best to
use addAttribute, if you are "interested" on an attribute. If it does not
exist, you just get default values (like termLength()==0 for TermAttribute,
posIncr==1,.).

 

Hope that explains your questions.

 

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

  _  

From: Shai Erera [mailto:[email protected]] 
Sent: Sunday, November 22, 2009 9:37 AM
To: [email protected]
Subject: Problematic documentation in o.a.l.analysis.package.html

 

Hi

I've read the analysis package.html and I found two issues:

1) The code sample under Invoking the Analyzer is broken. It calls
incrementToken() but inside the while it prints 'ts' (which is TokenStream)
and then do "t = ts.next()", which no longer works. That's an easy fix, so I
don't think a JIRA issue is needed.

2) The documentation specifies that "Even consumers of TokenStreams should
normally call addAttribute() instead of getAttribute(), because it would not
fail if the TokenStream does not have this Attribute". IMO this is wrong and
will give the wrong impression about how this API should be used. What if
the TokenStream does not care about this attribute? It will not fill it with
any informat

Re: Hiding JIRA issues

2009-11-22 Thread Michael McCandless
Eek, that's kinda spooky... that we didn't get to the root cause.  I
sure hope Lucene is not to blame ;)

Mike

On Sat, Nov 21, 2009 at 9:21 PM, Robert Muir  wrote:
> i sent a note to infrastructure about this, they reindexed, and everything
> is fixed now.
>
> On Sat, Nov 21, 2009 at 10:20 AM, Uwe Schindler  wrote:
>>
>> Issues are missing after JIRA Client 1.4.1 synchronization, too.
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: [email protected]
>>
>>
>> > -Original Message-
>> > From: Uwe Schindler [mailto:[email protected]]
>> > Sent: Saturday, November 21, 2009 4:04 PM
>> > To: [email protected]; [email protected]
>> > Subject: RE: Hiding JIRA issues
>> >
>> > I am now trying to find it with JIRA Client GUI...
>> >
>> > -
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail: [email protected]
>> >
>> >
>> > > -Original Message-
>> > > From: Simon Willnauer [mailto:[email protected]]
>> > > Sent: Saturday, November 21, 2009 3:57 PM
>> > > To: [email protected]
>> > > Subject: Re: Hiding JIRA issues
>> > >
>> > > On Sat, Nov 21, 2009 at 3:53 PM, Robert Muir  wrote:
>> > > > Mark, theres actually a huge chunk of hidden issues around that
>> > > > time.
>> > > > click 'view all issues' for lucene, and sort descending. we have
>> > > > about
>> > > 20
>> > > > right now, and you can easily see the gaps, those are the hidden
>> > > > ones!
>> > >
>> > > Uwe, that way you should be able to spot your infra issue too
>> > > >
>> > > > On Sat, Nov 21, 2009 at 9:50 AM, Mark Miller 
>> > > wrote:
>> > > >>
>> > > >> By the way, I'm sure its prob the same issue - KayKay reported 2065
>> > > >> -
>> > > >> but if you go the list of issues that KayKay reported - 2065 is not
>> > in
>> > > >> that list.
>> > > >>
>> > > >> Mark Miller wrote:
>> > > >> > Robert suggested on Twitter that perhaps the JIRA Lucene index is
>> > > >> > corrupt :)
>> > > >> >
>> > > >> > Uwe Schindler wrote:
>> > > >> >
>> > > >> >> I opened an issue in INFRA to get a Hudson account. I forgot the
>> > > issue
>> > > >> >> number and now I cannot find it. It is even not listed under my
>> > > name.
>> > > >> >> How
>> > > >> >> can this happen?
>> > > >> >>
>> > > >> >> Uwe
>> > > >> >>
>> > > >> >> -
>> > > >> >> Uwe Schindler
>> > > >> >> H.-H.-Meier-Allee 63, D-28213 Bremen
>> > > >> >> http://www.thetaphi.de
>> > > >> >> eMail: [email protected]
>> > > >> >>
>> > > >> >>
>> > > >> >>
>> > > >> >>
>> > > >> >>> -Original Message-
>> > > >> >>> From: Mark Miller [mailto:[email protected]]
>> > > >> >>> Sent: Saturday, November 21, 2009 3:43 PM
>> > > >> >>> To: [email protected]
>> > > >> >>> Subject: Hiding JIRA issues
>> > > >> >>>
>> > > >> >>> Robert mentioned there are issues that appear to be hidden in
>> > JIRA
>> > > >> >>> unless you know what they are. Anyone else notice this?
>> > > >> >>>
>> > > >> >>> An example he gave is
>> > http://issues.apache.org/jira/browse/LUCENE-
>> > > 2065
>> > > >> >>> -
>> > > >> >>> there it is, but now try and find it in the summary/search view
>> > > >> >>> -
>> > I
>> > > >> >>> have
>> > > >> >>> and I can't. Kind of disturbing. There are others as well.
>> > > >> >>>
>> > > >> >>> - Mark
>> > > >> >>>
>> > > >> >>>
>> > > >> >>> -
>> > --
>> > > --
>> > > >> >>> To unsubscribe, e-mail: [email protected]
>> > > >> >>> For additional commands, e-mail:
>> > > >> >>> [email protected]
>> > > >> >>>
>> > > >> >>>
>> > > >> >>
>> > > >> >>
>> > > >> >> --
>> > --
>> > > -
>> > > >> >> To unsubscribe, e-mail: [email protected]
>> > > >> >> For additional commands, e-mail: [email protected]
>> > > >> >>
>> > > >> >>
>> > > >> >>
>> > > >> >
>> > > >> >
>> > > >>
>> > > >>
>> > > >>
>> > > >> -
>> > > >> To unsubscribe, e-mail: [email protected]
>> > > >> For additional commands, e-mail: [email protected]
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Robert Muir
>> > > > [email protected]
>> > > >
>> > >
>> > > -
>> > > To unsubscribe, e-mail: [email protected]
>> > > For additional commands, e-mail: [email protected]
>> >
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>>
>>
>>
>> -
>> To unsubscribe, e-mail: [email protected]
>> For additional comma

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781110#action_12781110
 ] 

Michael McCandless commented on LUCENE-2075:


{quote}
bq. BTW the flex branch fixes this - TermsEnum.seek always checks the cache.

Can we fix this for trunk, too? But I think this issue talks about trunk.
{quote}

Right, this issue is about trunk (not flex API).  I think we could fix this 
(see my suggestions above), basically decoupling "useCache" from 
"seekTheEnum"... but we have to fix the terminfo cache to also store the term's 
ord.  I'll try out this approach...

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)

2009-11-22 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1278#action_1278
 ] 

Thomas Mueller commented on LUCENE-1877:


> take it somewhere other than this closed issue.

Yes, where?

> shouldn't active code like that live in the application layer?

Why?

> exceed the polling interval for a low priority process on a system under 
> heavy load?

The watchdog thread runs with high priority (see the H2 docs). It's still 
possible the thread isn't run at all, but highly unlikely. High priority 
threads are quite reliable. I wrote a MP3 player in Java (mp3transform) which I 
used a lot, I never heard any gaps. If the thread can be avoided, that would be 
great of course. I'm just trying to say that in theory, the thread is 
problematic, but in practice it isn't. While file locking is not a problem in 
theory, but in practice.

> What happens when the app sleeps?

Good question! Standby / hibernate are not supported. I didn't think about 
that. Is there a way to detect the wakeup?

> host name and the pid

Yes. It is not so easy to get the PID in Java, I found: 
http://stackoverflow.com/questions/35842/process-id-in-java 
"ManagementFactory.getRuntimeMXBean().getName()". What do you do if the lock 
was generated by another machine? I tried with using a server socket, so you 
need the IP address, but unfortunately, sometimes the network is not configured 
correctly (but maybe it's possible to detect that). Maybe the two machines 
can't access each other over TCP/IP.

> hard links

Yes, but it looks like this doesn't work always.



> Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
> --
>
> Key: LUCENE-1877
> URL: https://issues.apache.org/jira/browse/LUCENE-1877
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Mark Miller
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, 
> LUCENE-1877.patch
>
>
> A user requested we add a note in IndexWriter alerting the availability of 
> NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm 
> exit). Seems reasonable to me - we want users to be able to easily stumble 
> upon this class. The below code looks like a good spot to add a note - could 
> also improve whats there a bit - opening an IndexWriter does not necessarily 
> create a lock file - that would depend on the LockFactory used.
> {code}  Opening an IndexWriter creates a lock file for the 
> directory in use. Trying to open
>   another IndexWriter on the same directory will lead to a
>   {...@link LockObtainFailedException}. The {...@link 
> LockObtainFailedException}
>   is also thrown if an IndexReader on the same directory is used to delete 
> documents
>   from the index.{code}
> Anyone remember why NativeFSLockFactory is not the default over 
> SimpleFSLockFactory?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781112#action_12781112
 ] 

Michael McCandless commented on LUCENE-2075:



bq. in both cases, its slower than trunk, but I assume this is due to flex 
branch not being optimized yet?

The automaton benchmark looks great -- I'll dig into why the flex branch
is slower in both of these cases.

The first case tests old API on top of an old index, which I'm
surprised to see not matching trunk's performance.  The flex changes
are supposed to "optimize" that case by directly using the old (trunk)
code.

The second test tests old API emulated over a flex index, which I'm
also surprised to see is not faster than trunk -- there must be
something silly going on in the API emulation.

I'll dig...

When I tested MTQs (TermRangeQuery, WildcardQuery), using flex API on
flex index, they were reasonably faster, so I'll also try to get
automaton's FilteredTermEnum cutover to the flex API, and test that.


> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Problematic documentation in o.a.l.analysis.package.html

2009-11-22 Thread Shai Erera
Thanks Uwe.

About (3), I use copyTo, not clone. I used the word 'clone' just out of
habit. I'll read more about captureState, but I think copyTo works fine for
me.

Abour (2), I still think it's confusing. When I read addAttribute, I get an
impression as if by calling this method, it is guaranteed for me that that
attribute will be processed somehow by the input TokenStream. And this may
not be true. There is a semantic difference between: addAttribute, and
input.addAttribute. The former means I add the attribute to *my* TS
instance, while the latter to the input TS. Even though both are the same
(as TokenFilter attributes share the same instance of TokenStream
attributes), these are sematically different. I'd expect to do the former
just to ensure this attribute is in the 'attributes' map, while the latter
to ensure the input TS will process it.

BTW, remember that whatever looks clear to you, the implementer of this, may
not be cleared to users. I still think that the current documentation may
confuse people, and adding an extra "NOTE: this does not ensure the input TS
will do anything with the passed Attribute. If need to, call
input.hasAttribute to determine if that attribute is handled by the inpu TS"
won't hurt.

Shai

On Sun, Nov 22, 2009 at 12:12 PM, Uwe Schindler  wrote:

>  Hi Shai,
>
>
>
> Thanks for the suggestions!
>
>
>
> About your points:
>
> 1)   This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is
> already in the vote phase and 2.9x is also already out.
>
> 2)   Maybe the explanation is not so good. This text comes especially
> from the 2.9 old to new TS migration. The current indexer normally uses all
> attributes and the default Token implementation implements all six (in 2.9).
> We had a lot of TokenStreams in core that used getAttribute without
> checking, if the attribute is really there. With Token/TokenWrapper as
> implementation, this was always ok, but as soon as you switched to new API
> only, the attributes were implemented by single instances, so after adding a
> TermAttribute you were not be sure, to also have a PosIncr attribute. Even
> the indexer of Lucene had that problem. So for the basic six attributes you
> can always use addAttribute(), because if you not add the attribute in the
> instantiation of the stream, all attributes are added by the indexer. This
> applies to TermAttribute, PosIncrementAttribute and OffsetAttribute (if TV
> are enabled). So if the chain does not add them, they are added by the
> indexer (consumer). Because all attributes have a default value, this is no
> problem and the indexer code is even faster (e.g. it does not make sense to
> every time check for a PositionIncrementAttribute, as reading the default
> int value of 1 is read in the same speed, even faster without the extra null
> check. And with current index format it is always tracked and indexed). If
> you have own attribute, it may correctly make sense to check with
> hasAttribute() and throw IAE. But after this check, you can also use
> addAttribute() to get the reference to the attribute (there were a
> discussion about removing getAttribute at all). getAttribute and
> addAttribute have the same speed when fetching an already existing
> attribute. So it is up to yours, how you would implement your tokenstreams,
> that are just hints for beginners. The thing with hasAttribute() and
> addAttribute() is also mentioned in the docs a few lines down at the
> “optimization” part.
>
> 3)   Attribute interfaces cannot be cloned (the Attribute interface
> has no clone()). You can only clone implementations (AttributeImpl), but it
> may happen that you not only clone the TermAttribute, but some other
> attributes also (e.g. if you use Token as impl, like 2.9 does by default for
> backwards compatinility). Because of that, attribute instances cannot be
> simply cloned, only their implementing classes. What you generally should do
> (see caching TokenFilter): use AttributeSource.captureState() to capture all
> current attributes in the filter chain. In incrementToken(), you can restore
> these captured states again into the existing attributes. Such states can be
> handles in local instance variables or Lists etc.
> The second point I do not understand: Consumers always call addAttribute
> (maybe hasAttribute/getAttribute) exactly one time before consuming! So they
> only have the current attribute instance, which can never change during the
> lifetime of a TokenStream chain. Also all
> TokenFilters/TokenStreams/Tokenizers use the same instances! Because of this
> the TS chain works. If every TokenStream would have its own attributes, they
> could not communicate! The important thing is, that you cannot relay on the
> fact that a chain may contain attributes or not, because even if you created
> a the Tokenizer and one TokenFilter in the chain and checked for
> hasAttribute() in the ctor of the filter, it may happen, that the third
> TokenFilter adds a new Attribute, w

[jira] Commented: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)

2009-11-22 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781113#action_12781113
 ] 

Thomas Mueller commented on LUCENE-1877:


> detect the wakeup / polling interval exceeded

An obvious solution is to use System.currentTimeMillis() and compare with the 
expected delay. And then stop writing and throw a exception.

> Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
> --
>
> Key: LUCENE-1877
> URL: https://issues.apache.org/jira/browse/LUCENE-1877
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Mark Miller
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, 
> LUCENE-1877.patch
>
>
> A user requested we add a note in IndexWriter alerting the availability of 
> NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm 
> exit). Seems reasonable to me - we want users to be able to easily stumble 
> upon this class. The below code looks like a good spot to add a note - could 
> also improve whats there a bit - opening an IndexWriter does not necessarily 
> create a lock file - that would depend on the LockFactory used.
> {code}  Opening an IndexWriter creates a lock file for the 
> directory in use. Trying to open
>   another IndexWriter on the same directory will lead to a
>   {...@link LockObtainFailedException}. The {...@link 
> LockObtainFailedException}
>   is also thrown if an IndexReader on the same directory is used to delete 
> documents
>   from the index.{code}
> Anyone remember why NativeFSLockFactory is not the default over 
> SimpleFSLockFactory?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



RE: Problematic documentation in o.a.l.analysis.package.html

2009-11-22 Thread Uwe Schindler
About (3): CopyTo is also only availabe for AttributeImpls, but not for the
interfaces (you have to cast first) and then you are warned. If you copyTo()
on a TernmAttribute, it may also copy other attributes with it, if
TermAttribute and PosIncr. Attribute are all implemented by the same
AttributeImpl. With captureState you can be sure to have copyied all
attributes.

 

There is one important thing: *always* ever call
AttributeSource.clearAttributes when you generate new tokens in
incrementToken. This affects mainly Tokenizers (they always have to
clearAttributes, see docs), but if you inject tokens inside a TokenFilter,
always call clearAttributes. If you do not do this and e.g. only Touch the
TermAttribute, the PosIncr may stay the same as before and does not get the
default, so e.g. StopFilter gets broken when you inject Zokens (the
SmartChinese analyzer had that problem).

 

About (2). There is no semantic difference, the call to input.addAttribute()
does exactly the same like this.addAttribute() because all TokenFilters in a
chain share the same attributes. It is exactly the same effect (you get the
same instance of the attribute) and so they are semantically and in reality
the same. If you think this is different, you have not understood the whole
attributes API.

 

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

  _  

From: Shai Erera [mailto:[email protected]] 
Sent: Sunday, November 22, 2009 11:39 AM
To: [email protected]
Subject: Re: Problematic documentation in o.a.l.analysis.package.html

 

Thanks Uwe.

About (3), I use copyTo, not clone. I used the word 'clone' just out of
habit. I'll read more about captureState, but I think copyTo works fine for
me.

Abour (2), I still think it's confusing. When I read addAttribute, I get an
impression as if by calling this method, it is guaranteed for me that that
attribute will be processed somehow by the input TokenStream. And this may
not be true. There is a semantic difference between: addAttribute, and
input.addAttribute. The former means I add the attribute to *my* TS
instance, while the latter to the input TS. Even though both are the same
(as TokenFilter attributes share the same instance of TokenStream
attributes), these are sematically different. I'd expect to do the former
just to ensure this attribute is in the 'attributes' map, while the latter
to ensure the input TS will process it.

BTW, remember that whatever looks clear to you, the implementer of this, may
not be cleared to users. I still think that the current documentation may
confuse people, and adding an extra "NOTE: this does not ensure the input TS
will do anything with the passed Attribute. If need to, call
input.hasAttribute to determine if that attribute is handled by the inpu TS"
won't hurt.

Shai

On Sun, Nov 22, 2009 at 12:12 PM, Uwe Schindler  wrote:

Hi Shai,

 

Thanks for the suggestions!

 

About your points: 

1)   This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is
already in the vote phase and 2.9x is also already out.

2)   Maybe the explanation is not so good. This text comes especially
from the 2.9 old to new TS migration. The current indexer normally uses all
attributes and the default Token implementation implements all six (in 2.9).
We had a lot of TokenStreams in core that used getAttribute without
checking, if the attribute is really there. With Token/TokenWrapper as
implementation, this was always ok, but as soon as you switched to new API
only, the attributes were implemented by single instances, so after adding a
TermAttribute you were not be sure, to also have a PosIncr attribute. Even
the indexer of Lucene had that problem. So for the basic six attributes you
can always use addAttribute(), because if you not add the attribute in the
instantiation of the stream, all attributes are added by the indexer. This
applies to TermAttribute, PosIncrementAttribute and OffsetAttribute (if TV
are enabled). So if the chain does not add them, they are added by the
indexer (consumer). Because all attributes have a default value, this is no
problem and the indexer code is even faster (e.g. it does not make sense to
every time check for a PositionIncrementAttribute, as reading the default
int value of 1 is read in the same speed, even faster without the extra null
check. And with current index format it is always tracked and indexed). If
you have own attribute, it may correctly make sense to check with
hasAttribute() and throw IAE. But after this check, you can also use
addAttribute() to get the reference to the attribute (there were a
discussion about removing getAttribute at all). getAttribute and
addAttribute have the same speed when fetching an already existing
attribute. So it is up to yours, how you would implement your tokenstreams,
that are just hints for beginners. The thing with hasAttribute() and
addAttribute() is also mentioned in the docs a few lines down at the

Re: Problematic documentation in o.a.l.analysis.package.html

2009-11-22 Thread Shai Erera
Perhaps copyTo works for me because I reference Token, but like I said it's
working for me ...

Thanks for the tip regarding clearAttributes(). I assume I'll get the same
behavior if I clear the attributes one by one, defaulting their values to
whatever are my defaults.

Well ... IMO as a user, there is a semantic difference. I understand that
eventually both are the same. And even though I don't understand the new TS
API as much as you do, I think I have a fair understanding of it. And I also
think that naive users will be confused ... In Java, when you call
object.dosomething vs. dosomething, you understand that the former is
performed on object and the latter on your instance. That that you (in the
new API code) made sure both are the same is invisible to the user, unless
he/she carefully reads the javadocs. The current impl is smart and useful, I
just think some better documentation can help new users.

Shai

On Sun, Nov 22, 2009 at 12:51 PM, Uwe Schindler  wrote:

>  About (3): CopyTo is also only availabe for AttributeImpls, but not for
> the interfaces (you have to cast first) and then you are warned. If you
> copyTo() on a TernmAttribute, it may also copy other attributes with it, if
> TermAttribute and PosIncr. Attribute are all implemented by the same
> AttributeImpl. With captureState you can be sure to have copyied all
> attributes.
>
>
>
> There is one important thing: **always** ever call
> AttributeSource.clearAttributes when you generate new tokens in
> incrementToken. This affects mainly Tokenizers (they always have to
> clearAttributes, see docs), but if you inject tokens inside a TokenFilter,
> always call clearAttributes. If you do not do this and e.g. only Touch the
> TermAttribute, the PosIncr may stay the same as before and does not get the
> default, so e.g. StopFilter gets broken when you inject Zokens (the
> SmartChinese analyzer had that problem).
>
>
>
> About (2). There is no semantic difference, the call to
> input.addAttribute() does exactly the same like this.addAttribute() because
> all TokenFilters in a chain share the same attributes. It is exactly the
> same effect (you get the same instance of the attribute) and so they are
> semantically and in reality the same. If you think this is different, you
> have not understood the whole attributes API.
>
>
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>   --
>
> *From:* Shai Erera [mailto:[email protected]]
> *Sent:* Sunday, November 22, 2009 11:39 AM
>
> *To:* [email protected]
> *Subject:* Re: Problematic documentation in o.a.l.analysis.package.html
>
>
>
> Thanks Uwe.
>
> About (3), I use copyTo, not clone. I used the word 'clone' just out of
> habit. I'll read more about captureState, but I think copyTo works fine for
> me.
>
> Abour (2), I still think it's confusing. When I read addAttribute, I get an
> impression as if by calling this method, it is guaranteed for me that that
> attribute will be processed somehow by the input TokenStream. And this may
> not be true. There is a semantic difference between: addAttribute, and
> input.addAttribute. The former means I add the attribute to *my* TS
> instance, while the latter to the input TS. Even though both are the same
> (as TokenFilter attributes share the same instance of TokenStream
> attributes), these are sematically different. I'd expect to do the former
> just to ensure this attribute is in the 'attributes' map, while the latter
> to ensure the input TS will process it.
>
> BTW, remember that whatever looks clear to you, the implementer of this,
> may not be cleared to users. I still think that the current documentation
> may confuse people, and adding an extra "NOTE: this does not ensure the
> input TS will do anything with the passed Attribute. If need to, call
> input.hasAttribute to determine if that attribute is handled by the inpu TS"
> won't hurt.
>
> Shai
>
> On Sun, Nov 22, 2009 at 12:12 PM, Uwe Schindler  wrote:
>
> Hi Shai,
>
>
>
> Thanks for the suggestions!
>
>
>
> About your points:
>
> 1)   This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is
> already in the vote phase and 2.9x is also already out.
>
> 2)   Maybe the explanation is not so good. This text comes especially
> from the 2.9 old to new TS migration. The current indexer normally uses all
> attributes and the default Token implementation implements all six (in 2.9).
> We had a lot of TokenStreams in core that used getAttribute without
> checking, if the attribute is really there. With Token/TokenWrapper as
> implementation, this was always ok, but as soon as you switched to new API
> only, the attributes were implemented by single instances, so after adding a
> TermAttribute you were not be sure, to also have a PosIncr attribute. Even
> the indexer of Lucene had that problem. So for the basic six attributes you
> can always use addAttribute(), because if you n

RE: Problematic documentation in o.a.l.analysis.package.html

2009-11-22 Thread Uwe Schindler
Abvout clearAttributes: Just the warning if your clear the attributes one by
one, you have two problems:

-  you can only clear attributes you know about. E.g. most
Tokenizers just set TermAttribute and OffsetAttribute (because only these
two attributes are interesting). The PosIncr attribute does not need to be
touched, as it defaults to 1, which is correct for generating Tokens from a
Reader. If you only clear those two attributes in the Tokenizer, the posIncr
attribute is not resetet. If you add a StopFilter or Sysnonymfilter after
this tokenizer,  who uses the PosIncr attribute, it would not see this
attribute reseted. This leads to the bug of the SmartChineseAnalyzer, that
exactly missed to reset *all* attributes, and therefore the StopFilter
incremented the PosIncr each time, and suddenly overflowing with invalid
values. So please, whenever you inject new Tokens, call clearAttributes on
the TokenStream.

-  If you clear the attributes one by one, you may clear them
multiple times. If all 6 attributes are implemented by Token, and you clear
them one-by one, Token.clear() is called 6 times. Not very effective.
Because of that, no attribute interface contains copyTo() and clone() or
equals/hashcode.

 

I know the documentation can be improved. Sorry about that, but you are
tyring to use expert features. Normal TokenFilters just add some attributes
in the ctor and use them. Token may be the implementation behind, but you
cannot rely on that!!! If the Tokenizer already added attributes to it, the
Token impl added by the Filter will never be added (because the interfaces
are still available). So Just for "keeping" Token, do *not* try to cast all
attributes to Token, this will fail (see next mail).

 

Uwe

 

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

  _  

From: Shai Erera [mailto:[email protected]] 
Sent: Sunday, November 22, 2009 12:06 PM
To: [email protected]
Subject: Re: Problematic documentation in o.a.l.analysis.package.html

 

Perhaps copyTo works for me because I reference Token, but like I said it's
working for me ...

Thanks for the tip regarding clearAttributes(). I assume I'll get the same
behavior if I clear the attributes one by one, defaulting their values to
whatever are my defaults.

Well ... IMO as a user, there is a semantic difference. I understand that
eventually both are the same. And even though I don't understand the new TS
API as much as you do, I think I have a fair understanding of it. And I also
think that naive users will be confused ... In Java, when you call
object.dosomething vs. dosomething, you understand that the former is
performed on object and the latter on your instance. That that you (in the
new API code) made sure both are the same is invisible to the user, unless
he/she carefully reads the javadocs. The current impl is smart and useful, I
just think some better documentation can help new users.

Shai

On Sun, Nov 22, 2009 at 12:51 PM, Uwe Schindler  wrote:

About (3): CopyTo is also only availabe for AttributeImpls, but not for the
interfaces (you have to cast first) and then you are warned. If you copyTo()
on a TernmAttribute, it may also copy other attributes with it, if
TermAttribute and PosIncr. Attribute are all implemented by the same
AttributeImpl. With captureState you can be sure to have copyied all
attributes.

 

There is one important thing: *always* ever call
AttributeSource.clearAttributes when you generate new tokens in
incrementToken. This affects mainly Tokenizers (they always have to
clearAttributes, see docs), but if you inject tokens inside a TokenFilter,
always call clearAttributes. If you do not do this and e.g. only Touch the
TermAttribute, the PosIncr may stay the same as before and does not get the
default, so e.g. StopFilter gets broken when you inject Zokens (the
SmartChinese analyzer had that problem).

 

About (2). There is no semantic difference, the call to input.addAttribute()
does exactly the same like this.addAttribute() because all TokenFilters in a
chain share the same attributes. It is exactly the same effect (you get the
same instance of the attribute) and so they are semantically and in reality
the same. If you think this is different, you have not understood the whole
attributes API.

 

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

  _  

From: Shai Erera [mailto:[email protected]] 
Sent: Sunday, November 22, 2009 11:39 AM


To: [email protected]

Subject: Re: Problematic documentation in o.a.l.analysis.package.html

 

Thanks Uwe.

About (3), I use copyTo, not clone. I used the word 'clone' just out of
habit. I'll read more about captureState, but I think copyTo works fine for
me.

Abour (2), I still think it's confusing. When I read addAttribute, I get an
impression as if by calling this method, it is guaranteed for me that that
attribute will be processe

[jira] Created: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY

2009-11-22 Thread Uwe Schindler (JIRA)
AttributeSource.addAttribute should only accept interfaces, the missing test 
leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
-

 Key: LUCENE-2088
 URL: https://issues.apache.org/jira/browse/LUCENE-2088
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9.1, 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Priority: Blocker
 Fix For: 3.0


This is a blocker, because you can call addAttribute(Token.class) without 
getting an error message.

I will commit the fix and restart the vote for 3.0. This also applies to 2.9, 
but there is no Token Attribute Factory. But I will merge to 2.9, too, if a 
2.9.2 comes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY

2009-11-22 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2088:
--

Attachment: LUCENE-2088.patch

Here the patch, will commit soon and respawn 3.0.

I will also merge to 2.9 branch.

> AttributeSource.addAttribute should only accept interfaces, the missing test 
> leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
> -
>
> Key: LUCENE-2088
> URL: https://issues.apache.org/jira/browse/LUCENE-2088
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Blocker
> Fix For: 3.0
>
> Attachments: LUCENE-2088.patch
>
>
> This is a blocker, because you can call addAttribute(Token.class) without 
> getting an error message.
> I will commit the fix and restart the vote for 3.0. This also applies to 2.9, 
> but there is no Token Attribute Factory. But I will merge to 2.9, too, if a 
> 2.9.2 comes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY

2009-11-22 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781122#action_12781122
 ] 

Earwin Burrfoot commented on LUCENE-2088:
-

bq. && Attribute.class.isAssignableFrom(attClass)
What is this for? This line
bq. public  A addAttribute(Class attClass) {
ensures the same at compile time.

> AttributeSource.addAttribute should only accept interfaces, the missing test 
> leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
> -
>
> Key: LUCENE-2088
> URL: https://issues.apache.org/jira/browse/LUCENE-2088
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Blocker
> Fix For: 3.0
>
> Attachments: LUCENE-2088.patch
>
>
> This is a blocker, because you can call addAttribute(Token.class) without 
> getting an error message.
> I will commit the fix and restart the vote for 3.0. This also applies to 2.9, 
> but there is no Token Attribute Factory. But I will merge to 2.9, too, if a 
> 2.9.2 comes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



RE: [jira] Commented: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY

2009-11-22 Thread Uwe Schindler
If you use it type unsafe without generics, it will break. And we need it
for 2.9.

I was thinking about both variants and thought it would be better to leave
it in. I will merge this now to 2.9, too.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -Original Message-
> From: Earwin Burrfoot (JIRA) [mailto:[email protected]]
> Sent: Sunday, November 22, 2009 2:35 PM
> To: [email protected]
> Subject: [jira] Commented: (LUCENE-2088) AttributeSource.addAttribute
> should only accept interfaces, the missing test leads to problems with
> Token.TOKEN_ATTRIBUTE_FACTORY
> 
> 
> [ https://issues.apache.org/jira/browse/LUCENE-
> 2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=12781122#action_12781122 ]
> 
> Earwin Burrfoot commented on LUCENE-2088:
> -
> 
> bq. && Attribute.class.isAssignableFrom(attClass)
> What is this for? This line
> bq. public  A addAttribute(Class attClass) {
> ensures the same at compile time.
> 
> > AttributeSource.addAttribute should only accept interfaces, the missing
> test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
> > 
> -
> >
> > Key: LUCENE-2088
> > URL: https://issues.apache.org/jira/browse/LUCENE-2088
> > Project: Lucene - Java
> >  Issue Type: Bug
> >Affects Versions: 2.9, 2.9.1, 3.0
> >Reporter: Uwe Schindler
> >Assignee: Uwe Schindler
> >Priority: Blocker
> > Fix For: 3.0
> >
> > Attachments: LUCENE-2088.patch
> >
> >
> > This is a blocker, because you can call addAttribute(Token.class)
> without getting an error message.
> > I will commit the fix and restart the vote for 3.0. This also applies to
> 2.9, but there is no Token Attribute Factory. But I will merge to 2.9,
> too, if a 2.9.2 comes.
> 
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]



-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY

2009-11-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781124#action_12781124
 ] 

Uwe Schindler commented on LUCENE-2088:
---

If you use it type unsafe without generics, it will break. And we need it for 
2.9.

You can break this if you do:
addAttribute((Class) List.class)

I was thinking about both variants and thought it would be better to leave it 
in. I will merge this now to 2.9, too, where we need it in all cases.

> AttributeSource.addAttribute should only accept interfaces, the missing test 
> leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
> -
>
> Key: LUCENE-2088
> URL: https://issues.apache.org/jira/browse/LUCENE-2088
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Blocker
> Fix For: 3.0
>
> Attachments: LUCENE-2088.patch
>
>
> This is a blocker, because you can call addAttribute(Token.class) without 
> getting an error message.
> I will commit the fix and restart the vote for 3.0. This also applies to 2.9, 
> but there is no Token Attribute Factory. But I will merge to 2.9, too, if a 
> 2.9.2 comes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-2087) Remove recursion in NumericRangeTermEnum

2009-11-22 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2087:
--

Fix Version/s: 3.0

> Remove recursion in NumericRangeTermEnum
> 
>
> Key: LUCENE-2087
> URL: https://issues.apache.org/jira/browse/LUCENE-2087
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0, 3.1
>
> Attachments: LUCENE-2087.patch
>
>
> The current FilteredTermEnum in NRQ uses setEnum() which itsself calls 
> next(). This may lead to a recursion that can overflow stack, if the index is 
> empty and a large range with low precStep is used. With 64 bit numbers and 
> precStep == 1 there may be 127 recursions, as each sub-range would hit no 
> term on empty index and the setEnum call would then call next() which itsself 
> calls setEnum again. This leads to recursion depth of 256.
> Attached is a patch that converts to iterative approach. setEnum is now 
> unused and throws UOE (like endEnum()).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY

2009-11-22 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2088:
--

Attachment: LUCENE-2088-test.patch

This patch shows how you can break.

As Shai said, the problem is not only that it may have no effect, it completely 
breaks the behaviour of AttributeSource when you do this. Because of that the 
extra check is needed.

> AttributeSource.addAttribute should only accept interfaces, the missing test 
> leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
> -
>
> Key: LUCENE-2088
> URL: https://issues.apache.org/jira/browse/LUCENE-2088
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Blocker
> Fix For: 3.0
>
> Attachments: LUCENE-2088-test.patch, LUCENE-2088.patch
>
>
> This is a blocker, because you can call addAttribute(Token.class) without 
> getting an error message.
> I will commit the fix and restart the vote for 3.0. This also applies to 2.9, 
> but there is no Token Attribute Factory. But I will merge to 2.9, too, if a 
> 2.9.2 comes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY

2009-11-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781129#action_12781129
 ] 

Uwe Schindler commented on LUCENE-2088:
---

Thinking about it more and reading 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6461827 maybe we should 
remove it again. But addAttributeImpl already does a lot of isAssignableFrom 
checks (but cached) so maybe we should remove it for 3.0/3.1. In 2.9 it must 
stay alive.

What do others think?

> AttributeSource.addAttribute should only accept interfaces, the missing test 
> leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
> -
>
> Key: LUCENE-2088
> URL: https://issues.apache.org/jira/browse/LUCENE-2088
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Blocker
> Fix For: 3.0
>
> Attachments: LUCENE-2088-test.patch, LUCENE-2088.patch
>
>
> This is a blocker, because you can call addAttribute(Token.class) without 
> getting an error message.
> I will commit the fix and restart the vote for 3.0. This also applies to 2.9, 
> but there is no Token Attribute Factory. But I will merge to 2.9, too, if a 
> 2.9.2 comes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY

2009-11-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781131#action_12781131
 ] 

Uwe Schindler commented on LUCENE-2088:
---

But its no problem anymore, the sun bug is fixed since:
Release Fixed  6u2(b01), 5.0u12(b02) (Bug ID:2144702) , hs10(b07) (Bug 
ID:2146432) , 7(b07) (Bug ID:2176843) 

Let's keep it in.

> AttributeSource.addAttribute should only accept interfaces, the missing test 
> leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
> -
>
> Key: LUCENE-2088
> URL: https://issues.apache.org/jira/browse/LUCENE-2088
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Blocker
> Fix For: 3.0
>
> Attachments: LUCENE-2088-test.patch, LUCENE-2088.patch
>
>
> This is a blocker, because you can call addAttribute(Token.class) without 
> getting an error message.
> I will commit the fix and restart the vote for 3.0. This also applies to 2.9, 
> but there is no Token Attribute Factory. But I will merge to 2.9, too, if a 
> 2.9.2 comes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781134#action_12781134
 ] 

Michael McCandless commented on LUCENE-1606:


Are we going to deprecate contrib/regex with this?

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Created: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)
explore using automaton for fuzzyquery
--

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor


Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
itching to write that nasty algorithm)

we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
* up front, calculate the maximum required N edits needed to match the users 
supplied float threshold.
* for at least common N (1,2,3, etc) we should use automatontermenum. if its 
outside of that, maybe use the existing slow logic. At high N, it will seek too 
much to be helpful anyway.

i modified my wildcard benchmark to generate random fuzzy queries.
* Pattern: 7N stands for NNN, etc.
* AvgMS_DFA: this is the time spent creating the automaton (constructor)

||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
|7N|10|64.0|4155.9|38.6|20.3|
|14N|10|0.0|2511.6|46.0|37.9|   
|28N|10|0.0|2506.3|93.0|86.6|
|56N|10|0.0|2524.5|304.4|298.5|

as you can see, this prototype is no good yet, because it creates the DFA in a 
slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA 
conversion.
So, for a very long string, it just gets worse and worse. This has nothing to 
do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), 
there is no problem there.

instead we should just build a DFA to begin with, maybe with this paper: 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
we can precompute the tables with that algorithm up to some reasonable N, and 
then I think we are ok.

the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
linear minimization, if someone wants to implement this they should not worry 
about minimization.
in fact, we need to at some point determine if AutomatonQuery should even 
minimize FSM's at all, or if it is simply enough for them to be deterministic 
with no transitions to dead states. (The only code that actually assumes 
minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
summation easily). we need to benchmark really complex DFAs (i.e. write a regex 
benchmark) to figure out if minimization is even helping right now.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781135#action_12781135
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. Are we going to deprecate contrib/regex with this? 

I would argue against that, only because the other regex impl's have different 
features and syntax, even if they are slow.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[VOTE] Release Apache Lucene Java 3.0.0 (take #2)

2009-11-22 Thread Uwe Schindler
Hi,

I have built the artifacts for the final release of "Apache Lucene Java
3.0.0" a second time, because of a bug in the TokenStream API (found by Shai
Erera, who wanted to make "bad" things with addAttribute, breaking its
behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to prevent
stack overflow, LUCENE-2087). They are targeted for release on 2009-11-25.

The artifacts are here:
http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/

You find the changes in the corresponding sub folder. The SVN revision is
883080, here the manifest with build system info:

Manifest-Version: 1.0
Ant-Version: Apache Ant 1.7.0
Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.)
Specification-Title: Lucene Search Engine
Specification-Version: 3.0.0
Specification-Vendor: The Apache Software Foundation
Implementation-Title: org.apache.lucene
Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49
Implementation-Vendor: The Apache Software Foundation
X-Compile-Source-JDK: 1.5
X-Compile-Target-JDK: 1.5

Please vote to officially release these artifacts as "Apache Lucene Java
3.0.0".

We need at least 3 binding (PMC) votes.

Thanks everyone for all their hard work on this and I am very sorry for
requesting a vote again, but that's life! Thanks Shai for the pointer to the
bug!




Here is the proposed release note, please edit, if needed:
--

Hello Lucene users, 

On behalf of the Lucene dev community (a growing community far larger than
just the committers) I would like to announce the release of Lucene Java
3.0:

The new version is mostly a cleanup release without any new features. All
deprecations targeted to be removed in version 3.0 were removed. If you are
upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
warnings in your code base to be able to recompile against this version.

This is the first Lucene release with Java 5 as a minimum requirement. The
API was cleaned up to make use of Java 5's generics, varargs, enums, and
autoboxing. New users of Lucene are advised to use this version for new
developments, because it has a clean, type safe new API. Upgrading users can
now remove unnecessary casts and add generics to their code, too. If you
have not upgraded your installation to Java 5, please read the file
JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene
3.0, it will also happen with any previous release when you upgrade your
Java environment).

Lucene 3.0 has some changes regarding compressed fields: 2.9 already
deprecated compressed fields; support for them was removed now. Lucene 3.0
is still able to read indexes with compressed fields, but as soon as merges
occur or the index is optimized, all compressed fields are decompressed and
converted to Field.Store.YES. Because of this, indexes with compressed
fields can suddenly get larger.

While we generally try and maintain full backwards compatibility between
major versions, Lucene 3.0 has some minor breaks, mostly related to
deprecation removal, pointed out in the 'Changes in backwards compatibility
policy' section of CHANGES.txt. Notable are:

- IndexReader.open(Directory) now opens in read-only mode per default (this
method was deprecated because of that in 2.9). The same occurs to
IndexSearcher.

- Already started in 2.9, core TokenStreams are now made final to enforce
the decorator pattern.

- If you interrupt an IndexWriter merge thread, IndexWriter now throws an
unchecked ThreadInterruptedException that extends RuntimeException and
clears the interrupt status.

--



Thanks,
Uwe


-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]



-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781136#action_12781136
 ] 

Mark Miller commented on LUCENE-2089:
-

bq. (i will assign this to him, I know he is itching to write that nasty 
algorithm
ha - too much wine last night to laugh that hard this morning. Painful.

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required N edits needed to match the users 
> supplied float threshold.
> * for at least common N (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high N, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable N, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2089:


Description: 
Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
itching to write that nasty algorithm)

we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
* up front, calculate the maximum required K edits needed to match the users 
supplied float threshold.
* for at least common K (1,2,3, etc) we should use automatontermenum. if its 
outside of that, maybe use the existing slow logic. At high N, it will seek too 
much to be helpful anyway.

i modified my wildcard benchmark to generate random fuzzy queries.
* Pattern: 7N stands for NNN, etc.
* AvgMS_DFA: this is the time spent creating the automaton (constructor)

||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
|7N|10|64.0|4155.9|38.6|20.3|
|14N|10|0.0|2511.6|46.0|37.9|   
|28N|10|0.0|2506.3|93.0|86.6|
|56N|10|0.0|2524.5|304.4|298.5|

as you can see, this prototype is no good yet, because it creates the DFA in a 
slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA 
conversion.
So, for a very long string, it just gets worse and worse. This has nothing to 
do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), 
there is no problem there.

instead we should just build a DFA to begin with, maybe with this paper: 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
we can precompute the tables with that algorithm up to some reasonable K, and 
then I think we are ok.

the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
linear minimization, if someone wants to implement this they should not worry 
about minimization.
in fact, we need to at some point determine if AutomatonQuery should even 
minimize FSM's at all, or if it is simply enough for them to be deterministic 
with no transitions to dead states. (The only code that actually assumes 
minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
summation easily). we need to benchmark really complex DFAs (i.e. write a regex 
benchmark) to figure out if minimization is even helping right now.



  was:
Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
itching to write that nasty algorithm)

we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
* up front, calculate the maximum required N edits needed to match the users 
supplied float threshold.
* for at least common N (1,2,3, etc) we should use automatontermenum. if its 
outside of that, maybe use the existing slow logic. At high N, it will seek too 
much to be helpful anyway.

i modified my wildcard benchmark to generate random fuzzy queries.
* Pattern: 7N stands for NNN, etc.
* AvgMS_DFA: this is the time spent creating the automaton (constructor)

||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
|7N|10|64.0|4155.9|38.6|20.3|
|14N|10|0.0|2511.6|46.0|37.9|   
|28N|10|0.0|2506.3|93.0|86.6|
|56N|10|0.0|2524.5|304.4|298.5|

as you can see, this prototype is no good yet, because it creates the DFA in a 
slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA 
conversion.
So, for a very long string, it just gets worse and worse. This has nothing to 
do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), 
there is no problem there.

instead we should just build a DFA to begin with, maybe with this paper: 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
we can precompute the tables with that algorithm up to some reasonable N, and 
then I think we are ok.

the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
linear minimization, if someone wants to implement this they should not worry 
about minimization.
in fact, we need to at some point determine if AutomatonQuery should even 
minimize FSM's at all, or if it is simply enough for them to be deterministic 
with no transitions to dead states. (The only code that actually assumes 
minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
summation easily). we need to benchmark really complex DFAs (i.e. write a regex 
benchmark) to figure out if minimization is even helping right now.




modify description to be readable, K is number of edits, N refers to length of 
word.

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)

[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2089:


Description: 
Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
itching to write that nasty algorithm)

we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
* up front, calculate the maximum required K edits needed to match the users 
supplied float threshold.
* for at least common K (1,2,3, etc) we should use automatontermenum. if its 
outside of that, maybe use the existing slow logic. At high K, it will seek too 
much to be helpful anyway.

i modified my wildcard benchmark to generate random fuzzy queries.
* Pattern: 7N stands for NNN, etc.
* AvgMS_DFA: this is the time spent creating the automaton (constructor)

||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
|7N|10|64.0|4155.9|38.6|20.3|
|14N|10|0.0|2511.6|46.0|37.9|   
|28N|10|0.0|2506.3|93.0|86.6|
|56N|10|0.0|2524.5|304.4|298.5|

as you can see, this prototype is no good yet, because it creates the DFA in a 
slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA 
conversion.
So, for a very long string, it just gets worse and worse. This has nothing to 
do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), 
there is no problem there.

instead we should just build a DFA to begin with, maybe with this paper: 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
we can precompute the tables with that algorithm up to some reasonable K, and 
then I think we are ok.

the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
linear minimization, if someone wants to implement this they should not worry 
about minimization.
in fact, we need to at some point determine if AutomatonQuery should even 
minimize FSM's at all, or if it is simply enough for them to be deterministic 
with no transitions to dead states. (The only code that actually assumes 
minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
summation easily). we need to benchmark really complex DFAs (i.e. write a regex 
benchmark) to figure out if minimization is even helping right now.



  was:
Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
itching to write that nasty algorithm)

we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
* up front, calculate the maximum required K edits needed to match the users 
supplied float threshold.
* for at least common K (1,2,3, etc) we should use automatontermenum. if its 
outside of that, maybe use the existing slow logic. At high N, it will seek too 
much to be helpful anyway.

i modified my wildcard benchmark to generate random fuzzy queries.
* Pattern: 7N stands for NNN, etc.
* AvgMS_DFA: this is the time spent creating the automaton (constructor)

||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
|7N|10|64.0|4155.9|38.6|20.3|
|14N|10|0.0|2511.6|46.0|37.9|   
|28N|10|0.0|2506.3|93.0|86.6|
|56N|10|0.0|2524.5|304.4|298.5|

as you can see, this prototype is no good yet, because it creates the DFA in a 
slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA 
conversion.
So, for a very long string, it just gets worse and worse. This has nothing to 
do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), 
there is no problem there.

instead we should just build a DFA to begin with, maybe with this paper: 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
we can precompute the tables with that algorithm up to some reasonable K, and 
then I think we are ok.

the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
linear minimization, if someone wants to implement this they should not worry 
about minimization.
in fact, we need to at some point determine if AutomatonQuery should even 
minimize FSM's at all, or if it is simply enough for them to be deterministic 
with no transitions to dead states. (The only code that actually assumes 
minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
summation easily). we need to benchmark really complex DFAs (i.e. write a regex 
benchmark) to figure out if minimization is even helping right now.




> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front,

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781138#action_12781138
 ] 

Uwe Schindler commented on LUCENE-2089:
---

bq. ha - too much wine last night to laugh that hard this morning. Painful.

I need more beer, after that 3.0.0 problem with the AttributeSource API :-( 
(LUCENE-2088).

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781139#action_12781139
 ] 

Robert Muir commented on LUCENE-2089:
-

by the way, the only open impl of this algorithm i could find is at 
http://rrette.com/moman.html (ZSpell) in python.
the download link for 0.2 is not available, and it appears from the website 
this is the version with the updated algorithm.


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Robert Muir
was this why i saw strange benchmark results?

On Sun, Nov 22, 2009 at 9:52 AM,  wrote:

> Author: mikemccand
> Date: Sun Nov 22 14:52:02 2009
> New Revision: 883088
>
> URL: http://svn.apache.org/viewvc?rev=883088&view=rev
> Log:
> LUCENE-1458 (on flex branch): small optimization to terms dict cache: don't
> store redundant TermRef
>
> Modified:
>
>  lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
>
>  
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
>
> Modified:
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
> URL:
> http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java?rev=883088&r1=883087&r2=883088&view=diff
>
> ==
> ---
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
> (original)
> +++
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
> Sun Nov 22 14:52:02 2009
> @@ -36,6 +36,8 @@
> copy(text);
>   }
>
> +  // nocommit: we could do this w/ UnicodeUtil w/o requiring
> +  // allocation of new bytes[]?
>   public void copy(String text) {
> try {
>   bytes = text.getBytes("UTF-8");
>
> Modified:
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
> URL:
> http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java?rev=883088&r1=883087&r2=883088&view=diff
>
> ==
> ---
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
> (original)
> +++
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
> Sun Nov 22 14:52:02 2009
> @@ -304,6 +304,7 @@
>   public SeekStatus seek(TermRef term) throws IOException {
> ReuseLRUCache cache = null;
> CacheEntry entry = null;
> +TermRef entryKey = null;
>
> if (docs.canCaptureState()) {
>   final ThreadResources resources = getThreadResources();
> @@ -312,7 +313,7 @@
>   entry = cache.get(term);
>   if (entry != null) {
> docFreq = entry.freq;
> -bytesReader.term.copy(entry.term);
> +bytesReader.term.copy(term);
> docs.setState(entry, docFreq);
> termUpto = entry.termUpTo;
> // nocommit -- would be better to do this lazy?
> @@ -384,16 +385,17 @@
> entry = cache.eldest;
> cache.eldest = null;
> docs.captureState(entry);
> -entry.term.copy(bytesReader.term);
> +entryKey = cache.eldestKey;
> +entryKey.copy(bytesReader.term);
>   } else {
> entry = docs.captureState(null);
> -entry.term = (TermRef) bytesReader.term.clone();
> +entryKey = (TermRef) bytesReader.term.clone();
>   }
>   entry.freq = docFreq;
>   entry.termUpTo = termUpto;
>   entry.filePointer = in.getFilePointer();
>
> -  cache.put(entry.term, entry);
> +  cache.put(entryKey, entry);
> }
> return SeekStatus.FOUND;
>   } else if (cmp > 0) {
> @@ -517,9 +519,8 @@
>
>   // nocommit -- scrutinize API
>   public static class CacheEntry {
> -int termUpTo;
> -TermRef term; // nocommit -- really needed?
> -long filePointer;
> +int termUpTo; // ord for this term
> +long filePointer; // fp into the terms
> dict primary file (_X.tis)
>
> // nocommit -- belongs in Pulsing's CacheEntry class:
> public int freq;
> @@ -563,6 +564,7 @@
> private final static float LOADFACTOR = 0.75f;
> private int cacheSize;
> V eldest;
> +K eldestKey;
>
> /**
>  * Creates a last-recently-used cache with the specified size.
> @@ -580,6 +582,7 @@
>   boolean remove = size() > ReuseLRUCache.this.cacheSize;
>   if (remove) {
> this.eldest = eldest.getValue();
> +this.eldestKey = eldest.getKey();
>   }
>   return remove;
> }
>
>
>


-- 
Robert Muir
[email protected]


[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781140#action_12781140
 ] 

Robert Muir commented on LUCENE-2089:
-

I hope its obvious from the benchmark why we shouldn't use the crappy 
prototype, and optimize K=1 (which is probably very common).
if someone has a small index, with very long terms (bio sequences, or 
something), then fuzzy would actually get slower.


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781142#action_12781142
 ] 

Mark Miller commented on LUCENE-2089:
-

I'll take a look anyway - too bad I can't find my stupid automata  text book - 
already bought the stupid thing twice in my life. Python translation sounds 
easier anyway :)

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781144#action_12781144
 ] 

Robert Muir commented on LUCENE-2089:
-

bq. I'll take a look anyway - too bad I can't find my stupid automata text book 
- already bought the stupid thing twice in my life. Python translation sounds 
easier anyway

you will need more wine. If it starts to catch, I'll be happy to offer help 
with use of the automaton API, etc.


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781149#action_12781149
 ] 

Mark Miller commented on LUCENE-2089:
-

bq. we can precompute the tables with that algorithm up to some reasonable K, 
and then I think we are ok.

I wonder what the best way to handle this would be - these transition tables 
appear to get very large very fast. Looks like the python stuff is now 
calculating them on the fly. You wouldn't want to load them all up into RAM for 
those not using them. Perhaps some sort of lru cache load on demand or 
something - though can't really fully warm up that way.

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Michael McCandless
No, not really... just an optimization I found when hunting ;)

I'm working now on an AutomatonTermsEnum that uses the flex API
directly, to test that performance.

One of the major challenges with flex is the 4-way testing required.
Ie, you can have a non-flex or flex index, and then you can access it
via non-flex or flex API.  All 4 are allowed, and must work (for
back-compat).

I'm most concerned about performance of flex API on top of flex index,
since that's the future, but not hurting performance of the other 3 is
also important.

Mike

On Sun, Nov 22, 2009 at 10:22 AM, Robert Muir  wrote:
> was this why i saw strange benchmark results?
>
> On Sun, Nov 22, 2009 at 9:52 AM,  wrote:
>>
>> Author: mikemccand
>> Date: Sun Nov 22 14:52:02 2009
>> New Revision: 883088
>>
>> URL: http://svn.apache.org/viewvc?rev=883088&view=rev
>> Log:
>> LUCENE-1458 (on flex branch): small optimization to terms dict cache:
>> don't store redundant TermRef
>>
>> Modified:
>>
>>  lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
>>
>>  lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
>>
>> Modified:
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
>> URL:
>> http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java?rev=883088&r1=883087&r2=883088&view=diff
>>
>> ==
>> ---
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
>> (original)
>> +++
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
>> Sun Nov 22 14:52:02 2009
>> @@ -36,6 +36,8 @@
>>     copy(text);
>>   }
>>
>> +  // nocommit: we could do this w/ UnicodeUtil w/o requiring
>> +  // allocation of new bytes[]?
>>   public void copy(String text) {
>>     try {
>>       bytes = text.getBytes("UTF-8");
>>
>> Modified:
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
>> URL:
>> http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java?rev=883088&r1=883087&r2=883088&view=diff
>>
>> ==
>> ---
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
>> (original)
>> +++
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
>> Sun Nov 22 14:52:02 2009
>> @@ -304,6 +304,7 @@
>>       public SeekStatus seek(TermRef term) throws IOException {
>>         ReuseLRUCache cache = null;
>>         CacheEntry entry = null;
>> +        TermRef entryKey = null;
>>
>>         if (docs.canCaptureState()) {
>>           final ThreadResources resources = getThreadResources();
>> @@ -312,7 +313,7 @@
>>           entry = cache.get(term);
>>           if (entry != null) {
>>             docFreq = entry.freq;
>> -            bytesReader.term.copy(entry.term);
>> +            bytesReader.term.copy(term);
>>             docs.setState(entry, docFreq);
>>             termUpto = entry.termUpTo;
>>             // nocommit -- would be better to do this lazy?
>> @@ -384,16 +385,17 @@
>>                 entry = cache.eldest;
>>                 cache.eldest = null;
>>                 docs.captureState(entry);
>> -                entry.term.copy(bytesReader.term);
>> +                entryKey = cache.eldestKey;
>> +                entryKey.copy(bytesReader.term);
>>               } else {
>>                 entry = docs.captureState(null);
>> -                entry.term = (TermRef) bytesReader.term.clone();
>> +                entryKey = (TermRef) bytesReader.term.clone();
>>               }
>>               entry.freq = docFreq;
>>               entry.termUpTo = termUpto;
>>               entry.filePointer = in.getFilePointer();
>>
>> -              cache.put(entry.term, entry);
>> +              cache.put(entryKey, entry);
>>             }
>>             return SeekStatus.FOUND;
>>           } else if (cmp > 0) {
>> @@ -517,9 +519,8 @@
>>
>>   // nocommit -- scrutinize API
>>   public static class CacheEntry {
>> -    int termUpTo;
>> -    TermRef term; // nocommit -- really needed?
>> -    long filePointer;
>> +    int termUpTo;                                 // ord for this term
>> +    long filePointer;                             // fp into the terms
>> dict primary file (_X.tis)
>>
>>     // nocommit -- belongs in Pulsing's CacheEntry class:
>>     public int freq;
>> @@ -563,6 +564,7 @@
>>     private final static float LOADFACTOR = 0.75f;
>>     private int cacheSize;
>>     V eldest;
>> +    K eldestKey;
>>
>>     /**
>>      * Creates a last-recently-used cache with the specified size.
>> @@ -580,6 +582,7 @@
>>       boolean remove = size() > ReuseLRUCache.this.cacheSize;
>>

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781151#action_12781151
 ] 

Robert Muir commented on LUCENE-2089:
-

Mark, they would get large fast, but i think we only need say 1,2,3.
once you go high K, it will be many disk seeks anyway, and 'linear' mode like 
it does now will be probably faster.


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781154#action_12781154
 ] 

Robert Muir commented on LUCENE-2089:
-

Another twist, is that we have to support the 'constant prefix' as well.
the easiest way would be that if the user asks for this, we intersect the 
automaton with a 'prefix automaton' such as abc.*

but the BasicOperations.intersection is quadratic in # of states, because it 
supports both NFA and DFA
we might need to implement a better DFA-only intersection alg for this constant 
prefix, so that supplying constant prefix doesnt actually make things slower :)


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Robert Muir
On Sun, Nov 22, 2009 at 11:23 AM, Michael McCandless <
[email protected]> wrote:

> No, not really... just an optimization I found when hunting ;)
>
> I'm working now on an AutomatonTermsEnum that uses the flex API
> directly, to test that performance.
>
>
I didn't mean to 'bail out' on this but I could not tell if TermsEnum was
close to stabilized, and it might be significant work to convert it?
Maybe benching numeric range would be easier and accomplish the same thing?

-- 
Robert Muir
[email protected]


[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781155#action_12781155
 ] 

Michael McCandless commented on LUCENE-1606:


{quote}
bq. Are we going to deprecate contrib/regex with this?

I would argue against that, only because the other regex impl's have different 
features and syntax, even if they are slow.
{quote}

Ahh OK, I agree then.  I didn't realize this new query doesn't subsume 
contrib's.  Would be good to call out what's different in the javadocs 
somewhere...

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781156#action_12781156
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. Would be good to call out what's different in the javadocs somewhere...

good idea, let me know if you have some suggested wording.
its really a general issue, the supported features and syntax of even the 
existing regex implementations in contrib are different I think?
(i.e. they are not compatible: you cannot just swap impls around, without 
testing that it supports the syntax and features you are using)


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781157#action_12781157
 ] 

Mark Miller commented on LUCENE-2089:
-

bq. Mark, they would get large fast, but i think we only need say 1,2,3.

Ah, right - you mention that above in the summary.

bq. Another twist, is that we have to support the 'constant prefix' as well.

Generally, if you have any kind of length to the prefix, linear wouldn't be so 
bad either though right? Guess it depends on your terms though.

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781160#action_12781160
 ] 

Michael McCandless commented on LUCENE-1606:


I don't have any wording -- I don't really know the differences :)

If it's "only" that the syntax is different, that's one thing... but if eg 
certain functionality isn't possible w/ new or old, that's another.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781161#action_12781161
 ] 

Robert Muir commented on LUCENE-2089:
-

bq. Generally, if you have any kind of length to the prefix, linear wouldn't be 
so bad either though right? Guess it depends on your terms though.

not so bad, but not so good either. see my wildcard results, for example (on 
the 10M equally distributed terms)
||Pattern||Iter||AvgHits||AvgMS (old)||AvgMS (new)||
||N?N?N?N||10||1000.0||288.6||38.5||

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781162#action_12781162
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. If it's "only" that the syntax is different, that's one thing... but if eg 
certain functionality isn't possible w/ new or old, that's another.

from a glance, it appears to me that both the syntax and functionality of our 
two contrib impls (java.util and jakarta), are very different.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781162#action_12781162
 ] 

Robert Muir edited comment on LUCENE-1606 at 11/22/09 5:06 PM:
---

bq. If it's "only" that the syntax is different, that's one thing... but if eg 
certain functionality isn't possible w/ new or old, that's another.

from a glance, it appears to me that both the syntax and functionality of our 
two contrib impls (java.util and jakarta), are very different.

here is one example. Java.util supports reluctant {m,n} closures, jakarta does 
not, it says this right in the javadocs.
http://jakarta.apache.org/regexp/apidocs/org/apache/regexp/RE.html

Should RE support reluctant {m,n} closures (does anyone care)?
But it supports reluctant versus greedy for other operators.

in automaton, this concept of reluctance versus greedy, does not even exist, as 
spelled out on their page:
The * operator is mathematically the Kleene star operator (i.e. we don't have 
greedy/reluctant/possesive variants). 
http://www.brics.dk/automaton/faq.html

this is an example, where all 3 are different... i guess i kinda assumed 
everyone was aware that all these regex packages are very different.

  was (Author: rcmuir):
bq. If it's "only" that the syntax is different, that's one thing... but if 
eg certain functionality isn't possible w/ new or old, that's another.

from a glance, it appears to me that both the syntax and functionality of our 
two contrib impls (java.util and jakarta), are very different.

  
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781167#action_12781167
 ] 

Mark Miller commented on LUCENE-2089:
-

Right, I wouldn't expect it to be great with a single char prefix - especially 
with the random terms your making. But I think thats a much worse case than a 
slightly longer prefix with a normal term distribution.

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781168#action_12781168
 ] 

Robert Muir commented on LUCENE-1606:
-

we call this out nicely in the current RegexQuery,
The expressions supported depend on the regular expression implementation used 
by way of the RegexCapabilities interface. 

what should I say for the automaton implementation? it already has a javadoc 
link to the precise syntax supported,
so in my opinion its actually less ambiguous than contrib RegexQuery.

but maybe improve this, instead of
{code}
The supported syntax is documented in the {...@link RegExp} class.
{code}

maybe:
{code}
The supported syntax is documented in the {...@link RegExp} class.
warning: this might not be the syntax you are used to!
{code}


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781170#action_12781170
 ] 

Robert Muir commented on LUCENE-2089:
-

Mark maybe, though it also depends largely on the size of the "alphabet" in the 
term dictionary. 

for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, 
say [GACD], the automaton enumeration becomes even more effective over any 
"prefix" mechanism

so the performance is language-dependent.


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781170#action_12781170
 ] 

Robert Muir edited comment on LUCENE-2089 at 11/22/09 5:44 PM:
---

Mark maybe, though it also depends largely on the size of the "alphabet" in the 
term dictionary. 

for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, 
say [GACD], the automaton enumeration becomes even more effective over any 
"prefix" mechanism

so the performance is language-dependent.

also mark, you said you want to "scale" fuzzy query right? Taking a linear 
algorithm and dividing it by a constant (alphabet size), which is what constant 
prefix does, does not improve the scalability. for english, it just makes your 
fuzzy query 26 times faster. so imo it would be better to improve the real 
scalability... the constant prefix is just an optimization/hack but does not 
really solve the problem


  was (Author: rcmuir):
Mark maybe, though it also depends largely on the size of the "alphabet" in 
the term dictionary. 

for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, 
say [GACD], the automaton enumeration becomes even more effective over any 
"prefix" mechanism

so the performance is language-dependent.

  
> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781173#action_12781173
 ] 

Mark Miller commented on LUCENE-2089:
-

bq.  the constant prefix is just an optimization/hack

Right - which is part of why I am not so worried about handling it with a new 
impl - its really only there now because the thing is so slow without it - if 
we really needed to support it, it wouldn't be so bad to fall back to as it is. 
Though if we can support it with the new impl fine - but I don't think I would 
go out of the way for it myself.

bq. for english, it just makes your fuzzy query 26 times faster

With a prefix of 1 again? Yeah - you really need to use a longer prefix to get 
much benifit - but I wouldnt think we care about a prefix of 1 with the new 
impl - just don't use a prefix. Its just a hack to somewhat get around the 
current scability issues.


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781174#action_12781174
 ] 

Robert Muir commented on LUCENE-2089:
-

bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix to 
get much benifit - but I wouldnt think we care about a prefix of 1 with the new 
impl - just don't use a prefix. Its just a hack to somewhat get around the 
current scability issues.

you write that crazy algorithm from the paper, and I will do the easy part 
(fast DFA intersection) so we can use the constant prefix.
we must "use" the prefix, so the results are the same as old FuzzyQuery for 
back compat, might as well do it right.


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781174#action_12781174
 ] 

Robert Muir edited comment on LUCENE-2089 at 11/22/09 6:01 PM:
---

bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix to 
get much benifit - but I wouldnt think we care about a prefix of 1 with the new 
impl - just don't use a prefix. Its just a hack to somewhat get around the 
current scability issues.

you write that crazy algorithm from the paper, and I will do the easy part 
(fast DFA intersection) so we can use the constant prefix.
we must "use" the prefix, so the results are the same as old FuzzyQuery for 
back compat, might as well do it right.

btw, I am implying here that we should never do a real levenshtein dynamic 
programming calculation if we do not have to.
for example, in my k=1 prototype, it is never used:
{code}
public float difference() {
  if (currentTerm == null || currentTerm.equals(term))
return 1.0F;
  else
return 1.0F - (1.0F / (Math.max(currentTerm.text().length(), 
term.text().length(;
} 
{code}

instead, if up to k=3 is required, we should do these steps:
* equals(), then 1.0F
* matches k=1 automaton?
* matches k=2 automaton?
* matches k=3 automaton?

even for k=3 i am sure this is likely faster on average than doing levenshtein, 
especially if the query matches many terms.
automaton matching is linear time, and just array access.
 

  was (Author: rcmuir):
bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix 
to get much benifit - but I wouldnt think we care about a prefix of 1 with the 
new impl - just don't use a prefix. Its just a hack to somewhat get around the 
current scability issues.

you write that crazy algorithm from the paper, and I will do the easy part 
(fast DFA intersection) so we can use the constant prefix.
we must "use" the prefix, so the results are the same as old FuzzyQuery for 
back compat, might as well do it right.

  
> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175
 ] 

Mark Miller commented on LUCENE-2089:
-

bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for 
back compat

I'm not so worried about that myself. The prefix on the old Fuzzy is just to 
get around scalability - we could just deprecate and create a new Fuzzy without 
it. Why carry along the hack when we fix the root problem?

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Michael McCandless
On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir  wrote:

>> No, not really... just an optimization I found when hunting ;)
>>
>> I'm working now on an AutomatonTermsEnum that uses the flex API
>> directly, to test that performance.
>>
>
> I didn't mean to 'bail out' on this

You didn't 'bail out'; I 'bailed in' ;)  This is the joy of open
source... great big noisy Bazaar.

> but I could not tell if TermsEnum was close to stabilized

I think it's close; we need to do this port anyway, once automaton is
committed to trunk, so really I saved Mark some work ;)

> and it might be significant work to convert it?

It wasn't too bad, but maybe you can look it over once I post patch
and see if I messed anything up :)

> Maybe benching numeric range would be easier and accomplish the same thing?

Yeah benching NRQ would be good too... many benchmarks still to run.

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781178#action_12781178
 ] 

Michael McCandless commented on LUCENE-1606:


OK that warning seems good.  Maybe also reference contrib/regex, as another 
alternative, nothing that syntax/capabilities are different?

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175
 ] 

Mark Miller edited comment on LUCENE-2089 at 11/22/09 6:06 PM:
---

bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for 
back compat

I'm not so worried about that myself. The prefix on the old Fuzzy is just to 
get around scalability - we could just deprecate and create a new Fuzzy without 
it. Why carry along the hack when we fix the root problem?

*edit*

Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've 
even seen an email were Doug talked about just dropping it. He didn't like the 
non scalable queries. Even if this new impl didn't exactly match the results of 
the old, I'd have no problem just deprecating the old and saying we are 
starting over with a "real" fuzzyquery - the old one goes away on the next 
major - anyone dying to stick with it can just pull the query into their own 
code.

I think if we keep the prefix, it should be for a good reason in itself, rather 
than just back compat with the old crappy version.

  was (Author: [email protected]):
bq. we must "use" the prefix, so the results are the same as old FuzzyQuery 
for back compat

I'm not so worried about that myself. The prefix on the old Fuzzy is just to 
get around scalability - we could just deprecate and create a new Fuzzy without 
it. Why carry along the hack when we fix the root problem?
  
> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781179#action_12781179
 ] 

Robert Muir commented on LUCENE-2089:
-

bq. Basically, what I'm saying is the old FuzzyQuery is just garbage really - 
I've even seen an email were Doug talked about just dropping it.

I agree with garbage, but we cannot completely replace it anyway. for example 
what if someone supplies a term of length 54 and asks for distance of 0.5?
we should not use this algorithm for nonsense like that, in that case I think 
they should just use the garbage algorithm.

Here is a quote from that moman page:
It means that in theory, you could ask for a Levenshtein distance of 27! Well, 
if you have a week ahead of you... 

we shouldnt burn cycles creating useless tables that will be huge arrays either 
in fuzzyquery, or whatever. we can't compute all the way up to infinity, this 
is why i think something like 1,2,3 is "reasonable" , if it requires more edits 
than that, go with the old brute force algorithm?

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Robert Muir
Mike, I guess what I am implying is should i even bother with lucene-1606
and trunk?

or instead, should i be helping you, looking at TermsEnum, and working on
integrating it into flex?

On Sun, Nov 22, 2009 at 1:05 PM, Michael McCandless <
[email protected]> wrote:

> On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir  wrote:
>
> >> No, not really... just an optimization I found when hunting ;)
> >>
> >> I'm working now on an AutomatonTermsEnum that uses the flex API
> >> directly, to test that performance.
> >>
> >
> > I didn't mean to 'bail out' on this
>
> You didn't 'bail out'; I 'bailed in' ;)  This is the joy of open
> source... great big noisy Bazaar.
>
> > but I could not tell if TermsEnum was close to stabilized
>
> I think it's close; we need to do this port anyway, once automaton is
> committed to trunk, so really I saved Mark some work ;)
>
> > and it might be significant work to convert it?
>
> It wasn't too bad, but maybe you can look it over once I post patch
> and see if I messed anything up :)
>
> > Maybe benching numeric range would be easier and accomplish the same
> thing?
>
> Yeah benching NRQ would be good too... many benchmarks still to run.
>
> Mike
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]


[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175
 ] 

Mark Miller edited comment on LUCENE-2089 at 11/22/09 6:12 PM:
---

bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for 
back compat

I'm not so worried about that myself. The prefix on the old Fuzzy is just to 
get around scalability - we could just deprecate and create a new Fuzzy without 
it. Why carry along the hack when we fix the root problem?

*edit*

Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've 
even seen an email were Doug talked about just dropping it. He didn't like the 
non scalable queries. Even if this new impl didn't exactly match the results of 
the old, I'd have no problem just deprecating the old and saying we are 
starting over with a "real" fuzzyquery - the old one goes away on the next 
major - anyone dying to stick with it can just pull the query into their own 
code.

I think if we keep the prefix, it should be for a good reason in itself, rather 
than just back compat with the old crappy version.

*edit*

Not that I can't see people wanting just the tail end to be fuzzy - in that 
case, fine - do your dirty work :) I just don't know how much of a real req 
that is compared to a prefix query. Perhaps its more useful than I'd guess 
myself. I think we are quite a ways from getting this alg implemented anyway :) 
Though I am trying to track down the latest python code.

  was (Author: [email protected]):
bq. we must "use" the prefix, so the results are the same as old FuzzyQuery 
for back compat

I'm not so worried about that myself. The prefix on the old Fuzzy is just to 
get around scalability - we could just deprecate and create a new Fuzzy without 
it. Why carry along the hack when we fix the root problem?

*edit*

Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've 
even seen an email were Doug talked about just dropping it. He didn't like the 
non scalable queries. Even if this new impl didn't exactly match the results of 
the old, I'd have no problem just deprecating the old and saying we are 
starting over with a "real" fuzzyquery - the old one goes away on the next 
major - anyone dying to stick with it can just pull the query into their own 
code.

I think if we keep the prefix, it should be for a good reason in itself, rather 
than just back compat with the old crappy version.
  
> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we nee

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781180#action_12781180
 ] 

Mark Miller commented on LUCENE-2089:
-

bq. if it requires more edits than that, go with the old brute force algorithm?

Perhaps - but it should be an option. We don't want perf to just drop off a 
cliff (and thats an understatement on a large index), just because the input 
crossed some line. I'd almost prefer the input just doesn't work - in my mind 
FuzzyQuery just doesn't work on large indecies. But far be it from me to take 
away the option ...

I wouldnt really like it if it was default automatic - n=3 takes a few 
milliseconds, then n=4 takes an hour. Whoops.

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1606:
---

Attachment: LUCENE-1606-flex.patch

First cut @ cutting over to flex API attached -- note that this
applies to the flex branch, not trunk!

I made some small changes to the benchmarker: use constant score
filter mode, and print the min (not avg) time (less noise).

Also, I ported the AutomatonTermEnum to the flex API, so this is now a
better measure ("flex on flex") of what future perf will be.  It's
possible there's a bug here, though TestWildcard passes.

I still need to investigate why "non-flex on non-flex" and "non-flex
on flex" perform worse.

I ran like this:

  java -server -Xmx1g -Xms1g BenchWildcard

java is 1.6.0_14 64 bit, on OpenSolaris.

Results (msec is min of 10 runs each);

||Pattern||ITrunk (min msec)||(Flex (min msec)||
|N?N?N?N0.0|13|18|
|?NN|1|3|
|??N|4|6|
|???|23|28|
|NNN|210|170|
|NN??NNN|3|3|
|NN?N*|7|4|
|?NN*|62|30|
|*N|4332|2576|
|N??|1|1|

Looks like flex API is faster for the slow queries.  Once I fix caching on 
trunk we should retest...


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781182#action_12781182
 ] 

Robert Muir commented on LUCENE-2089:
-

bq. I wouldnt really like it if it was default automatic - n=3 takes a few 
milliseconds, then n=4 takes an hour. Whoops.

you are completely right, it should not drop off a cliff. though i think as n 
increases, it will get significantly worse, because of so much seeking.
we find the nice n where this is almost as bad as brute force, and cut her off 
there?

this is basically what i did for the enum already, if you have a wildcard with 
leading * or a regex with .*, its faster to linear scan than to seek around.
Uwe and I are referring this as "dumb" mode versus "smart" mode. I'm gonna add 
a constructor to the enum, so this can be specified rather than "auto" as well.


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781183#action_12781183
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. Looks like flex API is faster for the slow queries. Once I fix caching on 
trunk we should retest...

Mike, this is cool. I like the results. it appears tentatively, that flex api 
is faster for both "dumb" (brute force linear reading) and "fast" (lots of 
seeking) modes.
at least looking at NNN, and *N, which are the worst cases of both here. So 
it would seem its faster in every case.

I'll look at what you did to port this to the TermsEnum api!


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Michael McCandless
I think you should keep doing all LUCENE-1606 work (and, any other
issues) on trunk, and then we merge down to flex branch once it's
committed?

We shouldn't hold up any trunk features because flex is
coming... merging down every so often seems manageable so far (Mark?).

I'm hoping to finish flex soonish -- largely what remains (I think!)
is better testing (correctness & performance) of the 4-way
combinations.  I think the codecs approach is generally working
well.. the fact that we have initial Pulsing & PforDelta codecs
working is great.

Mike

On Sun, Nov 22, 2009 at 1:11 PM, Robert Muir  wrote:
> Mike, I guess what I am implying is should i even bother with lucene-1606
> and trunk?
>
> or instead, should i be helping you, looking at TermsEnum, and working on
> integrating it into flex?
>
> On Sun, Nov 22, 2009 at 1:05 PM, Michael McCandless
>  wrote:
>>
>> On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir  wrote:
>>
>> >> No, not really... just an optimization I found when hunting ;)
>> >>
>> >> I'm working now on an AutomatonTermsEnum that uses the flex API
>> >> directly, to test that performance.
>> >>
>> >
>> > I didn't mean to 'bail out' on this
>>
>> You didn't 'bail out'; I 'bailed in' ;)  This is the joy of open
>> source... great big noisy Bazaar.
>>
>> > but I could not tell if TermsEnum was close to stabilized
>>
>> I think it's close; we need to do this port anyway, once automaton is
>> committed to trunk, so really I saved Mark some work ;)
>>
>> > and it might be significant work to convert it?
>>
>> It wasn't too bad, but maybe you can look it over once I post patch
>> and see if I messed anything up :)
>>
>> > Maybe benching numeric range would be easier and accomplish the same
>> > thing?
>>
>> Yeah benching NRQ would be good too... many benchmarks still to run.
>>
>> Mike
>>
>> -
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>
>
> --
> Robert Muir
> [email protected]
>

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781185#action_12781185
 ] 

Robert Muir commented on LUCENE-1606:
-

Mike, I think your port to TermsEnum is correct, and its definitely faster here.

One question, is it possible to speed this up further, by using 
UnicodeUtil/char[] conversion from TermRef instead of String?
Because its trivial to use char[] with the Automaton api (even tho that is not 
exposed, its no problem)

I use only string because of the old TermEnum api.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781188#action_12781188
 ] 

Michael McCandless commented on LUCENE-1606:


{quote}
One question, is it possible to speed this up further, by using 
UnicodeUtil/char[] conversion from TermRef instead of String?
Because its trivial to use char[] with the Automaton api (even tho that is not 
exposed, its no problem)

I use only string because of the old TermEnum api.
{quote}

Oh, that'd be great!  It would be faster.  I was bummed at how many new 
String()'s I was doing...

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Robert Muir
ok, I only ask because some rework of this enum could be necessary to take
advantage of the new api.

examples include changing it to use char[] (easy) to prevent lots of string
creation, which was unavoidable with TermEnum since it is based on string.

i will never mention this again, but it could also run on byte[] pretty
easily.
However I think high-level processing like this should use utf-16
processing, as java intended, although I'm pretty positive it would be
extremely fast.

On Sun, Nov 22, 2009 at 1:33 PM, Michael McCandless <
[email protected]> wrote:

> I think you should keep doing all LUCENE-1606 work (and, any other
> issues) on trunk, and then we merge down to flex branch once it's
> committed?
>
> We shouldn't hold up any trunk features because flex is
> coming... merging down every so often seems manageable so far (Mark?).
>
> I'm hoping to finish flex soonish -- largely what remains (I think!)
> is better testing (correctness & performance) of the 4-way
> combinations.  I think the codecs approach is generally working
> well.. the fact that we have initial Pulsing & PforDelta codecs
> working is great.
>
> Mike
>
> On Sun, Nov 22, 2009 at 1:11 PM, Robert Muir  wrote:
> > Mike, I guess what I am implying is should i even bother with lucene-1606
> > and trunk?
> >
> > or instead, should i be helping you, looking at TermsEnum, and working on
> > integrating it into flex?
> >
> > On Sun, Nov 22, 2009 at 1:05 PM, Michael McCandless
> >  wrote:
> >>
> >> On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir  wrote:
> >>
> >> >> No, not really... just an optimization I found when hunting ;)
> >> >>
> >> >> I'm working now on an AutomatonTermsEnum that uses the flex API
> >> >> directly, to test that performance.
> >> >>
> >> >
> >> > I didn't mean to 'bail out' on this
> >>
> >> You didn't 'bail out'; I 'bailed in' ;)  This is the joy of open
> >> source... great big noisy Bazaar.
> >>
> >> > but I could not tell if TermsEnum was close to stabilized
> >>
> >> I think it's close; we need to do this port anyway, once automaton is
> >> committed to trunk, so really I saved Mark some work ;)
> >>
> >> > and it might be significant work to convert it?
> >>
> >> It wasn't too bad, but maybe you can look it over once I post patch
> >> and see if I messed anything up :)
> >>
> >> > Maybe benching numeric range would be easier and accomplish the same
> >> > thing?
> >>
> >> Yeah benching NRQ would be good too... many benchmarks still to run.
> >>
> >> Mike
> >>
> >> -
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> >
> >
> > --
> > Robert Muir
> > [email protected]
> >
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]


[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781190#action_12781190
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. Oh, that'd be great! It would be faster. I was bummed at how many new 
String()'s I was doing...

it would be nice I think if TermRef provided a helper method to make the char[] 
available? 
i.e. I don't think i should do unicode conversion in a multitermquery?


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1260) Norm codec strategy in Similarity

2009-11-22 Thread Johan Kindgren (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Kindgren updated LUCENE-1260:
---

Attachment: Lucene-1260-2.patch

I've added the old static methods again, but made them deprecated.

In contrib/misc there is still a reference to the static encodeNorm method, 
maybe that should be replaced with 
Similarity.getDefaultSimilarity().encodeNormValue(f)? This call to the static 
method is only done if no similarity is passed to the FieldNormModifier.

I added a short javadoc description to the static methods, not sure if that is 
enough? (I guess they will be removed, so the relevant javadoc is probably in 
the instance methods?)

> Norm codec strategy in Similarity
> -
>
> Key: LUCENE-1260
> URL: https://issues.apache.org/jira/browse/LUCENE-1260
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
>Reporter: Karl Wettin
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
> Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt
>
>
> The static span and resolution of the 8 bit norms codec might not fit with 
> all applications. 
> My use case requires that 100f-250f is discretized in 60 bags instead of the 
> default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Mark Miller
bq. merging down every so often seems manageable so far (Mark?).

Yeah, this has been working great from my perspective.

Michael McCandless wrote:
> I think you should keep doing all LUCENE-1606 work (and, any other
> issues) on trunk, and then we merge down to flex branch once it's
> committed?
>
> We shouldn't hold up any trunk features because flex is
> coming... 
>
> I'm hoping to finish flex soonish -- largely what remains (I think!)
> is better testing (correctness & performance) of the 4-way
> combinations.  I think the codecs approach is generally working
> well.. the fact that we have initial Pulsing & PforDelta codecs
> working is great.
>
> Mike
>
> On Sun, Nov 22, 2009 at 1:11 PM, Robert Muir  wrote:
>   
>> Mike, I guess what I am implying is should i even bother with lucene-1606
>> and trunk?
>>
>> or instead, should i be helping you, looking at TermsEnum, and working on
>> integrating it into flex?
>>
>> On Sun, Nov 22, 2009 at 1:05 PM, Michael McCandless
>>  wrote:
>> 
>>> On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir  wrote:
>>>
>>>   
> No, not really... just an optimization I found when hunting ;)
>
> I'm working now on an AutomatonTermsEnum that uses the flex API
> directly, to test that performance.
>
>   
 I didn't mean to 'bail out' on this
 
>>> You didn't 'bail out'; I 'bailed in' ;)  This is the joy of open
>>> source... great big noisy Bazaar.
>>>
>>>   
 but I could not tell if TermsEnum was close to stabilized
 
>>> I think it's close; we need to do this port anyway, once automaton is
>>> committed to trunk, so really I saved Mark some work ;)
>>>
>>>   
 and it might be significant work to convert it?
 
>>> It wasn't too bad, but maybe you can look it over once I post patch
>>> and see if I messed anything up :)
>>>
>>>   
 Maybe benching numeric range would be easier and accomplish the same
 thing?
 
>>> Yeah benching NRQ would be good too... many benchmarks still to run.
>>>
>>> Mike
>>>
>>> -
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>   
>>
>> --
>> Robert Muir
>> [email protected]
>>
>> 
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>   


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781204#action_12781204
 ] 

Mark Miller commented on LUCENE-2089:
-

bq. we find the nice n where this is almost as bad as brute force, and cut her 
off there?

yeah, that sounds good if there is a nice transition at a relatively lower n.

bq. this is basically what i did for the enum already, if you have a wildcard 
with leading * or a regex with .*, its faster to linear scan than to seek 
around.

Yeah, and I think this makes sense - but you will notice both the Lucene qp and 
Solr don't allow leading wildcard by default - the perf is generally bad enough 
on a large index that we assume you don't want to allow it and force users to 
"turn on" the option.

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781205#action_12781205
 ] 

Robert Muir commented on LUCENE-2089:
-

bq. but you will notice both the Lucene qp and Solr don't allow leading 
wildcard by default

if we add this automaton stuff, I think we should reconsider this rule.
instead of don't allow leading * or ? by default,  just don't allow just 
leading * by default.

i won't really argue for it though, because a wildcard query of 
"?abc", probably just as bad as "*abc"

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781206#action_12781206
 ] 

Mark Miller commented on LUCENE-2089:
-

I think it makes sense to allow leading ? - ?abc is prob rare enough 
that its worth it to allow by default. Or perhaps a separate knob to turn it on 
and leave leading * off.

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781207#action_12781207
 ] 

Robert Muir commented on LUCENE-2089:
-

mark, you are right.

plus, the qp does not throw exceptions if you have a fuzzy query with no 
constant prefix, which is going to be actually worse than even leading *!!!

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781208#action_12781208
 ] 

Mark Miller commented on LUCENE-2089:
-

solr doesnt even allow for a constant prefix with fuzzy - its broken, broken, 
broken :) DisMax to the rescue.

> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2072) Upgrade contrib/regex to jakarta-regex 1.5

2009-11-22 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781209#action_12781209
 ] 

Simon Willnauer commented on LUCENE-2072:
-

I just added a testcase to check if the prefix method does what it is supposed 
to do. Well, it doesn't! 1.4 version works fine with prefixes after upgrading 
to 1.5 the prefix is always null. I wrote to the jakarta-regexp list but I do 
not expect to get any reply - jakarta-regexp seems to be dead.

simon

> Upgrade contrib/regex to jakarta-regex 1.5 
> ---
>
> Key: LUCENE-2072
> URL: https://issues.apache.org/jira/browse/LUCENE-2072
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.9.1
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: jakarta-regexp-1.5.jar, LUCENE-2072.patch
>
>
> contrib/regex uses jakarta regex 1.4 while 1.5 is out for a while and make 
> the package private workaround obsolete. We should upgrade the lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Resolved: (LUCENE-2072) Upgrade contrib/regex to jakarta-regex 1.5

2009-11-22 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-2072.
-

Resolution: Later

once jakarta-regexp fixes their issues we can go on and upgrade. for now this 
doesn't make sense.



> Upgrade contrib/regex to jakarta-regex 1.5 
> ---
>
> Key: LUCENE-2072
> URL: https://issues.apache.org/jira/browse/LUCENE-2072
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.9.1
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: jakarta-regexp-1.5.jar, LUCENE-2072.patch
>
>
> contrib/regex uses jakarta regex 1.4 while 1.5 is out for a while and make 
> the package private workaround obsolete. We should upgrade the lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Michael McCandless
Yeah I think there will be lots of optimizing we can do, after flex lands.

Maybe stick w/ String for now?  But open an issue, today, to remind us
to cutover to char[] post-flex?

Doing all processing in UTF8 is tantalizing too ;)  This would mean no
conversion of the terms data on iterating from the terms dict...

Mike

On Sun, Nov 22, 2009 at 1:56 PM, Robert Muir  wrote:
> ok, I only ask because some rework of this enum could be necessary to take
> advantage of the new api.
>
> examples include changing it to use char[] (easy) to prevent lots of string
> creation, which was unavoidable with TermEnum since it is based on string.
>
> i will never mention this again, but it could also run on byte[] pretty
> easily.
> However I think high-level processing like this should use utf-16
> processing, as java intended, although I'm pretty positive it would be
> extremely fast.
>
> On Sun, Nov 22, 2009 at 1:33 PM, Michael McCandless
>  wrote:
>>
>> I think you should keep doing all LUCENE-1606 work (and, any other
>> issues) on trunk, and then we merge down to flex branch once it's
>> committed?
>>
>> We shouldn't hold up any trunk features because flex is
>> coming... merging down every so often seems manageable so far (Mark?).
>>
>> I'm hoping to finish flex soonish -- largely what remains (I think!)
>> is better testing (correctness & performance) of the 4-way
>> combinations.  I think the codecs approach is generally working
>> well.. the fact that we have initial Pulsing & PforDelta codecs
>> working is great.
>>
>> Mike
>>
>> On Sun, Nov 22, 2009 at 1:11 PM, Robert Muir  wrote:
>> > Mike, I guess what I am implying is should i even bother with
>> > lucene-1606
>> > and trunk?
>> >
>> > or instead, should i be helping you, looking at TermsEnum, and working
>> > on
>> > integrating it into flex?
>> >
>> > On Sun, Nov 22, 2009 at 1:05 PM, Michael McCandless
>> >  wrote:
>> >>
>> >> On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir  wrote:
>> >>
>> >> >> No, not really... just an optimization I found when hunting ;)
>> >> >>
>> >> >> I'm working now on an AutomatonTermsEnum that uses the flex API
>> >> >> directly, to test that performance.
>> >> >>
>> >> >
>> >> > I didn't mean to 'bail out' on this
>> >>
>> >> You didn't 'bail out'; I 'bailed in' ;)  This is the joy of open
>> >> source... great big noisy Bazaar.
>> >>
>> >> > but I could not tell if TermsEnum was close to stabilized
>> >>
>> >> I think it's close; we need to do this port anyway, once automaton is
>> >> committed to trunk, so really I saved Mark some work ;)
>> >>
>> >> > and it might be significant work to convert it?
>> >>
>> >> It wasn't too bad, but maybe you can look it over once I post patch
>> >> and see if I messed anything up :)
>> >>
>> >> > Maybe benching numeric range would be easier and accomplish the same
>> >> > thing?
>> >>
>> >> Yeah benching NRQ would be good too... many benchmarks still to run.
>> >>
>> >> Mike
>> >>
>> >> -
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >
>> >
>> >
>> > --
>> > Robert Muir
>> > [email protected]
>> >
>>
>> -
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>
>
> --
> Robert Muir
> [email protected]
>

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781213#action_12781213
 ] 

Michael McCandless commented on LUCENE-1606:


bq. it would be nice I think if TermRef provided a helper method to make the 
char[] available? 

I agree... though, this requires state (UnicodeUtil.UTF16Result).  We could 
lazily set such state on the TermRef, but, that's making TermRef kinda heavy 
(it's nice and light weight now).  Hmmm.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Robert Muir
On Sun, Nov 22, 2009 at 3:50 PM, Michael McCandless <
[email protected]> wrote:

> Yeah I think there will be lots of optimizing we can do, after flex lands.
>
> Maybe stick w/ String for now?  But open an issue, today, to remind us
> to cutover to char[] post-flex?
>

ok, i'll create one.


>
> Doing all processing in UTF8 is tantalizing too ;)  This would mean no
> conversion of the terms data on iterating from the terms dict...
>

lets please not go this route :) its gonna be enough trouble fixing the
char[]-based code for unicode 4, forget about byte[]

>
> Mike
>
>
-- 
Robert Muir
[email protected]


[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781214#action_12781214
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. I agree... though, this requires state (UnicodeUtil.UTF16Result). We could 
lazily set such state on the TermRef, but, that's making TermRef kinda heavy 
(it's nice and light weight now). Hmmm.

i guess the state could be in the TermsEnum, but that doesnt make for general 
use of TermRef.
what else uses TermRef?


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Michael McCandless
On Sun, Nov 22, 2009 at 3:52 PM, Robert Muir  wrote:
>
> On Sun, Nov 22, 2009 at 3:50 PM, Michael McCandless
>  wrote:
>>
>> Yeah I think there will be lots of optimizing we can do, after flex lands.
>>
>> Maybe stick w/ String for now?  But open an issue, today, to remind us
>> to cutover to char[] post-flex?
>
> ok, i'll create one.

Thanks.

>> Doing all processing in UTF8 is tantalizing too ;)  This would mean no
>> conversion of the terms data on iterating from the terms dict...
>
> lets please not go this route :) its gonna be enough trouble fixing the
> char[]-based code for unicode 4, forget about byte[]

I'll defer to you ;)

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Created: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

2009-11-22 Thread Robert Muir (JIRA)
convert automaton to char[] based processing and TermRef / TermsEnum api


 Key: LUCENE-2090
 URL: https://issues.apache.org/jira/browse/LUCENE-2090
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
Priority: Minor
 Fix For: 3.1


The automaton processing is currently done with String, mostly because TermEnum 
is based on String.
it is easy to change the processing to work with char[], since behind the 
scenes this is used anyway.

in general I think we should make sure char[] based processing is exposed in 
the automaton pkg anyway, for things like pattern-based tokenizers and such.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Robert Muir
I guess here is where I just say that unicode and java are optimized for
utf-16 processing, and so while I agree with byte[] being available in
places like this for flex indexing,
I'm already nervous about seeing code / optimizations that only work well
with latin-1, and are very slow / buggy for anything else.

On Sun, Nov 22, 2009 at 3:58 PM, Michael McCandless <
[email protected]> wrote:

> On Sun, Nov 22, 2009 at 3:52 PM, Robert Muir  wrote:
> >
> > On Sun, Nov 22, 2009 at 3:50 PM, Michael McCandless
> >  wrote:
> >>
> >> Yeah I think there will be lots of optimizing we can do, after flex
> lands.
> >>
> >> Maybe stick w/ String for now?  But open an issue, today, to remind us
> >> to cutover to char[] post-flex?
> >
> > ok, i'll create one.
>
> Thanks.
>
> >> Doing all processing in UTF8 is tantalizing too ;)  This would mean no
> >> conversion of the terms data on iterating from the terms dict...
> >
> > lets please not go this route :) its gonna be enough trouble fixing the
> > char[]-based code for unicode 4, forget about byte[]
>
> I'll defer to you ;)
>
> Mike
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]


[jira] Updated: (LUCENE-2068) fix reverseStringFilter for unicode 4.0

2009-11-22 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2068:


Attachment: LUCENE_2068.patch

added a CHANGES.txt entry. Will commit soon.

> fix reverseStringFilter for unicode 4.0
> ---
>
> Key: LUCENE-2068
> URL: https://issues.apache.org/jira/browse/LUCENE-2068
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2068.patch, LUCENE-2068.patch, LUCENE_2068.patch, 
> LUCENE_2068.patch, LUCENE_2068.patch
>
>
> ReverseStringFilter is not aware of supplementary characters: when it 
> reverses it will create unpaired surrogates, which will be replaced by U+FFFD 
> by the indexer (but not at query time).
> The wrong words will conflate to each other, and the right words won't match, 
> basically the whole thing falls apart.
> This patch implements in-place reverse with the algorithm from apache harmony 
> AbstractStringBuilder.reverse0()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Resolved: (LUCENE-2068) fix reverseStringFilter for unicode 4.0

2009-11-22 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-2068.
-

Resolution: Fixed

Commited in revision 883149

> fix reverseStringFilter for unicode 4.0
> ---
>
> Key: LUCENE-2068
> URL: https://issues.apache.org/jira/browse/LUCENE-2068
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2068.patch, LUCENE-2068.patch, LUCENE_2068.patch, 
> LUCENE_2068.patch, LUCENE_2068.patch
>
>
> ReverseStringFilter is not aware of supplementary characters: when it 
> reverses it will create unpaired surrogates, which will be replaced by U+FFFD 
> by the indexer (but not at query time).
> The wrong words will conflate to each other, and the right words won't match, 
> basically the whole thing falls apart.
> This patch implements in-place reverse with the algorithm from apache harmony 
> AbstractStringBuilder.reverse0()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Michael McCandless
On Sun, Nov 22, 2009 at 4:06 PM, Robert Muir  wrote:
> I guess here is where I just say that unicode and java are optimized for
> utf-16 processing

I agree, though leaving things as UTF8 works fine for low level stuff
(sorting, comparing equality, etc.)?

> and so while I agree with byte[] being available in
> places like this for flex indexing,
> I'm already nervous about seeing code / optimizations that only work well
> with latin-1, and are very slow / buggy for anything else.

Buggy we should clearly outright fix.

Slower, maybe.  But very slow, I hope not?

What places specifically are you worried about?

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Robert Muir
On Sun, Nov 22, 2009 at 4:16 PM, Michael McCandless <
[email protected]> wrote:

> On Sun, Nov 22, 2009 at 4:06 PM, Robert Muir  wrote:
> > I guess here is where I just say that unicode and java are optimized for
> > utf-16 processing
>
> I agree, though leaving things as UTF8 works fine for low level stuff
> (sorting, comparing equality, etc.)?
>

+1


>
> > and so while I agree with byte[] being available in
> > places like this for flex indexing,
> > I'm already nervous about seeing code / optimizations that only work well
> > with latin-1, and are very slow / buggy for anything else.
>
> Buggy we should clearly outright fix.
>
> Slower, maybe.  But very slow, I hope not?
>
> What places specifically are you worried about?
>

places like AutomatonQuery, where I found myself wanting to consider the
option of processing byte[], when I know this is very bad!


> Mike
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]


[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781219#action_12781219
 ] 

Michael McCandless commented on LUCENE-1606:


Besides TermsEnum.. TermRef is used by the terms dict, when doing the binary 
search + scan to find a term.  And also by TermsConsumer (implemented by the 
codec, and used when writing a segment to the index).

Maybe MTQ holds the state, or FilteredTermsEnum?  Other consumers of TermsEnum 
don't need to convert to char[].

We can discuss this under the new [separately] "optimization" issue for MTQs?

Also, remember that the current API is doing not only new String() but also new 
Term() when it enums the terms, so having to do new String() for MTQs on flex 
API is OK for starters.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



RE: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)

2009-11-22 Thread Uwe Schindler
Hi,

As a non-counting vote:

+1 to release these artifacts as Lucene 3.0

I tested lucene-core.3.0.0.jar with my updated application, no problems
occurred. QueryParser search works, fieldcache/sorting works, numeric range
works. Reopen also works correct, no leftover open files. MMPaDirectory on
64 bit Java 1.5.0_22 on Solaris. Merging old indexes with compressed fields
created  newer, but larger segments (as expected).
I also reindexed my indexes, filesize was identical to the one after
merge/optimize with compress expansion, so the indexes seems to be
identical.

I only had to change some parts of my code and remove lots of unneeded casts
(thanks to generics). :-)

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -Original Message-
> From: Uwe Schindler [mailto:[email protected]]
> Sent: Sunday, November 22, 2009 4:07 PM
> To: [email protected]; [email protected]
> Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
> 
> Hi,
> 
> I have built the artifacts for the final release of "Apache Lucene Java
> 3.0.0" a second time, because of a bug in the TokenStream API (found by
> Shai
> Erera, who wanted to make "bad" things with addAttribute, breaking its
> behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to
> prevent
> stack overflow, LUCENE-2087). They are targeted for release on 2009-11-25.
> 
> The artifacts are here:
> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/
> 
> You find the changes in the corresponding sub folder. The SVN revision is
> 883080, here the manifest with build system info:
> 
> Manifest-Version: 1.0
> Ant-Version: Apache Ant 1.7.0
> Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.)
> Specification-Title: Lucene Search Engine
> Specification-Version: 3.0.0
> Specification-Vendor: The Apache Software Foundation
> Implementation-Title: org.apache.lucene
> Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49
> Implementation-Vendor: The Apache Software Foundation
> X-Compile-Source-JDK: 1.5
> X-Compile-Target-JDK: 1.5
> 
> Please vote to officially release these artifacts as "Apache Lucene Java
> 3.0.0".
> 
> We need at least 3 binding (PMC) votes.
> 
> Thanks everyone for all their hard work on this and I am very sorry for
> requesting a vote again, but that's life! Thanks Shai for the pointer to
> the
> bug!
> 
> 
> 
> 
> Here is the proposed release note, please edit, if needed:
> --
> 
> Hello Lucene users,
> 
> On behalf of the Lucene dev community (a growing community far larger than
> just the committers) I would like to announce the release of Lucene Java
> 3.0:
> 
> The new version is mostly a cleanup release without any new features. All
> deprecations targeted to be removed in version 3.0 were removed. If you
> are
> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
> warnings in your code base to be able to recompile against this version.
> 
> This is the first Lucene release with Java 5 as a minimum requirement. The
> API was cleaned up to make use of Java 5's generics, varargs, enums, and
> autoboxing. New users of Lucene are advised to use this version for new
> developments, because it has a clean, type safe new API. Upgrading users
> can
> now remove unnecessary casts and add generics to their code, too. If you
> have not upgraded your installation to Java 5, please read the file
> JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene
> 3.0, it will also happen with any previous release when you upgrade your
> Java environment).
> 
> Lucene 3.0 has some changes regarding compressed fields: 2.9 already
> deprecated compressed fields; support for them was removed now. Lucene 3.0
> is still able to read indexes with compressed fields, but as soon as
> merges
> occur or the index is optimized, all compressed fields are decompressed
> and
> converted to Field.Store.YES. Because of this, indexes with compressed
> fields can suddenly get larger.
> 
> While we generally try and maintain full backwards compatibility between
> major versions, Lucene 3.0 has some minor breaks, mostly related to
> deprecation removal, pointed out in the 'Changes in backwards
> compatibility
> policy' section of CHANGES.txt. Notable are:
> 
> - IndexReader.open(Directory) now opens in read-only mode per default
> (this
> method was deprecated because of that in 2.9). The same occurs to
> IndexSearcher.
> 
> - Already started in 2.9, core TokenStreams are now made final to enforce
> the decorator pattern.
> 
> - If you interrupt an IndexWriter merge thread, IndexWriter now throws an
> unchecked ThreadInterruptedException that extends RuntimeException and
> clears the interrupt status.
> 
> --
> 
> 
> 
> Thanks,
> Uwe
> 
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen

Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java

2009-11-22 Thread Michael McCandless
On Sun, Nov 22, 2009 at 4:19 PM, Robert Muir  wrote:

>> What places specifically are you worried about?
>
> places like AutomatonQuery, where I found myself wanting to consider the
> option of processing byte[], when I know this is very bad!

Ahh OK :)  Well you got the better of yourself before it was too late!

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781220#action_12781220
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. We can discuss this under the new [separately] "optimization" issue for 
MTQs?

is there a jira issue for this??

bq. Also, remember that the current API is doing not only new String() but also 
new Term() when it enums the terms, so having to do new String() for MTQs on 
flex API is OK for starters.

oh yeah, its clear the flex api is already better, from benchmarks.
I am just trying to think of ways to make it both faster, at the same time easy 
too.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781221#action_12781221
 ] 

Michael McCandless commented on LUCENE-1606:


bq. is there a jira issue for this??

I thought you were about to open one!

bq. I am just trying to think of ways to make it both faster, at the same time 
easy too.

Which is great: keep it up!

Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to 
be most convenient for "higher up" APIs like automaton?  If we did char[] you'd 
still have to handle surrogates process (and then it's not unlike doing byte[]).

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781222#action_12781222
 ] 

Michael McCandless commented on LUCENE-2090:


Spinoff from LUCENE-1606.

> convert automaton to char[] based processing and TermRef / TermsEnum api
> 
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because 
> TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the 
> scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in 
> the automaton pkg anyway, for things like pattern-based tokenizers and such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781223#action_12781223
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. I thought you were about to open one!

I opened one for Automaton specifically, should i change it to be all MTQ?

bq. Actually... wouldn't we need to convert to int[] (for Unicode 4) not 
char[], to be most convenient for "higher up" APIs like automaton? If we did 
char[] you'd still have to handle surrogates process (and then it's not unlike 
doing byte[]).

nope. because unicode and java are optimized for UTF-16, not UTF-32. so we 
should use char[], but use the codePoint apis, which are designed such that you 
can process text in UTF-16 (char[]) efficiently, yet also handle the rare case 
of supp. characters.
char[] is correct, its just that we have to be careful to use the right apis 
for processing it.
With String() a lot of the apis such as String.toLowerCase do this 
automatically for you, so most applications have no issues.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2009-11-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781224#action_12781224
 ] 

Michael McCandless commented on LUCENE-1260:


Patch looks good!  Thanks Johan.  I'll commit in a day or two...

> Norm codec strategy in Similarity
> -
>
> Key: LUCENE-1260
> URL: https://issues.apache.org/jira/browse/LUCENE-1260
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
>Reporter: Karl Wettin
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
> Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt
>
>
> The static span and resolution of the 8 bit norms codec might not fit with 
> all applications. 
> My use case requires that 100f-250f is discretized in 60 bags instead of the 
> default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



RE: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)

2009-11-22 Thread Uwe Schindler
> Hi,
> 
> As a non-counting vote:
> 
> +1 to release these artifacts as Lucene 3.0
> 
> I tested lucene-core.3.0.0.jar with my updated application, no problems
> occurred. QueryParser search works, fieldcache/sorting works, numeric
> range
> works. Reopen also works correct, no leftover open files. MMPaDirectory on
> 64 bit Java 1.5.0_22 on Solaris. Merging old indexes with compressed
> fields
> created  newer, but larger segments (as expected).
> I also reindexed my indexes, filesize was identical to the one after
> merge/optimize with compress expansion, so the indexes seems to be
> identical.
> 
> I only had to change some parts of my code and remove lots of unneeded
> casts
> (thanks to generics). :-)

And I forgot:
I was able to build and test the whole distribution from the source ZIP.
JavaDocs of binary distrib were ok, too.

So no change: +1 :-)

Uwe

> > -Original Message-
> > From: Uwe Schindler [mailto:[email protected]]
> > Sent: Sunday, November 22, 2009 4:07 PM
> > To: [email protected]; [email protected]
> > Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
> >
> > Hi,
> >
> > I have built the artifacts for the final release of "Apache Lucene Java
> > 3.0.0" a second time, because of a bug in the TokenStream API (found by
> > Shai
> > Erera, who wanted to make "bad" things with addAttribute, breaking its
> > behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to
> > prevent
> > stack overflow, LUCENE-2087). They are targeted for release on 2009-11-
> 25.
> >
> > The artifacts are here:
> > http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/
> >
> > You find the changes in the corresponding sub folder. The SVN revision
> is
> > 883080, here the manifest with build system info:
> >
> > Manifest-Version: 1.0
> > Ant-Version: Apache Ant 1.7.0
> > Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.)
> > Specification-Title: Lucene Search Engine
> > Specification-Version: 3.0.0
> > Specification-Vendor: The Apache Software Foundation
> > Implementation-Title: org.apache.lucene
> > Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49
> > Implementation-Vendor: The Apache Software Foundation
> > X-Compile-Source-JDK: 1.5
> > X-Compile-Target-JDK: 1.5
> >
> > Please vote to officially release these artifacts as "Apache Lucene Java
> > 3.0.0".
> >
> > We need at least 3 binding (PMC) votes.
> >
> > Thanks everyone for all their hard work on this and I am very sorry for
> > requesting a vote again, but that's life! Thanks Shai for the pointer to
> > the
> > bug!
> >
> >
> >
> >
> > Here is the proposed release note, please edit, if needed:
> > 
> --
> >
> > Hello Lucene users,
> >
> > On behalf of the Lucene dev community (a growing community far larger
> than
> > just the committers) I would like to announce the release of Lucene Java
> > 3.0:
> >
> > The new version is mostly a cleanup release without any new features.
> All
> > deprecations targeted to be removed in version 3.0 were removed. If you
> > are
> > upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
> > warnings in your code base to be able to recompile against this version.
> >
> > This is the first Lucene release with Java 5 as a minimum requirement.
> The
> > API was cleaned up to make use of Java 5's generics, varargs, enums, and
> > autoboxing. New users of Lucene are advised to use this version for new
> > developments, because it has a clean, type safe new API. Upgrading users
> > can
> > now remove unnecessary casts and add generics to their code, too. If you
> > have not upgraded your installation to Java 5, please read the file
> > JRE_VERSION_MIGRATION.txt (please note that this is not related to
> Lucene
> > 3.0, it will also happen with any previous release when you upgrade your
> > Java environment).
> >
> > Lucene 3.0 has some changes regarding compressed fields: 2.9 already
> > deprecated compressed fields; support for them was removed now. Lucene
> 3.0
> > is still able to read indexes with compressed fields, but as soon as
> > merges
> > occur or the index is optimized, all compressed fields are decompressed
> > and
> > converted to Field.Store.YES. Because of this, indexes with compressed
> > fields can suddenly get larger.
> >
> > While we generally try and maintain full backwards compatibility between
> > major versions, Lucene 3.0 has some minor breaks, mostly related to
> > deprecation removal, pointed out in the 'Changes in backwards
> > compatibility
> > policy' section of CHANGES.txt. Notable are:
> >
> > - IndexReader.open(Directory) now opens in read-only mode per default
> > (this
> > method was deprecated because of that in 2.9). The same occurs to
> > IndexSearcher.
> >
> > - Already started in 2.9, core TokenStreams are now made final to
> enforce
> > the decorator pattern.
> >
> > - If you interrupt an IndexWriter merge thread, IndexWriter now throw

[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery

2009-11-22 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2089:


Attachment: Moman-0.1.tar.gz

>From Moman author:

Absolutely. Sorry for the missing links. I had some problems with my
provider which backed up an old version of my website.
The library is MIT licensed, so feel free to take anything you want. I
didn't made the subversion available since I was working
on Daciuk's "Incremental construction of minimal acyclic finite-state
automata", improving the memory footprint of the algorithm.
Since I want to be sure the new algorithm is right before publishing
it, the repository isn't available. Anyway, here's the source
code (attached). I must warn you that there's no comments at all,
which is unfortunate, since I was contacted by many people
lately, that had the same kind of request than yours.

If you need anything, just don't hesitate to ask.


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
> Attachments: Moman-0.1.tar.gz
>
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its 
> outside of that, maybe use the existing slow logic. At high K, it will seek 
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781227#action_12781227
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. Actually... wouldn't we need to convert to int[] (for Unicode 4) not 
char[], to be most convenient for "higher up" APIs like automaton? If we did 
char[] you'd still have to handle surrogates process (and then it's not unlike 
doing byte[]).

I wanted to make another comment here, I agree that this somewhat like byte[].
But there are some major differences:
# the java API provides mechanisms in Character, etc for processing text this 
way.
# lots of stuff is unaffected. for example .startsWith() is not broken for supp 
characters.
 it does not have to use codepoint anywhere, can just compare chars, which are 
surrogates, but this is ok.
 so lots of char[]-based processing is already compatible, and completely 
unaware of this issue. this is not true for byte[]
# it will perform the best overall, its only needed in very few places and we 
can be very careful where we add these checks, so we don't slow anything down 
or increase RAM usage, etc.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



  1   2   >