Problematic documentation in o.a.l.analysis.package.html
Hi I've read the analysis package.html and I found two issues: 1) The code sample under Invoking the Analyzer is broken. It calls incrementToken() but inside the while it prints 'ts' (which is TokenStream) and then do "t = ts.next()", which no longer works. That's an easy fix, so I don't think a JIRA issue is needed. 2) The documentation specifies that "Even consumers of TokenStreams should normally call addAttribute() instead of getAttribute(), because it would not fail if the TokenStream does not have this Attribute". IMO this is wrong and will give the wrong impression about how this API should be used. What if the TokenStream does not care about this attribute? It will not fill it with any information. The example with LengthFilter which calls addAttribute instead of has/getAttribute is a good one regarding why you shouldn't just call addAttribute. LegthFilter relies on the given TokenStream to fill TermAttribute with some information, so that it can later filter out terms of length < threshold. But what if I create a LengthFilter and give it a TokenStream which creates just PartOfSpeechAttribute? Or output terms that are not TermAttribute? Obviously it would be silly for me to do it, but no one restricts me from doing so. LengthFilter should either document that it expects TermAttribute to be returned from the input TokenStream, or better yet, enforce it in the constructor --> if you pass a TokenStream that does not return TermAttribute, throw an IllegalArgumentException. But anyway, the current documentation is, IMO, wrong and may lead to wrong impression. I don't know if this warrants a larger issue to investigate all the current TokenFilters and validate the input TokenStream. In my filters, I enforce the existence of a certain attribute. If I've misunderstood something, please correct me. 3) I think it would help if there will be some documentation/example about how TokenFilters are expected to process an Attribute before they return it. For example, if I have a TokenFilter which processes a certain TermAttribute by returning two other TermAttributes, then according to my understanding, upon calling incrementToken() it should: 3.1) If first call, clone the TokenStream's TermAttribute in an instance variable. Then process it and store in the TokenStream's TermAttribute the first TA it should return. 3.2) If second call, process it again and store the second TA in the TokenStream's TermAttribute. That's because the consumer will call incrementToken and then getAttribute. That getAttribute will return the TokenStream's attribute and not the filter's. I think I've read it somewhere, but it doesn't appear in this package.html. Shai
[jira] Commented: (LUCENE-2068) fix reverseStringFilter for unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781102#action_12781102 ] Simon Willnauer commented on LUCENE-2068: - Robert, we can take the MatchV. Version of the patch while I still think this one should not maintain backcompatibility here as it maintains extremely broken code. I plan to commit this soon if nobody objects. simon > fix reverseStringFilter for unicode 4.0 > --- > > Key: LUCENE-2068 > URL: https://issues.apache.org/jira/browse/LUCENE-2068 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Simon Willnauer >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2068.patch, LUCENE-2068.patch, LUCENE_2068.patch, > LUCENE_2068.patch > > > ReverseStringFilter is not aware of supplementary characters: when it > reverses it will create unpaired surrogates, which will be replaced by U+FFFD > by the indexer (but not at query time). > The wrong words will conflate to each other, and the right words won't match, > basically the whole thing falls apart. > This patch implements in-place reverse with the algorithm from apache harmony > AbstractStringBuilder.reverse0() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
RE: Problematic documentation in o.a.l.analysis.package.html
Hi Shai, Thanks for the suggestions! About your points: 1) This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is already in the vote phase and 2.9x is also already out. 2) Maybe the explanation is not so good. This text comes especially from the 2.9 old to new TS migration. The current indexer normally uses all attributes and the default Token implementation implements all six (in 2.9). We had a lot of TokenStreams in core that used getAttribute without checking, if the attribute is really there. With Token/TokenWrapper as implementation, this was always ok, but as soon as you switched to new API only, the attributes were implemented by single instances, so after adding a TermAttribute you were not be sure, to also have a PosIncr attribute. Even the indexer of Lucene had that problem. So for the basic six attributes you can always use addAttribute(), because if you not add the attribute in the instantiation of the stream, all attributes are added by the indexer. This applies to TermAttribute, PosIncrementAttribute and OffsetAttribute (if TV are enabled). So if the chain does not add them, they are added by the indexer (consumer). Because all attributes have a default value, this is no problem and the indexer code is even faster (e.g. it does not make sense to every time check for a PositionIncrementAttribute, as reading the default int value of 1 is read in the same speed, even faster without the extra null check. And with current index format it is always tracked and indexed). If you have own attribute, it may correctly make sense to check with hasAttribute() and throw IAE. But after this check, you can also use addAttribute() to get the reference to the attribute (there were a discussion about removing getAttribute at all). getAttribute and addAttribute have the same speed when fetching an already existing attribute. So it is up to yours, how you would implement your tokenstreams, that are just hints for beginners. The thing with hasAttribute() and addAttribute() is also mentioned in the docs a few lines down at the "optimization" part. 3) Attribute interfaces cannot be cloned (the Attribute interface has no clone()). You can only clone implementations (AttributeImpl), but it may happen that you not only clone the TermAttribute, but some other attributes also (e.g. if you use Token as impl, like 2.9 does by default for backwards compatinility). Because of that, attribute instances cannot be simply cloned, only their implementing classes. What you generally should do (see caching TokenFilter): use AttributeSource.captureState() to capture all current attributes in the filter chain. In incrementToken(), you can restore these captured states again into the existing attributes. Such states can be handles in local instance variables or Lists etc. The second point I do not understand: Consumers always call addAttribute (maybe hasAttribute/getAttribute) exactly one time before consuming! So they only have the current attribute instance, which can never change during the lifetime of a TokenStream chain. Also all TokenFilters/TokenStreams/Tokenizers use the same instances! Because of this the TS chain works. If every TokenStream would have its own attributes, they could not communicate! The important thing is, that you cannot relay on the fact that a chain may contain attributes or not, because even if you created a the Tokenizer and one TokenFilter in the chain and checked for hasAttribute() in the ctor of the filter, it may happen, that the third TokenFilter adds a new Attribute, which is then suddenly available (but the second filter does not see it - ok, it may not need to know about it, because the Tokenizer will never set it). Because of that its always best to use addAttribute, if you are "interested" on an attribute. If it does not exist, you just get default values (like termLength()==0 for TermAttribute, posIncr==1,.). Hope that explains your questions. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] _ From: Shai Erera [mailto:[email protected]] Sent: Sunday, November 22, 2009 9:37 AM To: [email protected] Subject: Problematic documentation in o.a.l.analysis.package.html Hi I've read the analysis package.html and I found two issues: 1) The code sample under Invoking the Analyzer is broken. It calls incrementToken() but inside the while it prints 'ts' (which is TokenStream) and then do "t = ts.next()", which no longer works. That's an easy fix, so I don't think a JIRA issue is needed. 2) The documentation specifies that "Even consumers of TokenStreams should normally call addAttribute() instead of getAttribute(), because it would not fail if the TokenStream does not have this Attribute". IMO this is wrong and will give the wrong impression about how this API should be used. What if the TokenStream does not care about this attribute? It will not fill it with any informat
Re: Hiding JIRA issues
Eek, that's kinda spooky... that we didn't get to the root cause. I sure hope Lucene is not to blame ;) Mike On Sat, Nov 21, 2009 at 9:21 PM, Robert Muir wrote: > i sent a note to infrastructure about this, they reindexed, and everything > is fixed now. > > On Sat, Nov 21, 2009 at 10:20 AM, Uwe Schindler wrote: >> >> Issues are missing after JIRA Client 1.4.1 synchronization, too. >> >> - >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: [email protected] >> >> >> > -Original Message- >> > From: Uwe Schindler [mailto:[email protected]] >> > Sent: Saturday, November 21, 2009 4:04 PM >> > To: [email protected]; [email protected] >> > Subject: RE: Hiding JIRA issues >> > >> > I am now trying to find it with JIRA Client GUI... >> > >> > - >> > Uwe Schindler >> > H.-H.-Meier-Allee 63, D-28213 Bremen >> > http://www.thetaphi.de >> > eMail: [email protected] >> > >> > >> > > -Original Message- >> > > From: Simon Willnauer [mailto:[email protected]] >> > > Sent: Saturday, November 21, 2009 3:57 PM >> > > To: [email protected] >> > > Subject: Re: Hiding JIRA issues >> > > >> > > On Sat, Nov 21, 2009 at 3:53 PM, Robert Muir wrote: >> > > > Mark, theres actually a huge chunk of hidden issues around that >> > > > time. >> > > > click 'view all issues' for lucene, and sort descending. we have >> > > > about >> > > 20 >> > > > right now, and you can easily see the gaps, those are the hidden >> > > > ones! >> > > >> > > Uwe, that way you should be able to spot your infra issue too >> > > > >> > > > On Sat, Nov 21, 2009 at 9:50 AM, Mark Miller >> > > wrote: >> > > >> >> > > >> By the way, I'm sure its prob the same issue - KayKay reported 2065 >> > > >> - >> > > >> but if you go the list of issues that KayKay reported - 2065 is not >> > in >> > > >> that list. >> > > >> >> > > >> Mark Miller wrote: >> > > >> > Robert suggested on Twitter that perhaps the JIRA Lucene index is >> > > >> > corrupt :) >> > > >> > >> > > >> > Uwe Schindler wrote: >> > > >> > >> > > >> >> I opened an issue in INFRA to get a Hudson account. I forgot the >> > > issue >> > > >> >> number and now I cannot find it. It is even not listed under my >> > > name. >> > > >> >> How >> > > >> >> can this happen? >> > > >> >> >> > > >> >> Uwe >> > > >> >> >> > > >> >> - >> > > >> >> Uwe Schindler >> > > >> >> H.-H.-Meier-Allee 63, D-28213 Bremen >> > > >> >> http://www.thetaphi.de >> > > >> >> eMail: [email protected] >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >>> -Original Message- >> > > >> >>> From: Mark Miller [mailto:[email protected]] >> > > >> >>> Sent: Saturday, November 21, 2009 3:43 PM >> > > >> >>> To: [email protected] >> > > >> >>> Subject: Hiding JIRA issues >> > > >> >>> >> > > >> >>> Robert mentioned there are issues that appear to be hidden in >> > JIRA >> > > >> >>> unless you know what they are. Anyone else notice this? >> > > >> >>> >> > > >> >>> An example he gave is >> > http://issues.apache.org/jira/browse/LUCENE- >> > > 2065 >> > > >> >>> - >> > > >> >>> there it is, but now try and find it in the summary/search view >> > > >> >>> - >> > I >> > > >> >>> have >> > > >> >>> and I can't. Kind of disturbing. There are others as well. >> > > >> >>> >> > > >> >>> - Mark >> > > >> >>> >> > > >> >>> >> > > >> >>> - >> > -- >> > > -- >> > > >> >>> To unsubscribe, e-mail: [email protected] >> > > >> >>> For additional commands, e-mail: >> > > >> >>> [email protected] >> > > >> >>> >> > > >> >>> >> > > >> >> >> > > >> >> >> > > >> >> -- >> > -- >> > > - >> > > >> >> To unsubscribe, e-mail: [email protected] >> > > >> >> For additional commands, e-mail: [email protected] >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> > >> > > >> > >> > > >> >> > > >> >> > > >> >> > > >> - >> > > >> To unsubscribe, e-mail: [email protected] >> > > >> For additional commands, e-mail: [email protected] >> > > >> >> > > > >> > > > >> > > > >> > > > -- >> > > > Robert Muir >> > > > [email protected] >> > > > >> > > >> > > - >> > > To unsubscribe, e-mail: [email protected] >> > > For additional commands, e-mail: [email protected] >> > >> > >> > >> > - >> > To unsubscribe, e-mail: [email protected] >> > For additional commands, e-mail: [email protected] >> >> >> >> - >> To unsubscribe, e-mail: [email protected] >> For additional comma
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781110#action_12781110
]
Michael McCandless commented on LUCENE-2075:
{quote}
bq. BTW the flex branch fixes this - TermsEnum.seek always checks the cache.
Can we fix this for trunk, too? But I think this issue talks about trunk.
{quote}
Right, this issue is about trunk (not flex API). I think we could fix this
(see my suggestions above), basically decoupling "useCache" from
"seekTheEnum"... but we have to fix the terminfo cache to also store the term's
ord. I'll try out this approach...
> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch,
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch,
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
[
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1278#action_1278
]
Thomas Mueller commented on LUCENE-1877:
> take it somewhere other than this closed issue.
Yes, where?
> shouldn't active code like that live in the application layer?
Why?
> exceed the polling interval for a low priority process on a system under
> heavy load?
The watchdog thread runs with high priority (see the H2 docs). It's still
possible the thread isn't run at all, but highly unlikely. High priority
threads are quite reliable. I wrote a MP3 player in Java (mp3transform) which I
used a lot, I never heard any gaps. If the thread can be avoided, that would be
great of course. I'm just trying to say that in theory, the thread is
problematic, but in practice it isn't. While file locking is not a problem in
theory, but in practice.
> What happens when the app sleeps?
Good question! Standby / hibernate are not supported. I didn't think about
that. Is there a way to detect the wakeup?
> host name and the pid
Yes. It is not so easy to get the PID in Java, I found:
http://stackoverflow.com/questions/35842/process-id-in-java
"ManagementFactory.getRuntimeMXBean().getName()". What do you do if the lock
was generated by another machine? I tried with using a server socket, so you
need the IP address, but unfortunately, sometimes the network is not configured
correctly (but maybe it's possible to detect that). Maybe the two machines
can't access each other over TCP/IP.
> hard links
Yes, but it looks like this doesn't work always.
> Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
> --
>
> Key: LUCENE-1877
> URL: https://issues.apache.org/jira/browse/LUCENE-1877
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Javadocs
>Reporter: Mark Miller
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch,
> LUCENE-1877.patch
>
>
> A user requested we add a note in IndexWriter alerting the availability of
> NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm
> exit). Seems reasonable to me - we want users to be able to easily stumble
> upon this class. The below code looks like a good spot to add a note - could
> also improve whats there a bit - opening an IndexWriter does not necessarily
> create a lock file - that would depend on the LockFactory used.
> {code} Opening an IndexWriter creates a lock file for the
> directory in use. Trying to open
> another IndexWriter on the same directory will lead to a
> {...@link LockObtainFailedException}. The {...@link
> LockObtainFailedException}
> is also thrown if an IndexReader on the same directory is used to delete
> documents
> from the index.{code}
> Anyone remember why NativeFSLockFactory is not the default over
> SimpleFSLockFactory?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781112#action_12781112 ] Michael McCandless commented on LUCENE-2075: bq. in both cases, its slower than trunk, but I assume this is due to flex branch not being optimized yet? The automaton benchmark looks great -- I'll dig into why the flex branch is slower in both of these cases. The first case tests old API on top of an old index, which I'm surprised to see not matching trunk's performance. The flex changes are supposed to "optimize" that case by directly using the old (trunk) code. The second test tests old API emulated over a flex index, which I'm also surprised to see is not faster than trunk -- there must be something silly going on in the API emulation. I'll dig... When I tested MTQs (TermRangeQuery, WildcardQuery), using flex API on flex index, they were reasonably faster, so I'll also try to get automaton's FilteredTermEnum cutover to the flex API, and test that. > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: Problematic documentation in o.a.l.analysis.package.html
Thanks Uwe. About (3), I use copyTo, not clone. I used the word 'clone' just out of habit. I'll read more about captureState, but I think copyTo works fine for me. Abour (2), I still think it's confusing. When I read addAttribute, I get an impression as if by calling this method, it is guaranteed for me that that attribute will be processed somehow by the input TokenStream. And this may not be true. There is a semantic difference between: addAttribute, and input.addAttribute. The former means I add the attribute to *my* TS instance, while the latter to the input TS. Even though both are the same (as TokenFilter attributes share the same instance of TokenStream attributes), these are sematically different. I'd expect to do the former just to ensure this attribute is in the 'attributes' map, while the latter to ensure the input TS will process it. BTW, remember that whatever looks clear to you, the implementer of this, may not be cleared to users. I still think that the current documentation may confuse people, and adding an extra "NOTE: this does not ensure the input TS will do anything with the passed Attribute. If need to, call input.hasAttribute to determine if that attribute is handled by the inpu TS" won't hurt. Shai On Sun, Nov 22, 2009 at 12:12 PM, Uwe Schindler wrote: > Hi Shai, > > > > Thanks for the suggestions! > > > > About your points: > > 1) This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is > already in the vote phase and 2.9x is also already out. > > 2) Maybe the explanation is not so good. This text comes especially > from the 2.9 old to new TS migration. The current indexer normally uses all > attributes and the default Token implementation implements all six (in 2.9). > We had a lot of TokenStreams in core that used getAttribute without > checking, if the attribute is really there. With Token/TokenWrapper as > implementation, this was always ok, but as soon as you switched to new API > only, the attributes were implemented by single instances, so after adding a > TermAttribute you were not be sure, to also have a PosIncr attribute. Even > the indexer of Lucene had that problem. So for the basic six attributes you > can always use addAttribute(), because if you not add the attribute in the > instantiation of the stream, all attributes are added by the indexer. This > applies to TermAttribute, PosIncrementAttribute and OffsetAttribute (if TV > are enabled). So if the chain does not add them, they are added by the > indexer (consumer). Because all attributes have a default value, this is no > problem and the indexer code is even faster (e.g. it does not make sense to > every time check for a PositionIncrementAttribute, as reading the default > int value of 1 is read in the same speed, even faster without the extra null > check. And with current index format it is always tracked and indexed). If > you have own attribute, it may correctly make sense to check with > hasAttribute() and throw IAE. But after this check, you can also use > addAttribute() to get the reference to the attribute (there were a > discussion about removing getAttribute at all). getAttribute and > addAttribute have the same speed when fetching an already existing > attribute. So it is up to yours, how you would implement your tokenstreams, > that are just hints for beginners. The thing with hasAttribute() and > addAttribute() is also mentioned in the docs a few lines down at the > “optimization” part. > > 3) Attribute interfaces cannot be cloned (the Attribute interface > has no clone()). You can only clone implementations (AttributeImpl), but it > may happen that you not only clone the TermAttribute, but some other > attributes also (e.g. if you use Token as impl, like 2.9 does by default for > backwards compatinility). Because of that, attribute instances cannot be > simply cloned, only their implementing classes. What you generally should do > (see caching TokenFilter): use AttributeSource.captureState() to capture all > current attributes in the filter chain. In incrementToken(), you can restore > these captured states again into the existing attributes. Such states can be > handles in local instance variables or Lists etc. > The second point I do not understand: Consumers always call addAttribute > (maybe hasAttribute/getAttribute) exactly one time before consuming! So they > only have the current attribute instance, which can never change during the > lifetime of a TokenStream chain. Also all > TokenFilters/TokenStreams/Tokenizers use the same instances! Because of this > the TS chain works. If every TokenStream would have its own attributes, they > could not communicate! The important thing is, that you cannot relay on the > fact that a chain may contain attributes or not, because even if you created > a the Tokenizer and one TokenFilter in the chain and checked for > hasAttribute() in the ctor of the filter, it may happen, that the third > TokenFilter adds a new Attribute, w
[jira] Commented: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
[
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781113#action_12781113
]
Thomas Mueller commented on LUCENE-1877:
> detect the wakeup / polling interval exceeded
An obvious solution is to use System.currentTimeMillis() and compare with the
expected delay. And then stop writing and throw a exception.
> Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
> --
>
> Key: LUCENE-1877
> URL: https://issues.apache.org/jira/browse/LUCENE-1877
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Javadocs
>Reporter: Mark Miller
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch,
> LUCENE-1877.patch
>
>
> A user requested we add a note in IndexWriter alerting the availability of
> NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm
> exit). Seems reasonable to me - we want users to be able to easily stumble
> upon this class. The below code looks like a good spot to add a note - could
> also improve whats there a bit - opening an IndexWriter does not necessarily
> create a lock file - that would depend on the LockFactory used.
> {code} Opening an IndexWriter creates a lock file for the
> directory in use. Trying to open
> another IndexWriter on the same directory will lead to a
> {...@link LockObtainFailedException}. The {...@link
> LockObtainFailedException}
> is also thrown if an IndexReader on the same directory is used to delete
> documents
> from the index.{code}
> Anyone remember why NativeFSLockFactory is not the default over
> SimpleFSLockFactory?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
RE: Problematic documentation in o.a.l.analysis.package.html
About (3): CopyTo is also only availabe for AttributeImpls, but not for the interfaces (you have to cast first) and then you are warned. If you copyTo() on a TernmAttribute, it may also copy other attributes with it, if TermAttribute and PosIncr. Attribute are all implemented by the same AttributeImpl. With captureState you can be sure to have copyied all attributes. There is one important thing: *always* ever call AttributeSource.clearAttributes when you generate new tokens in incrementToken. This affects mainly Tokenizers (they always have to clearAttributes, see docs), but if you inject tokens inside a TokenFilter, always call clearAttributes. If you do not do this and e.g. only Touch the TermAttribute, the PosIncr may stay the same as before and does not get the default, so e.g. StopFilter gets broken when you inject Zokens (the SmartChinese analyzer had that problem). About (2). There is no semantic difference, the call to input.addAttribute() does exactly the same like this.addAttribute() because all TokenFilters in a chain share the same attributes. It is exactly the same effect (you get the same instance of the attribute) and so they are semantically and in reality the same. If you think this is different, you have not understood the whole attributes API. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] _ From: Shai Erera [mailto:[email protected]] Sent: Sunday, November 22, 2009 11:39 AM To: [email protected] Subject: Re: Problematic documentation in o.a.l.analysis.package.html Thanks Uwe. About (3), I use copyTo, not clone. I used the word 'clone' just out of habit. I'll read more about captureState, but I think copyTo works fine for me. Abour (2), I still think it's confusing. When I read addAttribute, I get an impression as if by calling this method, it is guaranteed for me that that attribute will be processed somehow by the input TokenStream. And this may not be true. There is a semantic difference between: addAttribute, and input.addAttribute. The former means I add the attribute to *my* TS instance, while the latter to the input TS. Even though both are the same (as TokenFilter attributes share the same instance of TokenStream attributes), these are sematically different. I'd expect to do the former just to ensure this attribute is in the 'attributes' map, while the latter to ensure the input TS will process it. BTW, remember that whatever looks clear to you, the implementer of this, may not be cleared to users. I still think that the current documentation may confuse people, and adding an extra "NOTE: this does not ensure the input TS will do anything with the passed Attribute. If need to, call input.hasAttribute to determine if that attribute is handled by the inpu TS" won't hurt. Shai On Sun, Nov 22, 2009 at 12:12 PM, Uwe Schindler wrote: Hi Shai, Thanks for the suggestions! About your points: 1) This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is already in the vote phase and 2.9x is also already out. 2) Maybe the explanation is not so good. This text comes especially from the 2.9 old to new TS migration. The current indexer normally uses all attributes and the default Token implementation implements all six (in 2.9). We had a lot of TokenStreams in core that used getAttribute without checking, if the attribute is really there. With Token/TokenWrapper as implementation, this was always ok, but as soon as you switched to new API only, the attributes were implemented by single instances, so after adding a TermAttribute you were not be sure, to also have a PosIncr attribute. Even the indexer of Lucene had that problem. So for the basic six attributes you can always use addAttribute(), because if you not add the attribute in the instantiation of the stream, all attributes are added by the indexer. This applies to TermAttribute, PosIncrementAttribute and OffsetAttribute (if TV are enabled). So if the chain does not add them, they are added by the indexer (consumer). Because all attributes have a default value, this is no problem and the indexer code is even faster (e.g. it does not make sense to every time check for a PositionIncrementAttribute, as reading the default int value of 1 is read in the same speed, even faster without the extra null check. And with current index format it is always tracked and indexed). If you have own attribute, it may correctly make sense to check with hasAttribute() and throw IAE. But after this check, you can also use addAttribute() to get the reference to the attribute (there were a discussion about removing getAttribute at all). getAttribute and addAttribute have the same speed when fetching an already existing attribute. So it is up to yours, how you would implement your tokenstreams, that are just hints for beginners. The thing with hasAttribute() and addAttribute() is also mentioned in the docs a few lines down at the
Re: Problematic documentation in o.a.l.analysis.package.html
Perhaps copyTo works for me because I reference Token, but like I said it's working for me ... Thanks for the tip regarding clearAttributes(). I assume I'll get the same behavior if I clear the attributes one by one, defaulting their values to whatever are my defaults. Well ... IMO as a user, there is a semantic difference. I understand that eventually both are the same. And even though I don't understand the new TS API as much as you do, I think I have a fair understanding of it. And I also think that naive users will be confused ... In Java, when you call object.dosomething vs. dosomething, you understand that the former is performed on object and the latter on your instance. That that you (in the new API code) made sure both are the same is invisible to the user, unless he/she carefully reads the javadocs. The current impl is smart and useful, I just think some better documentation can help new users. Shai On Sun, Nov 22, 2009 at 12:51 PM, Uwe Schindler wrote: > About (3): CopyTo is also only availabe for AttributeImpls, but not for > the interfaces (you have to cast first) and then you are warned. If you > copyTo() on a TernmAttribute, it may also copy other attributes with it, if > TermAttribute and PosIncr. Attribute are all implemented by the same > AttributeImpl. With captureState you can be sure to have copyied all > attributes. > > > > There is one important thing: **always** ever call > AttributeSource.clearAttributes when you generate new tokens in > incrementToken. This affects mainly Tokenizers (they always have to > clearAttributes, see docs), but if you inject tokens inside a TokenFilter, > always call clearAttributes. If you do not do this and e.g. only Touch the > TermAttribute, the PosIncr may stay the same as before and does not get the > default, so e.g. StopFilter gets broken when you inject Zokens (the > SmartChinese analyzer had that problem). > > > > About (2). There is no semantic difference, the call to > input.addAttribute() does exactly the same like this.addAttribute() because > all TokenFilters in a chain share the same attributes. It is exactly the > same effect (you get the same instance of the attribute) and so they are > semantically and in reality the same. If you think this is different, you > have not understood the whole attributes API. > > > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [email protected] > -- > > *From:* Shai Erera [mailto:[email protected]] > *Sent:* Sunday, November 22, 2009 11:39 AM > > *To:* [email protected] > *Subject:* Re: Problematic documentation in o.a.l.analysis.package.html > > > > Thanks Uwe. > > About (3), I use copyTo, not clone. I used the word 'clone' just out of > habit. I'll read more about captureState, but I think copyTo works fine for > me. > > Abour (2), I still think it's confusing. When I read addAttribute, I get an > impression as if by calling this method, it is guaranteed for me that that > attribute will be processed somehow by the input TokenStream. And this may > not be true. There is a semantic difference between: addAttribute, and > input.addAttribute. The former means I add the attribute to *my* TS > instance, while the latter to the input TS. Even though both are the same > (as TokenFilter attributes share the same instance of TokenStream > attributes), these are sematically different. I'd expect to do the former > just to ensure this attribute is in the 'attributes' map, while the latter > to ensure the input TS will process it. > > BTW, remember that whatever looks clear to you, the implementer of this, > may not be cleared to users. I still think that the current documentation > may confuse people, and adding an extra "NOTE: this does not ensure the > input TS will do anything with the passed Attribute. If need to, call > input.hasAttribute to determine if that attribute is handled by the inpu TS" > won't hurt. > > Shai > > On Sun, Nov 22, 2009 at 12:12 PM, Uwe Schindler wrote: > > Hi Shai, > > > > Thanks for the suggestions! > > > > About your points: > > 1) This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is > already in the vote phase and 2.9x is also already out. > > 2) Maybe the explanation is not so good. This text comes especially > from the 2.9 old to new TS migration. The current indexer normally uses all > attributes and the default Token implementation implements all six (in 2.9). > We had a lot of TokenStreams in core that used getAttribute without > checking, if the attribute is really there. With Token/TokenWrapper as > implementation, this was always ok, but as soon as you switched to new API > only, the attributes were implemented by single instances, so after adding a > TermAttribute you were not be sure, to also have a PosIncr attribute. Even > the indexer of Lucene had that problem. So for the basic six attributes you > can always use addAttribute(), because if you n
RE: Problematic documentation in o.a.l.analysis.package.html
Abvout clearAttributes: Just the warning if your clear the attributes one by one, you have two problems: - you can only clear attributes you know about. E.g. most Tokenizers just set TermAttribute and OffsetAttribute (because only these two attributes are interesting). The PosIncr attribute does not need to be touched, as it defaults to 1, which is correct for generating Tokens from a Reader. If you only clear those two attributes in the Tokenizer, the posIncr attribute is not resetet. If you add a StopFilter or Sysnonymfilter after this tokenizer, who uses the PosIncr attribute, it would not see this attribute reseted. This leads to the bug of the SmartChineseAnalyzer, that exactly missed to reset *all* attributes, and therefore the StopFilter incremented the PosIncr each time, and suddenly overflowing with invalid values. So please, whenever you inject new Tokens, call clearAttributes on the TokenStream. - If you clear the attributes one by one, you may clear them multiple times. If all 6 attributes are implemented by Token, and you clear them one-by one, Token.clear() is called 6 times. Not very effective. Because of that, no attribute interface contains copyTo() and clone() or equals/hashcode. I know the documentation can be improved. Sorry about that, but you are tyring to use expert features. Normal TokenFilters just add some attributes in the ctor and use them. Token may be the implementation behind, but you cannot rely on that!!! If the Tokenizer already added attributes to it, the Token impl added by the Filter will never be added (because the interfaces are still available). So Just for "keeping" Token, do *not* try to cast all attributes to Token, this will fail (see next mail). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] _ From: Shai Erera [mailto:[email protected]] Sent: Sunday, November 22, 2009 12:06 PM To: [email protected] Subject: Re: Problematic documentation in o.a.l.analysis.package.html Perhaps copyTo works for me because I reference Token, but like I said it's working for me ... Thanks for the tip regarding clearAttributes(). I assume I'll get the same behavior if I clear the attributes one by one, defaulting their values to whatever are my defaults. Well ... IMO as a user, there is a semantic difference. I understand that eventually both are the same. And even though I don't understand the new TS API as much as you do, I think I have a fair understanding of it. And I also think that naive users will be confused ... In Java, when you call object.dosomething vs. dosomething, you understand that the former is performed on object and the latter on your instance. That that you (in the new API code) made sure both are the same is invisible to the user, unless he/she carefully reads the javadocs. The current impl is smart and useful, I just think some better documentation can help new users. Shai On Sun, Nov 22, 2009 at 12:51 PM, Uwe Schindler wrote: About (3): CopyTo is also only availabe for AttributeImpls, but not for the interfaces (you have to cast first) and then you are warned. If you copyTo() on a TernmAttribute, it may also copy other attributes with it, if TermAttribute and PosIncr. Attribute are all implemented by the same AttributeImpl. With captureState you can be sure to have copyied all attributes. There is one important thing: *always* ever call AttributeSource.clearAttributes when you generate new tokens in incrementToken. This affects mainly Tokenizers (they always have to clearAttributes, see docs), but if you inject tokens inside a TokenFilter, always call clearAttributes. If you do not do this and e.g. only Touch the TermAttribute, the PosIncr may stay the same as before and does not get the default, so e.g. StopFilter gets broken when you inject Zokens (the SmartChinese analyzer had that problem). About (2). There is no semantic difference, the call to input.addAttribute() does exactly the same like this.addAttribute() because all TokenFilters in a chain share the same attributes. It is exactly the same effect (you get the same instance of the attribute) and so they are semantically and in reality the same. If you think this is different, you have not understood the whole attributes API. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] _ From: Shai Erera [mailto:[email protected]] Sent: Sunday, November 22, 2009 11:39 AM To: [email protected] Subject: Re: Problematic documentation in o.a.l.analysis.package.html Thanks Uwe. About (3), I use copyTo, not clone. I used the word 'clone' just out of habit. I'll read more about captureState, but I think copyTo works fine for me. Abour (2), I still think it's confusing. When I read addAttribute, I get an impression as if by calling this method, it is guaranteed for me that that attribute will be processe
[jira] Created: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY - Key: LUCENE-2088 URL: https://issues.apache.org/jira/browse/LUCENE-2088 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9.1, 2.9, 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Blocker Fix For: 3.0 This is a blocker, because you can call addAttribute(Token.class) without getting an error message. I will commit the fix and restart the vote for 3.0. This also applies to 2.9, but there is no Token Attribute Factory. But I will merge to 2.9, too, if a 2.9.2 comes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Updated: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
[ https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2088: -- Attachment: LUCENE-2088.patch Here the patch, will commit soon and respawn 3.0. I will also merge to 2.9 branch. > AttributeSource.addAttribute should only accept interfaces, the missing test > leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY > - > > Key: LUCENE-2088 > URL: https://issues.apache.org/jira/browse/LUCENE-2088 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Blocker > Fix For: 3.0 > > Attachments: LUCENE-2088.patch > > > This is a blocker, because you can call addAttribute(Token.class) without > getting an error message. > I will commit the fix and restart the vote for 3.0. This also applies to 2.9, > but there is no Token Attribute Factory. But I will merge to 2.9, too, if a > 2.9.2 comes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
[
https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781122#action_12781122
]
Earwin Burrfoot commented on LUCENE-2088:
-
bq. && Attribute.class.isAssignableFrom(attClass)
What is this for? This line
bq. public A addAttribute(Class attClass) {
ensures the same at compile time.
> AttributeSource.addAttribute should only accept interfaces, the missing test
> leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
> -
>
> Key: LUCENE-2088
> URL: https://issues.apache.org/jira/browse/LUCENE-2088
> Project: Lucene - Java
> Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Blocker
> Fix For: 3.0
>
> Attachments: LUCENE-2088.patch
>
>
> This is a blocker, because you can call addAttribute(Token.class) without
> getting an error message.
> I will commit the fix and restart the vote for 3.0. This also applies to 2.9,
> but there is no Token Attribute Factory. But I will merge to 2.9, too, if a
> 2.9.2 comes.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
RE: [jira] Commented: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
If you use it type unsafe without generics, it will break. And we need it for 2.9. I was thinking about both variants and thought it would be better to leave it in. I will merge this now to 2.9, too. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] > -Original Message- > From: Earwin Burrfoot (JIRA) [mailto:[email protected]] > Sent: Sunday, November 22, 2009 2:35 PM > To: [email protected] > Subject: [jira] Commented: (LUCENE-2088) AttributeSource.addAttribute > should only accept interfaces, the missing test leads to problems with > Token.TOKEN_ATTRIBUTE_FACTORY > > > [ https://issues.apache.org/jira/browse/LUCENE- > 2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment- > tabpanel&focusedCommentId=12781122#action_12781122 ] > > Earwin Burrfoot commented on LUCENE-2088: > - > > bq. && Attribute.class.isAssignableFrom(attClass) > What is this for? This line > bq. public A addAttribute(Class attClass) { > ensures the same at compile time. > > > AttributeSource.addAttribute should only accept interfaces, the missing > test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY > > > - > > > > Key: LUCENE-2088 > > URL: https://issues.apache.org/jira/browse/LUCENE-2088 > > Project: Lucene - Java > > Issue Type: Bug > >Affects Versions: 2.9, 2.9.1, 3.0 > >Reporter: Uwe Schindler > >Assignee: Uwe Schindler > >Priority: Blocker > > Fix For: 3.0 > > > > Attachments: LUCENE-2088.patch > > > > > > This is a blocker, because you can call addAttribute(Token.class) > without getting an error message. > > I will commit the fix and restart the vote for 3.0. This also applies to > 2.9, but there is no Token Attribute Factory. But I will merge to 2.9, > too, if a 2.9.2 comes. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > - > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
[ https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781124#action_12781124 ] Uwe Schindler commented on LUCENE-2088: --- If you use it type unsafe without generics, it will break. And we need it for 2.9. You can break this if you do: addAttribute((Class) List.class) I was thinking about both variants and thought it would be better to leave it in. I will merge this now to 2.9, too, where we need it in all cases. > AttributeSource.addAttribute should only accept interfaces, the missing test > leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY > - > > Key: LUCENE-2088 > URL: https://issues.apache.org/jira/browse/LUCENE-2088 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Blocker > Fix For: 3.0 > > Attachments: LUCENE-2088.patch > > > This is a blocker, because you can call addAttribute(Token.class) without > getting an error message. > I will commit the fix and restart the vote for 3.0. This also applies to 2.9, > but there is no Token Attribute Factory. But I will merge to 2.9, too, if a > 2.9.2 comes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Updated: (LUCENE-2087) Remove recursion in NumericRangeTermEnum
[ https://issues.apache.org/jira/browse/LUCENE-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2087: -- Fix Version/s: 3.0 > Remove recursion in NumericRangeTermEnum > > > Key: LUCENE-2087 > URL: https://issues.apache.org/jira/browse/LUCENE-2087 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Minor > Fix For: 3.0, 3.1 > > Attachments: LUCENE-2087.patch > > > The current FilteredTermEnum in NRQ uses setEnum() which itsself calls > next(). This may lead to a recursion that can overflow stack, if the index is > empty and a large range with low precStep is used. With 64 bit numbers and > precStep == 1 there may be 127 recursions, as each sub-range would hit no > term on empty index and the setEnum call would then call next() which itsself > calls setEnum again. This leads to recursion depth of 256. > Attached is a patch that converts to iterative approach. setEnum is now > unused and throws UOE (like endEnum()). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Updated: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
[ https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2088: -- Attachment: LUCENE-2088-test.patch This patch shows how you can break. As Shai said, the problem is not only that it may have no effect, it completely breaks the behaviour of AttributeSource when you do this. Because of that the extra check is needed. > AttributeSource.addAttribute should only accept interfaces, the missing test > leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY > - > > Key: LUCENE-2088 > URL: https://issues.apache.org/jira/browse/LUCENE-2088 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Blocker > Fix For: 3.0 > > Attachments: LUCENE-2088-test.patch, LUCENE-2088.patch > > > This is a blocker, because you can call addAttribute(Token.class) without > getting an error message. > I will commit the fix and restart the vote for 3.0. This also applies to 2.9, > but there is no Token Attribute Factory. But I will merge to 2.9, too, if a > 2.9.2 comes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
[ https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781129#action_12781129 ] Uwe Schindler commented on LUCENE-2088: --- Thinking about it more and reading http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6461827 maybe we should remove it again. But addAttributeImpl already does a lot of isAssignableFrom checks (but cached) so maybe we should remove it for 3.0/3.1. In 2.9 it must stay alive. What do others think? > AttributeSource.addAttribute should only accept interfaces, the missing test > leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY > - > > Key: LUCENE-2088 > URL: https://issues.apache.org/jira/browse/LUCENE-2088 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Blocker > Fix For: 3.0 > > Attachments: LUCENE-2088-test.patch, LUCENE-2088.patch > > > This is a blocker, because you can call addAttribute(Token.class) without > getting an error message. > I will commit the fix and restart the vote for 3.0. This also applies to 2.9, > but there is no Token Attribute Factory. But I will merge to 2.9, too, if a > 2.9.2 comes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2088) AttributeSource.addAttribute should only accept interfaces, the missing test leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY
[ https://issues.apache.org/jira/browse/LUCENE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781131#action_12781131 ] Uwe Schindler commented on LUCENE-2088: --- But its no problem anymore, the sun bug is fixed since: Release Fixed 6u2(b01), 5.0u12(b02) (Bug ID:2144702) , hs10(b07) (Bug ID:2146432) , 7(b07) (Bug ID:2176843) Let's keep it in. > AttributeSource.addAttribute should only accept interfaces, the missing test > leads to problems with Token.TOKEN_ATTRIBUTE_FACTORY > - > > Key: LUCENE-2088 > URL: https://issues.apache.org/jira/browse/LUCENE-2088 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Blocker > Fix For: 3.0 > > Attachments: LUCENE-2088-test.patch, LUCENE-2088.patch > > > This is a blocker, because you can call addAttribute(Token.class) without > getting an error message. > I will commit the fix and restart the vote for 3.0. This also applies to 2.9, > but there is no Token Attribute Factory. But I will merge to 2.9, too, if a > 2.9.2 comes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781134#action_12781134 ] Michael McCandless commented on LUCENE-1606: Are we going to deprecate contrib/regex with this? > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Created: (LUCENE-2089) explore using automaton for fuzzyquery
explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea * up front, calculate the maximum required N edits needed to match the users supplied float threshold. * for at least common N (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable N, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781135#action_12781135 ] Robert Muir commented on LUCENE-1606: - bq. Are we going to deprecate contrib/regex with this? I would argue against that, only because the other regex impl's have different features and syntax, even if they are slow. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[VOTE] Release Apache Lucene Java 3.0.0 (take #2)
Hi, I have built the artifacts for the final release of "Apache Lucene Java 3.0.0" a second time, because of a bug in the TokenStream API (found by Shai Erera, who wanted to make "bad" things with addAttribute, breaking its behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to prevent stack overflow, LUCENE-2087). They are targeted for release on 2009-11-25. The artifacts are here: http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/ You find the changes in the corresponding sub folder. The SVN revision is 883080, here the manifest with build system info: Manifest-Version: 1.0 Ant-Version: Apache Ant 1.7.0 Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.) Specification-Title: Lucene Search Engine Specification-Version: 3.0.0 Specification-Vendor: The Apache Software Foundation Implementation-Title: org.apache.lucene Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49 Implementation-Vendor: The Apache Software Foundation X-Compile-Source-JDK: 1.5 X-Compile-Target-JDK: 1.5 Please vote to officially release these artifacts as "Apache Lucene Java 3.0.0". We need at least 3 binding (PMC) votes. Thanks everyone for all their hard work on this and I am very sorry for requesting a vote again, but that's life! Thanks Shai for the pointer to the bug! Here is the proposed release note, please edit, if needed: -- Hello Lucene users, On behalf of the Lucene dev community (a growing community far larger than just the committers) I would like to announce the release of Lucene Java 3.0: The new version is mostly a cleanup release without any new features. All deprecations targeted to be removed in version 3.0 were removed. If you are upgrading from version 2.9.1 of Lucene, you have to fix all deprecation warnings in your code base to be able to recompile against this version. This is the first Lucene release with Java 5 as a minimum requirement. The API was cleaned up to make use of Java 5's generics, varargs, enums, and autoboxing. New users of Lucene are advised to use this version for new developments, because it has a clean, type safe new API. Upgrading users can now remove unnecessary casts and add generics to their code, too. If you have not upgraded your installation to Java 5, please read the file JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene 3.0, it will also happen with any previous release when you upgrade your Java environment). Lucene 3.0 has some changes regarding compressed fields: 2.9 already deprecated compressed fields; support for them was removed now. Lucene 3.0 is still able to read indexes with compressed fields, but as soon as merges occur or the index is optimized, all compressed fields are decompressed and converted to Field.Store.YES. Because of this, indexes with compressed fields can suddenly get larger. While we generally try and maintain full backwards compatibility between major versions, Lucene 3.0 has some minor breaks, mostly related to deprecation removal, pointed out in the 'Changes in backwards compatibility policy' section of CHANGES.txt. Notable are: - IndexReader.open(Directory) now opens in read-only mode per default (this method was deprecated because of that in 2.9). The same occurs to IndexSearcher. - Already started in 2.9, core TokenStreams are now made final to enforce the decorator pattern. - If you interrupt an IndexWriter merge thread, IndexWriter now throws an unchecked ThreadInterruptedException that extends RuntimeException and clears the interrupt status. -- Thanks, Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781136#action_12781136 ] Mark Miller commented on LUCENE-2089: - bq. (i will assign this to him, I know he is itching to write that nasty algorithm ha - too much wine last night to laugh that hard this morning. Painful. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required N edits needed to match the users > supplied float threshold. > * for at least common N (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high N, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable N, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2089: Description: Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. was: Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea * up front, calculate the maximum required N edits needed to match the users supplied float threshold. * for at least common N (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable N, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. modify description to be readable, K is number of edits, N refers to length of word. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm)
[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2089: Description: Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. was: Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front,
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781138#action_12781138 ] Uwe Schindler commented on LUCENE-2089: --- bq. ha - too much wine last night to laugh that hard this morning. Painful. I need more beer, after that 3.0.0 problem with the AttributeSource API :-( (LUCENE-2088). > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781139#action_12781139 ] Robert Muir commented on LUCENE-2089: - by the way, the only open impl of this algorithm i could find is at http://rrette.com/moman.html (ZSpell) in python. the download link for 0.2 is not available, and it appears from the website this is the version with the updated algorithm. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
was this why i saw strange benchmark results?
On Sun, Nov 22, 2009 at 9:52 AM, wrote:
> Author: mikemccand
> Date: Sun Nov 22 14:52:02 2009
> New Revision: 883088
>
> URL: http://svn.apache.org/viewvc?rev=883088&view=rev
> Log:
> LUCENE-1458 (on flex branch): small optimization to terms dict cache: don't
> store redundant TermRef
>
> Modified:
>
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
>
>
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
>
> Modified:
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
> URL:
> http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java?rev=883088&r1=883087&r2=883088&view=diff
>
> ==
> ---
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
> (original)
> +++
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
> Sun Nov 22 14:52:02 2009
> @@ -36,6 +36,8 @@
> copy(text);
> }
>
> + // nocommit: we could do this w/ UnicodeUtil w/o requiring
> + // allocation of new bytes[]?
> public void copy(String text) {
> try {
> bytes = text.getBytes("UTF-8");
>
> Modified:
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
> URL:
> http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java?rev=883088&r1=883087&r2=883088&view=diff
>
> ==
> ---
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
> (original)
> +++
> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
> Sun Nov 22 14:52:02 2009
> @@ -304,6 +304,7 @@
> public SeekStatus seek(TermRef term) throws IOException {
> ReuseLRUCache cache = null;
> CacheEntry entry = null;
> +TermRef entryKey = null;
>
> if (docs.canCaptureState()) {
> final ThreadResources resources = getThreadResources();
> @@ -312,7 +313,7 @@
> entry = cache.get(term);
> if (entry != null) {
> docFreq = entry.freq;
> -bytesReader.term.copy(entry.term);
> +bytesReader.term.copy(term);
> docs.setState(entry, docFreq);
> termUpto = entry.termUpTo;
> // nocommit -- would be better to do this lazy?
> @@ -384,16 +385,17 @@
> entry = cache.eldest;
> cache.eldest = null;
> docs.captureState(entry);
> -entry.term.copy(bytesReader.term);
> +entryKey = cache.eldestKey;
> +entryKey.copy(bytesReader.term);
> } else {
> entry = docs.captureState(null);
> -entry.term = (TermRef) bytesReader.term.clone();
> +entryKey = (TermRef) bytesReader.term.clone();
> }
> entry.freq = docFreq;
> entry.termUpTo = termUpto;
> entry.filePointer = in.getFilePointer();
>
> - cache.put(entry.term, entry);
> + cache.put(entryKey, entry);
> }
> return SeekStatus.FOUND;
> } else if (cmp > 0) {
> @@ -517,9 +519,8 @@
>
> // nocommit -- scrutinize API
> public static class CacheEntry {
> -int termUpTo;
> -TermRef term; // nocommit -- really needed?
> -long filePointer;
> +int termUpTo; // ord for this term
> +long filePointer; // fp into the terms
> dict primary file (_X.tis)
>
> // nocommit -- belongs in Pulsing's CacheEntry class:
> public int freq;
> @@ -563,6 +564,7 @@
> private final static float LOADFACTOR = 0.75f;
> private int cacheSize;
> V eldest;
> +K eldestKey;
>
> /**
> * Creates a last-recently-used cache with the specified size.
> @@ -580,6 +582,7 @@
> boolean remove = size() > ReuseLRUCache.this.cacheSize;
> if (remove) {
> this.eldest = eldest.getValue();
> +this.eldestKey = eldest.getKey();
> }
> return remove;
> }
>
>
>
--
Robert Muir
[email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781140#action_12781140 ] Robert Muir commented on LUCENE-2089: - I hope its obvious from the benchmark why we shouldn't use the crappy prototype, and optimize K=1 (which is probably very common). if someone has a small index, with very long terms (bio sequences, or something), then fuzzy would actually get slower. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781142#action_12781142 ] Mark Miller commented on LUCENE-2089: - I'll take a look anyway - too bad I can't find my stupid automata text book - already bought the stupid thing twice in my life. Python translation sounds easier anyway :) > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781144#action_12781144 ] Robert Muir commented on LUCENE-2089: - bq. I'll take a look anyway - too bad I can't find my stupid automata text book - already bought the stupid thing twice in my life. Python translation sounds easier anyway you will need more wine. If it starts to catch, I'll be happy to offer help with use of the automaton API, etc. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781149#action_12781149 ] Mark Miller commented on LUCENE-2089: - bq. we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. I wonder what the best way to handle this would be - these transition tables appear to get very large very fast. Looks like the python stuff is now calculating them on the fly. You wouldn't want to load them all up into RAM for those not using them. Perhaps some sort of lru cache load on demand or something - though can't really fully warm up that way. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
No, not really... just an optimization I found when hunting ;)
I'm working now on an AutomatonTermsEnum that uses the flex API
directly, to test that performance.
One of the major challenges with flex is the 4-way testing required.
Ie, you can have a non-flex or flex index, and then you can access it
via non-flex or flex API. All 4 are allowed, and must work (for
back-compat).
I'm most concerned about performance of flex API on top of flex index,
since that's the future, but not hurting performance of the other 3 is
also important.
Mike
On Sun, Nov 22, 2009 at 10:22 AM, Robert Muir wrote:
> was this why i saw strange benchmark results?
>
> On Sun, Nov 22, 2009 at 9:52 AM, wrote:
>>
>> Author: mikemccand
>> Date: Sun Nov 22 14:52:02 2009
>> New Revision: 883088
>>
>> URL: http://svn.apache.org/viewvc?rev=883088&view=rev
>> Log:
>> LUCENE-1458 (on flex branch): small optimization to terms dict cache:
>> don't store redundant TermRef
>>
>> Modified:
>>
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
>>
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
>>
>> Modified:
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
>> URL:
>> http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java?rev=883088&r1=883087&r2=883088&view=diff
>>
>> ==
>> ---
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
>> (original)
>> +++
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/TermRef.java
>> Sun Nov 22 14:52:02 2009
>> @@ -36,6 +36,8 @@
>> copy(text);
>> }
>>
>> + // nocommit: we could do this w/ UnicodeUtil w/o requiring
>> + // allocation of new bytes[]?
>> public void copy(String text) {
>> try {
>> bytes = text.getBytes("UTF-8");
>>
>> Modified:
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
>> URL:
>> http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java?rev=883088&r1=883087&r2=883088&view=diff
>>
>> ==
>> ---
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
>> (original)
>> +++
>> lucene/java/branches/flex_1458/src/java/org/apache/lucene/index/codecs/standard/StandardTermsDictReader.java
>> Sun Nov 22 14:52:02 2009
>> @@ -304,6 +304,7 @@
>> public SeekStatus seek(TermRef term) throws IOException {
>> ReuseLRUCache cache = null;
>> CacheEntry entry = null;
>> + TermRef entryKey = null;
>>
>> if (docs.canCaptureState()) {
>> final ThreadResources resources = getThreadResources();
>> @@ -312,7 +313,7 @@
>> entry = cache.get(term);
>> if (entry != null) {
>> docFreq = entry.freq;
>> - bytesReader.term.copy(entry.term);
>> + bytesReader.term.copy(term);
>> docs.setState(entry, docFreq);
>> termUpto = entry.termUpTo;
>> // nocommit -- would be better to do this lazy?
>> @@ -384,16 +385,17 @@
>> entry = cache.eldest;
>> cache.eldest = null;
>> docs.captureState(entry);
>> - entry.term.copy(bytesReader.term);
>> + entryKey = cache.eldestKey;
>> + entryKey.copy(bytesReader.term);
>> } else {
>> entry = docs.captureState(null);
>> - entry.term = (TermRef) bytesReader.term.clone();
>> + entryKey = (TermRef) bytesReader.term.clone();
>> }
>> entry.freq = docFreq;
>> entry.termUpTo = termUpto;
>> entry.filePointer = in.getFilePointer();
>>
>> - cache.put(entry.term, entry);
>> + cache.put(entryKey, entry);
>> }
>> return SeekStatus.FOUND;
>> } else if (cmp > 0) {
>> @@ -517,9 +519,8 @@
>>
>> // nocommit -- scrutinize API
>> public static class CacheEntry {
>> - int termUpTo;
>> - TermRef term; // nocommit -- really needed?
>> - long filePointer;
>> + int termUpTo; // ord for this term
>> + long filePointer; // fp into the terms
>> dict primary file (_X.tis)
>>
>> // nocommit -- belongs in Pulsing's CacheEntry class:
>> public int freq;
>> @@ -563,6 +564,7 @@
>> private final static float LOADFACTOR = 0.75f;
>> private int cacheSize;
>> V eldest;
>> + K eldestKey;
>>
>> /**
>> * Creates a last-recently-used cache with the specified size.
>> @@ -580,6 +582,7 @@
>> boolean remove = size() > ReuseLRUCache.this.cacheSize;
>>
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781151#action_12781151 ] Robert Muir commented on LUCENE-2089: - Mark, they would get large fast, but i think we only need say 1,2,3. once you go high K, it will be many disk seeks anyway, and 'linear' mode like it does now will be probably faster. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781154#action_12781154 ] Robert Muir commented on LUCENE-2089: - Another twist, is that we have to support the 'constant prefix' as well. the easiest way would be that if the user asks for this, we intersect the automaton with a 'prefix automaton' such as abc.* but the BasicOperations.intersection is quadratic in # of states, because it supports both NFA and DFA we might need to implement a better DFA-only intersection alg for this constant prefix, so that supplying constant prefix doesnt actually make things slower :) > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
On Sun, Nov 22, 2009 at 11:23 AM, Michael McCandless < [email protected]> wrote: > No, not really... just an optimization I found when hunting ;) > > I'm working now on an AutomatonTermsEnum that uses the flex API > directly, to test that performance. > > I didn't mean to 'bail out' on this but I could not tell if TermsEnum was close to stabilized, and it might be significant work to convert it? Maybe benching numeric range would be easier and accomplish the same thing? -- Robert Muir [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781155#action_12781155
]
Michael McCandless commented on LUCENE-1606:
{quote}
bq. Are we going to deprecate contrib/regex with this?
I would argue against that, only because the other regex impl's have different
features and syntax, even if they are slow.
{quote}
Ahh OK, I agree then. I didn't realize this new query doesn't subsume
contrib's. Would be good to call out what's different in the javadocs
somewhere...
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch,
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
> automatonWithWildCard.patch, automatonWithWildCard2.patch,
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc.
> Additionally all of the existing RegexQuery implementations in Lucene are
> really slow if there is no constant prefix. This implementation does not
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http://
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert
> regular expressions into a DFA. Then, the filter "enumerates" terms in a
> special way, by using the underlying state machine. Here is my short
> description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a
> binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the
> DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded
> from http://www.brics.dk/automaton/ and is BSD-licensed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781156#action_12781156 ] Robert Muir commented on LUCENE-1606: - bq. Would be good to call out what's different in the javadocs somewhere... good idea, let me know if you have some suggested wording. its really a general issue, the supported features and syntax of even the existing regex implementations in contrib are different I think? (i.e. they are not compatible: you cannot just swap impls around, without testing that it supports the syntax and features you are using) > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781157#action_12781157 ] Mark Miller commented on LUCENE-2089: - bq. Mark, they would get large fast, but i think we only need say 1,2,3. Ah, right - you mention that above in the summary. bq. Another twist, is that we have to support the 'constant prefix' as well. Generally, if you have any kind of length to the prefix, linear wouldn't be so bad either though right? Guess it depends on your terms though. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781160#action_12781160 ] Michael McCandless commented on LUCENE-1606: I don't have any wording -- I don't really know the differences :) If it's "only" that the syntax is different, that's one thing... but if eg certain functionality isn't possible w/ new or old, that's another. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781161#action_12781161 ] Robert Muir commented on LUCENE-2089: - bq. Generally, if you have any kind of length to the prefix, linear wouldn't be so bad either though right? Guess it depends on your terms though. not so bad, but not so good either. see my wildcard results, for example (on the 10M equally distributed terms) ||Pattern||Iter||AvgHits||AvgMS (old)||AvgMS (new)|| ||N?N?N?N||10||1000.0||288.6||38.5|| > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781162#action_12781162 ] Robert Muir commented on LUCENE-1606: - bq. If it's "only" that the syntax is different, that's one thing... but if eg certain functionality isn't possible w/ new or old, that's another. from a glance, it appears to me that both the syntax and functionality of our two contrib impls (java.util and jakarta), are very different. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781162#action_12781162
]
Robert Muir edited comment on LUCENE-1606 at 11/22/09 5:06 PM:
---
bq. If it's "only" that the syntax is different, that's one thing... but if eg
certain functionality isn't possible w/ new or old, that's another.
from a glance, it appears to me that both the syntax and functionality of our
two contrib impls (java.util and jakarta), are very different.
here is one example. Java.util supports reluctant {m,n} closures, jakarta does
not, it says this right in the javadocs.
http://jakarta.apache.org/regexp/apidocs/org/apache/regexp/RE.html
Should RE support reluctant {m,n} closures (does anyone care)?
But it supports reluctant versus greedy for other operators.
in automaton, this concept of reluctance versus greedy, does not even exist, as
spelled out on their page:
The * operator is mathematically the Kleene star operator (i.e. we don't have
greedy/reluctant/possesive variants).
http://www.brics.dk/automaton/faq.html
this is an example, where all 3 are different... i guess i kinda assumed
everyone was aware that all these regex packages are very different.
was (Author: rcmuir):
bq. If it's "only" that the syntax is different, that's one thing... but if
eg certain functionality isn't possible w/ new or old, that's another.
from a glance, it appears to me that both the syntax and functionality of our
two contrib impls (java.util and jakarta), are very different.
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch,
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
> automatonWithWildCard.patch, automatonWithWildCard2.patch,
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc.
> Additionally all of the existing RegexQuery implementations in Lucene are
> really slow if there is no constant prefix. This implementation does not
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http://
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert
> regular expressions into a DFA. Then, the filter "enumerates" terms in a
> special way, by using the underlying state machine. Here is my short
> description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a
> binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the
> DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded
> from http://www.brics.dk/automaton/ and is BSD-licensed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781167#action_12781167 ] Mark Miller commented on LUCENE-2089: - Right, I wouldn't expect it to be great with a single char prefix - especially with the random terms your making. But I think thats a much worse case than a slightly longer prefix with a normal term distribution. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781168#action_12781168
]
Robert Muir commented on LUCENE-1606:
-
we call this out nicely in the current RegexQuery,
The expressions supported depend on the regular expression implementation used
by way of the RegexCapabilities interface.
what should I say for the automaton implementation? it already has a javadoc
link to the precise syntax supported,
so in my opinion its actually less ambiguous than contrib RegexQuery.
but maybe improve this, instead of
{code}
The supported syntax is documented in the {...@link RegExp} class.
{code}
maybe:
{code}
The supported syntax is documented in the {...@link RegExp} class.
warning: this might not be the syntax you are used to!
{code}
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch,
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
> automatonWithWildCard.patch, automatonWithWildCard2.patch,
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc.
> Additionally all of the existing RegexQuery implementations in Lucene are
> really slow if there is no constant prefix. This implementation does not
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http://
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert
> regular expressions into a DFA. Then, the filter "enumerates" terms in a
> special way, by using the underlying state machine. Here is my short
> description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a
> binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the
> DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded
> from http://www.brics.dk/automaton/ and is BSD-licensed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781170#action_12781170 ] Robert Muir commented on LUCENE-2089: - Mark maybe, though it also depends largely on the size of the "alphabet" in the term dictionary. for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, say [GACD], the automaton enumeration becomes even more effective over any "prefix" mechanism so the performance is language-dependent. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781170#action_12781170 ] Robert Muir edited comment on LUCENE-2089 at 11/22/09 5:44 PM: --- Mark maybe, though it also depends largely on the size of the "alphabet" in the term dictionary. for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, say [GACD], the automaton enumeration becomes even more effective over any "prefix" mechanism so the performance is language-dependent. also mark, you said you want to "scale" fuzzy query right? Taking a linear algorithm and dividing it by a constant (alphabet size), which is what constant prefix does, does not improve the scalability. for english, it just makes your fuzzy query 26 times faster. so imo it would be better to improve the real scalability... the constant prefix is just an optimization/hack but does not really solve the problem was (Author: rcmuir): Mark maybe, though it also depends largely on the size of the "alphabet" in the term dictionary. for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, say [GACD], the automaton enumeration becomes even more effective over any "prefix" mechanism so the performance is language-dependent. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781173#action_12781173 ] Mark Miller commented on LUCENE-2089: - bq. the constant prefix is just an optimization/hack Right - which is part of why I am not so worried about handling it with a new impl - its really only there now because the thing is so slow without it - if we really needed to support it, it wouldn't be so bad to fall back to as it is. Though if we can support it with the new impl fine - but I don't think I would go out of the way for it myself. bq. for english, it just makes your fuzzy query 26 times faster With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781174#action_12781174 ] Robert Muir commented on LUCENE-2089: - bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues. you write that crazy algorithm from the paper, and I will do the easy part (fast DFA intersection) so we can use the constant prefix. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat, might as well do it right. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery
[
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781174#action_12781174
]
Robert Muir edited comment on LUCENE-2089 at 11/22/09 6:01 PM:
---
bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix to
get much benifit - but I wouldnt think we care about a prefix of 1 with the new
impl - just don't use a prefix. Its just a hack to somewhat get around the
current scability issues.
you write that crazy algorithm from the paper, and I will do the easy part
(fast DFA intersection) so we can use the constant prefix.
we must "use" the prefix, so the results are the same as old FuzzyQuery for
back compat, might as well do it right.
btw, I am implying here that we should never do a real levenshtein dynamic
programming calculation if we do not have to.
for example, in my k=1 prototype, it is never used:
{code}
public float difference() {
if (currentTerm == null || currentTerm.equals(term))
return 1.0F;
else
return 1.0F - (1.0F / (Math.max(currentTerm.text().length(),
term.text().length(;
}
{code}
instead, if up to k=3 is required, we should do these steps:
* equals(), then 1.0F
* matches k=1 automaton?
* matches k=2 automaton?
* matches k=3 automaton?
even for k=3 i am sure this is likely faster on average than doing levenshtein,
especially if the query matches many terms.
automaton matching is linear time, and just array access.
was (Author: rcmuir):
bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix
to get much benifit - but I wouldnt think we care about a prefix of 1 with the
new impl - just don't use a prefix. Its just a hack to somewhat get around the
current scability issues.
you write that crazy algorithm from the paper, and I will do the easy part
(fast DFA intersection) so we can use the constant prefix.
we must "use" the prefix, so the results are the same as old FuzzyQuery for
back compat, might as well do it right.
> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
> Issue Type: Wish
> Components: Search
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users
> supplied float threshold.
> * for at least common K (1,2,3, etc) we should use automatontermenum. if its
> outside of that, maybe use the existing slow logic. At high K, it will seek
> too much to be helpful anyway.
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9|
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in
> a slow way. right now it creates an NFA, and all this wasted time is in
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to
> do with lucene, and here you can see, the TermEnum is fast (AvgMS -
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper:
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for
> linear minimization, if someone wants to implement this they should not worry
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even
> minimize FSM's at all, or if it is simply enough for them to be deterministic
> with no transitions to dead states. (The only code that actually assumes
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a
> summation easily). we need to benchmark really complex DFAs (i.e. write a
> regex benchmark) to figure out if minimization is even helping right now.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175 ] Mark Miller commented on LUCENE-2089: - bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem? > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir wrote: >> No, not really... just an optimization I found when hunting ;) >> >> I'm working now on an AutomatonTermsEnum that uses the flex API >> directly, to test that performance. >> > > I didn't mean to 'bail out' on this You didn't 'bail out'; I 'bailed in' ;) This is the joy of open source... great big noisy Bazaar. > but I could not tell if TermsEnum was close to stabilized I think it's close; we need to do this port anyway, once automaton is committed to trunk, so really I saved Mark some work ;) > and it might be significant work to convert it? It wasn't too bad, but maybe you can look it over once I post patch and see if I messed anything up :) > Maybe benching numeric range would be easier and accomplish the same thing? Yeah benching NRQ would be good too... many benchmarks still to run. Mike - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781178#action_12781178 ] Michael McCandless commented on LUCENE-1606: OK that warning seems good. Maybe also reference contrib/regex, as another alternative, nothing that syntax/capabilities are different? > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175 ] Mark Miller edited comment on LUCENE-2089 at 11/22/09 6:06 PM: --- bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem? *edit* Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. He didn't like the non scalable queries. Even if this new impl didn't exactly match the results of the old, I'd have no problem just deprecating the old and saying we are starting over with a "real" fuzzyquery - the old one goes away on the next major - anyone dying to stick with it can just pull the query into their own code. I think if we keep the prefix, it should be for a good reason in itself, rather than just back compat with the old crappy version. was (Author: [email protected]): bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem? > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781179#action_12781179 ] Robert Muir commented on LUCENE-2089: - bq. Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. I agree with garbage, but we cannot completely replace it anyway. for example what if someone supplies a term of length 54 and asks for distance of 0.5? we should not use this algorithm for nonsense like that, in that case I think they should just use the garbage algorithm. Here is a quote from that moman page: It means that in theory, you could ask for a Levenshtein distance of 27! Well, if you have a week ahead of you... we shouldnt burn cycles creating useless tables that will be huge arrays either in fuzzyquery, or whatever. we can't compute all the way up to infinity, this is why i think something like 1,2,3 is "reasonable" , if it requires more edits than that, go with the old brute force algorithm? > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
Mike, I guess what I am implying is should i even bother with lucene-1606 and trunk? or instead, should i be helping you, looking at TermsEnum, and working on integrating it into flex? On Sun, Nov 22, 2009 at 1:05 PM, Michael McCandless < [email protected]> wrote: > On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir wrote: > > >> No, not really... just an optimization I found when hunting ;) > >> > >> I'm working now on an AutomatonTermsEnum that uses the flex API > >> directly, to test that performance. > >> > > > > I didn't mean to 'bail out' on this > > You didn't 'bail out'; I 'bailed in' ;) This is the joy of open > source... great big noisy Bazaar. > > > but I could not tell if TermsEnum was close to stabilized > > I think it's close; we need to do this port anyway, once automaton is > committed to trunk, so really I saved Mark some work ;) > > > and it might be significant work to convert it? > > It wasn't too bad, but maybe you can look it over once I post patch > and see if I messed anything up :) > > > Maybe benching numeric range would be easier and accomplish the same > thing? > > Yeah benching NRQ would be good too... many benchmarks still to run. > > Mike > > - > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Robert Muir [email protected]
[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175 ] Mark Miller edited comment on LUCENE-2089 at 11/22/09 6:12 PM: --- bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem? *edit* Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. He didn't like the non scalable queries. Even if this new impl didn't exactly match the results of the old, I'd have no problem just deprecating the old and saying we are starting over with a "real" fuzzyquery - the old one goes away on the next major - anyone dying to stick with it can just pull the query into their own code. I think if we keep the prefix, it should be for a good reason in itself, rather than just back compat with the old crappy version. *edit* Not that I can't see people wanting just the tail end to be fuzzy - in that case, fine - do your dirty work :) I just don't know how much of a real req that is compared to a prefix query. Perhaps its more useful than I'd guess myself. I think we are quite a ways from getting this alg implemented anyway :) Though I am trying to track down the latest python code. was (Author: [email protected]): bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem? *edit* Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. He didn't like the non scalable queries. Even if this new impl didn't exactly match the results of the old, I'd have no problem just deprecating the old and saying we are starting over with a "real" fuzzyquery - the old one goes away on the next major - anyone dying to stick with it can just pull the query into their own code. I think if we keep the prefix, it should be for a good reason in itself, rather than just back compat with the old crappy version. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we nee
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781180#action_12781180 ] Mark Miller commented on LUCENE-2089: - bq. if it requires more edits than that, go with the old brute force algorithm? Perhaps - but it should be an option. We don't want perf to just drop off a cliff (and thats an understatement on a large index), just because the input crossed some line. I'd almost prefer the input just doesn't work - in my mind FuzzyQuery just doesn't work on large indecies. But far be it from me to take away the option ... I wouldnt really like it if it was default automatic - n=3 takes a few milliseconds, then n=4 takes an hour. Whoops. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-1606:
---
Attachment: LUCENE-1606-flex.patch
First cut @ cutting over to flex API attached -- note that this
applies to the flex branch, not trunk!
I made some small changes to the benchmarker: use constant score
filter mode, and print the min (not avg) time (less noise).
Also, I ported the AutomatonTermEnum to the flex API, so this is now a
better measure ("flex on flex") of what future perf will be. It's
possible there's a bug here, though TestWildcard passes.
I still need to investigate why "non-flex on non-flex" and "non-flex
on flex" perform worse.
I ran like this:
java -server -Xmx1g -Xms1g BenchWildcard
java is 1.6.0_14 64 bit, on OpenSolaris.
Results (msec is min of 10 runs each);
||Pattern||ITrunk (min msec)||(Flex (min msec)||
|N?N?N?N0.0|13|18|
|?NN|1|3|
|??N|4|6|
|???|23|28|
|NNN|210|170|
|NN??NNN|3|3|
|NN?N*|7|4|
|?NN*|62|30|
|*N|4332|2576|
|N??|1|1|
Looks like flex API is faster for the slow queries. Once I fix caching on
trunk we should retest...
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch,
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
> automatonWithWildCard.patch, automatonWithWildCard2.patch,
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc.
> Additionally all of the existing RegexQuery implementations in Lucene are
> really slow if there is no constant prefix. This implementation does not
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http://
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert
> regular expressions into a DFA. Then, the filter "enumerates" terms in a
> special way, by using the underlying state machine. Here is my short
> description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a
> binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the
> DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded
> from http://www.brics.dk/automaton/ and is BSD-licensed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781182#action_12781182 ] Robert Muir commented on LUCENE-2089: - bq. I wouldnt really like it if it was default automatic - n=3 takes a few milliseconds, then n=4 takes an hour. Whoops. you are completely right, it should not drop off a cliff. though i think as n increases, it will get significantly worse, because of so much seeking. we find the nice n where this is almost as bad as brute force, and cut her off there? this is basically what i did for the enum already, if you have a wildcard with leading * or a regex with .*, its faster to linear scan than to seek around. Uwe and I are referring this as "dumb" mode versus "smart" mode. I'm gonna add a constructor to the enum, so this can be specified rather than "auto" as well. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781183#action_12781183 ] Robert Muir commented on LUCENE-1606: - bq. Looks like flex API is faster for the slow queries. Once I fix caching on trunk we should retest... Mike, this is cool. I like the results. it appears tentatively, that flex api is faster for both "dumb" (brute force linear reading) and "fast" (lots of seeking) modes. at least looking at NNN, and *N, which are the worst cases of both here. So it would seem its faster in every case. I'll look at what you did to port this to the TermsEnum api! > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
I think you should keep doing all LUCENE-1606 work (and, any other issues) on trunk, and then we merge down to flex branch once it's committed? We shouldn't hold up any trunk features because flex is coming... merging down every so often seems manageable so far (Mark?). I'm hoping to finish flex soonish -- largely what remains (I think!) is better testing (correctness & performance) of the 4-way combinations. I think the codecs approach is generally working well.. the fact that we have initial Pulsing & PforDelta codecs working is great. Mike On Sun, Nov 22, 2009 at 1:11 PM, Robert Muir wrote: > Mike, I guess what I am implying is should i even bother with lucene-1606 > and trunk? > > or instead, should i be helping you, looking at TermsEnum, and working on > integrating it into flex? > > On Sun, Nov 22, 2009 at 1:05 PM, Michael McCandless > wrote: >> >> On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir wrote: >> >> >> No, not really... just an optimization I found when hunting ;) >> >> >> >> I'm working now on an AutomatonTermsEnum that uses the flex API >> >> directly, to test that performance. >> >> >> > >> > I didn't mean to 'bail out' on this >> >> You didn't 'bail out'; I 'bailed in' ;) This is the joy of open >> source... great big noisy Bazaar. >> >> > but I could not tell if TermsEnum was close to stabilized >> >> I think it's close; we need to do this port anyway, once automaton is >> committed to trunk, so really I saved Mark some work ;) >> >> > and it might be significant work to convert it? >> >> It wasn't too bad, but maybe you can look it over once I post patch >> and see if I messed anything up :) >> >> > Maybe benching numeric range would be easier and accomplish the same >> > thing? >> >> Yeah benching NRQ would be good too... many benchmarks still to run. >> >> Mike >> >> - >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > > > -- > Robert Muir > [email protected] > - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781185#action_12781185 ] Robert Muir commented on LUCENE-1606: - Mike, I think your port to TermsEnum is correct, and its definitely faster here. One question, is it possible to speed this up further, by using UnicodeUtil/char[] conversion from TermRef instead of String? Because its trivial to use char[] with the Automaton api (even tho that is not exposed, its no problem) I use only string because of the old TermEnum api. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781188#action_12781188
]
Michael McCandless commented on LUCENE-1606:
{quote}
One question, is it possible to speed this up further, by using
UnicodeUtil/char[] conversion from TermRef instead of String?
Because its trivial to use char[] with the Automaton api (even tho that is not
exposed, its no problem)
I use only string because of the old TermEnum api.
{quote}
Oh, that'd be great! It would be faster. I was bummed at how many new
String()'s I was doing...
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch,
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
> automatonWithWildCard.patch, automatonWithWildCard2.patch,
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc.
> Additionally all of the existing RegexQuery implementations in Lucene are
> really slow if there is no constant prefix. This implementation does not
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http://
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert
> regular expressions into a DFA. Then, the filter "enumerates" terms in a
> special way, by using the underlying state machine. Here is my short
> description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a
> binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the
> DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded
> from http://www.brics.dk/automaton/ and is BSD-licensed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
ok, I only ask because some rework of this enum could be necessary to take advantage of the new api. examples include changing it to use char[] (easy) to prevent lots of string creation, which was unavoidable with TermEnum since it is based on string. i will never mention this again, but it could also run on byte[] pretty easily. However I think high-level processing like this should use utf-16 processing, as java intended, although I'm pretty positive it would be extremely fast. On Sun, Nov 22, 2009 at 1:33 PM, Michael McCandless < [email protected]> wrote: > I think you should keep doing all LUCENE-1606 work (and, any other > issues) on trunk, and then we merge down to flex branch once it's > committed? > > We shouldn't hold up any trunk features because flex is > coming... merging down every so often seems manageable so far (Mark?). > > I'm hoping to finish flex soonish -- largely what remains (I think!) > is better testing (correctness & performance) of the 4-way > combinations. I think the codecs approach is generally working > well.. the fact that we have initial Pulsing & PforDelta codecs > working is great. > > Mike > > On Sun, Nov 22, 2009 at 1:11 PM, Robert Muir wrote: > > Mike, I guess what I am implying is should i even bother with lucene-1606 > > and trunk? > > > > or instead, should i be helping you, looking at TermsEnum, and working on > > integrating it into flex? > > > > On Sun, Nov 22, 2009 at 1:05 PM, Michael McCandless > > wrote: > >> > >> On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir wrote: > >> > >> >> No, not really... just an optimization I found when hunting ;) > >> >> > >> >> I'm working now on an AutomatonTermsEnum that uses the flex API > >> >> directly, to test that performance. > >> >> > >> > > >> > I didn't mean to 'bail out' on this > >> > >> You didn't 'bail out'; I 'bailed in' ;) This is the joy of open > >> source... great big noisy Bazaar. > >> > >> > but I could not tell if TermsEnum was close to stabilized > >> > >> I think it's close; we need to do this port anyway, once automaton is > >> committed to trunk, so really I saved Mark some work ;) > >> > >> > and it might be significant work to convert it? > >> > >> It wasn't too bad, but maybe you can look it over once I post patch > >> and see if I messed anything up :) > >> > >> > Maybe benching numeric range would be easier and accomplish the same > >> > thing? > >> > >> Yeah benching NRQ would be good too... many benchmarks still to run. > >> > >> Mike > >> > >> - > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > > > > > > > > -- > > Robert Muir > > [email protected] > > > > - > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Robert Muir [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781190#action_12781190 ] Robert Muir commented on LUCENE-1606: - bq. Oh, that'd be great! It would be faster. I was bummed at how many new String()'s I was doing... it would be nice I think if TermRef provided a helper method to make the char[] available? i.e. I don't think i should do unicode conversion in a multitermquery? > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Updated: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Kindgren updated LUCENE-1260: --- Attachment: Lucene-1260-2.patch I've added the old static methods again, but made them deprecated. In contrib/misc there is still a reference to the static encodeNorm method, maybe that should be replaced with Similarity.getDefaultSimilarity().encodeNormValue(f)? This call to the static method is only done if no similarity is passed to the FieldNormModifier. I added a short javadoc description to the static methods, not sure if that is enough? (I guess they will be removed, so the relevant javadoc is probably in the instance methods?) > Norm codec strategy in Similarity > - > > Key: LUCENE-1260 > URL: https://issues.apache.org/jira/browse/LUCENE-1260 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.1 >Reporter: Karl Wettin >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, > Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt > > > The static span and resolution of the 8 bit norms codec might not fit with > all applications. > My use case requires that 100f-250f is discretized in 60 bags instead of the > default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
bq. merging down every so often seems manageable so far (Mark?). Yeah, this has been working great from my perspective. Michael McCandless wrote: > I think you should keep doing all LUCENE-1606 work (and, any other > issues) on trunk, and then we merge down to flex branch once it's > committed? > > We shouldn't hold up any trunk features because flex is > coming... > > I'm hoping to finish flex soonish -- largely what remains (I think!) > is better testing (correctness & performance) of the 4-way > combinations. I think the codecs approach is generally working > well.. the fact that we have initial Pulsing & PforDelta codecs > working is great. > > Mike > > On Sun, Nov 22, 2009 at 1:11 PM, Robert Muir wrote: > >> Mike, I guess what I am implying is should i even bother with lucene-1606 >> and trunk? >> >> or instead, should i be helping you, looking at TermsEnum, and working on >> integrating it into flex? >> >> On Sun, Nov 22, 2009 at 1:05 PM, Michael McCandless >> wrote: >> >>> On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir wrote: >>> >>> > No, not really... just an optimization I found when hunting ;) > > I'm working now on an AutomatonTermsEnum that uses the flex API > directly, to test that performance. > > I didn't mean to 'bail out' on this >>> You didn't 'bail out'; I 'bailed in' ;) This is the joy of open >>> source... great big noisy Bazaar. >>> >>> but I could not tell if TermsEnum was close to stabilized >>> I think it's close; we need to do this port anyway, once automaton is >>> committed to trunk, so really I saved Mark some work ;) >>> >>> and it might be significant work to convert it? >>> It wasn't too bad, but maybe you can look it over once I post patch >>> and see if I messed anything up :) >>> >>> Maybe benching numeric range would be easier and accomplish the same thing? >>> Yeah benching NRQ would be good too... many benchmarks still to run. >>> >>> Mike >>> >>> - >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> >> -- >> Robert Muir >> [email protected] >> >> > > - > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781204#action_12781204 ] Mark Miller commented on LUCENE-2089: - bq. we find the nice n where this is almost as bad as brute force, and cut her off there? yeah, that sounds good if there is a nice transition at a relatively lower n. bq. this is basically what i did for the enum already, if you have a wildcard with leading * or a regex with .*, its faster to linear scan than to seek around. Yeah, and I think this makes sense - but you will notice both the Lucene qp and Solr don't allow leading wildcard by default - the perf is generally bad enough on a large index that we assume you don't want to allow it and force users to "turn on" the option. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781205#action_12781205 ] Robert Muir commented on LUCENE-2089: - bq. but you will notice both the Lucene qp and Solr don't allow leading wildcard by default if we add this automaton stuff, I think we should reconsider this rule. instead of don't allow leading * or ? by default, just don't allow just leading * by default. i won't really argue for it though, because a wildcard query of "?abc", probably just as bad as "*abc" > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781206#action_12781206 ] Mark Miller commented on LUCENE-2089: - I think it makes sense to allow leading ? - ?abc is prob rare enough that its worth it to allow by default. Or perhaps a separate knob to turn it on and leave leading * off. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781207#action_12781207 ] Robert Muir commented on LUCENE-2089: - mark, you are right. plus, the qp does not throw exceptions if you have a fuzzy query with no constant prefix, which is going to be actually worse than even leading *!!! > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781208#action_12781208 ] Mark Miller commented on LUCENE-2089: - solr doesnt even allow for a constant prefix with fuzzy - its broken, broken, broken :) DisMax to the rescue. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2072) Upgrade contrib/regex to jakarta-regex 1.5
[ https://issues.apache.org/jira/browse/LUCENE-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781209#action_12781209 ] Simon Willnauer commented on LUCENE-2072: - I just added a testcase to check if the prefix method does what it is supposed to do. Well, it doesn't! 1.4 version works fine with prefixes after upgrading to 1.5 the prefix is always null. I wrote to the jakarta-regexp list but I do not expect to get any reply - jakarta-regexp seems to be dead. simon > Upgrade contrib/regex to jakarta-regex 1.5 > --- > > Key: LUCENE-2072 > URL: https://issues.apache.org/jira/browse/LUCENE-2072 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.9.1 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Minor > Fix For: 3.1 > > Attachments: jakarta-regexp-1.5.jar, LUCENE-2072.patch > > > contrib/regex uses jakarta regex 1.4 while 1.5 is out for a while and make > the package private workaround obsolete. We should upgrade the lib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Resolved: (LUCENE-2072) Upgrade contrib/regex to jakarta-regex 1.5
[ https://issues.apache.org/jira/browse/LUCENE-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-2072. - Resolution: Later once jakarta-regexp fixes their issues we can go on and upgrade. for now this doesn't make sense. > Upgrade contrib/regex to jakarta-regex 1.5 > --- > > Key: LUCENE-2072 > URL: https://issues.apache.org/jira/browse/LUCENE-2072 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.9.1 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Minor > Fix For: 3.1 > > Attachments: jakarta-regexp-1.5.jar, LUCENE-2072.patch > > > contrib/regex uses jakarta regex 1.4 while 1.5 is out for a while and make > the package private workaround obsolete. We should upgrade the lib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
Yeah I think there will be lots of optimizing we can do, after flex lands. Maybe stick w/ String for now? But open an issue, today, to remind us to cutover to char[] post-flex? Doing all processing in UTF8 is tantalizing too ;) This would mean no conversion of the terms data on iterating from the terms dict... Mike On Sun, Nov 22, 2009 at 1:56 PM, Robert Muir wrote: > ok, I only ask because some rework of this enum could be necessary to take > advantage of the new api. > > examples include changing it to use char[] (easy) to prevent lots of string > creation, which was unavoidable with TermEnum since it is based on string. > > i will never mention this again, but it could also run on byte[] pretty > easily. > However I think high-level processing like this should use utf-16 > processing, as java intended, although I'm pretty positive it would be > extremely fast. > > On Sun, Nov 22, 2009 at 1:33 PM, Michael McCandless > wrote: >> >> I think you should keep doing all LUCENE-1606 work (and, any other >> issues) on trunk, and then we merge down to flex branch once it's >> committed? >> >> We shouldn't hold up any trunk features because flex is >> coming... merging down every so often seems manageable so far (Mark?). >> >> I'm hoping to finish flex soonish -- largely what remains (I think!) >> is better testing (correctness & performance) of the 4-way >> combinations. I think the codecs approach is generally working >> well.. the fact that we have initial Pulsing & PforDelta codecs >> working is great. >> >> Mike >> >> On Sun, Nov 22, 2009 at 1:11 PM, Robert Muir wrote: >> > Mike, I guess what I am implying is should i even bother with >> > lucene-1606 >> > and trunk? >> > >> > or instead, should i be helping you, looking at TermsEnum, and working >> > on >> > integrating it into flex? >> > >> > On Sun, Nov 22, 2009 at 1:05 PM, Michael McCandless >> > wrote: >> >> >> >> On Sun, Nov 22, 2009 at 11:31 AM, Robert Muir wrote: >> >> >> >> >> No, not really... just an optimization I found when hunting ;) >> >> >> >> >> >> I'm working now on an AutomatonTermsEnum that uses the flex API >> >> >> directly, to test that performance. >> >> >> >> >> > >> >> > I didn't mean to 'bail out' on this >> >> >> >> You didn't 'bail out'; I 'bailed in' ;) This is the joy of open >> >> source... great big noisy Bazaar. >> >> >> >> > but I could not tell if TermsEnum was close to stabilized >> >> >> >> I think it's close; we need to do this port anyway, once automaton is >> >> committed to trunk, so really I saved Mark some work ;) >> >> >> >> > and it might be significant work to convert it? >> >> >> >> It wasn't too bad, but maybe you can look it over once I post patch >> >> and see if I messed anything up :) >> >> >> >> > Maybe benching numeric range would be easier and accomplish the same >> >> > thing? >> >> >> >> Yeah benching NRQ would be good too... many benchmarks still to run. >> >> >> >> Mike >> >> >> >> - >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> > >> > >> > >> > -- >> > Robert Muir >> > [email protected] >> > >> >> - >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > > > -- > Robert Muir > [email protected] > - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781213#action_12781213 ] Michael McCandless commented on LUCENE-1606: bq. it would be nice I think if TermRef provided a helper method to make the char[] available? I agree... though, this requires state (UnicodeUtil.UTF16Result). We could lazily set such state on the TermRef, but, that's making TermRef kinda heavy (it's nice and light weight now). Hmmm. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
On Sun, Nov 22, 2009 at 3:50 PM, Michael McCandless < [email protected]> wrote: > Yeah I think there will be lots of optimizing we can do, after flex lands. > > Maybe stick w/ String for now? But open an issue, today, to remind us > to cutover to char[] post-flex? > ok, i'll create one. > > Doing all processing in UTF8 is tantalizing too ;) This would mean no > conversion of the terms data on iterating from the terms dict... > lets please not go this route :) its gonna be enough trouble fixing the char[]-based code for unicode 4, forget about byte[] > > Mike > > -- Robert Muir [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781214#action_12781214 ] Robert Muir commented on LUCENE-1606: - bq. I agree... though, this requires state (UnicodeUtil.UTF16Result). We could lazily set such state on the TermRef, but, that's making TermRef kinda heavy (it's nice and light weight now). Hmmm. i guess the state could be in the TermsEnum, but that doesnt make for general use of TermRef. what else uses TermRef? > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
On Sun, Nov 22, 2009 at 3:52 PM, Robert Muir wrote: > > On Sun, Nov 22, 2009 at 3:50 PM, Michael McCandless > wrote: >> >> Yeah I think there will be lots of optimizing we can do, after flex lands. >> >> Maybe stick w/ String for now? But open an issue, today, to remind us >> to cutover to char[] post-flex? > > ok, i'll create one. Thanks. >> Doing all processing in UTF8 is tantalizing too ;) This would mean no >> conversion of the terms data on iterating from the terms dict... > > lets please not go this route :) its gonna be enough trouble fixing the > char[]-based code for unicode 4, forget about byte[] I'll defer to you ;) Mike - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Created: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api
convert automaton to char[] based processing and TermRef / TermsEnum api Key: LUCENE-2090 URL: https://issues.apache.org/jira/browse/LUCENE-2090 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Priority: Minor Fix For: 3.1 The automaton processing is currently done with String, mostly because TermEnum is based on String. it is easy to change the processing to work with char[], since behind the scenes this is used anyway. in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
I guess here is where I just say that unicode and java are optimized for utf-16 processing, and so while I agree with byte[] being available in places like this for flex indexing, I'm already nervous about seeing code / optimizations that only work well with latin-1, and are very slow / buggy for anything else. On Sun, Nov 22, 2009 at 3:58 PM, Michael McCandless < [email protected]> wrote: > On Sun, Nov 22, 2009 at 3:52 PM, Robert Muir wrote: > > > > On Sun, Nov 22, 2009 at 3:50 PM, Michael McCandless > > wrote: > >> > >> Yeah I think there will be lots of optimizing we can do, after flex > lands. > >> > >> Maybe stick w/ String for now? But open an issue, today, to remind us > >> to cutover to char[] post-flex? > > > > ok, i'll create one. > > Thanks. > > >> Doing all processing in UTF8 is tantalizing too ;) This would mean no > >> conversion of the terms data on iterating from the terms dict... > > > > lets please not go this route :) its gonna be enough trouble fixing the > > char[]-based code for unicode 4, forget about byte[] > > I'll defer to you ;) > > Mike > > - > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Robert Muir [email protected]
[jira] Updated: (LUCENE-2068) fix reverseStringFilter for unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2068: Attachment: LUCENE_2068.patch added a CHANGES.txt entry. Will commit soon. > fix reverseStringFilter for unicode 4.0 > --- > > Key: LUCENE-2068 > URL: https://issues.apache.org/jira/browse/LUCENE-2068 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Simon Willnauer >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2068.patch, LUCENE-2068.patch, LUCENE_2068.patch, > LUCENE_2068.patch, LUCENE_2068.patch > > > ReverseStringFilter is not aware of supplementary characters: when it > reverses it will create unpaired surrogates, which will be replaced by U+FFFD > by the indexer (but not at query time). > The wrong words will conflate to each other, and the right words won't match, > basically the whole thing falls apart. > This patch implements in-place reverse with the algorithm from apache harmony > AbstractStringBuilder.reverse0() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Resolved: (LUCENE-2068) fix reverseStringFilter for unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-2068. - Resolution: Fixed Commited in revision 883149 > fix reverseStringFilter for unicode 4.0 > --- > > Key: LUCENE-2068 > URL: https://issues.apache.org/jira/browse/LUCENE-2068 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Simon Willnauer >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2068.patch, LUCENE-2068.patch, LUCENE_2068.patch, > LUCENE_2068.patch, LUCENE_2068.patch > > > ReverseStringFilter is not aware of supplementary characters: when it > reverses it will create unpaired surrogates, which will be replaced by U+FFFD > by the indexer (but not at query time). > The wrong words will conflate to each other, and the right words won't match, > basically the whole thing falls apart. > This patch implements in-place reverse with the algorithm from apache harmony > AbstractStringBuilder.reverse0() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
On Sun, Nov 22, 2009 at 4:06 PM, Robert Muir wrote: > I guess here is where I just say that unicode and java are optimized for > utf-16 processing I agree, though leaving things as UTF8 works fine for low level stuff (sorting, comparing equality, etc.)? > and so while I agree with byte[] being available in > places like this for flex indexing, > I'm already nervous about seeing code / optimizations that only work well > with latin-1, and are very slow / buggy for anything else. Buggy we should clearly outright fix. Slower, maybe. But very slow, I hope not? What places specifically are you worried about? Mike - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
On Sun, Nov 22, 2009 at 4:16 PM, Michael McCandless < [email protected]> wrote: > On Sun, Nov 22, 2009 at 4:06 PM, Robert Muir wrote: > > I guess here is where I just say that unicode and java are optimized for > > utf-16 processing > > I agree, though leaving things as UTF8 works fine for low level stuff > (sorting, comparing equality, etc.)? > +1 > > > and so while I agree with byte[] being available in > > places like this for flex indexing, > > I'm already nervous about seeing code / optimizations that only work well > > with latin-1, and are very slow / buggy for anything else. > > Buggy we should clearly outright fix. > > Slower, maybe. But very slow, I hope not? > > What places specifically are you worried about? > places like AutomatonQuery, where I found myself wanting to consider the option of processing byte[], when I know this is very bad! > Mike > > - > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Robert Muir [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781219#action_12781219 ] Michael McCandless commented on LUCENE-1606: Besides TermsEnum.. TermRef is used by the terms dict, when doing the binary search + scan to find a term. And also by TermsConsumer (implemented by the codec, and used when writing a segment to the index). Maybe MTQ holds the state, or FilteredTermsEnum? Other consumers of TermsEnum don't need to convert to char[]. We can discuss this under the new [separately] "optimization" issue for MTQs? Also, remember that the current API is doing not only new String() but also new Term() when it enums the terms, so having to do new String() for MTQs on flex API is OK for starters. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
RE: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
Hi, As a non-counting vote: +1 to release these artifacts as Lucene 3.0 I tested lucene-core.3.0.0.jar with my updated application, no problems occurred. QueryParser search works, fieldcache/sorting works, numeric range works. Reopen also works correct, no leftover open files. MMPaDirectory on 64 bit Java 1.5.0_22 on Solaris. Merging old indexes with compressed fields created newer, but larger segments (as expected). I also reindexed my indexes, filesize was identical to the one after merge/optimize with compress expansion, so the indexes seems to be identical. I only had to change some parts of my code and remove lots of unneeded casts (thanks to generics). :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] > -Original Message- > From: Uwe Schindler [mailto:[email protected]] > Sent: Sunday, November 22, 2009 4:07 PM > To: [email protected]; [email protected] > Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2) > > Hi, > > I have built the artifacts for the final release of "Apache Lucene Java > 3.0.0" a second time, because of a bug in the TokenStream API (found by > Shai > Erera, who wanted to make "bad" things with addAttribute, breaking its > behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to > prevent > stack overflow, LUCENE-2087). They are targeted for release on 2009-11-25. > > The artifacts are here: > http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/ > > You find the changes in the corresponding sub folder. The SVN revision is > 883080, here the manifest with build system info: > > Manifest-Version: 1.0 > Ant-Version: Apache Ant 1.7.0 > Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.) > Specification-Title: Lucene Search Engine > Specification-Version: 3.0.0 > Specification-Vendor: The Apache Software Foundation > Implementation-Title: org.apache.lucene > Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49 > Implementation-Vendor: The Apache Software Foundation > X-Compile-Source-JDK: 1.5 > X-Compile-Target-JDK: 1.5 > > Please vote to officially release these artifacts as "Apache Lucene Java > 3.0.0". > > We need at least 3 binding (PMC) votes. > > Thanks everyone for all their hard work on this and I am very sorry for > requesting a vote again, but that's life! Thanks Shai for the pointer to > the > bug! > > > > > Here is the proposed release note, please edit, if needed: > -- > > Hello Lucene users, > > On behalf of the Lucene dev community (a growing community far larger than > just the committers) I would like to announce the release of Lucene Java > 3.0: > > The new version is mostly a cleanup release without any new features. All > deprecations targeted to be removed in version 3.0 were removed. If you > are > upgrading from version 2.9.1 of Lucene, you have to fix all deprecation > warnings in your code base to be able to recompile against this version. > > This is the first Lucene release with Java 5 as a minimum requirement. The > API was cleaned up to make use of Java 5's generics, varargs, enums, and > autoboxing. New users of Lucene are advised to use this version for new > developments, because it has a clean, type safe new API. Upgrading users > can > now remove unnecessary casts and add generics to their code, too. If you > have not upgraded your installation to Java 5, please read the file > JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene > 3.0, it will also happen with any previous release when you upgrade your > Java environment). > > Lucene 3.0 has some changes regarding compressed fields: 2.9 already > deprecated compressed fields; support for them was removed now. Lucene 3.0 > is still able to read indexes with compressed fields, but as soon as > merges > occur or the index is optimized, all compressed fields are decompressed > and > converted to Field.Store.YES. Because of this, indexes with compressed > fields can suddenly get larger. > > While we generally try and maintain full backwards compatibility between > major versions, Lucene 3.0 has some minor breaks, mostly related to > deprecation removal, pointed out in the 'Changes in backwards > compatibility > policy' section of CHANGES.txt. Notable are: > > - IndexReader.open(Directory) now opens in read-only mode per default > (this > method was deprecated because of that in 2.9). The same occurs to > IndexSearcher. > > - Already started in 2.9, core TokenStreams are now made final to enforce > the decorator pattern. > > - If you interrupt an IndexWriter merge thread, IndexWriter now throws an > unchecked ThreadInterruptedException that extends RuntimeException and > clears the interrupt status. > > -- > > > > Thanks, > Uwe > > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen
Re: svn commit: r883088 - in /lucene/java/branches/flex_1458/src/java/org/apache/lucene/index: TermRef.java codecs/standard/StandardTermsDictReader.java
On Sun, Nov 22, 2009 at 4:19 PM, Robert Muir wrote: >> What places specifically are you worried about? > > places like AutomatonQuery, where I found myself wanting to consider the > option of processing byte[], when I know this is very bad! Ahh OK :) Well you got the better of yourself before it was too late! Mike - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781220#action_12781220 ] Robert Muir commented on LUCENE-1606: - bq. We can discuss this under the new [separately] "optimization" issue for MTQs? is there a jira issue for this?? bq. Also, remember that the current API is doing not only new String() but also new Term() when it enums the terms, so having to do new String() for MTQs on flex API is OK for starters. oh yeah, its clear the flex api is already better, from benchmarks. I am just trying to think of ways to make it both faster, at the same time easy too. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781221#action_12781221 ] Michael McCandless commented on LUCENE-1606: bq. is there a jira issue for this?? I thought you were about to open one! bq. I am just trying to think of ways to make it both faster, at the same time easy too. Which is great: keep it up! Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]). > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api
[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781222#action_12781222 ] Michael McCandless commented on LUCENE-2090: Spinoff from LUCENE-1606. > convert automaton to char[] based processing and TermRef / TermsEnum api > > > Key: LUCENE-2090 > URL: https://issues.apache.org/jira/browse/LUCENE-2090 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Robert Muir >Priority: Minor > Fix For: 3.1 > > > The automaton processing is currently done with String, mostly because > TermEnum is based on String. > it is easy to change the processing to work with char[], since behind the > scenes this is used anyway. > in general I think we should make sure char[] based processing is exposed in > the automaton pkg anyway, for things like pattern-based tokenizers and such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781223#action_12781223 ] Robert Muir commented on LUCENE-1606: - bq. I thought you were about to open one! I opened one for Automaton specifically, should i change it to be all MTQ? bq. Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]). nope. because unicode and java are optimized for UTF-16, not UTF-32. so we should use char[], but use the codePoint apis, which are designed such that you can process text in UTF-16 (char[]) efficiently, yet also handle the rare case of supp. characters. char[] is correct, its just that we have to be careful to use the right apis for processing it. With String() a lot of the apis such as String.toLowerCase do this automatically for you, so most applications have no issues. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781224#action_12781224 ] Michael McCandless commented on LUCENE-1260: Patch looks good! Thanks Johan. I'll commit in a day or two... > Norm codec strategy in Similarity > - > > Key: LUCENE-1260 > URL: https://issues.apache.org/jira/browse/LUCENE-1260 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.1 >Reporter: Karl Wettin >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, > Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt > > > The static span and resolution of the 8 bit norms codec might not fit with > all applications. > My use case requires that 100f-250f is discretized in 60 bags instead of the > default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
RE: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
> Hi, > > As a non-counting vote: > > +1 to release these artifacts as Lucene 3.0 > > I tested lucene-core.3.0.0.jar with my updated application, no problems > occurred. QueryParser search works, fieldcache/sorting works, numeric > range > works. Reopen also works correct, no leftover open files. MMPaDirectory on > 64 bit Java 1.5.0_22 on Solaris. Merging old indexes with compressed > fields > created newer, but larger segments (as expected). > I also reindexed my indexes, filesize was identical to the one after > merge/optimize with compress expansion, so the indexes seems to be > identical. > > I only had to change some parts of my code and remove lots of unneeded > casts > (thanks to generics). :-) And I forgot: I was able to build and test the whole distribution from the source ZIP. JavaDocs of binary distrib were ok, too. So no change: +1 :-) Uwe > > -Original Message- > > From: Uwe Schindler [mailto:[email protected]] > > Sent: Sunday, November 22, 2009 4:07 PM > > To: [email protected]; [email protected] > > Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2) > > > > Hi, > > > > I have built the artifacts for the final release of "Apache Lucene Java > > 3.0.0" a second time, because of a bug in the TokenStream API (found by > > Shai > > Erera, who wanted to make "bad" things with addAttribute, breaking its > > behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to > > prevent > > stack overflow, LUCENE-2087). They are targeted for release on 2009-11- > 25. > > > > The artifacts are here: > > http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/ > > > > You find the changes in the corresponding sub folder. The SVN revision > is > > 883080, here the manifest with build system info: > > > > Manifest-Version: 1.0 > > Ant-Version: Apache Ant 1.7.0 > > Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.) > > Specification-Title: Lucene Search Engine > > Specification-Version: 3.0.0 > > Specification-Vendor: The Apache Software Foundation > > Implementation-Title: org.apache.lucene > > Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49 > > Implementation-Vendor: The Apache Software Foundation > > X-Compile-Source-JDK: 1.5 > > X-Compile-Target-JDK: 1.5 > > > > Please vote to officially release these artifacts as "Apache Lucene Java > > 3.0.0". > > > > We need at least 3 binding (PMC) votes. > > > > Thanks everyone for all their hard work on this and I am very sorry for > > requesting a vote again, but that's life! Thanks Shai for the pointer to > > the > > bug! > > > > > > > > > > Here is the proposed release note, please edit, if needed: > > > -- > > > > Hello Lucene users, > > > > On behalf of the Lucene dev community (a growing community far larger > than > > just the committers) I would like to announce the release of Lucene Java > > 3.0: > > > > The new version is mostly a cleanup release without any new features. > All > > deprecations targeted to be removed in version 3.0 were removed. If you > > are > > upgrading from version 2.9.1 of Lucene, you have to fix all deprecation > > warnings in your code base to be able to recompile against this version. > > > > This is the first Lucene release with Java 5 as a minimum requirement. > The > > API was cleaned up to make use of Java 5's generics, varargs, enums, and > > autoboxing. New users of Lucene are advised to use this version for new > > developments, because it has a clean, type safe new API. Upgrading users > > can > > now remove unnecessary casts and add generics to their code, too. If you > > have not upgraded your installation to Java 5, please read the file > > JRE_VERSION_MIGRATION.txt (please note that this is not related to > Lucene > > 3.0, it will also happen with any previous release when you upgrade your > > Java environment). > > > > Lucene 3.0 has some changes regarding compressed fields: 2.9 already > > deprecated compressed fields; support for them was removed now. Lucene > 3.0 > > is still able to read indexes with compressed fields, but as soon as > > merges > > occur or the index is optimized, all compressed fields are decompressed > > and > > converted to Field.Store.YES. Because of this, indexes with compressed > > fields can suddenly get larger. > > > > While we generally try and maintain full backwards compatibility between > > major versions, Lucene 3.0 has some minor breaks, mostly related to > > deprecation removal, pointed out in the 'Changes in backwards > > compatibility > > policy' section of CHANGES.txt. Notable are: > > > > - IndexReader.open(Directory) now opens in read-only mode per default > > (this > > method was deprecated because of that in 2.9). The same occurs to > > IndexSearcher. > > > > - Already started in 2.9, core TokenStreams are now made final to > enforce > > the decorator pattern. > > > > - If you interrupt an IndexWriter merge thread, IndexWriter now throw
[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2089: Attachment: Moman-0.1.tar.gz >From Moman author: Absolutely. Sorry for the missing links. I had some problems with my provider which backed up an old version of my website. The library is MIT licensed, so feel free to take anything you want. I didn't made the subversion available since I was working on Daciuk's "Incremental construction of minimal acyclic finite-state automata", improving the memory footprint of the algorithm. Since I want to be sure the new algorithm is right before publishing it, the repository isn't available. Anyway, here's the source code (attached). I must warn you that there's no comments at all, which is unfortunate, since I was contacted by many people lately, that had the same kind of request than yours. If you need anything, just don't hesitate to ask. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > Attachments: Moman-0.1.tar.gz > > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least common K (1,2,3, etc) we should use automatontermenum. if its > outside of that, maybe use the existing slow logic. At high K, it will seek > too much to be helpful anyway. > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781227#action_12781227 ] Robert Muir commented on LUCENE-1606: - bq. Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]). I wanted to make another comment here, I agree that this somewhat like byte[]. But there are some major differences: # the java API provides mechanisms in Character, etc for processing text this way. # lots of stuff is unaffected. for example .startsWith() is not broken for supp characters. it does not have to use codepoint anywhere, can just compare chars, which are surrogates, but this is ok. so lots of char[]-based processing is already compatible, and completely unaware of this issue. this is not true for byte[] # it will perform the best overall, its only needed in very few places and we can be very careful where we add these checks, so we don't slow anything down or increase RAM usage, etc. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
