RE: Build failed in Hudson: Lucene-trunk #1091

2010-02-11 Thread Uwe Schindler
Really cool:
This time we hit the failure during the clover run:

[junit] Testsuite: org.apache.lucene.index.TestIndexWriterMergePolicy
[junit] Tests run: 6, Failures: 1, Errors: 0, Time elapsed: 28.519 sec
[junit] 
[junit] Testcase: 
testMaxBufferedDocsChange(org.apache.lucene.index.TestIndexWriterMergePolicy):  
  FAILED
[junit] maxMergeDocs=2147483647; numSegments=11; upperBound=10; 
mergeFactor=10; segs=_65:c5950 _5t:c10-_32 _5u:c10-_32 _5v:c10-_32 
_5w:c10-_32 _5x:c10-_32 _5y:c10-_32 _5z:c10-_32 _60:c10-_32 _61:c10-_32 
_62:c9-_32 _64:c1-_62
[junit] junit.framework.AssertionFailedError: maxMergeDocs=2147483647; 
numSegments=11; upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10-_32 
_5u:c10-_32 _5v:c10-_32 _5w:c10-_32 _5x:c10-_32 _5y:c10-_32 _5z:c10-_32 
_60:c10-_32 _61:c10-_32 _62:c9-_32 _64:c1-_62
[junit] at 
org.apache.lucene.index.TestIndexWriterMergePolicy.checkInvariants(TestIndexWriterMergePolicy.java:234)
[junit] at 
org.apache.lucene.index.TestIndexWriterMergePolicy.__CLR2_6_3zf7i0317qu(TestIndexWriterMergePolicy.java:164)
[junit] at 
org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange(TestIndexWriterMergePolicy.java:125)
[junit] at 
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:214)
[junit] 
[junit] 
[junit] Test org.apache.lucene.index.TestIndexWriterMergePolicy FAILED


We also get the exact location of the failure:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1091/clover-report/org/apache/lucene/index/TestIndexWriterMergePolicy.html?line=125#src-125

And you can see which lines are called how often until the failure occurred!

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org]
 Sent: Thursday, February 11, 2010 5:16 AM
 To: java-dev@lucene.apache.org
 Subject: Build failed in Hudson: Lucene-trunk #1091
 
 See http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/1091/changes
 
 Changes:
 
 [uschindler] LUCENE-2248: Change core tests to use a global Version
 constant
 
 [uschindler] LUCENE-2258: Remove unneeded synchronization in
 FuzzyTermEnum
 
 --
 [...truncated 23095 lines...]
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard...
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard.builders...
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard.config...
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard.nodes...
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard.parser...
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard.processors...
   [javadoc] Constructing Javadoc information...
   [javadoc] Standard Doclet version 1.5.0_22
   [javadoc] Building tree for all the packages and classes...
   [javadoc] Building index for all the packages and classes...
   [javadoc] Building index for all classes...
   [javadoc] Generating
 http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/ws/trunk/build/docs/api/contrib-queryparser/stylesheet.css...
   [javadoc] Note: Custom tags that were not seen:
 @lucene.experimental, @lucene.internal
   [jar] Building jar:
 http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/ws/trunk/build/contrib/queryparser/lucene-queryparser-2010-02-
 11_02-03-57-javadoc.jar
  [echo] Building regex...
 
 javadocs:
 [mkdir] Created dir:
 http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/ws/trunk/build/docs/api/contrib-regex
   [javadoc] Generating Javadoc
   [javadoc] Javadoc execution
   [javadoc] Loading source files for package
 org.apache.lucene.search.regex...
   [javadoc] Loading source files for package org.apache.regexp...
   [javadoc] Constructing Javadoc information...
   [javadoc] Standard Doclet version 1.5.0_22
   [javadoc] Building tree for all the packages and classes...
   [javadoc] Building index for all the packages and classes...
   [javadoc] Building index for all classes...
   [javadoc] Generating
 http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/ws/trunk/build/docs/api/contrib-regex/stylesheet.css...
   [javadoc] Note: Custom tags that were not seen:
 @lucene.experimental, @lucene.internal
   [jar] Building jar:
 http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/ws/trunk/build/contrib/regex/lucene-regex-2010-02-11_02-03-57-
 javadoc.jar
  [echo] Building remote...
 
 javadocs:
 [mkdir] Created dir:
 http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/ws/trunk/build/docs/api/contrib-remote
   [javadoc] Generating Javadoc
   [javadoc] Javadoc execution
   [javadoc] Loading source files for package
 org.apache.lucene.search...
   [javadoc] Constructing Javadoc 

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832424#action_12832424
 ] 

Eks Dev commented on LUCENE-2089:
-

{quote}
 What about this,
http://www.catalysoft.com/articles/StrikeAMatch.html
it seems logically more appropriate to (human-entered) text objects than 
Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance 
faster? 
{quote}

Is that only me who sees plain, vanilla bigram distance here? What is new or 
better in StrikeAMatch compared to the first phase of the current SpellCehcker 
(feeding PriorityQueue with candidates)? 

If you need too use this, nothing simpler, you do not even need pair comparison 
(aka traversal), just Index terms split into bigrams and search with standard 
Query. 


Autmaton trick is a neat one. Imo,  the only thing that would work better is to 
make term dictionary real trie (ternary, n-ary, dfa, makes no big diff). Making 
TerrmDict some sort of trie/dfa would permit smart beam-search,  even without 
compiling query DFA. Beam search also makes implementation of better distances 
possible (Weighted Edit distance without metric constraint ). I guess this is 
going to be possible with Flex, Mike was allready talking about DFA Dictionary 
:)

It took a while to figure out the trick Robert pooled here, treating term 
dictionary as another DFA due to the sortedness, nice. 

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1720:
---

Attachment: LUCENE-1720.patch

Collected all the files in the issue into a single .patch file. I also included 
the following changes/additions:
# Formatting.
# Renamed TestTimeLimitedIndexReader to TimeLimitedIndexReaderBenchmark - I 
don't want it to be run by ant test. We should review it anyway and put in in 
benchmark.
# Created TestActivityTimeMonitor which tests the ATM itself. This revealed a 
bug in the ATM, where it removed from the map while iterating on it 
(TimeoutThread.run()).
#* BTW, the test testMultipleThreadsFailing() passes when the number of threads 
is 10, and fails when it's set higher. It is definitely a timing issue, b/c if 
I change each thread to sleep more than 200ms, the test passes. I suspect that 
when there are so many threads in the system, TimeoutThread will not be as 
accurate. However I'm not sure if we should do something about it ... if there 
are 50 threads running concurrently, things will be slow, and executing search 
requests will take longer, therefore the TimeoutThread will have enough time to 
pick their timeouts.

We need to create a test for the IndexReader case, I hope to get to it, but if 
you want, please take a stab at it.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832427#action_12832427
 ] 

Shai Erera commented on LUCENE-1720:


BTW Mark, if we assume the number of threads that will be monitored is 
relatively low, I agree w/ your comment on using the TreeMap. I think we can 
remove it from the code (the comment I mean). What do you think?

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2154) Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers

2010-02-11 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2154:
--

Attachment: LUCENE-2154.patch

Here is a first patch about cglib-generated proxy attributes.

In IRC we found out yesterday, that the proposed idea to share the attributes 
accross all Multi*Enums would result in problems as the call to next() on any 
sub-enum would overwrite the contents of the attributes of the previous 
sub-enum which would make TermsEnum not working (because e.g. TermsEnum looks 
forward by calling next() an all sub-enums and choosing the lowest term to 
return - after calling each enums next() the attributes of the first enums 
cannot be restored without captureState  co, as overwritten by the next() call 
to the last enum).

This patch needs cglib-nodep-2.2.jar put into the lib-folder of the checkout 
[http://sourceforge.net/projects/cglib/files/cglib2/2.2/cglib-nodep-2.2.jar/download].

It contains a test and that shows how the usage is. The central part is cglib's 
Enhancer that creates a dynamic class extending ProxyAttributeImpl (which 
defines the general AttributeImpl methods delegating to the delegate) and 
implementing the requested Attribute interface using a MethodInterceptor.

Please note: This uses no reflection (only during in-memory class file 
creation, which is only run one time on loading the proxy class). The proxy 
implements MethodInterceptor and uses the fast MethodProxy class (which is also 
generated by cglib for each proxied method, too) and can invoke the delegated 
method directly (without reflection) on the delegate.

The test verifies everything works and also compares speed by using a 
TermAttribute natively and proxied. The speed is lower (which is not caused by 
reflection, but by the MethodInterceptor creating an array of parameters and 
boxing/unboxing native parameters into the Object[]), but for the testcase I 
have seen about only  50% more time needed.

The generated classes are cached and reused (like DEFAULT_ATTRIBUTE_FACTORY 
does).

To get maximum speed and no external libraries, the code implemented by 
Enhancer can be rewritten natively using the Apache Harmony 
java.lang.reflect.Proxy implementation source code as basis. The hardest part 
in generating bytecode is the ConstantPool in class files. But as the proxy 
methods are simply delegating and no magic like boxing/unboxing is needed, the 
generated bytecode is rather simple.

One other use-case for these proxies is AppendingTokenStream, which is not 
possible since 3.0 without captureState (in old TS API it was possible, because 
you could reuse the same TokenInstance even over the appended streams). In the 
new TS api, the appending stream must have a view on the attributes of the 
current consuming sub-stream.

 Need a clean way for Dir/MultiReader to merge the AttributeSources of the 
 sub-readers
 ---

 Key: LUCENE-2154
 URL: https://issues.apache.org/jira/browse/LUCENE-2154
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: Flex Branch
Reporter: Michael McCandless
 Fix For: Flex Branch

 Attachments: LUCENE-2154.patch


 The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum 
 levels, for a codec to set custom attrs.
 But, it's currently broken for Dir/MultiReader, which must somehow share 
 attrs across all the sub-readers.  Somehow we must make a single attr source, 
 and tell each sub-reader's enum to use that instead of creating its own.  
 Hopefully Uwe can work some magic here :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #1091

2010-02-11 Thread Michael McCandless
This is LUCENE-2118 -- it strikes every so often.

I think it's harmless (but really annoying).  It looks like a corner
case in the merge policy where somehow a level is able to have 11
segments after merging thinks it's done.

Mike

On Thu, Feb 11, 2010 at 3:03 AM, Uwe Schindler u...@thetaphi.de wrote:
 Really cool:
 This time we hit the failure during the clover run:

    [junit] Testsuite: org.apache.lucene.index.TestIndexWriterMergePolicy
    [junit] Tests run: 6, Failures: 1, Errors: 0, Time elapsed: 28.519 sec
    [junit]
    [junit] Testcase: 
 testMaxBufferedDocsChange(org.apache.lucene.index.TestIndexWriterMergePolicy):
     FAILED
    [junit] maxMergeDocs=2147483647; numSegments=11; upperBound=10; 
 mergeFactor=10; segs=_65:c5950 _5t:c10-_32 _5u:c10-_32 _5v:c10-_32 
 _5w:c10-_32 _5x:c10-_32 _5y:c10-_32 _5z:c10-_32 _60:c10-_32 _61:c10-_32 
 _62:c9-_32 _64:c1-_62
    [junit] junit.framework.AssertionFailedError: maxMergeDocs=2147483647; 
 numSegments=11; upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10-_32 
 _5u:c10-_32 _5v:c10-_32 _5w:c10-_32 _5x:c10-_32 _5y:c10-_32 _5z:c10-_32 
 _60:c10-_32 _61:c10-_32 _62:c9-_32 _64:c1-_62
    [junit]     at 
 org.apache.lucene.index.TestIndexWriterMergePolicy.checkInvariants(TestIndexWriterMergePolicy.java:234)
    [junit]     at 
 org.apache.lucene.index.TestIndexWriterMergePolicy.__CLR2_6_3zf7i0317qu(TestIndexWriterMergePolicy.java:164)
    [junit]     at 
 org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange(TestIndexWriterMergePolicy.java:125)
    [junit]     at 
 org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:214)
    [junit]
    [junit]
    [junit] Test org.apache.lucene.index.TestIndexWriterMergePolicy FAILED


 We also get the exact location of the failure:
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1091/clover-report/org/apache/lucene/index/TestIndexWriterMergePolicy.html?line=125#src-125

 And you can see which lines are called how often until the failure occurred!

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org]
 Sent: Thursday, February 11, 2010 5:16 AM
 To: java-dev@lucene.apache.org
 Subject: Build failed in Hudson: Lucene-trunk #1091

 See http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/1091/changes

 Changes:

 [uschindler] LUCENE-2248: Change core tests to use a global Version
 constant

 [uschindler] LUCENE-2258: Remove unneeded synchronization in
 FuzzyTermEnum

 --
 [...truncated 23095 lines...]
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard...
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard.builders...
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard.config...
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard.nodes...
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard.parser...
   [javadoc] Loading source files for package
 org.apache.lucene.queryParser.standard.processors...
   [javadoc] Constructing Javadoc information...
   [javadoc] Standard Doclet version 1.5.0_22
   [javadoc] Building tree for all the packages and classes...
   [javadoc] Building index for all the packages and classes...
   [javadoc] Building index for all classes...
   [javadoc] Generating
 http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/ws/trunk/build/docs/api/contrib-queryparser/stylesheet.css...
   [javadoc] Note: Custom tags that were not seen:
 @lucene.experimental, @lucene.internal
       [jar] Building jar:
 http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/ws/trunk/build/contrib/queryparser/lucene-queryparser-2010-02-
 11_02-03-57-javadoc.jar
      [echo] Building regex...

 javadocs:
     [mkdir] Created dir:
 http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/ws/trunk/build/docs/api/contrib-regex
   [javadoc] Generating Javadoc
   [javadoc] Javadoc execution
   [javadoc] Loading source files for package
 org.apache.lucene.search.regex...
   [javadoc] Loading source files for package org.apache.regexp...
   [javadoc] Constructing Javadoc information...
   [javadoc] Standard Doclet version 1.5.0_22
   [javadoc] Building tree for all the packages and classes...
   [javadoc] Building index for all the packages and classes...
   [javadoc] Building index for all classes...
   [javadoc] Generating
 http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/ws/trunk/build/docs/api/contrib-regex/stylesheet.css...
   [javadoc] Note: Custom tags that were not seen:
 @lucene.experimental, @lucene.internal
       [jar] Building jar:
 http://hudson.zones.apache.org/hudson/job/Lucene-
 trunk/ws/trunk/build/contrib/regex/lucene-regex-2010-02-11_02-03-57-
 javadoc.jar
      [echo] Building remote...

 

[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832444#action_12832444
 ] 

Mark Harwood commented on LUCENE-1720:
--

Thanks for the updates, Shai.

Agreed on removing the treemap comment..
As you suggest, their may be a low-level accuracy timing issue under heavy load 
but for the typically longer timeout settings we may set this is less likely to 
be an issue. 

Related: I did think of another feature for ATM - timeouts will typically be 
set to the maximum bearable value that can be sustained by the hardware without 
upsetting lots of users/customers who need answers.
This setting is therefore a tough business decision to make and is likely to be 
on the high side to avoid annoying customers (10 seconds? 30?).
The current monitoring solution only aborts at the latest possible stage when 
the uppermost acceptable limit has been reached and expensive resource has 
already been burned.
Maybe we could add a progress-testing method to ATM which can throw an 
exception earlier e.g.
public void checkForProjectedActivityTimeout(float 
percentActivityCompletedSoFar)
Clients would need to estimate how far through a task they were and call this 
method periodically.



 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2154) Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers

2010-02-11 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2154:
--

Attachment: LUCENE-2154.patch

I had some more fun. Made ProxyAttributeSource non-final and added class name 
policy to also contain the corresponding interface (to make stack traces on 
errors nicer).

Here the example output:
{noformat}
[junit] DEBUG: Created class 
org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$TermAttribute$$EnhancerByCGLIB$$6100bdf9
 for attribute org.apache.lucene.analysis.tokenattributes.TermAttribute
[junit] DEBUG: Created class 
org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$TypeAttribute$$EnhancerByCGLIB$$6f89c3ff
 for attribute org.apache.lucene.analysis.tokenattributes.TypeAttribute
[junit] DEBUG: Created class 
org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$FlagsAttribute$$EnhancerByCGLIB$$4668733c
 for attribute org.apache.lucene.analysis.tokenattributes.FlagsAttribute
[junit] Time taken using 
org.apache.lucene.analysis.tokenattributes.TermAttributeImpl:
[junit]   1476.090658 ms for 1000 iterations
[junit] Time taken using 
org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$TermAttribute$$EnhancerByCGLIB$$6100bdf
9:
[junit]   1881.295734 ms for 1000 iterations
{noformat}

 Need a clean way for Dir/MultiReader to merge the AttributeSources of the 
 sub-readers
 ---

 Key: LUCENE-2154
 URL: https://issues.apache.org/jira/browse/LUCENE-2154
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: Flex Branch
Reporter: Michael McCandless
 Fix For: Flex Branch

 Attachments: LUCENE-2154.patch, LUCENE-2154.patch


 The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum 
 levels, for a codec to set custom attrs.
 But, it's currently broken for Dir/MultiReader, which must somehow share 
 attrs across all the sub-readers.  Somehow we must make a single attr source, 
 and tell each sub-reader's enum to use that instead of creating its own.  
 Hopefully Uwe can work some magic here :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832461#action_12832461
 ] 

Shai Erera commented on LUCENE-1720:


I like the idea of adding the projected activity timeout in general, but I'd 
like to question its usefulness in reality (or at least for search 
applications). The way I think of it (and it might be because I'm thinking of 
my use case) there are two problems with such API:
# It might not be very easy (if at all) or performing to project how much of 
the work has been done. For TermQuery it might be easy to tell this (e.g. 
numSeenSoFar / df(term)), but that will add an 'if' to every document that is 
traversed, and possible more operations. But for more complicated queries, I'm 
not sure you'll be able to tell how much of the query has been processed.
# If I am willing to sustain a 10s query, then I guess I'd want to extract as 
much information as I can in those 10s. If after 1s I realize I haven't 
processed even 10% of the data that doesn't mean I'd like to stop, right? Maybe 
the query/activity will speed up shortly? I think that if I put a cap on the 
query time, it means I don't mind spending that amount of time ... but I also 
recognize this may depend on the application, and therefore that is not a too 
strong argument.

I think this approach is interesting, as it is able to detect 'hanging' threads 
(such as those stuck in infinite loops).

I realize however that ActivityTimeMonitor is not search specific (which makes 
me think it should be moved to o.a.l.util or something) and therefore the 
projected activity timeout can have its usage in other places.

How about if we do it in a separate issue? We still need to write enough tests 
for what exists so far, and turn the Benchmark class into a benchmark task/alg. 
I think that if we can avoid extra functionality (which is likely to add more 
bugs to cover) it will be easier to finish that issue, no?
BTW, in order to support this we'll need to store the startTime as well, not 
just the timeoutTime, which means that we either add another startTimesThreads 
map, or change the map to be from Thread to a Times object which encapsulates 
both times ... Minor thing though.

Also, is this targeted to be added to 'core' or contrib?

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832470#action_12832470
 ] 

Mark Harwood commented on LUCENE-1720:
--

The change to ATM isn't that big - as you say just adding start to the data 
on each thread.
Here's an (untested) example
{code:title=Bar.java|borderStyle=solid}
/**
 * Checks to see if this thread is likely to exceed it's pre-determined 
timeout. 
 * This is a heavier-weight call than checkForTimeout and should not 
be called quite as frequently
 * 
 * Throws {...@link ActivityTimedOutException}RuntimeException in the 
event of any anticipated timeout.
 * @param progress
 */
public static final void checkProjectedTimeoutOnThisThread(float 
progress)
{
Thread currentThread=Thread.currentThread();
synchronized(timeLimitedThreads)
{   
ActivityTime thisTimeOut = 
timeLimitedThreads.get(currentThread);
if(thisTimeOut!=null )
{
long now=System.currentTimeMillis();
long 
maxDuration=thisTimeOut.scheduledTimeout-thisTimeOut.startTime;
long durationSoFar=now-thisTimeOut.startTime;
float 
expectedMinimumProgress=(float)durationSoFar/maxDuration;
if(progressexpectedMinimumProgress)
{   

long expectedOverrun=(long) 
(((durationSoFar*(1f-progress))+now)-thisTimeOut.scheduledTimeout);
throw new 
ActivityTimedOutException(Thread +currentThread+ is expected to time out, 
estimated overrun =
+expectedOverrun+  
ms,expectedOverrun);
}
}
}
}   
static class ActivityTime
{
public ActivityTime(long startTime, long timeOutTime)
{
this.startTime=startTime;
this.scheduledTimeout=timeOutTime;
}
long startTime;
long scheduledTimeout;
}
{code} 

I agree it will be challenging to work out when to call this from readers etc 
and how to estimate completeness but as a general utility class (as you 
suggest, in o.a.l.util ) it seems like a useful addition.

My suspicion is that this is currently contrib - but then 
TimeLimitingCollector is currently in core.
Maybe TimeLimitingCollector could be rewritten to use ATM and then we maintain 
a common generally reusable implementation?






 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are

2010-02-11 Thread Uwe Schindler (JIRA)
AttributeSource holds strong reference to class instances and prevents 
unloading e.g. in Solr if webapplication reload and custom attributes in 
separate classloaders are used (e.g. in the Solr plugins classloader)
-

 Key: LUCENE-2260
 URL: https://issues.apache.org/jira/browse/LUCENE-2260
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0, 2.9.1
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9.2, 3.0.1, 3.1


When working on the dynmaic proxy classes using cglib/javaassist i recognized a 
problem in the caching code inside AttributeSource:
- AttributeSource has a static (!) cache map that holds implementation classes 
for attributes to be faster on creating new attributes (reflection cost)
- AttributeSource has a static (!) cache map that holds a list of all 
interfaces implemented by a specific AttributeImpl

Also:
- VirtualMethod in 3.1 hold a map of implementation distances keyed by 
subclasses of the deprecated API

Both have the problem that this strong reference is inside Lucene's classloader 
and so persists as long as lucene lives. The classes referenced can never be 
unloaded therefore, which would be fine if all live in the same classloader. As 
soon as the Attribute or implementation class or the subclass of the deprecated 
API are loaded by a different classloder (e.g. Lucene lives in bootclasspath of 
tomcat, but lucene-consumer with custom attributes lives in a webapp), they can 
never be unloaded, because a reference exists.

Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for 
generated class files. They also manage this by a WeakHashMap. The cache will 
always work perfect and no class will be evicted without reason, as classes are 
only unloaded when the classloader goes and this will only happen on request 
(e.g. by Tomcat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are

2010-02-11 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2260:
--

Attachment: LUCENE-2260.patch

Attached patch. I will commit this in a day and also merge to 2.9 and 3.0 
(without VirtualMethod) as this is a resource leak. This problem is similar to 
LUCENE-2182.

 AttributeSource holds strong reference to class instances and prevents 
 unloading e.g. in Solr if webapplication reload and custom attributes in 
 separate classloaders are used (e.g. in the Solr plugins classloader)
 -

 Key: LUCENE-2260
 URL: https://issues.apache.org/jira/browse/LUCENE-2260
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9.1, 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9.2, 3.0.1, 3.1

 Attachments: LUCENE-2260.patch


 When working on the dynmaic proxy classes using cglib/javaassist i recognized 
 a problem in the caching code inside AttributeSource:
 - AttributeSource has a static (!) cache map that holds implementation 
 classes for attributes to be faster on creating new attributes (reflection 
 cost)
 - AttributeSource has a static (!) cache map that holds a list of all 
 interfaces implemented by a specific AttributeImpl
 Also:
 - VirtualMethod in 3.1 hold a map of implementation distances keyed by 
 subclasses of the deprecated API
 Both have the problem that this strong reference is inside Lucene's 
 classloader and so persists as long as lucene lives. The classes referenced 
 can never be unloaded therefore, which would be fine if all live in the same 
 classloader. As soon as the Attribute or implementation class or the subclass 
 of the deprecated API are loaded by a different classloder (e.g. Lucene lives 
 in bootclasspath of tomcat, but lucene-consumer with custom attributes lives 
 in a webapp), they can never be unloaded, because a reference exists.
 Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for 
 generated class files. They also manage this by a WeakHashMap. The cache will 
 always work perfect and no class will be evicted without reason, as classes 
 are only unloaded when the classloader goes and this will only happen on 
 request (e.g. by Tomcat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832483#action_12832483
 ] 

Mark Harwood commented on LUCENE-1720:
--

Agreed, might be useful to provide boolean response to the progress method - a 
kind of how am I doing? check.
We can always provide a convenience wrapper method which throws an exception : 
ATM.blowUpIfNotGoingFastEnough(float progress)

Re TimeLimitingCollector - agreed, you really do need to protect ATM/start/stop 
calls in the same try...finally block.
Maybe ATM could have a start method variant that takes an additional 
alreadyRunningSince argument as opposed to the existing assumption that the 
activity is starting right now. The first collect could then call this with a 
timestamp initialised in the constructor.
Even then, there is the issue of where to put the stop call - collector has 
no close call to signal the end of the activity.

Doesn't seem like TimeLimitingCollector can be based on the same ATM code. 
Shame.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832491#action_12832491
 ] 

Shai Erera commented on LUCENE-1720:


bq. We can always provide a convenience wrapper method which throws an 
exception : ATM.blowUpIfNotGoingFastEnough(float progress) 

Right. Although I'm thinking, in the sake of keeping the API simple, and 
back-compat sane, it's easy enough for the caller to do:
{code}
if (!ATM.checkProjectedTimeoutOnThisThread(progress)) {
  throw new ActivityTimedOutException(terminating due to projected timeout. 
progress =  + progress);
}
{code}

bq. The first collect

Note: that means that every collect will need to check if it's the first.

I think we should leave TLC out of this for now. Maybe w/ 
TimeLimitingIndexReader, which has finer granularity, TLC won't be used at all. 
At least, I definitely don't see why would one use both...

About the packages, so we agree that ATM and exception move to o.a.l.util, and 
the reader stay in index (like IndexReader)? And I don't think this warrants a 
contrib package, unless we move TLC there as well. But this seems like a useful 
Lucene core utility.

So that we don't override each other - are you going to add the projected 
method to the patch? If so, can you also order the packages and remove the 
TreeMap comment? If not I can do it.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are

2010-02-11 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2260:
--

Attachment: LUCENE-2260.patch

Improved patch, now all class references are weak. The assumption on the 
WeakReference inside addAttributeImpl is always != null is true because the 
code has a strong reference on the implementing class.

 AttributeSource holds strong reference to class instances and prevents 
 unloading e.g. in Solr if webapplication reload and custom attributes in 
 separate classloaders are used (e.g. in the Solr plugins classloader)
 -

 Key: LUCENE-2260
 URL: https://issues.apache.org/jira/browse/LUCENE-2260
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9.1, 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9.2, 3.0.1, 3.1

 Attachments: LUCENE-2260.patch, LUCENE-2260.patch


 When working on the dynmaic proxy classes using cglib/javaassist i recognized 
 a problem in the caching code inside AttributeSource:
 - AttributeSource has a static (!) cache map that holds implementation 
 classes for attributes to be faster on creating new attributes (reflection 
 cost)
 - AttributeSource has a static (!) cache map that holds a list of all 
 interfaces implemented by a specific AttributeImpl
 Also:
 - VirtualMethod in 3.1 hold a map of implementation distances keyed by 
 subclasses of the deprecated API
 Both have the problem that this strong reference is inside Lucene's 
 classloader and so persists as long as lucene lives. The classes referenced 
 can never be unloaded therefore, which would be fine if all live in the same 
 classloader. As soon as the Attribute or implementation class or the subclass 
 of the deprecated API are loaded by a different classloder (e.g. Lucene lives 
 in bootclasspath of tomcat, but lucene-consumer with custom attributes lives 
 in a webapp), they can never be unloaded, because a reference exists.
 Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for 
 generated class files. They also manage this by a WeakHashMap. The cache will 
 always work perfect and no class will be evicted without reason, as classes 
 are only unloaded when the classloader goes and this will only happen on 
 request (e.g. by Tomcat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832500#action_12832500
 ] 

Mark Harwood commented on LUCENE-1720:
--

I'll pick this up

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832501#action_12832501
 ] 

Shai Erera commented on LUCENE-1720:


BTW, perhaps to allow even finer granularity and precision, we should move to 
use nanoTime()? the given timeout can be in ms, but the TimeoutThread can work 
w/ nanoTime() ... what do you think?

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2080) Improve the documentation of Version

2010-02-11 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-2080.
-

Resolution: Fixed

Committed revision 908975.

 Improve the documentation of Version
 

 Key: LUCENE-2080
 URL: https://issues.apache.org/jira/browse/LUCENE-2080
 Project: Lucene - Java
  Issue Type: Task
  Components: Javadocs
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Trivial
 Fix For: 2.9.2, 3.1, 3.0

 Attachments: LUCENE-2080.patch, LUCENE-2080.patch, LUCENE-2080.patch, 
 LUCENE-2080.patch


 In my opinion, we should elaborate more on the effects of changing the 
 Version parameter.
 Particularly, changing this value, even if you recompile your code, likely 
 involves reindexing your data.
 I do not think this is adequately clear from the current javadocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Fuad Efendi

Hi Eks Dev, thanks for the comment!

I like this:
 If you need too use this, nothing simpler, you do not even need pair
 comparison (aka traversal), just Index terms split into bigrams and
 search with standard Query.


And, I need to study all the tricks of Automaton...


-Fuad




 -Original Message-
 From: Eks Dev (JIRA) [mailto:j...@apache.org]
 Sent: February-11-10 5:04 AM
 To: java-dev@lucene.apache.org
 Subject: [jira] Commented: (LUCENE-2089) explore using automaton for
 fuzzyquery
 
 
 [ https://issues.apache.org/jira/browse/LUCENE-
 2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
 tabpanelfocusedCommentId=12832424#action_12832424 ]
 
 Eks Dev commented on LUCENE-2089:
 -
 
 {quote}
  What about this,
 http://www.catalysoft.com/articles/StrikeAMatch.html
 it seems logically more appropriate to (human-entered) text objects than
 Levenshtein distance, and it is (in theory) extremely fast; is DFA-
 distance faster?
 {quote}
 
 Is that only me who sees plain, vanilla bigram distance here? What is
 new or better in StrikeAMatch compared to the first phase of the current
 SpellCehcker (feeding PriorityQueue with candidates)?
 
 If you need too use this, nothing simpler, you do not even need pair
 comparison (aka traversal), just Index terms split into bigrams and
 search with standard Query.
 
 
 Autmaton trick is a neat one. Imo,  the only thing that would work
 better is to make term dictionary real trie (ternary, n-ary, dfa, makes
 no big diff). Making TerrmDict some sort of trie/dfa would permit smart
 beam-search,  even without compiling query DFA. Beam search also makes
 implementation of better distances possible (Weighted Edit distance
 without metric constraint ). I guess this is going to be possible with
 Flex, Mike was allready talking about DFA Dictionary :)
 
 It took a while to figure out the trick Robert pooled here, treating
 term dictionary as another DFA due to the sortedness, nice.
 
  explore using automaton for fuzzyquery
  --
 
  Key: LUCENE-2089
  URL: https://issues.apache.org/jira/browse/LUCENE-2089
  Project: Lucene - Java
   Issue Type: Wish
   Components: Search
 Reporter: Robert Muir
 Assignee: Mark Miller
 Priority: Minor
  Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz,
 TestFuzzy.java
 
 
  Mark brought this up on LUCENE-1606 (i will assign this to him, I know
 he is itching to write that nasty algorithm)
  we can optimize fuzzyquery by using AutomatonTermsEnum, here is my
 idea
  * up front, calculate the maximum required K edits needed to match the
 users supplied float threshold.
  * for at least small common E up to some max K (1,2,3, etc) we should
 create a DFA for each E.
  if the required E is above our supported max, we use dumb mode at
 first (no seeking, no DFA, just brute force like now).
  As the pq fills, we swap progressively lower DFAs into the enum, based
 upon the lowest score in the pq.
  This should work well on avg, at high E, you will typically fill the
 pq very quickly since you will match many terms.
  This not only provides a mechanism to switch to more efficient DFAs
 during enumeration, but also to switch from dumb mode to smart mode.
  i modified my wildcard benchmark to generate random fuzzy queries.
  * Pattern: 7N stands for NNN, etc.
  * AvgMS_DFA: this is the time spent creating the automaton
 (constructor)
  ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
  |7N|10|64.0|4155.9|38.6|20.3|
  |14N|10|0.0|2511.6|46.0|37.9|
  |28N|10|0.0|2506.3|93.0|86.6|
  |56N|10|0.0|2524.5|304.4|298.5|
  as you can see, this prototype is no good yet, because it creates the
 DFA in a slow way. right now it creates an NFA, and all this wasted time
 is in NFA-DFA conversion.
  So, for a very long string, it just gets worse and worse. This has
 nothing to do with lucene, and here you can see, the TermEnum is fast
 (AvgMS - AvgMS_DFA), there is no problem there.
  instead we should just build a DFA to begin with, maybe with this
 paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
  we can precompute the tables with that algorithm up to some reasonable
 K, and then I think we are ok.
  the paper references using
 http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if
 someone wants to implement this they should not worry about
 minimization.
  in fact, we need to at some point determine if AutomatonQuery should
 even minimize FSM's at all, or if it is simply enough for them to be
 deterministic with no transitions to dead states. (The only code that
 actually assumes minimal DFA is the Dumb vs Smart heuristic and this
 can be rewritten as a summation easily). we need to benchmark really
 complex DFAs (i.e. write a regex benchmark) to figure out if
 minimization is even helping right now.
 
 --
 

FieldCacheImpl concurrency

2010-02-11 Thread Shay Banon
Hi,

   I would like to try and improve concurrency in Lucene in several places,
and thought I would start with FieldCacheImpl. The implementation is heavily
synchronized on both a global map and on creation values for a pretty
heavily used path (get). I think the weak hash map is a good solution (but
also think that the ability to choose to use soft hash map would add even
greater flexibility), the problem is that there is no built in concurrent
weak/soft hash map. I have built several of these over the course of my Java
life, but now, with google collections, I really don't think someone should
build anything like this anymore.

   I would like to take the task of doing this, I wanted to first check if
the it is possible for lucene to have a dependency on google collections?
Its apache licensed, and developed in cooperation with Doug Lea and
basically considered as java.util valid extension with possible inclusion of
some of the work into future JDK versions (and how know when those will come
:) ).

-shay.banon


[jira] Updated: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are

2010-02-11 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2260:
--

Attachment: LUCENE-2260-lucene29.patch

Patch for 2.9 branch (without Java 5 generics)

 AttributeSource holds strong reference to class instances and prevents 
 unloading e.g. in Solr if webapplication reload and custom attributes in 
 separate classloaders are used (e.g. in the Solr plugins classloader)
 -

 Key: LUCENE-2260
 URL: https://issues.apache.org/jira/browse/LUCENE-2260
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9.1, 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9.2, 3.0.1, 3.1

 Attachments: LUCENE-2260-lucene29.patch, LUCENE-2260.patch, 
 LUCENE-2260.patch


 When working on the dynmaic proxy classes using cglib/javaassist i recognized 
 a problem in the caching code inside AttributeSource:
 - AttributeSource has a static (!) cache map that holds implementation 
 classes for attributes to be faster on creating new attributes (reflection 
 cost)
 - AttributeSource has a static (!) cache map that holds a list of all 
 interfaces implemented by a specific AttributeImpl
 Also:
 - VirtualMethod in 3.1 hold a map of implementation distances keyed by 
 subclasses of the deprecated API
 Both have the problem that this strong reference is inside Lucene's 
 classloader and so persists as long as lucene lives. The classes referenced 
 can never be unloaded therefore, which would be fine if all live in the same 
 classloader. As soon as the Attribute or implementation class or the subclass 
 of the deprecated API are loaded by a different classloder (e.g. Lucene lives 
 in bootclasspath of tomcat, but lucene-consumer with custom attributes lives 
 in a webapp), they can never be unloaded, because a reference exists.
 Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for 
 generated class files. They also manage this by a WeakHashMap. The cache will 
 always work perfect and no class will be evicted without reason, as classes 
 are only unloaded when the classloader goes and this will only happen on 
 request (e.g. by Tomcat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: FieldCacheImpl concurrency

2010-02-11 Thread Yonik Seeley
On Thu, Feb 11, 2010 at 9:54 AM, Shay Banon kim...@gmail.com wrote:
    I would like to try and improve concurrency in Lucene in several places,
 and thought I would start with FieldCacheImpl. The implementation is heavily
 synchronized on both a global map and on creation values for a pretty
 heavily used path (get).

It really shouldn't be heavily used.
For a sorted search, get() is called once per segment in the index.
There is no synchronization to retrieve per-document values.

We have just barely moved to Java5 though, and so it would probably be
pretty easy to improve the concurrency of the read path if it did
become problematic.

 I think the weak hash map is a good solution (but
 also think that the ability to choose to use soft hash map would add even
 greater flexibility), the problem is that there is no built in concurrent
 weak/soft hash map.

It might be a better idea to have interfaces that allow one to
implement their own policies rather than to try and push all the
different options into Lucene:
https://issues.apache.org/jira/browse/LUCENE-831

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size

2010-02-11 Thread Robert Muir (JIRA)
configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
-

 Key: LUCENE-2261
 URL: https://issues.apache.org/jira/browse/LUCENE-2261
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
Priority: Minor
 Fix For: Flex Branch, 3.1


MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority queue 
to expand the query to the top-N terms.

currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be 
nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy 
query to at most only the 50 closest terms.

at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite 
(it is private right now) and add a ctor to it, so a MultiTermQuery can 
instantiate one with its own limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1720:
-

Attachment: Lucene-1720.patch

Moved ATM to o.a.l.util package
Added isProjectedToTimeout method to ATM and corresponding Junit test
Removed treemap comments

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size

2010-02-11 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2261:


Attachment: LUCENE-2261.patch

attached is a patch. i changed FuzzyQueries pq test to use this so it does not 
have to do the try-finally thing/BooleanQuery.maxClauseCount

 configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
 -

 Key: LUCENE-2261
 URL: https://issues.apache.org/jira/browse/LUCENE-2261
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
Priority: Minor
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2261.patch


 MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority 
 queue to expand the query to the top-N terms.
 currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be 
 nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy 
 query to at most only the 50 closest terms.
 at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite 
 (it is private right now) and add a ctor to it, so a MultiTermQuery can 
 instantiate one with its own limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size

2010-02-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832607#action_12832607
 ] 

Uwe Schindler commented on LUCENE-2261:
---

Patch looks good, some things because of serializable:
- The readResove method must go to the singleton constant, which should also 
throw UOE when modified
- euquals / hashcode is neaded for the rewritemode, else FuzzyQuery  Co would 
no longer compare

It could be solved by doing like for AutoRewrite and its unmodifiable constant. 
I know: Queries are a pain because of Serializable.

+1 on adding a param to FuzzyQuery ctor

 configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
 -

 Key: LUCENE-2261
 URL: https://issues.apache.org/jira/browse/LUCENE-2261
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
Priority: Minor
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2261.patch


 MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority 
 queue to expand the query to the top-N terms.
 currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be 
 nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy 
 query to at most only the 50 closest terms.
 at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite 
 (it is private right now) and add a ctor to it, so a MultiTermQuery can 
 instantiate one with its own limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size

2010-02-11 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2261:


Attachment: LUCENE-2261.patch

the previous patch was wrong for readResolve, here is a fix.


 configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
 -

 Key: LUCENE-2261
 URL: https://issues.apache.org/jira/browse/LUCENE-2261
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
Priority: Minor
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2261.patch, LUCENE-2261.patch


 MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority 
 queue to expand the query to the top-N terms.
 currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be 
 nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy 
 query to at most only the 50 closest terms.
 at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite 
 (it is private right now) and add a ctor to it, so a MultiTermQuery can 
 instantiate one with its own limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832638#action_12832638
 ] 

Jason Rutherglen commented on LUCENE-1720:
--

When's this ready to test with Solr?

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size

2010-02-11 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2261:


Attachment: LUCENE-2261.patch

here is a patch with uwe's suggestions. now that fuzzyquery supports the param 
via ctor, the singleton TopTerms is no longer used so i removed it.

 configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
 -

 Key: LUCENE-2261
 URL: https://issues.apache.org/jira/browse/LUCENE-2261
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
Priority: Minor
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch


 MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority 
 queue to expand the query to the top-N terms.
 currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be 
 nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy 
 query to at most only the 50 closest terms.
 at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite 
 (it is private right now) and add a ctor to it, so a MultiTermQuery can 
 instantiate one with its own limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size

2010-02-11 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2261:


Attachment: LUCENE-2261.patch

sorry, i created a javadoc warning, (TOP_TERMS singleton was referenced in the 
top of mtq javadocs). here is the fix

 configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
 -

 Key: LUCENE-2261
 URL: https://issues.apache.org/jira/browse/LUCENE-2261
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
Priority: Minor
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch, 
 LUCENE-2261.patch


 MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority 
 queue to expand the query to the top-N terms.
 currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be 
 nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy 
 query to at most only the 50 closest terms.
 at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite 
 (it is private right now) and add a ctor to it, so a MultiTermQuery can 
 instantiate one with its own limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size

2010-02-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832683#action_12832683
 ] 

Uwe Schindler commented on LUCENE-2261:
---

Looks good when gaining a first insight. I have not tried the patch, will do 
soon.

 configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
 -

 Key: LUCENE-2261
 URL: https://issues.apache.org/jira/browse/LUCENE-2261
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
Priority: Minor
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch, 
 LUCENE-2261.patch


 MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority 
 queue to expand the query to the top-N terms.
 currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be 
 nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy 
 query to at most only the 50 closest terms.
 at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite 
 (it is private right now) and add a ctor to it, so a MultiTermQuery can 
 instantiate one with its own limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832687#action_12832687
 ] 

Robert Muir commented on LUCENE-2089:
-

bq. Imo, the only thing that would work better is to make term dictionary real 
trie (ternary, n-ary, dfa, makes no big diff). 

yeah I think a real DFA would be more efficient, its too much to think about. I 
guess this is why we leave this kind of thing to Mike :)

I haven't grasped how this beam search would work with a DFA dictionary, 
although it definitely sounds cool. 
I assume you mean by weighted edit distance that the transitions in the state 
machine would have costs?
If this is the case couldn't we even define standard levenshtein very easily 
(instead of nasty math), and would the beam search technique enumerate 
efficiently for us?


 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2257) relax the per-segment max unique term limit

2010-02-11 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832689#action_12832689
 ] 

Tom Burton-West commented on LUCENE-2257:
-

Hi Michael,

Thanks for your help. We mounted one of the indexes with 2.4 billion terms on 
our dev server and tested with and without the patch. (I discovered that 
queries containing Korean characters would consistently trigger the bug).   
With the patch, we don't see any ArrayIndexOutOfBounds exceptions.  We are 
going to do a bit more testing before we put this into production. (We rolled 
back our production indexes temporarily to indexes that split the terms over 2 
segments and therefore didn't trigger the bug).

Other than walking though the code in the debugger, is there some systematic 
way of looking for any other places where an int is used that might also have 
problems when we have over 2.1x billion terms?

Tom

 relax the per-segment max unique term limit
 ---

 Key: LUCENE-2257
 URL: https://issues.apache.org/jira/browse/LUCENE-2257
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9.2, 3.0.1, 3.1

 Attachments: LUCENE-2257.patch


 Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms 
 in a single segment.
 But I think we can improve this to termIndexInterval (default 128) * 2.1B.  
 There is one place (internal API only) where Lucene uses an int but should 
 use a long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2257) relax the per-segment max unique term limit

2010-02-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832698#action_12832698
 ] 

Robert Muir commented on LUCENE-2257:
-

bq. (I discovered that queries containing Korean characters would consistently 
trigger the bug).

this makes sense because Hangul is sorted towards the end of the term dictionary

you can see this visually here: http://unicode.org/roadmaps/bmp/


 relax the per-segment max unique term limit
 ---

 Key: LUCENE-2257
 URL: https://issues.apache.org/jira/browse/LUCENE-2257
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9.2, 3.0.1, 3.1

 Attachments: LUCENE-2257.patch


 Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms 
 in a single segment.
 But I think we can improve this to termIndexInterval (default 128) * 2.1B.  
 There is one place (internal API only) where Lucene uses an int but should 
 use a long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832721#action_12832721
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. When's this ready to test with Solr?

I think the API is pretty stable - call try..start..finally...stop around 
time-critical stuff and use a TimeLimitedIndexReader to wrap your IndexReader.

Internally the implementation feels reasonably stable too.

In my tests it doesn't seem to add too much overhead to calls -  I was getting 
response times of 3400 milliseconds on a heavy wikipedia query with 
TimeLimitedIndexReader versus 3300 for the same query on a raw IndexReader 
without timeout protection.

I'm tempted to try put the timeout check calls directly into a version of 
IndexReader rather than in a delegating reader wrapper just to try see if the 
wrapper code is where the bulk of the extra overhead comes in. I'd hate to add 
any overhead to core IndexReader but I'm keen to see just how low-cost this 
check can get.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2257) relax the per-segment max unique term limit

2010-02-11 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2257:
---

Attachment: LUCENE-2257.patch

 relax the per-segment max unique term limit
 ---

 Key: LUCENE-2257
 URL: https://issues.apache.org/jira/browse/LUCENE-2257
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9.2, 3.0.1, 3.1

 Attachments: LUCENE-2257.patch, LUCENE-2257.patch


 Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms 
 in a single segment.
 But I think we can improve this to termIndexInterval (default 128) * 2.1B.  
 There is one place (internal API only) where Lucene uses an int but should 
 use a long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2257) relax the per-segment max unique term limit

2010-02-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832729#action_12832729
 ] 

Michael McCandless commented on LUCENE-2257:


bq. With the patch, we don't see any ArrayIndexOutOfBounds exceptions.

Great!  And the results look correct?

bq. Other than walking though the code in the debugger, is there some 
systematic way of looking for any other places where an int is used that might 
also have problems when we have over 2.1x billion terms?

Not that I know of!  The code that handles the term dict lookup is
fairly contained, in TermInfosReader and SegmentTermEnum.  I think
scrutinizing the code and testing (as you're doing) is the only way.

I just looked again -- there are a few places where int is still being used.

First is two methods (get(int position) and scanEnum), in
TermInfosReader, that are actually dead code (package private 
unused).  Second is int SegmentTermEnum.scanTo, but this is fine
because it's never asked to can more than termIndexInterval terms.

I'll attach patch that additionally just removes that dead code.


 relax the per-segment max unique term limit
 ---

 Key: LUCENE-2257
 URL: https://issues.apache.org/jira/browse/LUCENE-2257
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9.2, 3.0.1, 3.1

 Attachments: LUCENE-2257.patch, LUCENE-2257.patch


 Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms 
 in a single segment.
 But I think we can improve this to termIndexInterval (default 128) * 2.1B.  
 There is one place (internal API only) where Lucene uses an int but should 
 use a long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832741#action_12832741
 ] 

Eks Dev commented on LUCENE-2089:
-

{quote}
I assume you mean by weighted edit distance that the transitions in the state 
machine would have costs?
{quote}

Yes, kind of, not embedded in the trie, just defined externally.

What I am talking about is a part of the noisy channel approach, modeling only 
channel distribution. Have a look at the http://norvig.com/spell-correct.html 
for basic theory. I am suggesting almost the same,  just applied at character 
level and without language model part. It is rather easy once you have your 
dictionary in some sort of tree structure.


You guide your trie traversal over the trie by  iterating on each char in your 
search term accumulating   log probabilities of single transformations 
(recycling prefix part). When you hit a leaf insert into PriorityQueue of 
appropriate depth. What I mean by probabilities of single transformations are 
defined as:
insertion(character a)//map char-log probability (think of it as kind of cost 
of inserting this particular character)
deletion(character)//map char-log probability...
transposition(char a, char b)
replacement(char a, char b)//2D matrix char,char-probability (cost)
if you wish , you could even add some positional information, boosting match on 
start/end of the string

I avoided tricky mechanicson traversal, insertion, deletion, but on trie you 
can do it by following different paths... 

the only good implementation (in memory) around there I know of is in LingPipe 
spell checker (they implement full Noisy Channel, with Language model  driving 
traversal)... has huge educational value, Bob is really great at explaining 
things. The code itself is proprietary. 
I would suggest you to peek into this code to see this 2-Minute rumbling I 
wrote here properly explained :) Just ignore the language model part and assume 
you have NULL language model (all chars in language are equally probable) , 
doing full traversal over the trie. 

{quote}
If this is the case couldn't we even define standard levenshtein very easily 
(instead of nasty math), and would the beam search technique enumerate 
efficiently for us?
{quote}
 Standard Lev. is trivially configured once you have this, it is just setting 
all these costs to 1 (delete, insert... in log domain)... But who would use 
standard distance with such a beast, reducing impact of inserting/deleting 
silent h as in Thomas Tomas... 
Enumeration is trie traversal, practically calculating distance against all 
terms at the same time and collectiong N best along the way. The place where 
you save your time is recycling prefix part in this calculation. Enumeration is 
optimal as this trie there contains only the terms from termDict, you are not 
trying all possible alphabet characters and you can implement early path 
abandoning easily ether by cost (log probability) or/and by limiting the 
number  of successive insertions

If interested in really in depth things, look at 
http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198
Great book, (another great tip from  b...@lingpipe). A bit strange with 
terminology (at least to me), but once you get used to it, is really worth the 
time you spend trying to grasp it.




 

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Aaron McCurry (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832773#action_12832773
 ] 

Aaron McCurry commented on LUCENE-2089:
---

I have written a levenstein generator today that seems to operate similarly to 
what is being discussed here.  It generates all the possible matches to 
levenstein algorithm given a term and a character set, it then creates a 
booleanquery from it.  For a given term with edit distance of 1 or 2 it is 
extremely fast.  I tested it on my dev data that has about 8 billion documents 
with 20 shards, each shard has about 170,000,000 terms in the field that I'm 
testing.  The normal fuzzy query with a term length of 8 and and edit distance 
of 2 took about 110 seconds to complete, and the auto generated query took 
around a 1.5 seconds complete.  However this method will probably only work 
with edits distances in the 1 and 2 range, because once I hit 3 it spiked the 
memory usage and slowed way down (to be expected).  Not sure if you all want to 
take a look at my code or not, but if you want me to post it I will.

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832785#action_12832785
 ] 

Robert Muir commented on LUCENE-2089:
-

Aaron i think generation may pose a problem for a full unicode alphabet.

e.g. edit distance of 1 on otter is 720,891 strings... this is a large 
boolean query!

so this is where the ping-ponging of automatonquery really helps, we dont have 
to do 720,891 term lookups because the term dictionary is in sorted order, so 
its treated like dfa intersection.

btw, you can test this patch with your 170M terms by applying this patch to 
flex branch, and using the following code.

{code}
LevenshteinAutomata a = new LevenshteinAutomata(otter); // searching on otter
Automaton lev1 = a.toAutomaton(1); // edit distance of 1
Query query = new AutomatonQuery(new Term(field, bogus), lev1);
...
{code}


 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832789#action_12832789
 ] 

Robert Muir commented on LUCENE-2089:
-

bq. But who would use standard distance with such a beast, reducing impact of 
inserting/deleting silent h as in Thomas Tomas...

hehe, well my only goal with this issue (and really automaton in general) is to 
speed up in a backwards compatible way, so I am stuck with standard distance 
for that purpose.

but the more intelligent stuff you speak of could be really cool esp. for 
spellchecking, sure you dont want to rewrite our spellchecker?

btw its not clear to me yet, could you implement that stuff on top of ghetto 
DFA (the sorted terms dict we have now) or is something more sophisticated 
needed? its a lot easier to write this stuff now with the flex MTQ apis :)


 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Aaron McCurry (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832794#action_12832794
 ] 

Aaron McCurry commented on LUCENE-2089:
---

Excuse my ignorance, but what is a DFA?

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Aaron McCurry (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832798#action_12832798
 ] 

Aaron McCurry commented on LUCENE-2089:
---

{quote}
Aaron i think generation may pose a problem for a full unicode alphabet.
e.g. edit distance of 1 on otter is 720,891 strings... this is a large 
boolean query!
{quote}

So I have to ask, who has the entire unicode alphabet indexed in a single 
field?  I don't, and I am willing to make tradeoffs in my system for 
performance.  So for instance, I am willing to code a mapping file or scan my 
index on startup to know the character set (a-z 0-9 or whatever) to reduce the 
possibilities.  This is a simple know your data problem, at least in IMHO.

Aaaron

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Aaron McCurry (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832798#action_12832798
 ] 

Aaron McCurry edited comment on LUCENE-2089 at 2/12/10 2:22 AM:


{quote}
Aaron i think generation may pose a problem for a full unicode alphabet.
e.g. edit distance of 1 on otter is 720,891 strings... this is a large 
boolean query!
{quote}

So I have to ask, who has the entire unicode alphabet indexed in a single 
field?  I don't, and I am willing to make tradeoffs in my system for 
performance.  So for instance, I am willing to code a mapping file or scan my 
index on startup to know the character set (a-z 0-9 or whatever) to reduce the 
possibilities.  This is a simple know your data problem, at least in IMHO.

Aaron

  was (Author: amccurry):
{quote}
Aaron i think generation may pose a problem for a full unicode alphabet.
e.g. edit distance of 1 on otter is 720,891 strings... this is a large 
boolean query!
{quote}

So I have to ask, who has the entire unicode alphabet indexed in a single 
field?  I don't, and I am willing to make tradeoffs in my system for 
performance.  So for instance, I am willing to code a mapping file or scan my 
index on startup to know the character set (a-z 0-9 or whatever) to reduce the 
possibilities.  This is a simple know your data problem, at least in IMHO.

Aaaron
  
 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832814#action_12832814
 ] 

Robert Muir commented on LUCENE-2089:
-

bq. So I have to ask, who has the entire unicode alphabet indexed in a single 
field? I don't, and I am willing to make tradeoffs in my system for performance.

Aaron, would you mind testing this patch with the flex branch (with the sample 
code i listed before)? There is no tradeoff to support unicode.
If you don't have chinese chars in your index, the enum will not seek to them, 
but skip past them, as they do not exist, and Term(s)Enum always returns things 
in sorted order.

For more details, see AutomatonTermsEnum in flex (this is the code that 
actually does DFA - DFA intersection of any Automaton with lucene's term 
dictionary). 


 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832817#action_12832817
 ] 

Robert Muir commented on LUCENE-2089:
-

bq. Excuse my ignorance, but what is a DFA? 

here is wikipedia page with a description: 
http://en.wikipedia.org/wiki/Deterministic_finite-state_machine


 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1990) Add unsigned packed int impls in oal.util

2010-02-11 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated LUCENE-1990:
---

Attachment: LUCENE-1990-te20100212.patch

I've read through the comments on LUCENE-1990 and implemented most of what has 
been suggested. The attached patch contains implementations for all the 
variants we've talked about, including aligned. There's a known bug in 
persistence for aligned64 (and probably also for aligned32) that I haven't 
stomped yet. There's also a clear need for a more elaborate unit-test with 
regard to persistence.

Other outstanding issues, as I see them, are whether or not mutable packed 
arrays should be requestable (as general purpose data structures) and how the 
factory for creating a writer should work. I have added a getMutable-method to 
the factory and not touched the return type Reader for the getReader-method. 
That way read-only users will not be tempted to try and update the received 
structure. As for the arguments to the factory, Michael McCandless suggested 
that the preferences should be expressed with (packed | aligned32 | aligned64 | 
auto). As fas as I can see, this should work. However, I've only just reached 
this conclusion and haven't had the time to implement it.

A speed-test has been added and the results from my machine can be seen below. 
In order for it to be really usable, it should be tried on other machines too.

I won't touch the code before sometime next week, but I'll keep an eye on 
LUCENE-1990 comments until then.

{code}
bitsPerValue  valueCountgetCount
PackedDirectByte   PackedDirectShortPacked32 PackedAligned32
 PackedDirectIntPacked64 PackedAligned64PackedDirectLong
   110001000 
167 141 258 242 
172 264 242 183
   1 1001000 
224 232 266 233 
246 262 238 338
   110001000 
359 469 280 278 
508 278 272 551
   310001000 
168 166 265 241 
163 262 243 166
   3 1001000 
227 226 261 251 
239 274 249 330
   310001000 
406 476 301 304 
522 300 308 547
   410001000 
167 168 266 239 
164 285 239 169
   4 1001000 
228 231 294 274 
262 291 269 314
   410001000 
385 480 308 333 
514 331 315 557
   710001000 
172 174 278 248 
162 271 238 177
   7 1001000 
224 236 289 281 
272 278 277 345
   710001000 
405 473 389 447 
516 399 402 553
   810001000 
192 171 268 242 
174 291 240 163
   8 1001000 
226 232 291 284 
286 274 265 314
   81000  

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Aaron McCurry (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832819#action_12832819
 ] 

Aaron McCurry commented on LUCENE-2089:
---

Sure I will give it a try, still wrapping my head around the flex branch.

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832822#action_12832822
 ] 

Robert Muir commented on LUCENE-2089:
-

thanks, this will give you something to play with wrt DFA queries. 
in flex, Regex and wildcard are already implemented this way.

one additional complexity here that this code will not do (until some other 
issues are resolved and i update the patch), is the pq trick.

This is the fact that FuzzyQuery (or really anything else that uses 
TopTermsRewrite) is really an n-best list, and we shouldn't seek to terms that 
will simply be overflowed off the priority queue anyway, as its just a waste of 
time.

this is the part i am working on now, most of it already one under LUCENE-2140 
and some setup now under LUCENE-2261


 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Hudson build is back to normal : Lucene-trunk #1092

2010-02-11 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1092/changes



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832858#action_12832858
 ] 

Shai Erera commented on LUCENE-1720:


Thanks Mark. I will take a look at the patch a bit later. TestActivityMonitor 
should move to o.a.l.util as well. I can do that.
I also want to experiment w/ using ConcurrentHashMap, as most of the times the 
map is just checked and not modified. I'll see if it removes some of the 
synchronization. The synchronization is needed because of TimeoutThread, which 
only checks the map ... if it's possible, I believe it will improve the 
performance of ATM.

About adding it to IndexReader directly - it'd be interesting to see if it 
affects performance in any way, but I wouldn't want to see it directly in IR, 
since if I don't want to limit anything, I don't want to add an 'if' everywhere 
in IR, or a call to ATM ... it'd be interesting to see if it improves anything 
though.

I also want to add a TestTimeLimitedIndexReader.

BTW about the name - I think TimeLimitingIndexReader is more accurate, since 
it's not the IR which is time limited, but the operations it performs. What do 
you think? (also we'll follow TimeLimitingCollector) ...

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: FieldCacheImpl concurrency

2010-02-11 Thread Shay Banon
On Thu, Feb 11, 2010 at 5:41 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Feb 11, 2010 at 9:54 AM, Shay Banon kim...@gmail.com wrote:
 I would like to try and improve concurrency in Lucene in several
 places,
  and thought I would start with FieldCacheImpl. The implementation is
 heavily
  synchronized on both a global map and on creation values for a pretty
  heavily used path (get).

 It really shouldn't be heavily used.
 For a sorted search, get() is called once per segment in the index.
 There is no synchronization to retrieve per-document values.


Sorting is not the only reason why FieldCache would be used...



 We have just barely moved to Java5 though, and so it would probably be
 pretty easy to improve the concurrency of the read path if it did
 become problematic.


I know you just moved to Java 5, and my first email talked about the
concurrency aspect.



  I think the weak hash map is a good solution (but
  also think that the ability to choose to use soft hash map would add even
  greater flexibility), the problem is that there is no built in concurrent
  weak/soft hash map.

 It might be a better idea to have interfaces that allow one to
 implement their own policies rather than to try and push all the
 different options into Lucene:
 https://issues.apache.org/jira/browse/LUCENE-831


I think that trying to provide an extension that either syncs or not is a
pain in terms of extension point (it will be very convoluted) and many times
hinders the optimizations that can be done in concurrent systems... . Just
see what is done in the FieldCacheImpl and the creation marker which is not
really needed with google collections (they allow for single entry be
created atomically).


 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org