from:"Michael McCandless \(Created\) \(JIRA\)"

[jira] [Created] (LUCENE-3968) Factor MockGraphTokenFilter into LookaheadTokenFilter + random tokens

2012-04-08 Thread Michael McCandless (Created) (JIRA)

Factor MockGraphTokenFilter into LookaheadTokenFilter + random tokens
-

 Key: LUCENE-3968
 URL: https://issues.apache.org/jira/browse/LUCENE-3968
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0


MockGraphTokenFilter is rather hairy... I've managed to simplify it (I think!) 
by breaking apart its two functions...

I think LookaheadTokenFilter can be used in the future for other graph aware 
filters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3966) smokeTestRelease should accept a local (file://) staging URL

2012-04-07 Thread Michael McCandless (Created) (JIRA)

smokeTestRelease should accept a local (file://) staging URL


 Key: LUCENE-3966
 URL: https://issues.apache.org/jira/browse/LUCENE-3966
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless


I'll also fix buildAndPushRelease so it can push to a local URL; this way at 
any time we can build, push to local staging, and run smoke tester on it, and 
hopefully nothing fails...

But really any tests in smoke tester should ideally be pushed back earlier in 
our dev process (into jenkins, into ant test).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3942) SynonymFilter should set pos length att

2012-04-02 Thread Michael McCandless (Created) (JIRA)

SynonymFilter should set pos length att
---

 Key: LUCENE-3942
 URL: https://issues.apache.org/jira/browse/LUCENE-3942
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0


Tokenizers/Filters can now produce graphs instead of a single linear
chain of tokens, by setting the PositionLengthAttribute, expressing
where (how many positions ahead) this token ends.

The default is 1, meaning it ends at the next position, to be
backwards compatible.

SynonymFilter produces graph output tokens, as long as the output is a
single token, but currently never sets the pos length to express this.
EG for the rule wifi network - hotspot, the hotspot token should
have pos length = 2.  With LUCENE-3940 this will allow us to verify
that the offsets for such tokens are correct...


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

2012-03-30 Thread Michael McCandless (Created) (JIRA)

When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave 
a hole
-

 Key: LUCENE-3940
 URL: https://issues.apache.org/jira/browse/LUCENE-3940
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0


I modified BaseTokenStreamTestCase to assert that the start/end
offsets match for graph (posLen  1) tokens, and this caught a bug in
Kuromoji when the decompounding of a compound token has a punctuation
token that's dropped.

In this case we should leave hole(s) so that the graph is intact, ie,
the graph should look the same as if the punctuation tokens were not
initially removed, but then a StopFilter had removed them.

This also affects tokens that have no compound over them, ie we fail
to leave a hole today when we remove the punctuation tokens.

I'm not sure this is serious enough to warrant fixing in 3.6 at the
last minute...


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3912) Improved the checked-in tiny line file docs

2012-03-24 Thread Michael McCandless (Created) (JIRA)

Improved the checked-in tiny line file docs
---

 Key: LUCENE-3912
 URL: https://issues.apache.org/jira/browse/LUCENE-3912
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0


I think it may not have any surrogate pairs (it was derived from Europarl).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

2012-03-24 Thread Michael McCandless (Created) (JIRA)

HTMLStripCharFilter produces invalid final offset
-

 Key: LUCENE-3913
 URL: https://issues.apache.org/jira/browse/LUCENE-3913
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 3.6, 4.0


Nightly build found this... I boiled it down to a small test case that doesn't 
require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3905) BaseTokenStreamTestCase should test analyzers on real-ish content

2012-03-23 Thread Michael McCandless (Created) (JIRA)

BaseTokenStreamTestCase should test analyzers on real-ish content
-

 Key: LUCENE-3905
 URL: https://issues.apache.org/jira/browse/LUCENE-3905
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless


We already have LineFileDocs, that pulls content generated from europarl or 
wikipedia... I think sometimes BTSTC should test the analyzers on that as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2012-03-23 Thread Michael McCandless (Created) (JIRA)

Improve the Edge/NGramTokenizer/Filters
---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0


Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
multiple passes, instead of stacked, which messes up offsets/positions and 
requires too much buffering (can hit OOME for long tokens).  They clip at 1024 
chars (tokenizers) but don't (token filters).  The split up surrogate pairs 
incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3890) GroupFacetCollectorTest nightly build failure

2012-03-20 Thread Michael McCandless (Created) (JIRA)

GroupFacetCollectorTest nightly build failure
-

 Key: LUCENE-3890
 URL: https://issues.apache.org/jira/browse/LUCENE-3890
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 4.0


Failure from nightly build:

https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/2022/testReport/junit/org.apache.lucene.search.grouping/GroupFacetCollectorTest/testRandom/

It reproduces for me with:
{noformat}
 ant test -Dtestcase=GroupFacetCollectorTest -Dtestmethod=testRandom 
-Dtests.seed=7d227aa075b7bfb8:550d2a0828ce2537:-3553c99f6a4d293e 
-Dtests.multiplier=3 -Dargs=-Dfile.encoding=US-ASCII
{noformat}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3891) Documents loaded at search time (IndexReader.document) should be a different class from the index-time Document

2012-03-20 Thread Michael McCandless (Created) (JIRA)

Documents loaded at search time (IndexReader.document) should be a different 
class from the index-time Document
---

 Key: LUCENE-3891
 URL: https://issues.apache.org/jira/browse/LUCENE-3891
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless


The fact that the Document you can load at search time is the same Document 
class you had indexed is horribly trappy in Lucene, because, the loaded 
document necessarily loses information like field boost, whether a field was 
tokenized, etc.  (See LUCENE-3854 for a recent example).

We should fix this, statically, so that it's an entirely different class at 
search time vs index time.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-03-20 Thread Michael McCandless (Created) (JIRA)

Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, 
etc.)
-

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0


On the flex branch we explored a number of possible intblock
encodings, but for whatever reason never brought them to completion.
There are still a number of issues opened with patches in different
states.

Initial results (based on prototype) were excellent (see
http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
).

I think this would make a good GSoC project.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3893) TermsFilter should use AutomatonQuery

2012-03-20 Thread Michael McCandless (Created) (JIRA)

TermsFilter should use AutomatonQuery
-

 Key: LUCENE-3893
 URL: https://issues.apache.org/jira/browse/LUCENE-3893
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless


I think we could see perf gains if TermsFilter sorted the terms, built a 
minimal automaton, and used TermsEnum.intersect to visit the terms...

This idea came up on the dev list recently.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

2012-03-20 Thread Michael McCandless (Created) (JIRA)

Make BaseTokenStreamTestCase a bit more evil


 Key: LUCENE-3894
 URL: https://issues.apache.org/jira/browse/LUCENE-3894
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0


Throw an exception from the Reader while tokenizing, stop after not consuming 
all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3877) Lucene should not call System.out.println

2012-03-16 Thread Michael McCandless (Created) (JIRA)

Lucene should not call System.out.println
-

 Key: LUCENE-3877
 URL: https://issues.apache.org/jira/browse/LUCENE-3877
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 3.6, 4.0


We seem to have accumulated a few random sops...

Eg, PairOutputs.java (oal.util.fst) and MultiDocValues.java, at least.

Can we somehow detect (eg, have a test failure) if we accidentally leave errant 
System.out.println's (leftover from debugging)...?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3872) Index changes are lost if you call prepareCommit() then close()

2012-03-15 Thread Michael McCandless (Created) (JIRA)

Index changes are lost if you call prepareCommit() then close()
---

 Key: LUCENE-3872
 URL: https://issues.apache.org/jira/browse/LUCENE-3872
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0


You are supposed to call commit() after calling prepareCommit(), but... if you 
forget, and call close() after prepareCommit() without calling commit(), then 
any changes done after the prepareCommit() are silently lost (including 
adding/deleting docs, but also any completed merges).

Spinoff from java-user thread lots of .cfs (compound files) in the index 
directory from Tim Bogaert.

I think to fix this, IW.close should throw an IllegalStateException if 
prepareCommit() was called with no matching call to commit().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3870) VarDerefBytesImpl doc values prefix length may fall across two pages

2012-03-14 Thread Michael McCandless (Created) (JIRA)

VarDerefBytesImpl doc values prefix length may fall across two pages


 Key: LUCENE-3870
 URL: https://issues.apache.org/jira/browse/LUCENE-3870
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Michael McCandless
 Fix For: 4.0


The VarDerefBytesImpl doc values encodes the unique byte[] with prefix (1 or 2 
bytes) first, followed by bytes, so that it can use 
PagedBytes.fillSliceWithPrefix.

It does this itself rather than using PagedBytes.copyUsingLengthPrefix...

The problem is, it can write an invalid 2 byte prefix spanning two blocks (ie, 
last byte of block N and first byte of block N+1), which fillSliceWithPrefix 
won't decode correctly.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3846) Fuzzy suggester

2012-03-04 Thread Michael McCandless (Created) (JIRA)

Fuzzy suggester
---

 Key: LUCENE-3846
 URL: https://issues.apache.org/jira/browse/LUCENE-3846
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0


Would be nice to have a suggester that can handle some fuzziness (like spell 
correction) so that it's able to suggest completions that are near what you 
typed.

As a first go at this, I implemented 1T (ie up to 1 edit, including a 
transposition), except the first letter must be correct.

But there is a penalty, ie, the corrected suggestion needs to have a much 
higher freq than the exact match suggestion before it can compete.

Still tons of nocommits, and somehow we should merge this / make it work with 
analyzing suggester too (LUCENE-3842).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3829) Lucene40 codec's DocValues DirectSource impls aren't thread-safe

2012-02-27 Thread Michael McCandless (Created) (JIRA)

Lucene40 codec's DocValues DirectSource impls aren't thread-safe


 Key: LUCENE-3829
 URL: https://issues.apache.org/jira/browse/LUCENE-3829
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 4.0


Our DirectSource impls hold IndexInput(s) open against the dat/idx
files, which we then seek + read when loading a specific document's
value.  But this is in no way protected against multiple threads
I think...?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3824) TermOrdVal/DocValuesComparator does too much work in compareBottom

2012-02-26 Thread Michael McCandless (Created) (JIRA)

TermOrdVal/DocValuesComparator does too much work in compareBottom
--

 Key: LUCENE-3824
 URL: https://issues.apache.org/jira/browse/LUCENE-3824
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.6, 4.0


We now have logic to fall back to by-value comparison, when the bottom
slot is not from the current reader.

But this is silly, because if the bottom slot is from a different
reader, it means the tie-break case is not possible (since the current
reader didn't have the bottom value), so when the incoming ord equals
the bottom ord we should always return x  0.

I added a new random string sort test case to TestSort...

I also renamed DocValues.SortedSource.getByValue - getOrdByValue and
cleaned up some whitespace.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3769) Simplify NRTManager

2012-02-10 Thread Michael McCandless (Created) (JIRA)

Simplify NRTManager
---

 Key: LUCENE-3769
 URL: https://issues.apache.org/jira/browse/LUCENE-3769
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0


NRTManager is hairy now, because the applyDeletes is separately passed
to ctor, passed to maybeReopen, passed to getSearcherManager, etc.

I think, instead, you should pass it only to the ctor, and if you have
some cases needing deletes and others not then you can make two
NRTManagers.  This should be no less efficient than we have today,
just simpler.

I think it will also enable NRTManager to subclass ThingyManager
(LUCENE-3761).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3766) Remove/deprecate Tokenizer's default ctor

2012-02-09 Thread Michael McCandless (Created) (JIRA)

Remove/deprecate Tokenizer's default ctor
-

 Key: LUCENE-3766
 URL: https://issues.apache.org/jira/browse/LUCENE-3766
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.6, 4.0


I was working on a new Tokenizer... and I accidentally forgot to call 
super(input) (and super.reset(input) from my reset method)... which then meant 
my correctOffset() calls were silently a no-op; this is very trappy.

Fortunately the awesome BaseTokenStreamTestCase caught this (I hit failures 
because the offsets were not in fact being corrected).

One minimal thing we can do (but it sounds like from Robert there may be 
reasons why we can't) is add {{assert input != null}} in 
Tokenizer.correctOffset:

{noformat}
Index: lucene/core/src/java/org/apache/lucene/analysis/Tokenizer.java
===
--- lucene/core/src/java/org/apache/lucene/analysis/Tokenizer.java  
(revision 1242316)
+++ lucene/core/src/java/org/apache/lucene/analysis/Tokenizer.java  
(working copy)
@@ -82,6 +82,7 @@
* @see CharStream#correctOffset
*/
   protected final int correctOffset(int currentOff) {
+assert input != null: subclass failed to call super(Reader) or 
super.reset(Reader);
 return (input instanceof CharStream) ? ((CharStream) 
input).correctOffset(currentOff) : currentOff;
   }
{noformat}

But best would be to remove the default ctor that leaves input null...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3767) Explore streaming Viterbi search in Kuromoji

2012-02-09 Thread Michael McCandless (Created) (JIRA)

Explore streaming Viterbi search in Kuromoji


 Key: LUCENE-3767
 URL: https://issues.apache.org/jira/browse/LUCENE-3767
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0


I've been playing with the idea of changing the Kuromoji viterbi
search to be 2 passes (intersect, backtrace) instead of 4 passes
(break into sentences, intersect, score, backtrace)... this is very
much a work in progress, so I'm just getting my current state up.
It's got tons of nocommits, doesn't properly handle the user dict nor
extended modes yet, etc.

One thing I'm playing with is to add a double backtrace for the long
compound tokens, ie, instead of penalizing these tokens so that
shorter tokens are picked, leave the scores unchanged but on backtrace
take that penalty and use it as a threshold for a 2nd best
segmentation...


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3760) Cleanup DR.getCurrentVersion/DR.getUserData/DR.getIndexCommit().getUserData()

2012-02-08 Thread Michael McCandless (Created) (JIRA)

Cleanup DR.getCurrentVersion/DR.getUserData/DR.getIndexCommit().getUserData()
-

 Key: LUCENE-3760
 URL: https://issues.apache.org/jira/browse/LUCENE-3760
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0


Spinoff from Ryan's dev thread DR.getCommitUserData() vs 
DR.getIndexCommit().getUserData()... these methods are confusing/dups right 
now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3756) Don't allow IndexWriterConfig setters to chain

2012-02-07 Thread Michael McCandless (Created) (JIRA)

Don't allow IndexWriterConfig setters to chain
--

 Key: LUCENE-3756
 URL: https://issues.apache.org/jira/browse/LUCENE-3756
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless


Spinoff from LUCENE-3736.

I don't like that IndexWriterConfig's setters are chainable; it
results in code in our tests like this:

{noformat}
IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig( 
TEST_VERSION_CURRENT, new 
MockAnalyzer(random)).setMaxBufferedDocs(2).setMergePolicy(newLogMergePolicy()));
{noformat}

I think in general we should avoid chaining since it encourages hard
to read code (code is already hard enough to read!).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-01-31 Thread Michael McCandless (Created) (JIRA)

Be consistent about negative vInt/vLong
---

 Key: LUCENE-3738
 URL: https://issues.apache.org/jira/browse/LUCENE-3738
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 3.6, 4.0


Today, write/readVInt allows a negative int, in that it will encode and 
decode correctly, just horribly inefficiently (5 bytes).

However, read/writeVLong fails (trips an assert).

I'd prefer that both vInt/vLong trip an assert if you ever try to write a 
negative number... it's badly trappy today.  But, unfortunately, we sometimes 
rely on this... had we had this assert in 'since the beginning' we could have 
avoided that.

So, if we can't add that assert in today, I think we should at least fix 
readVLong to handle negative longs... but then you quietly spend 9 bytes (even 
more trappy!).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3742) SynFilter doesn't set offsets for outputs that hang off the end of the input tokens

2012-01-31 Thread Michael McCandless (Created) (JIRA)

SynFilter doesn't set offsets for outputs that hang off the end of the input 
tokens
---

 Key: LUCENE-3742
 URL: https://issues.apache.org/jira/browse/LUCENE-3742
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0
 Attachments: LUCENE-3742.patch

If you have syn rule a - x y and input a then output is a/x y but... what 
should y's offsets be?  Right now we set to 0/0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3729) Allow using FST to hold terms data in DocValues.BYTES_*_SORTED

2012-01-29 Thread Michael McCandless (Created) (JIRA)

Allow using FST to hold terms data in DocValues.BYTES_*_SORTED
--

 Key: LUCENE-3729
 URL: https://issues.apache.org/jira/browse/LUCENE-3729
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3725) Add optional packing to FST building

2012-01-26 Thread Michael McCandless (Created) (JIRA)

Add optional packing to FST building


 Key: LUCENE-3725
 URL: https://issues.apache.org/jira/browse/LUCENE-3725
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0


The FSTs produced by Builder can be further shrunk if you are willing
to spend highish transient RAM to do so... our Builder today tries
hard not to use much RAM (and has options to tweak down the RAM usage,
in exchange for somewhat lager FST), even when building immense FSTs.

But for apps that can afford highish transient RAM to get a smaller
net FST, I think we should offer packing.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3685) Add top-down version of BlockJoinQuery

2012-01-10 Thread Michael McCandless (Created) (JIRA)

Add top-down version of BlockJoinQuery
--

 Key: LUCENE-3685
 URL: https://issues.apache.org/jira/browse/LUCENE-3685
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/join
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0


Today, BlockJoinQuery can join from child docIDs up to parent docIDs.
EG this works well for product (parent) + many SKUs (child) search.

But the reverse, which BJQ cannot do, is also useful in some cases.
EG say you index songs (child) within albums (parent), but you want to
search and present by song not album while involving some fields from
the album in the query.  In this case you want to wrap a parent query
(against album), joining down to the child document space.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3684) Add offsets to postings (DPEnum)

2012-01-09 Thread Michael McCandless (Created) (JIRA)

Add offsets to postings (DPEnum)
-

 Key: LUCENE-3684
 URL: https://issues.apache.org/jira/browse/LUCENE-3684
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0


I think should explore making start/end offsets a first-class attr in the
postings APIs, and fixing the indexer to index them into postings.

This will make term vector access cleaner (we now have to jump through
hoops w/ non-first-class offset attr).  It can also enable efficient
highlighting without term vectors / reanalyzing, if the app indexes
offsets into the postings.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3681) FST.BYTE2 should save as fixed 2 byte not as vInt

2012-01-08 Thread Michael McCandless (Created) (JIRA)

FST.BYTE2 should save as fixed 2 byte not as vInt
-

 Key: LUCENE-3681
 URL: https://issues.apache.org/jira/browse/LUCENE-3681
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0


We currently write BYTE1 as a single byte, but BYTE2/4 as vInt, but I think 
that's confusing.  Also, for the FST for the new Kuromoji analyzer 
(LUCENE-3305), writing as 2 bytes instead shrank the FST and ran faster, 
presumably because more values were = 16384 than were  128.

Separately the whole INPUT_TYPE is very confusing... really all it's doing is 
declaring the allowed range of the characters of the input alphabet, and then 
the only thing that uses that is the write/readLabel methods (well and some 
confusing sugar methods in Builder!).  Not sure how to fix that yet...

It's a simple change but it changes the FST binary format so any users w/ FSTs 
out there will have to rebuild (FST is marked experimental...).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3679) Replace IndexReader.getFieldNames with IndexReader.getFieldInfos

2012-01-07 Thread Michael McCandless (Created) (JIRA)

Replace IndexReader.getFieldNames with IndexReader.getFieldInfos


 Key: LUCENE-3679
 URL: https://issues.apache.org/jira/browse/LUCENE-3679
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3658) NRTCachingDir has invalid asserts (if same file name is written twice)

2011-12-19 Thread Michael McCandless (Created) (JIRA)

NRTCachingDir has invalid asserts (if same file name is written twice)
--

 Key: LUCENE-3658
 URL: https://issues.apache.org/jira/browse/LUCENE-3658
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0
 Attachments: LUCENE-3658.patch

Normally Lucene is write-once (except for segments.gen file, which 
NRTCachingDir never caches), but in some tests (TestDoc, TestCrash) we can 
write the same file more than once.

I don't think NRTCachingDir should have these asserts, and I think on 
createOutput it should remove any old file if present.

I also found  fixed a possible concurrency issue (if more than one thread 
syncs at the same time; IndexWriter doesn't ever do this today but it has in 
the past).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3639) Add test case support for shard searching

2011-12-11 Thread Michael McCandless (Created) (JIRA)

Add test case support for shard searching
-

 Key: LUCENE-3639
 URL: https://issues.apache.org/jira/browse/LUCENE-3639
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0, 3.5


New test case that helps stress test the APIs to support sharding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3640) remove IndexSearcher.close

2011-12-11 Thread Michael McCandless (Created) (JIRA)

remove IndexSearcher.close
--

 Key: LUCENE-3640
 URL: https://issues.apache.org/jira/browse/LUCENE-3640
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0


Now that IS is never heavy (since you have to pass in your own IR), IS.close 
is truly a no-op... I think we should remove it.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3634) remove old static main methods in core

2011-12-10 Thread Michael McCandless (Created) (JIRA)

remove old static main methods in core
--

 Key: LUCENE-3634
 URL: https://issues.apache.org/jira/browse/LUCENE-3634
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.6, 4.0


We have a few random static main methods that I think are very rarely used... 
we should remove them (IndexReader, UTF32ToUTF8, English).

The IndexReader main lets you list / extract the sub-files from a CFS... I 
think we should move this to a new tool in contrib/misc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3618) FST suggester should allow saving to Directory (not just File)

2011-12-04 Thread Michael McCandless (Created) (JIRA)

FST suggester should allow saving to Directory (not just File)
--

 Key: LUCENE-3618
 URL: https://issues.apache.org/jira/browse/LUCENE-3618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Michael McCandless


Currently FSTCompletionLookup has a store method, taking File storeDir, which 
it treats as a directory and then saves the FST to file fst.bin inside there.

I think we should also add a store method taking a Lucene Directory?  Eg then I 
can store my suggest FST in a RAMDir.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (PYLUCENE-15) Add spellchecker JAR

2011-12-01 Thread Michael McCandless (Created) (JIRA)

Add spellchecker JAR


 Key: PYLUCENE-15
 URL: https://issues.apache.org/jira/browse/PYLUCENE-15
 Project: PyLucene
  Issue Type: Improvement
Reporter: Michael McCandless


3.x's lucene/contrib/spellchecker has the spellchecker and suggest packages... 
would be nice to have PyLucene wrap these by default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PYLUCENE-14) Add PythonIndexDeletionPolicy so we can implement IndexDeletionPolicy in Python

2011-11-26 Thread Michael McCandless (Created) (JIRA)

Add  PythonIndexDeletionPolicy so we can implement IndexDeletionPolicy in Python


 Key: PYLUCENE-14
 URL: https://issues.apache.org/jira/browse/PYLUCENE-14
 Project: PyLucene
  Issue Type: Improvement
Reporter: Michael McCandless




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PYLUCENE-12) Add PythonReusableAnalyzerBase, so we can create analyzers in Python

2011-11-22 Thread Michael McCandless (Created) (JIRA)

Add PythonReusableAnalyzerBase, so we can create analyzers in Python


 Key: PYLUCENE-12
 URL: https://issues.apache.org/jira/browse/PYLUCENE-12
 Project: PyLucene
  Issue Type: Improvement
Reporter: Michael McCandless


Lucene now has a useful helper class, ReusableAnalyzerBase; you subclass it and 
override one method, to create an analyzer that provides reusableTokenStream 
impl.

I think we should expose it in Python... patch is simple.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (LUCENE-3572) MultiIndexDocValues pretends it can merge sorted sources

2011-11-14 Thread Michael McCandless (Created) (JIRA)

MultiIndexDocValues pretends it can merge sorted sources


 Key: LUCENE-3572
 URL: https://issues.apache.org/jira/browse/LUCENE-3572
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 4.0


Nightly build hit this failure:

{noformat}
ant test-core -Dtestcase=TestSort -Dtestmethod=testReverseSort 
-Dtests.seed=791b126576b0cfab:-48895c7243ecc5d0:743c683d1c9f7768 
-Dtests.multiplier=3 -Dargs=-Dfile.encoding=ISO8859-1

[junit] Testcase: testReverseSort(org.apache.lucene.search.TestSort):   
Caused an ERROR
[junit] expected:[CEGIA] but was:[ACEGI]
[junit] at 
org.apache.lucene.search.TestSort.assertMatches(TestSort.java:1248)
[junit] at 
org.apache.lucene.search.TestSort.assertMatches(TestSort.java:1216)
[junit] at 
org.apache.lucene.search.TestSort.testReverseSort(TestSort.java:759)
[junit] at 
org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:523)
[junit] at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149)
[junit] at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)
{noformat}

It's happening in the test for reverse-sort of a string field with DocValues, 
when the test had gotten SlowMultiReaderWrapper.

I committed a fix to the test to avoid testing this case, but we need a better 
fix to the underlying bug.

MultiIndexDocValues cannot merge sorted sources (I think?), yet somehow it's 
pretending it can (in the above test, the three subs had BYTES_FIXED_SORTED 
type, and the TypePromoter happily claims to merge these to BYTES_FIXED_SORTED; 
I think MultiIndexDocValues should return null for the sorted source in this 
case?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3575) Field names can be wrong for stored fields / term vectors after merging

2011-11-14 Thread Michael McCandless (Created) (JIRA)

Field names can be wrong for stored fields / term vectors after merging
---

 Key: LUCENE-3575
 URL: https://issues.apache.org/jira/browse/LUCENE-3575
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0


The good news is this bug only exists in trunk... the bad news is it's
been here for some time (created by accident in LUCENE-2881).  But the
good news is it should strike fairly rarely.

SegmentMerger sometimes incorrectly thinks it can bulk-copy TVs/stored
fields when it cannot (because field numbers don't map to the same
names across segments).

I think it happens only with addIndexes, or indexes that have
pre-trunk segments, and then SM falsely thinks it can bulk-merge only
when the last field number has the same field name across segments.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3564) rename IndexWriter.rollback to .rollbackAndClose

2011-11-06 Thread Michael McCandless (Created) (JIRA)

rename IndexWriter.rollback to .rollbackAndClose


 Key: LUCENE-3564
 URL: https://issues.apache.org/jira/browse/LUCENE-3564
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.5, 4.0


Spinoff from LUCENE-3454, where Shai noticed that rollback is trappy since it 
[unexpected] closes the IW.

I think we should rename it to rollbackAndClose.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3562) Stop storing TermsEnum in CloseableThreadLocal inside Terms instance

2011-11-05 Thread Michael McCandless (Created) (JIRA)

Stop storing TermsEnum in CloseableThreadLocal inside Terms instance


 Key: LUCENE-3562
 URL: https://issues.apache.org/jira/browse/LUCENE-3562
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0


We have sugar methods in Terms.java (docFreq, totalTermFreq, docs,
docsAndPositions) that use a saved thread-private TermsEnum to do the
lookups.

But on apps that send many threads through Lucene, and/or have many
segments, this can add up to a lot of RAM, especially if the codecs
impl holds onto stuff.

Also, Terms has a close method (closes the CloseableThreadLocal) which
must be called, but we fail to do so in some places.

These saved enums are the cause of the recent OOME in TestNRTManager
(TestNRTManager.testNRTManager -seed
2aa27e1aec20c4a2:-4a5a5ecf46837d0e:-7c4f651f1f0b75d7 -mult 3
-nightly).

Really sharing these enums is a holdover from before Lucene queries
would share state (ie, save the TermState from the first pass, and use
it later to pull enums, get docFreq, etc.).  It's not helpful anymore,
and it can use gobbs of RAM, so I'd like to remove it.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3539) IndexFormatTooOld/NewExc should try to include fileName + directory when possible

2011-10-27 Thread Michael McCandless (Created) (JIRA)

IndexFormatTooOld/NewExc should try to include fileName + directory when 
possible
-

 Key: LUCENE-3539
 URL: https://issues.apache.org/jira/browse/LUCENE-3539
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.5, 4.0


(Spinoff from http://markmail.org/thread/t6s7nn3ve765nojc )

When we throw a too old/new exc we should try to include the full path to the 
offending file, if possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3524) Add direct PackedInts.Reader impl, that reads directly from disk on each get

2011-10-20 Thread Michael McCandless (Created) (JIRA)

Add direct PackedInts.Reader impl, that reads directly from disk on each get
--

 Key: LUCENE-3524
 URL: https://issues.apache.org/jira/browse/LUCENE-3524
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless


Spinoff from LUCENE-3518.

If we had a direct PackedInts.Reader impl we could use that instead of
the RandomAccessReaderIterator.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3520) If the NRT reader hasn't changed then IndexReader.openIfChanged should return null

2011-10-15 Thread Michael McCandless (Created) (JIRA)

If the NRT reader hasn't changed then IndexReader.openIfChanged should return 
null
--

 Key: LUCENE-3520
 URL: https://issues.apache.org/jira/browse/LUCENE-3520
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.5, 4.0


I hit a failure in TestSearcherManager (NOTE: doesn't always fail):

{noformat}
  ant test -Dtestcase=TestSearcherManager -Dtestmethod=testSearcherManager 
-Dtests.seed=459ac99a4256789c:-29b8a7f52497c3b4:145ae632ae9e1ecf
{noformat}

It was tripping the assert inside SearcherLifetimeManager.record,
because two different IndexSearcher instances had different IR
instances sharing the same version.  This was happening because
IW.getReader always returns a new reader even when there are no
changes.  I think we should fix that...

Separately I found a deadlock in
TestSearcherManager.testIntermediateClose, if the test gets
SerialMergeScheduler and needs to merge during the second commit.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3518) Add sort-by-term with DocValues

2011-10-14 Thread Michael McCandless (Created) (JIRA)

Add sort-by-term with DocValues
---

 Key: LUCENE-3518
 URL: https://issues.apache.org/jira/browse/LUCENE-3518
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/search
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0


There are two sorted byte[] types with DocValues (BYTES_VAR_SORTED,
BYTES_FIXED_SORTED), so you can index this type, but you can't yet
sort by it.

So I added a FieldComparator just like TermOrdValComparator, except it
pulls from the doc values instead.

There are some small diffs, eg with doc values there are never null
values (see LUCENE-3504).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3519) BlockJoinCollector only allows retrieving groups for only one BlockJoinQuery

2011-10-14 Thread Michael McCandless (Created) (JIRA)

BlockJoinCollector only allows retrieving groups for only one BlockJoinQuery


 Key: LUCENE-3519
 URL: https://issues.apache.org/jira/browse/LUCENE-3519
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/join
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.5, 4.0


Spinoff from Mark Harwood's email (subject BlockJoin concerns) to
dev list.

It's fine to use multiple nested joins in a single query, and
BlockJoinCollector should let you retrieve the top groups for all of
them.

But currently it always returns null after the first query's groups
have been retrieved, because of a silly bug.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3515) Possible slowdown of indexing/merging on 3.x vs trunk

2011-10-13 Thread Michael McCandless (Created) (JIRA)

Possible slowdown of indexing/merging on 3.x vs trunk
-

 Key: LUCENE-3515
 URL: https://issues.apache.org/jira/browse/LUCENE-3515
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Reporter: Michael McCandless
 Fix For: 3.5, 4.0


Opening an issue to pursue the possible slowdown Marc Sturlese uncovered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3510) BooleanScorer should not limit number of prohibited clauses

2011-10-11 Thread Michael McCandless (Created) (JIRA)

BooleanScorer should not limit number of prohibited clauses
---

 Key: LUCENE-3510
 URL: https://issues.apache.org/jira/browse/LUCENE-3510
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.5, 4.0


Today it's limited to 32, because it uses a separate bit in the mask
for each clause.

But I don't understand why it does this; I think all prohibited
clauses can share a single boolean/bit?  Any match on a prohibited
clause sets this bit and the doc is not collected; we don't need each
prohibited clause to have a dedicated bit?

We also use the mask for required clauses, but this code is now
commented out (we always use BS2 if there are any required clauses);
if we re-enable this code (and I think we should, at least in certain
cases: I suspect it'd be faster than BS2 in many cases), I think we
can cutover to an int count instead of bit masks, and then have no
limit on the required clauses sent to BooleanScorer also.

Separately I cleaned a few things up about BooleanScorer: all of the
embedded scorer methods (nextDoc, docID, advance, score) now throw
UOE; pre-allocate the buckets instead of doing it lazily
per-sub-collect.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3502) Packed ints: move .getArray into Reader API

2011-10-10 Thread Michael McCandless (Created) (JIRA)

Packed ints: move .getArray into Reader API
---

 Key: LUCENE-3502
 URL: https://issues.apache.org/jira/browse/LUCENE-3502
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.5, 4.0


This is a simple code cleanup... it's messy that a consumer of
PackedInts.Reader must check whether the impl is Direct8/16/32/64 in
order to get an array; it's better to move up the .getArray into the
Reader interface and then make the DirectN impls package private.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3503) DisjunctionSumScorer gives slightly (float iotas) different scores when you .nextDoc vs .advance

2011-10-10 Thread Michael McCandless (Created) (JIRA)

DisjunctionSumScorer gives slightly (float iotas) different scores when you 
.nextDoc vs .advance


 Key: LUCENE-3503
 URL: https://issues.apache.org/jira/browse/LUCENE-3503
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-3503.patch

Spinoff from LUCENE-1536.

I dug into why we hit a score diff when using luceneutil to benchmark
the patch.

At first I thought it was BS1/BS2 difference, but because of a bug in
the patch it was still using BS2 (but should be BS1) -- Robert's last
patch fixes that.

But it's actually a diff in BS2 itself, whether you next or advance
through the docs.

It's because DisjunctionSumScorer, when summing the float scores for a
given doc that matches multiple sub-scorers, might sum in a different
order, when you had .nextDoc'd to that doc than when you had .advance'd
to it.

This in turn is because the PQ used by that scorer (ScorerDocQueue)
makes no effort to break ties.  So, when the top N scorers are on the
same doc, the PQ doesn't care what order they are in.

Fixing ScorerDocQueue to break ties will likely be a non-trivial perf
hit, though, so I'm not sure whether we should do anything here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3504) DocValues: deref/sorted bytes types shouldn't return empty byte[] when doc didn't have a value

2011-10-10 Thread Michael McCandless (Created) (JIRA)

DocValues: deref/sorted bytes types shouldn't return empty byte[] when doc 
didn't have a value
--

 Key: LUCENE-3504
 URL: https://issues.apache.org/jira/browse/LUCENE-3504
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0


I'm looking at making a FieldComparator that uses DV's SortedSource to
sort by string field (ie just like TermOrdValComparator, except using
DV instead of FieldCache).  We already have comparators for DV int and
float DV fields.

But one thing I noticed is we can't detect documents that didn't have
any value indexed vs documents that had empty byte[] indexed.

This is easy to fix (and we used to do this), because these types are
deref'd (ie, each doc stores an address, and then separately looks up
the byte[] at that address), we can reserve ord/address 0 to mean doc
didn't have the field.  Then we should return null when you retrieve
the BytesRef value for that field.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3486) Add SearcherLifetimeManager, so you can retrieve the same searcher you previously used

2011-10-04 Thread Michael McCandless (Created) (JIRA)

Add SearcherLifetimeManager, so you can retrieve the same searcher you 
previously used
--

 Key: LUCENE-3486
 URL: https://issues.apache.org/jira/browse/LUCENE-3486
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/search
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.5, 4.0
 Attachments: LUCENE-3486.patch

The idea is similar to SOLR-2809 (adding searcher leases to Solr).

This utility class sits above whatever your source is for the
current searcher (eg NRTManager, SearcherManager, etc.), and records
(holds a reference to) each searcher in recent history.

The idea is to ensure that when a user does a follow-on action (clicks
next page, drills down/up), or when two or more searcher invocations
within a single user search need to happen against the same searcher
(eg in distributed search), you can retrieve the same searcher you
used last time.

I think with the new searchAfter API (LUCENE-2215), doing follow-on
searches on the same searcher is more important, since the bottom
(score/docID) held for that API can easily shift when a new searcher
is opened.

When you do a new search, you record the searcher you used with the
manager, and it returns to you a long token (currently just the
IR.getVersion()), which you can later use to retrieve the same
searcher.

Separately you must periodically call prune(), to prune the old
searchers, ideally from the same thread / at the same time that
you open a new searcher.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2807) Upgrade to Tika 0.10

2011-10-03 Thread Michael McCandless (Created) (JIRA)

Upgrade to Tika 0.10


 Key: SOLR-2807
 URL: https://issues.apache.org/jira/browse/SOLR-2807
 Project: Solr
  Issue Type: Improvement
Reporter: Michael McCandless


Tika 0.10 was recently released... seems like we should upgrade?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3477) Fix JFlex tokenizer compiler warnings

2011-09-30 Thread Michael McCandless (Created) (JIRA)

Fix JFlex tokenizer compiler warnings
-

 Key: LUCENE-3477
 URL: https://issues.apache.org/jira/browse/LUCENE-3477
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-3477.patch

We get lots of distracting fallthrough warnings running ant compile
in modules/analysis, from the tokenizers generated from JFlex.

Digging a bit, they actually do look spooky.

So I managed to edit the JFlex inputs to insert a bunch of break
statements in our rules, but I have no idea if this is
right/dangerous, and it seems a bit weird having to do such insertions
of naked breaks.

But, this does fix all the warnings, and all tests pass...


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3478) TestSimpleExplanations failure

2011-09-30 Thread Michael McCandless (Created) (JIRA)

TestSimpleExplanations failure
--

 Key: LUCENE-3478
 URL: https://issues.apache.org/jira/browse/LUCENE-3478
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Michael McCandless
 Fix For: 4.0


{noformat}
ant test -Dtestcase=TestSimpleExplanations -Dtestmethod=testDMQ8 
-Dtests.seed=144152895b276837:eb7ba4953db943f:33373b79a971db02
{noformat}

fails w/ this on current trunk... looks like silly floating point precision 
issue:

{noformat}

[junit] Testsuite: org.apache.lucene.search.TestSimpleExplanations
[junit]   1.4508595 = (MATCH) sum of:
[junit] 1.4508595 = (MATCH) weight(field:yy in 2) [DefaultSimilarity], 
result of:
[junit]   1.4508595 = score(doc=2,freq=1.0 = termFreq=1
[junit] ), product of:
[junit] 1.287682 = queryWeight, product of:
[junit]   1.287682 = idf(docFreq=2, maxDocs=4)
[junit]   1.0 = queryNorm
[junit] 1.1267219 = fieldWeight in 2, product of:
[junit]   1.0 = tf(freq=1.0), with freq of:
[junit] 1.0 = termFreq=1
[junit]   1.287682 = idf(docFreq=2, maxDocs=4)
[junit]   0.875 = fieldNorm(doc=2)
[junit]   145085.95 = (MATCH) weight(field:xx^10.0 in 2) 
[DefaultSimilarity], result of:
[junit] 145085.95 = score(doc=2,freq=1.0 = termFreq=1
[junit] ), product of:
[junit]   128768.2 = queryWeight, product of:
[junit] 10.0 = boost
[junit] 1.287682 = idf(docFreq=2, maxDocs=4)
[junit] 1.0 = queryNorm
[junit]   1.1267219 = fieldWeight in 2, product of:
[junit] 1.0 = tf(freq=1.0), with freq of:
[junit]   1.0 = termFreq=1
[junit] 1.287682 = idf(docFreq=2, maxDocs=4)
[junit] 0.875 = fieldNorm(doc=2)
[junit]  expected:145086.66 but was:145086.69)
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 0.544 sec
[junit] 
[junit] - Standard Error -
[junit] NOTE: reproduce with: ant test -Dtestcase=TestSimpleExplanations 
-Dtestmethod=testDMQ8 
-Dtests.seed=144152895b276837:eb7ba4953db943f:33373b79a971db02
[junit] NOTE: test params are: codec=PreFlex, 
sim=RandomSimilarityProvider(queryNorm=false,coord=false): 
{field=DefaultSimilarity, alt=DFR I(ne)LZ(0.3), KEY=IB LL-D2}, locale=en_IN, 
timezone=Pacific/Samoa
[junit] NOTE: all tests run in this JVM:
[junit] [TestSimpleExplanations]
[junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 
1.6.0_21 (64-bit)/cpus=24,threads=1,free=130426744,total=189988864
[junit] -  ---
[junit] Testcase: 
testDMQ8(org.apache.lucene.search.TestSimpleExplanations):FAILED
[junit] ((field:yy field:w5^100.0) | field:xx^10.0)~0.5: 
score(doc=2)=145086.66 != explanationScore=145086.69 Explanation: 145086.69 = 
(MATCH) max plus 0.5 times others of:
[junit]   1.4508595 = (MATCH) sum of:
[junit] 1.4508595 = (MATCH) weight(field:yy in 2) [DefaultSimilarity], 
result of:
[junit]   1.4508595 = score(doc=2,freq=1.0 = termFreq=1
[junit] ), product of:
[junit] 1.287682 = queryWeight, product of:
[junit]   1.287682 = idf(docFreq=2, maxDocs=4)
[junit]   1.0 = queryNorm
[junit] 1.1267219 = fieldWeight in 2, product of:
[junit]   1.0 = tf(freq=1.0), with freq of:
[junit] 1.0 = termFreq=1
[junit]   1.287682 = idf(docFreq=2, maxDocs=4)
[junit]   0.875 = fieldNorm(doc=2)
[junit]   145085.95 = (MATCH) weight(field:xx^10.0 in 2) 
[DefaultSimilarity], result of:
[junit] 145085.95 = score(doc=2,freq=1.0 = termFreq=1
[junit] ), product of:
[junit]   128768.2 = queryWeight, product of:
[junit] 10.0 = boost
[junit] 1.287682 = idf(docFreq=2, maxDocs=4)
[junit] 1.0 = queryNorm
[junit]   1.1267219 = fieldWeight in 2, product of:
[junit] 1.0 = tf(freq=1.0), with freq of:
[junit]   1.0 = termFreq=1
[junit] 1.287682 = idf(docFreq=2, maxDocs=4)
[junit] 0.875 = fieldNorm(doc=2)
[junit]  expected:145086.66 but was:145086.69
[junit] junit.framework.AssertionFailedError: ((field:yy field:w5^100.0) | 
field:xx^10.0)~0.5: score(doc=2)=145086.66 != explanationScore=145086.69 
Explanation: 145086.69 = (MATCH) max plus 0.5 times others of:
[junit]   1.4508595 = (MATCH) sum of:
[junit] 1.4508595 = (MATCH) weight(field:yy in 2) [DefaultSimilarity], 
result of:
[junit]   1.4508595 = score(doc=2,freq=1.0 = termFreq=1
[junit] ), product of:
[junit] 1.287682 = queryWeight, product of:
[junit]   1.287682 = idf(docFreq=2, maxDocs=4)
[junit]   1.0 = queryNorm
[junit]

[jira] [Created] (LUCENE-3479) TestGrouping failure

2011-09-30 Thread Michael McCandless (Created) (JIRA)

TestGrouping failure


 Key: LUCENE-3479
 URL: https://issues.apache.org/jira/browse/LUCENE-3479
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/grouping
Reporter: Michael McCandless
Assignee: Michael McCandless


{noformat}
ant test -Dtestcase=TestGrouping -Dtestmethod=testRandom 
-Dtests.seed=295cdb78b4a442d4:-4c5d64ef4d698c27:-425d4c1eb87211ba
{noformat}

fails with this on current trunk:

{noformat}

[junit] - Standard Error -
[junit] NOTE: reproduce with: ant test -Dtestcase=TestGrouping 
-Dtestmethod=testRandom 
-Dtests.seed=295cdb78b4a442d4:-4c5d64ef4d698c27:-425d4c1eb87211ba
[junit] NOTE: test params are: codec=RandomCodecProvider: {id=MockRandom, 
content=MockSep, sort2=SimpleText, groupend=Pulsing(freqCutoff=3 
minBlockSize=65 maxBlockSize=132), sort1=Memory, group=Memory}, 
sim=RandomSimilarityProvider(queryNorm=true,coord=false): {id=DFR I(F)L2, 
content=DFR BeL3(800.0), sort2=DFR GL3(800.0), groupend=DFR G2, sort1=DFR 
GB3(800.0), group=LM Jelinek-Mercer(0.70)}, locale=zh_TW, 
timezone=America/Indiana/Indianapolis
[junit] NOTE: all tests run in this JVM:
[junit] [TestGrouping]
[junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 
1.6.0_21 (64-bit)/cpus=24,threads=1,free=143246344,total=281804800
[junit] -  ---
[junit] Testcase: 
testRandom(org.apache.lucene.search.grouping.TestGrouping):   FAILED
[junit] expected:11 but was:7
[junit] junit.framework.AssertionFailedError: expected:11 but was:7
[junit] at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:148)
[junit] at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:50)
[junit] at 
org.apache.lucene.search.grouping.TestGrouping.assertEquals(TestGrouping.java:980)
[junit] at 
org.apache.lucene.search.grouping.TestGrouping.testRandom(TestGrouping.java:865)
[junit] at 
org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:611)
[junit] 
[junit] 
{noformat}

I dug for a while... the test is a bit sneaky because it compares sorted docs 
(by score) across 2 indexes.  Index #1 has no deletions; Index #2 has same 
docs, but organized into doc blocks by group, and has some deletions.  In 
theory (I think) even though the deletions will cause scores to differ across 
the two indices, it should not alter the sort order of the docs.  Here is the 
explain output of the docs that sorted differently:

{noformat}
#1: top hit in the has deletes doc-block index (id=239):

explain: 2.394486 = (MATCH) weight(content:real1 in 292)
[DFRSimilarity], result of:
 2.394486 = score(DFRSimilarity, doc=292, freq=1.0), computed from:
   1.0 = termFreq=1
   41.944084 = NormalizationH3, computed from:
 1.0 = tf
 5.3102274 = avgFieldLength
 2.56 = len
   102.829 = BasicModelBE, computed from:
 41.944084 = tfn
 880.0 = numberOfDocuments
 239.0 = totalTermFreq
   0.023286095 = AfterEffectL, computed from:
 41.944084 = tfn


#2: hit in the no deletes normal index (id=229)

ID=229 explain=2.382285 = (MATCH) weight(content:real1 in 225)
[DFRSimilarity], result of:
 2.382285 = score(DFRSimilarity, doc=225, freq=1.0), computed from:
   1.0 = termFreq=1
   41.765594 = NormalizationH3, computed from:
 1.0 = tf
 5.3218827 = avgFieldLength
 10.24 = len
   101.879845 = BasicModelBE, computed from:
 41.765594 = tfn
 786.0 = numberOfDocuments
 215.0 = totalTermFreq
   0.023383282 = AfterEffectL, computed from:
 41.765594 = tfn

Then I went and called explain on the no deletes normal index for
the top doc (id=239):

explain: 2.3822558 = (MATCH) weight(content:real1 in 17)
[DFRSimilarity], result of:
 2.3822558 = score(DFRSimilarity, doc=17, freq=1.0), computed from:
   1.0 = termFreq=1
   42.165264 = NormalizationH3, computed from:
 1.0 = tf
 5.3218827 = avgFieldLength
 2.56 = len
   102.8307 = BasicModelBE, computed from:
 42.165264 = tfn
 786.0 = numberOfDocuments
 215.0 = totalTermFreq
   0.023166776 = AfterEffectL, computed from:
 42.165264 = tfn
{noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3465) IndexSearcher fails to pass docBase to Collector when using ExecutorService

2011-09-26 Thread Michael McCandless (Created) (JIRA)

IndexSearcher fails to pass docBase to Collector when using ExecutorService
---

 Key: LUCENE-3465
 URL: https://issues.apache.org/jira/browse/LUCENE-3465
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.5


This bug is causing the failure in TestSearchAfter.

We are now always passing docBase 0 to Collector when you use ExecutorService 
with IndexSearcher.

This doesn't affect trunk (AtomicReaderContext carries the right docBase); only 
3.x.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

60 matches

Mail list logo