date:20110114

[
https://issues.apache.org/jira/browse/LUCENE-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981674#action_12981674
]

Simon Willnauer commented on LUCENE-2773:
-

bq. So for 3.x/trunk (which already take deletions into account by default),
I'll switch maxMergeMB default to 2 GB. I think this is an OK default given
that it means your biggest segments will range from 2GB - 20GB.
Mike, this also means that an optimize will have no effect if all segments
2GB with this as default? It seems kind of odd to me ey?

Don't create compound file for large segments by default

Key: LUCENE-2773
URL: https://issues.apache.org/jira/browse/LUCENE-2773
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 2.9.4, 3.0.3, 3.1, 4.0

Attachments: LUCENE-2773.patch

Spinoff from LUCENE-2762.
CFS is useful for keeping the open file count down. But, it costs
some added time during indexing to build, and also ties up temporary
disk space, causing eg a large spike on the final merge of an
optimize.
Since MergePolicy dictates which segments should be CFS, we can
change it to only build CFS for smallish merges.
I think we should also set a maxMergeMB by default so that very large
merges aren't done.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically

[
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981682#action_12981682
]

Simon Willnauer commented on LUCENE-2868:
-

{quote}
When you have the same query in a query hierarchy multiple times, tremendous
savings can now be had if the user knows enough to share the rewritten queries
in the hierarchy, due to the TermCache addition. But this is clumsy and
requires a lot of coding by the user to take advantage of. Lucene should be
smart enough to share the rewritten queries automatically.
{quote}

First of all, I get nervous when it gets to stuff like this! Hence, I can see
when this could be useful, for instance if you have one and the same FuzzyQuery
/ RegexpQuery which has a rather large setup cost in more than one clause in a
boolean query then this would absolutely help. For other queryies like
TermQuery the TermState cache in TermsEnum already helps you a lot so for those
this wouldn't make a big difference though.

bq. Query rewriteUsingCache(IndexReader indexReader)
I think one major issue here is how would you clear a cache here.
WeakReferences would work but I would't to put any cache into any query. In
general we shouldn't make any query heavy weight or somewhate stateful at
all. Yet, if we would pass a RewriteCache into Query#rewrite(IR, RC) that has a
per IS#search lifetime this could actually work. This would also be easy to
implement Query#rewrite(IR, RC) would just forward to Query#rewrite(IR) for by
default and combining (BooleanQuery) queries could override the new one.
Eventually, MultiTermQuery can provide such an impl and check the cache if it
needs to rewrite itself or return an already rewritten version.

It should be easy to make use of TermCache; rewritten queries should be
shared automatically

Key: LUCENE-2868
URL: https://issues.apache.org/jira/browse/LUCENE-2868
Project: Lucene - Java
Issue Type: Improvement
Components: Query/Scoring
Reporter: Karl Wright

When you have the same query in a query hierarchy multiple times, tremendous
savings can now be had if the user knows enough to share the rewritten
queries in the hierarchy, due to the TermCache addition. But this is clumsy
and requires a lot of coding by the user to take advantage of. Lucene should
be smart enough to share the rewritten queries automatically.
This can be most readily (and powerfully) done by introducing a new method to
Query.java:
Query rewriteUsingCache(IndexReader indexReader)
... and including a caching implementation right in Query.java which would
then work for all. Of course, all callers would want to use this new method
rather than the current rewrite().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically

2011-01-14 Thread Karl Wright (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981685#action_12981685
]

Karl Wright commented on LUCENE-2868:
-

Fine by me if you have a better way of doing it!

Who would create the RewriteCache object? The IndexSearcher?

It should be easy to make use of TermCache; rewritten queries should be
shared automatically

Key: LUCENE-2868
URL: https://issues.apache.org/jira/browse/LUCENE-2868
Project: Lucene - Java
Issue Type: Improvement
Components: Query/Scoring
Reporter: Karl Wright

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically

[
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981686#action_12981686
]

Simon Willnauer commented on LUCENE-2868:
-

bq. Who would create the RewriteCache object? The IndexSearcher?
it could.. or just be an overloaded IS.search method

It should be easy to make use of TermCache; rewritten queries should be
shared automatically

Key: LUCENE-2868
URL: https://issues.apache.org/jira/browse/LUCENE-2868
Project: Lucene - Java
Issue Type: Improvement
Components: Query/Scoring
Reporter: Karl Wright

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically

[
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981706#action_12981706
]

Simon Willnauer commented on LUCENE-2868:
-

Actually, I think we need to clarify the description of this issue. This has
nothing todo with TermCache at all. It actually reads very scary though since
caches are really tricky and this one is mainly about rewrite cost in MTQ. This
said, adding a method to Query just for the sake of MTQ rewrite seems kind of
odd though. We should rather optimize the query structure somehow instead of
caching a rewrite method.

It should be easy to make use of TermCache; rewritten queries should be
shared automatically

Key: LUCENE-2868
URL: https://issues.apache.org/jira/browse/LUCENE-2868
Project: Lucene - Java
Issue Type: Improvement
Components: Query/Scoring
Reporter: Karl Wright

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2864) add maxtf to fieldinvertstate


 [ 
https://issues.apache.org/jira/browse/LUCENE-2864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-2864.
-

Resolution: Fixed
  Assignee: Robert Muir

Committed revision 1058939, 1058944 (3x)

 add maxtf to fieldinvertstate
 -

 Key: LUCENE-2864
 URL: https://issues.apache.org/jira/browse/LUCENE-2864
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Query/Scoring
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2864.patch


 the maximum within-document TF is a very useful scoring value, 
 we should expose it so that people can use it in scoring
 consider the following sim:
 {code}
 @Override
 public float idf(int docFreq, int numDocs) {
   return 1.0F; /* not used */
 }
 @Override
 public float computeNorm(String field, FieldInvertState state) {
   return state.getBoost() / (float) Math.sqrt(state.getMaxTF());
 }
 {code}
 which is surprisingly effective, but more interesting for practical reasons.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

CorruptIndexException when indexing

2011-01-14 Thread Li Li

hi all,
   we have confronted this problem 3 times when testing
   The exception stack is
Exception in thread Lucene Merge Thread #2
org.apache.lucene.index.MergePolicy$MergeException:
org.apache.lucene.index.CorruptIndexException: docs out of order (7286
= 7286 )
at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:355)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:319)
Caused by: org.apache.lucene.index.CorruptIndexException: docs out of
order (7286 = 7286 )
at 
org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:75)
at 
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:880)
at 
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:818)
at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:756)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:187)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5354)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4937)

Or
Exception in thread Lucene Merge Thread #0
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.ArrayIndexOutOfBoundsException: 330
at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:355)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:319)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 330
at org.apache.lucene.util.BitVector.get(BitVector.java:102)
at 
org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:238)
at 
org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:168)
at 
org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:98)
at 
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:870)
at 
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:818)
at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:756)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:187)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5354)


   We did some minor modification based on lucene 2.9.1 and solr
1.4.0. we modified frq file to store 4 bytes for the positions of the
term occured
in these document(Accessing full postions in prx is time consuming
that can't meed our needs). I can't tell it's our bug or lucene's own
bug.
   I searched the mail list and found the mail problem during index
merge posted in 2010-10-21. It's similar to our case.
   It seems the docList in frq file is wrongly stored. When Merging,
when it's decoded, the wrong docID many larger than maxDocs(BitVector
deletedDocs)
which cause the second exception. Or docID delta is less than 0(it
reads wrongly) which cause the first exception
   we are still continueing testing to turn off our modification and
open infoStream in solr-config.xml

   We found a strange phenomenon. when we test, it sometimes hited
exceptions but in our production environment, it never hit any.
   the hardware and software environments are the same. We checked
carefully and find the only difference is this line in solr-config.xml
  ramBufferSizeMB32/ramBufferSizeMB  in testing environment
  ramBufferSizeMB256/ramBufferSizeMBin production environment
  The indexed documents number for each machine is also roughly the
same. 10M+ documents.
  I can't make sure the indice in production env are correct because
even there are some terms' docList are wrong, if the doc delta 0  and
don't have
some deleted documents, it will not hit the 2 exceptions.
  The search results in production env and we don't find any strange results.

  Will when  the ramBufferSizeMB is too small results in index corruption?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2773) Don't create compound file for large segments by default

[
https://issues.apache.org/jira/browse/LUCENE-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981726#action_12981726
]

Michael McCandless commented on LUCENE-2773:

bq. Mike, this also means that an optimize will have no effect if all segments
2GB with this as default? It seems kind of odd to me ey?

There was a separate issue for this -- LUCENE-2701.

I agree it's debatable... and it's not clear which way we should default it.

Don't create compound file for large segments by default

Attachments: LUCENE-2773.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2773) Don't create compound file for large segments by default


[ 
https://issues.apache.org/jira/browse/LUCENE-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981729#action_12981729
 ] 

Simon Willnauer commented on LUCENE-2773:
-

bq. There was a separate issue for this - LUCENE-2701.
I think we should reopen and fix this. I expect optimize to have single segment 
semantics if I call optmize() as the JDocs states. However we do that :)

 Don't create compound file for large segments by default
 

 Key: LUCENE-2773
 URL: https://issues.apache.org/jira/browse/LUCENE-2773
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9.4, 3.0.3, 3.1, 4.0

 Attachments: LUCENE-2773.patch


 Spinoff from LUCENE-2762.
 CFS is useful for keeping the open file count down.  But, it costs
 some added time during indexing to build, and also ties up temporary
 disk space, causing eg a large spike on the final merge of an
 optimize.
 Since MergePolicy dictates which segments should be CFS, we can
 change it to only build CFS for smallish merges.
 I think we should also set a maxMergeMB by default so that very large
 merges aren't done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Reopened: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy


 [ 
https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reopened LUCENE-2701:
-


This change together with LUCENE-2773 caused a change of the IW#optimize() and 
friends semantics.
IW#optimize() says:
{code}
 /**
   * Requests an optimize operation on an index, priming the index
   * for the fastest available search. Traditionally this has meant
   * merging all segments into a single segment as is done in the
   * default merge policy, but individual merge policies may implement
   * optimize in different ways.
   *

{code}

Which is not entirely true anymore since default now uses 

{code}
  /** Default maximum segment size.  A segment of this size
   *  or larger will never be merged.  @see setMaxMergeMB */
  public static final double DEFAULT_MAX_MERGE_MB = 2048;
{code}

this is not what I would expect from optimize() even if it would be documented 
that way. A plain optimize call should by default result in a single segment 
IMO. Yet, we could make this set by a flag in LogMergePolicy maybe something 
like LogMergePolicy#useMasMergeSizeForOptimize = false; as a default?

 Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
 

 Key: LUCENE-2701
 URL: https://issues.apache.org/jira/browse/LUCENE-2701
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch


 LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken 
 into consideration in regular merges, yet ignored by findMergesForOptimze. I 
 think it'd be good if we take that into consideration even when optimizing. 
 This will allow the caller to specify two constraints: maxNumSegments and 
 maxMergeMB. Obviously both may not be satisfied, and therefore we will 
 guarantee that if there is any segment above the threshold, the threshold 
 constraint takes precedence and therefore you may end up w/ maxNumSegments 
 (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into 
 consideration.
 As part of this change, I plan to change some methods to protected (from 
 private) and members as well. I realized that if one wishes to implement his 
 own LMP extension, he needs to either put it under o.a.l.index or copy some 
 code over to his impl.
 I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

2011-01-14 Thread Karl Wright (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wright updated LUCENE-2868:

Description:
When you have the same query in a query hierarchy multiple times, tremendous
savings can now be had if the user knows enough to share the rewritten queries
in the hierarchy, due to the TermState addition. But this is clumsy and
requires a lot of coding by the user to take advantage of. Lucene should be
smart enough to share the rewritten queries automatically.

This can be most readily (and powerfully) done by introducing a new method to
Query.java:

Query rewriteUsingCache(IndexReader indexReader)

... and including a caching implementation right in Query.java which would then
work for all. Of course, all callers would want to use this new method rather
than the current rewrite().

was:
When you have the same query in a query hierarchy multiple times, tremendous
savings can now be had if the user knows enough to share the rewritten queries
in the hierarchy, due to the TermCache addition. But this is clumsy and
requires a lot of coding by the user to take advantage of. Lucene should be
smart enough to share the rewritten queries automatically.

This can be most readily (and powerfully) done by introducing a new method to
Query.java:

Query rewriteUsingCache(IndexReader indexReader)

... and including a caching implementation right in Query.java which would then
work for all. Of course, all callers would want to use this new method rather
than the current rewrite().

Summary: It should be easy to make use of TermState; rewritten queries
should be shared automatically (was: It should be easy to make use of
TermCache; rewritten queries should be shared automatically)

It should be easy to make use of TermState; rewritten queries should be
shared automatically

Key: LUCENE-2868
URL: https://issues.apache.org/jira/browse/LUCENE-2868
Project: Lucene - Java
Issue Type: Improvement
Components: Query/Scoring
Reporter: Karl Wright

When you have the same query in a query hierarchy multiple times, tremendous
savings can now be had if the user knows enough to share the rewritten
queries in the hierarchy, due to the TermState addition. But this is clumsy
and requires a lot of coding by the user to take advantage of. Lucene should
be smart enough to share the rewritten queries automatically.
This can be most readily (and powerfully) done by introducing a new method to
Query.java:
Query rewriteUsingCache(IndexReader indexReader)
... and including a caching implementation right in Query.java which would
then work for all. Of course, all callers would want to use this new method
rather than the current rewrite().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

2011-01-14 Thread Karl Wright (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981746#action_12981746
]

Karl Wright commented on LUCENE-2868:
-

I reworded the description.

I think the word cache is correct, but what we really need is simply a cache
that has the lifetime of a top-level rewrite. I agree that putting the data in
the query object itself would not have this characteristic, but on the other
hand a second Query method that is cache aware seems reasonable. For example:

Query rewriteMinimal(RewriteCache rc, IndexReader ir)

... where RewriteCache was an object that had a lifetime consistent with the
highest-level rewrite operation done on the query graph. The rewriteMinimal()
method would look for the rewrite of the the current query in the RewriteCache,
and if found, would return that, otherwise would call plain old rewrite() and
then save the result.

So the patch would include:
(a) the change as specified to Query.java
(b) an implementation of RewriteCache, which *could* just be simplified to
MapQuery,Query
(c) changes to the callers of rewrite(), so that the minimal rewrite was called
instead.

Thoughts?

It should be easy to make use of TermState; rewritten queries should be
shared automatically

Key: LUCENE-2868
URL: https://issues.apache.org/jira/browse/LUCENE-2868
Project: Lucene - Java
Issue Type: Improvement
Components: Query/Scoring
Reporter: Karl Wright

When you have the same query in a query hierarchy multiple times, tremendous
savings can now be had if the user knows enough to share the rewritten
queries in the hierarchy, due to the TermState addition. But this is clumsy
and requires a lot of coding by the user to take advantage of. Lucene should
be smart enough to share the rewritten queries automatically.
This can be most readily (and powerfully) done by introducing a new method to
Query.java:
Query rewriteUsingCache(IndexReader indexReader)
... and including a caching implementation right in Query.java which would
then work for all. Of course, all callers would want to use this new method
rather than the current rewrite().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

How to submit code?

2011-01-14 Thread Jörg Lang

Hi 

I started looking into Lucence, as I might need it on a project. As there was 
no GermanAnalyzer in the dotNet version, I ported the code that was available 
in the Java version to .NET.

As I new to the OpenSource world, I do not exactly know how I need to proceed, 
that this piece of code is included?
Send it to a contributor?

Thanks for any advice.

Regards
Jörg Lang

[jira] Commented: (LUCENE-2723) Speed up Lucene's low level bulk postings read API

[
https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981762#action_12981762
]

Robert Muir commented on LUCENE-2723:
-

Ok, we are caught up to trunk... but we need to integrate getBulkPostingsEnum
with termstate to fix the nocommits in TermQuery.

This should also finally allow us to fix the cost of that extra per-segment
docFreq.

Speed up Lucene's low level bulk postings read API
--

Key: LUCENE-2723
URL: https://issues.apache.org/jira/browse/LUCENE-2723
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 4.0

Attachments: LUCENE-2723-termscorer.patch,
LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch,
LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch,
LUCENE-2723.patch, LUCENE-2723_bulkvint.patch, LUCENE-2723_facetPerSeg.patch,
LUCENE-2723_facetPerSeg.patch, LUCENE-2723_openEnum.patch,
LUCENE-2723_termscorer.patch, LUCENE-2723_wastedint.patch

Spinoff from LUCENE-1410.
The flex DocsEnum has a simple bulk-read API that reads the next chunk
of docs/freqs. But it's a poor fit for intblock codecs like FOR/PFOR
(from LUCENE-1410). This is not unlike sucking coffee through those
tiny plastic coffee stirrers they hand out airplanes that,
surprisingly, also happen to function as a straw.
As a result we see no perf gain from using FOR/PFOR.
I had hacked up a fix for this, described at in my blog post at
http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html
I'm opening this issue to get that work to a committable point.
So... I've worked out a new bulk-read API to address performance
bottleneck. It has some big changes over the current bulk-read API:
* You can now also bulk-read positions (but not payloads), but, I
have yet to cutover positional queries.
* The buffer contains doc deltas, not absolute values, for docIDs
and positions (freqs are absolute).
* Deleted docs are not filtered out.
* The doc freq buffers need not be aligned. For fixed intblock
codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16,
Group varint, etc.) they won't be.
It's still a work in progress...

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

[
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Simon Willnauer updated LUCENE-2868:

Attachment: query-rewriter.patch

I just sketched out what I have in mind could solve this problem and create the
infrastructure to do way more than just caching a query#rewrite.
This patch (which is just a sketch to show what I have in mind) adds a
QueryRewriter class that walks the Query AST and rewrites each query node
in the tree. The default implementation does nothing special, it just forwards
to the queryies rewrite method but there seems to be a whole lot of potential
in such a tree-walker / visitor. For instance could we subclass it to optimize
certain queries if we fix the coord problem. Yet another usecase is to decouple
MTQ rewriter entirely from MTQ (not sure if we want that though) or somebody
wants to wrap a query during rewrite.

Even further somebody could rewrite against fieldcache? Maybe this can be even
more general and just be a QueryVisitor so folks can easily walk the tree.

I think this is really something that should be solved in general AND in a
different issue.

simon

It should be easy to make use of TermState; rewritten queries should be
shared automatically

Key: LUCENE-2868
URL: https://issues.apache.org/jira/browse/LUCENE-2868
Project: Lucene - Java
Issue Type: Improvement
Components: Query/Scoring
Reporter: Karl Wright
Attachments: query-rewriter.patch

When you have the same query in a query hierarchy multiple times, tremendous
savings can now be had if the user knows enough to share the rewritten
queries in the hierarchy, due to the TermState addition. But this is clumsy
and requires a lot of coding by the user to take advantage of. Lucene should
be smart enough to share the rewritten queries automatically.
This can be most readily (and powerfully) done by introducing a new method to
Query.java:
Query rewriteUsingCache(IndexReader indexReader)
... and including a caching implementation right in Query.java which would
then work for all. Of course, all callers would want to use this new method
rather than the current rewrite().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

2011-01-14 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981774#action_12981774
 ] 

Earwin Burrfoot commented on LUCENE-2868:
-

We here use an intermediate query AST, with a number of walkers that do synonym 
substitution, optimization, caching, rewriting for multiple fields, and finally 
- generating a tree of Lucene Queries.

I can share a generic reflection-based visitor that's somewhat more handy than 
default visitor pattern in java.
Usage looks roughly like: 
{code}
class ToStringWalker extends DispatchingVisitorString { // String here stands 
for the type of walk result
  String visit(TermQuery q) {
return {term:  + q.getTerm() + };
  }

  String visit(BooleanQuery q) {
StringBuffer buf = new StringBuffer();
buf.append({boolean: );
for (BooleanQuery.Clause clause: q.clauses()) {
  buf.append(dispatch(clause.getQuery()).append(, ); // Here we 
}
buf.append(});
return buf.toString();
  }

  String visit(SpanQuery q) { // Runs for all SpanQueries
.
  }

  String visit(Query q) { // Runs for all Queries not covered by a more exact 
visit() method 
..
  }
}

Query query = ...;
String stringRepresentation = new ToStringWalker().dispatch(query);
{code}

dispatch() checks its parameter runtime type, picks a visit()'s most close 
overload (according to java rules for compile-time overloaded method 
resolution), and invokes it.

 It should be easy to make use of TermState; rewritten queries should be 
 shared automatically
 

 Key: LUCENE-2868
 URL: https://issues.apache.org/jira/browse/LUCENE-2868
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Karl Wright
 Attachments: query-rewriter.patch


 When you have the same query in a query hierarchy multiple times, tremendous 
 savings can now be had if the user knows enough to share the rewritten 
 queries in the hierarchy, due to the TermState addition.  But this is clumsy 
 and requires a lot of coding by the user to take advantage of.  Lucene should 
 be smart enough to share the rewritten queries automatically.
 This can be most readily (and powerfully) done by introducing a new method to 
 Query.java:
 Query rewriteUsingCache(IndexReader indexReader)
 ... and including a caching implementation right in Query.java which would 
 then work for all.  Of course, all callers would want to use this new method 
 rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2723) Speed up Lucene's low level bulk postings read API

[
https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Simon Willnauer updated LUCENE-2723:

Attachment: LUCENE-2723.patch

here is a fix for the nocommit robert put into TermQuery. All tests pass, i
will commit in a bit

Speed up Lucene's low level bulk postings read API
--

Attachments: LUCENE-2723-termscorer.patch,
LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch,
LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch,
LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723_bulkvint.patch,
LUCENE-2723_facetPerSeg.patch, LUCENE-2723_facetPerSeg.patch,
LUCENE-2723_openEnum.patch, LUCENE-2723_termscorer.patch,
LUCENE-2723_wastedint.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

[
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981778#action_12981778
]

Simon Willnauer commented on LUCENE-2868:
-

bq. I can share a generic reflection-based visitor that's somewhat more handy
than default visitor pattern in java.
Earwin - I think we should make a new issue and get something like that
implemented in there which is more general than what I just sketched out. If
you could share your code that would be awesome!

It should be easy to make use of TermState; rewritten queries should be
shared automatically

When you have the same query in a query hierarchy multiple times, tremendous
savings can now be had if the user knows enough to share the rewritten
queries in the hierarchy, due to the TermState addition. But this is clumsy
and requires a lot of coding by the user to take advantage of. Lucene should
be smart enough to share the rewritten queries automatically.
This can be most readily (and powerfully) done by introducing a new method to
Query.java:
Query rewriteUsingCache(IndexReader indexReader)
... and including a caching implementation right in Query.java which would
then work for all. Of course, all callers would want to use this new method
rather than the current rewrite().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2869) remove Query.getSimilarity()

remove Query.getSimilarity()


 Key: LUCENE-2869
 URL: https://issues.apache.org/jira/browse/LUCENE-2869
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir


Spinoff of LUCENE-2854.

See LUCENE-2828 and LUCENE-2854 for reference.

In general, the SimilarityDelegator was problematic with regards to 
back-compat, and if queries
want to score differently, trying to runtime subclass Similarity only causes 
trouble.

The reason we could not fix this in LUCENE-2854 is because:
{noformat}
Michael McCandless added a comment - 08/Jan/11 01:53 PM
bq. Is it possible to remove this method Query.getSimilarity also? I don't 
understand why we need this method!

I would love to! But I think that's for another day...

I looked into this and got stuck with BoostingQuery, which rewrites to an anon 
subclass of BQ overriding its getSimilarity in turn override its coord method. 
Rather twisted... if we can do this differently I think we could remove 
Query.getSimilarity.
{noformat}

here is the method in question:

{noformat}
/** Expert: Returns the Similarity implementation to be used for this query.
 * Subclasses may override this method to specify their own Similarity
 * implementation, perhaps one that delegates through that of the Searcher.
 * By default the Searcher's Similarity implementation is returned.*/
public Similarity getSimilarity(IndexSearcher searcher) {
  return searcher.getSimilarity();
}
{noformat}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2869) remove Query.getSimilarity()

[
https://issues.apache.org/jira/browse/LUCENE-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated LUCENE-2869:

Attachment: LUCENE-2869.patch

Here's a patch.

To fix the BoostingQuery in contrib, it overrides BooleanWeight.
(Also a test that instantiates BooleanScorer with a null weight had to be
fixed).

remove Query.getSimilarity()

Key: LUCENE-2869
URL: https://issues.apache.org/jira/browse/LUCENE-2869
Project: Lucene - Java
Issue Type: Task
Reporter: Robert Muir
Attachments: LUCENE-2869.patch

Spinoff of LUCENE-2854.
See LUCENE-2828 and LUCENE-2854 for reference.
In general, the SimilarityDelegator was problematic with regards to
back-compat, and if queries
want to score differently, trying to runtime subclass Similarity only causes
trouble.
The reason we could not fix this in LUCENE-2854 is because:
{noformat}
Michael McCandless added a comment - 08/Jan/11 01:53 PM
bq. Is it possible to remove this method Query.getSimilarity also? I don't
understand why we need this method!
I would love to! But I think that's for another day...
I looked into this and got stuck with BoostingQuery, which rewrites to an
anon
subclass of BQ overriding its getSimilarity in turn override its coord
method.
Rather twisted... if we can do this differently I think we could remove
Query.getSimilarity.
{noformat}
here is the method in question:
{noformat}
/** Expert: Returns the Similarity implementation to be used for this query.
* Subclasses may override this method to specify their own Similarity
* implementation, perhaps one that delegates through that of the Searcher.
* By default the Searcher's Similarity implementation is returned.*/
public Similarity getSimilarity(IndexSearcher searcher) {
return searcher.getSimilarity();
}
{noformat}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

[
https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981803#action_12981803
]

Jason Rutherglen commented on LUCENE-2701:
--

I agree that there should not be a defaults for the max merge segment size for
optimize, though it's good to have the option.

Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

Key: LUCENE-2701
URL: https://issues.apache.org/jira/browse/LUCENE-2701
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1, 4.0

Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch

LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken
into consideration in regular merges, yet ignored by findMergesForOptimze. I
think it'd be good if we take that into consideration even when optimizing.
This will allow the caller to specify two constraints: maxNumSegments and
maxMergeMB. Obviously both may not be satisfied, and therefore we will
guarantee that if there is any segment above the threshold, the threshold
constraint takes precedence and therefore you may end up w/ maxNumSegments
(if it's not 1) after optimize. Otherwise, maxNumSegments is taken into
consideration.
As part of this change, I plan to change some methods to protected (from
private) and members as well. I realized that if one wishes to implement his
own LMP extension, he needs to either put it under o.a.l.index or copy some
code over to his impl.
I'll attach a patch shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

2011-01-14 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981813#action_12981813
]

Shai Erera commented on LUCENE-2701:

I don't think we need a useDefaultMaxMergeMb. Instead, we can default the
member to Long.MAX_VAL. That way, if no one sets it, all segments will be
considered for merge, and if one wants, he can set it.

I expect that if I use IW with a LMP that sets maxMergeMB, then even if I call
optimize() this setting will take effect.

BTW, I don't remember introducin this defaul as part of this issue. This issue
only changed LMP to take the already existed setting into account. So maybe
reverting this default should be handled within the issue I was changed in?

Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

Key: LUCENE-2701
URL: https://issues.apache.org/jira/browse/LUCENE-2701
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1, 4.0

Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

[
https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981817#action_12981817
]

Simon Willnauer commented on LUCENE-2701:
-

bq. BTW, I don't remember introducin this defaul as part of this issue. This
issue only changed LMP to take the already existed setting into account. So
maybe reverting this default should be handled within the issue I was changed
in?
True this was done in here: LUCENE-2773 - but this seemed to be more related?!
bq. I don't think we need a useDefaultMaxMergeMb. Instead, we can default the
member to Long.MAX_VAL. That way, if no one sets it, all segments will be
considered for merge, and if one wants, he can set it.

I think mike did that on purpose to prevent large segs from merging during
indexing so what is wrong with disable that limit during optimize?

Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

Key: LUCENE-2701
URL: https://issues.apache.org/jira/browse/LUCENE-2701
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1, 4.0

Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981827#action_12981827
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

I'm taking a guess here, however the 
ThreadAffinityDocumentsWriterThreadPool.getAndLock method looks a little 
suspicious as we're iterating on ThreadStates and on a non-concurrent hashmap 
calling put while not in a lock?  

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, 
 test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981830#action_12981830
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Also multiple threads can call DocumentsWriterPerThread.addDocument and that's 
resulting in this:

{code}[junit] java.lang.AssertionError: omitTermFreqAndPositions:false 
postings.docFreqs[termID]:0
[junit] at 
org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:143)
[junit] at 
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:234)
[junit] at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:91)
[junit] at 
org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:274)
[junit] at 
org.apache.lucene.index.DocumentsWriterPerThread.addDocument(DocumentsWriterPerThread.java:184)
[junit] at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:374)
[junit] at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1403)
[junit] at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1375)
{code}

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, 
 test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-14 Thread Michael Busch (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981832#action_12981832
]

Michael Busch commented on LUCENE-2324:
---

bq. as we're iterating on ThreadStates and on a non-concurrent hashmap calling
put while not in a lock?

The threadBindings hashmap is a ConcurrentHashMap and the
getActivePerThreadsIterator() is threadsafe I believe.

Per thread DocumentsWriters that write their own private segments
-

Key: LUCENE-2324
URL: https://issues.apache.org/jira/browse/LUCENE-2324
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
Fix For: Realtime Branch

Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch,
LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch,
lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out,
test.out

See LUCENE-2293 for motivation and more details.
I'm copying here Mike's summary he posted on 2293:
Change the approach for how we buffer in RAM to a more isolated
approach, whereby IW has N fully independent RAM segments
in-process and when a doc needs to be indexed it's added to one of
them. Each segment would also write its own doc stores and
normal segment merging (not the inefficient merge we now do on
flush) would merge them. This should be a good simplification in
the chain (eg maybe we can remove the *PerThread classes). The
segments can flush independently, letting us make much better
concurrent use of IO CPU.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

[
https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981836#action_12981836
]

Michael McCandless commented on LUCENE-2701:

bq. I think mike did that on purpose to prevent large segs from merging during
indexing.

Right -- large merges are really quite nasty -- mess up searches, NRT
turnaround, IW.close() suddenly unexpectedly takes like an hour, etc.

But, really the best fix, which I'd love to do at some point, is to fix our
merge policy so that insanely large merges are done w/ fewer segments (eg only
2 segments at once). I think BalancedMP does this.

Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

Key: LUCENE-2701
URL: https://issues.apache.org/jira/browse/LUCENE-2701
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1, 4.0

Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Release schedule Lucene 4?

2011-01-14 Thread Gregor Heinrich


Dear Lucene team,

I am wondering whether there is an updated Lucene release schedule for the v4.0 
stream.


Any earliest/latest alpha/beta/stable date? And if not yet, where to track such 
info?


Thanks in advance from Germany

gregor

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2870) if a segment is 100% deletions, we should just drop it

if a segment is 100% deletions, we should just drop it
--

 Key: LUCENE-2870
 URL: https://issues.apache.org/jira/browse/LUCENE-2870
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
 Fix For: 3.1, 4.0


I think in IndexWriter if the delCount ever == maxDoc() for a segment we should 
just drop it?

We don't, today, and so we force it to be merged, which is silly.

I won't have time for this any time soon so if someone wants to take it, please 
do!!  Should be simple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981839#action_12981839
]

Jason Rutherglen commented on LUCENE-2324:
--

bq. The threadBindings hashmap is a ConcurrentHashMap and the
getActivePerThreadsIterator() is threadsafe I believe.

Sorry yes CHM is used, it all looks thread safe, but there must be multiple
threads accessing a single DWPT at the same time for some of these errors to be
occurring.

Per thread DocumentsWriters that write their own private segments
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2666) ArrayIndexOutOfBoundsException when iterating over TermDocs


[ 
https://issues.apache.org/jira/browse/LUCENE-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981843#action_12981843
 ] 

Michael McCandless commented on LUCENE-2666:


Can you run CheckIndex on this index and post the result?  And, enable 
assertions.

And if possible turn on IndexWriter's infoStream and capture/post the output 
leading up to the corruption.

Many updates during indexing is just fine... and I know whether rolling back to 
older Lucene releases will help (until we've isolated the issue).  But: maybe 
try rolling forward to 3.0.3?  It's possible you're hitting a big fixed in 
3.0.3 (though this doesn't ring a bell for me).

 ArrayIndexOutOfBoundsException when iterating over TermDocs
 ---

 Key: LUCENE-2666
 URL: https://issues.apache.org/jira/browse/LUCENE-2666
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.2
Reporter: Shay Banon

 A user got this very strange exception, and I managed to get the index that 
 it happens on. Basically, iterating over the TermDocs causes an AAOIB 
 exception. I easily reproduced it using the FieldCache which does exactly 
 that (the field in question is indexed as numeric). Here is the exception:
 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
   at 
 org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:501)
   at 
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183)
   at 
 org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:470)
   at TestMe.main(TestMe.java:56)
 It happens on the following segment: _26t docCount: 914 delCount: 1 
 delFileName: _26t_1.del
 And as you can see, it smells like a corner case (it fails for document 
 number 912, the AIOOB happens from the deleted docs). The code to recreate it 
 is simple:
 FSDirectory dir = FSDirectory.open(new File(index));
 IndexReader reader = IndexReader.open(dir, true);
 IndexReader[] subReaders = reader.getSequentialSubReaders();
 for (IndexReader subReader : subReaders) {
 Field field = 
 subReader.getClass().getSuperclass().getDeclaredField(si);
 field.setAccessible(true);
 SegmentInfo si = (SegmentInfo) field.get(subReader);
 System.out.println(--  + si);
 if (si.getDocStoreSegment().contains(_26t)) {
 // this is the probleatic one...
 System.out.println(problematic one...);
 FieldCache.DEFAULT.getLongs(subReader, __documentdate, 
 FieldCache.NUMERIC_UTILS_LONG_PARSER);
 }
 }
 Here is the result of a check index on that segment:
   8 of 10: name=_26t docCount=914
 compound=true
 hasProx=true
 numFiles=2
 size (MB)=1.641
 diagnostics = {optimize=false, mergeFactor=10, 
 os.version=2.6.18-194.11.1.el5.centos.plus, os=Linux, mergeDocStores=true, 
 lucene.version=3.0.2 953716 - 2010-06-11 17:13:53, source=merge, 
 os.arch=amd64, java.version=1.6.0, java.vendor=Sun Microsystems Inc.}
 has deletions [delFileName=_26t_1.del]
 test: open reader.OK [1 deleted docs]
 test: fields..OK [32 fields]
 test: field norms.OK [32 fields]
 test: terms, freq, prox...ERROR [114]
 java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
   at 
 org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102)
   at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:616)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:509)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
   at TestMe.main(TestMe.java:47)
 test: stored fields...ERROR [114]
 java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34)
   at 
 org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:684)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:512)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
   at TestMe.main(TestMe.java:47)
 test: term vectorsERROR [114]
 java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at

[jira] Resolved: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader


 [ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1821.
---

Resolution: Fixed
  Assignee: Simon Willnauer

This is resolved by adding AtomicReaderContext in 4.0 (LUCENE-2831).

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-2439) Composite readers (Multi/DirIndexReader) should not subclass IndexReader


 [ 
https://issues.apache.org/jira/browse/LUCENE-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler closed LUCENE-2439.
-

Resolution: Duplicate

Duplicate of LUCENE-2858.

 Composite readers (Multi/DirIndexReader) should not subclass IndexReader
 

 Key: LUCENE-2439
 URL: https://issues.apache.org/jira/browse/LUCENE-2439
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
 Fix For: 4.0


 I'd like to change Multi/DirIndexReader so that they no longer implement the 
 low level methods of IndexReader, and instead act more like an ordered 
 collection of sub readers.  I think to do this we'd need a new interface, 
 common to atomic readers (SegmentReader) and the composite readers, which 
 IndexSearcher would accept.
 We should also require that the core Query scorers always receive an atomic 
 reader.
 We've taken strong initial steps here with flex, by forcing users to use 
 separate MultiFields static methods to obtain Fields/Terms/etc. from a 
 composite reader.  This issue is to finish this cutover.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2010) Remove segments with all documents deleted in commit/flush/close of IndexWriter instead of waiting until a merge occurs.


 [ 
https://issues.apache.org/jira/browse/LUCENE-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2010:
--

Fix Version/s: 4.0
   3.1

 Remove segments with all documents deleted in commit/flush/close of 
 IndexWriter instead of waiting until a merge occurs.
 

 Key: LUCENE-2010
 URL: https://issues.apache.org/jira/browse/LUCENE-2010
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Uwe Schindler
 Fix For: 3.1, 4.0


 I do not know if this is a bug in 2.9.0, but it seems that segments with all 
 documents deleted are not automatically removed:
 {noformat}
 4 of 14: name=_dlo docCount=5
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=0.059
   diagnostics = {java.version=1.5.0_21, lucene.version=2.9.0 817268P - 
 2009-09-21 10:25:09, os=SunOS,
  os.arch=amd64, java.vendor=Sun Microsystems Inc., os.version=5.10, 
 source=flush}
   has deletions [delFileName=_dlo_1.del]
   test: open reader.OK [5 deleted docs]
   test: fields..OK [136 fields]
   test: field norms.OK [136 fields]
   test: terms, freq, prox...OK [1698 terms; 4236 terms/docs pairs; 0 tokens]
   test: stored fields...OK [0 total field count; avg ? fields per doc]
   test: term vectorsOK [0 total vector count; avg ? term/freq vector 
 fields per doc]
 {noformat}
 Shouldn't such segments not be removed automatically during the next 
 commit/close of IndexWriter?
 *Mike McCandless:*
 Lucene doesn't actually short-circuit this case, ie, if every single doc in a 
 given segment has been deleted, it will still merge it [away] like normal, 
 rather than simply dropping it immediately from the index, which I agree 
 would be a simple optimization. Can you open a new issue? I would think IW 
 can drop such a segment immediately (ie not wait for a merge or optimize) on 
 flushing new deletes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-2870) if a segment is 100% deletions, we should just drop it


 [ 
https://issues.apache.org/jira/browse/LUCENE-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler closed LUCENE-2870.
-

Resolution: Duplicate

Duplicate of LUCENE-2010.

 if a segment is 100% deletions, we should just drop it
 --

 Key: LUCENE-2870
 URL: https://issues.apache.org/jira/browse/LUCENE-2870
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
 Fix For: 3.1, 4.0


 I think in IndexWriter if the delCount ever == maxDoc() for a segment we 
 should just drop it?
 We don't, today, and so we force it to be merged, which is silly.
 I won't have time for this any time soon so if someone wants to take it, 
 please do!!  Should be simple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [jira] Created: (LUCENE-2863) Updating a documenting looses its fields that only indexed, also NumericField tries are completely lost

2011-01-14 Thread Erick Erickson

This is behaving as intended if I'm reading this correctly. Lucene has
never fetched fields that aren't stored, and that's what you're
asking it to do. To see why, consider indexing but not storing
a normal text field with, say stop word removal and stemming. The
*only* data kept in the index is the analyzed data, so even if you
did reconstruct the field (no easy task BTW), you'd have something that
was not the original text and would be pretty unsatisfactory.

Kudos for providing the test case by the way, that makes figuring out
what the answer is much easier...

If this makes sense, could you close the JIRA? If not we can hash
it out a bit more...

Best
Erick

On Wed, Jan 12, 2011 at 2:12 PM, Tamas Sandor (JIRA) j...@apache.orgwrote:

 Updating a documenting looses its fields that only indexed, also
 NumericField tries are completely lost

 ---

 Key: LUCENE-2863
 URL: https://issues.apache.org/jira/browse/LUCENE-2863
 Project: Lucene - Java
  Issue Type: Bug
  Components: Store
Affects Versions: 3.0.3, 3.0.2
 Environment: WindowsXP, Java1.6.20 using a RamDirectory
Reporter: Tamas Sandor


 I have a code snippet (see below) which creates a new document with
 standard (stored, indexed), *not-stored, indexed-only* and some
 *NumericFields*. Then it updates the document via adding a new string field.
 The result is that all the fields that are not stored but indexed-only and
 especially NumericFields the trie tokens are completly lost from index after
 update or delete/add.
 {code:java}
 Directory ramDir = new RamDirectory();
 IndexWriter writer = new IndexWriter(ramDir, new WhitespaceAnalyzer(),
 MaxFieldLength.UNLIMITED);
 Document doc = new Document();
 doc.add(new Field(ID, HO1234, Store.YES, Index.NOT_ANALYZED_NO_NORMS));
 doc.add(new Field(PATTERN, HELLO, Store.NO,
 Index.NOT_ANALYZED_NO_NORMS));
 doc.add(new NumericField(LAT, Store.YES,
 true).setDoubleValue(51.48826603066d));
 doc.add(new NumericField(LNG, Store.YES,
 true).setDoubleValue(-0.08913399651646614d));
 writer.addDocument(doc);
 doc = new Document();
 doc.add(new Field(ID, HO, Store.YES, Index.NOT_ANALYZED_NO_NORMS));
 doc.add(new Field(PATTERN, BELLO, Store.NO,
 Index.NOT_ANALYZED_NO_NORMS));
 doc.add(new NumericField(LAT, Store.YES,
 true).setDoubleValue(101.48826603066d));
 doc.add(new NumericField(LNG, Store.YES,
 true).setDoubleValue(-100.08913399651646614d));
 writer.addDocument(doc);

 Term t = new Term(ID, HO1234);
 Query q = new TermQuery(t);
 IndexSearcher seacher = new IndexSearcher(writer.getReader());
 TopDocs hits = seacher.search(q, 1);
 if (hits.scoreDocs.length  0) {
  Document ndoc = seacher.doc(hits.scoreDocs[0].doc);
  ndoc.add(new Field(FINAL, FINAL, Store.YES,
 Index.NOT_ANALYZED_NO_NORMS));
  writer.updateDocument(t, ndoc);
 //  writer.deleteDocuments(q);
 //  writer.addDocument(ndoc);
 } else {
  LOG.info(Couldn't find the document via the query);
 }

 seacher = new IndexSearcher(writer.getReader());
 hits = seacher.search(new TermQuery(new Term(PATTERN, HELLO)), 1);
 LOG.info(_hits HELLO: + hits.totalHits); // should be 1 but it's 0

 writer.close();
 {code}

 And I have a boundingbox query based on *NumericRangeQuery*. After the
 document update it doesn't return any hit.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments