Contributing code

2011-01-14 Thread Troy Howard
All,

Now that we've moved past the proposal stage and defined our Initial
Committers list, I'd like to address the topic of how to be a
Contributor to Lucene.Net.

Some quick things to note upfront about roles. Previously I made a
point of distinguishing between Contributors and Committers at ASF.
This was meant to help motivated individuals to decide what level of
commitment they wanted to make to the project. I did not intend to
suggest that there is a special status of being a Contributor. I
listed those who had come forward offering support in the proposal
mostly to show that the community around the project was vital with a
lot of motivated individuals. I hope that this wasn't interpreted as
implying a special status to those people, or implying that others,
not on that list, could not be contributors.

There is no special status of Contributor that someone must gain prior
to submitting code. Anyone can write and submit code patches at any
time. As soon as you have done that, you are a Contributor.

All code contributions to ASF projects follow the same pattern. First,
a JIRA issue is created for the patch, with a description of the
change, and with the patch file attached to it. A project Committer
will find the issue, review the patch, and commit to SVN (or reject
the patch and provide an explanation).


Here's a quick guideline to the process for committing code to Lucene.Net:


Step-by-Step Example


Suppose I have downloaded the source code, and made a change to
'HelloWorld.cs'. Suppose I'm using TortoiseSVN.


STEP 1: Make a patch file

From TortoiseSVN, right click on the changed file/files and select
'Create Patch' from the 'TortoiseSVN' context menu. Save it as
'HelloWorld.cs.patch'.


STEP 2: Create a JIRA Issue

Lucene.Net's JIRA issue track is located here:

https://issues.apache.org/jira/browse/LUCENENET

If you don't have account in JIRA, you can sign up easily (click
'Login' in upper right and from that screen, click 'SignUp')

Once you're logged in to JIRA, you can create a new issue in the issue
tracker. For code patches, use issue type 'Improvement' or 'Bug'.
Please describe the patch you made with enough information that
someone else can understand both the code and the reasons why you
patched it.


STEP 3: Attach patch file to the JIRA Issue

After creating the issue, attach the 'HelloWorld.cs.patch' file to the
issue. For large patches, you may want to compress the source code
into a zip file.


STEP 4: Committer will apply or reject patch

A Committer will find the new issue, review the patch and either
commit to SVN or reject the patch with an explanation. This often
involves a discussion in the comments for the issue. Please remain
engaged with the conversation to ensure the completion of the issue.
Perhaps only a small change needs to be made to the patch in order for
it to be accepted.




An example of an issue that follows this process is here:

http://issues.apache.org/jira/browse/LUCENENET-331

I'd like to see a description of this process be available on the
project web page. I think this is a point of confusion for a lot of
would-be contributors.


Thanks,
Troy


Small change in one of the sample file, i.e., samples/mansearch.py

2011-01-14 Thread Jean Luc Truchtersheim
Hello,

I have just installed pylucene and tested it some of the sample scripts.

In samples/mansearch.py, line 68 should be

parser = QueryParser(Version.LUCENE_CURRENT,keywords,
StandardAnalyzer(Version.LUCENE_CURRENT))

rather than
parser = QueryParser(keywords, StandardAnalyzer(Version.LUCENE_CURRENT))

Maybe you could update that.

Many thanks.
Jean-Luc


Re: Small change in one of the sample file, i.e., samples/mansearch.py

2011-01-14 Thread Andi Vajda


On Fri, 14 Jan 2011, Jean Luc Truchtersheim wrote:


I have just installed pylucene and tested it some of the sample scripts.

In samples/mansearch.py, line 68 should be

parser = QueryParser(Version.LUCENE_CURRENT,keywords,
StandardAnalyzer(Version.LUCENE_CURRENT))

rather than
parser = QueryParser(keywords, StandardAnalyzer(Version.LUCENE_CURRENT))

Maybe you could update that.


Fixed in rev 1059118 of pylucene_2_9 branch.
Fixed in rev 1059131 of pylucene_3_0 branch.
Fixed in rev 1059134 of branch_3_x branch.

Thanks !

Andi..


[jira] Commented: (LUCENE-2773) Don't create compound file for large segments by default

2011-01-14 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981674#action_12981674
 ] 

Simon Willnauer commented on LUCENE-2773:
-

bq. So for 3.x/trunk (which already take deletions into account by default), 
I'll switch maxMergeMB default to 2 GB. I think this is an OK default given 
that it means your biggest segments will range from 2GB - 20GB.
Mike, this also means that an optimize will have no effect if all segments  
2GB with this as default? It seems kind of odd to me ey?


 Don't create compound file for large segments by default
 

 Key: LUCENE-2773
 URL: https://issues.apache.org/jira/browse/LUCENE-2773
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9.4, 3.0.3, 3.1, 4.0

 Attachments: LUCENE-2773.patch


 Spinoff from LUCENE-2762.
 CFS is useful for keeping the open file count down.  But, it costs
 some added time during indexing to build, and also ties up temporary
 disk space, causing eg a large spike on the final merge of an
 optimize.
 Since MergePolicy dictates which segments should be CFS, we can
 change it to only build CFS for smallish merges.
 I think we should also set a maxMergeMB by default so that very large
 merges aren't done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically

2011-01-14 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981682#action_12981682
 ] 

Simon Willnauer commented on LUCENE-2868:
-

{quote}
When you have the same query in a query hierarchy multiple times, tremendous 
savings can now be had if the user knows enough to share the rewritten queries 
in the hierarchy, due to the TermCache addition. But this is clumsy and 
requires a lot of coding by the user to take advantage of. Lucene should be 
smart enough to share the rewritten queries automatically.
{quote}

First of all, I get nervous when it gets to stuff like this! Hence, I can see 
when this could be useful, for instance if you have one and the same FuzzyQuery 
/ RegexpQuery which has a rather large setup cost in more than one clause in a 
boolean query then this would absolutely help. For other queryies like 
TermQuery the TermState cache in TermsEnum already helps you a lot so for those 
this wouldn't make a big difference though. 

bq. Query rewriteUsingCache(IndexReader indexReader)
I think one major issue here is how would you clear a cache here. 
WeakReferences would work but I would't  to put any cache into any query. In 
general we shouldn't make any query heavy weight or somewhate stateful at 
all. Yet, if we would pass a RewriteCache into Query#rewrite(IR, RC) that has a 
per IS#search lifetime this could actually work. This would also be easy to 
implement Query#rewrite(IR, RC) would just forward to Query#rewrite(IR) for by 
default and combining (BooleanQuery) queries could override the new one. 
Eventually, MultiTermQuery can provide such an impl and check the cache if it 
needs to rewrite itself or return an already rewritten version.

 It should be easy to make use of TermCache; rewritten queries should be 
 shared automatically
 

 Key: LUCENE-2868
 URL: https://issues.apache.org/jira/browse/LUCENE-2868
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Karl Wright

 When you have the same query in a query hierarchy multiple times, tremendous 
 savings can now be had if the user knows enough to share the rewritten 
 queries in the hierarchy, due to the TermCache addition.  But this is clumsy 
 and requires a lot of coding by the user to take advantage of.  Lucene should 
 be smart enough to share the rewritten queries automatically.
 This can be most readily (and powerfully) done by introducing a new method to 
 Query.java:
 Query rewriteUsingCache(IndexReader indexReader)
 ... and including a caching implementation right in Query.java which would 
 then work for all.  Of course, all callers would want to use this new method 
 rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically

2011-01-14 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981685#action_12981685
 ] 

Karl Wright commented on LUCENE-2868:
-

Fine by me if you have a better way of doing it!

Who would create the RewriteCache object?  The IndexSearcher?



 It should be easy to make use of TermCache; rewritten queries should be 
 shared automatically
 

 Key: LUCENE-2868
 URL: https://issues.apache.org/jira/browse/LUCENE-2868
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Karl Wright

 When you have the same query in a query hierarchy multiple times, tremendous 
 savings can now be had if the user knows enough to share the rewritten 
 queries in the hierarchy, due to the TermCache addition.  But this is clumsy 
 and requires a lot of coding by the user to take advantage of.  Lucene should 
 be smart enough to share the rewritten queries automatically.
 This can be most readily (and powerfully) done by introducing a new method to 
 Query.java:
 Query rewriteUsingCache(IndexReader indexReader)
 ... and including a caching implementation right in Query.java which would 
 then work for all.  Of course, all callers would want to use this new method 
 rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically

2011-01-14 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981686#action_12981686
 ] 

Simon Willnauer commented on LUCENE-2868:
-

bq. Who would create the RewriteCache object? The IndexSearcher?
it could.. or just be an overloaded IS.search method

 It should be easy to make use of TermCache; rewritten queries should be 
 shared automatically
 

 Key: LUCENE-2868
 URL: https://issues.apache.org/jira/browse/LUCENE-2868
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Karl Wright

 When you have the same query in a query hierarchy multiple times, tremendous 
 savings can now be had if the user knows enough to share the rewritten 
 queries in the hierarchy, due to the TermCache addition.  But this is clumsy 
 and requires a lot of coding by the user to take advantage of.  Lucene should 
 be smart enough to share the rewritten queries automatically.
 This can be most readily (and powerfully) done by introducing a new method to 
 Query.java:
 Query rewriteUsingCache(IndexReader indexReader)
 ... and including a caching implementation right in Query.java which would 
 then work for all.  Of course, all callers would want to use this new method 
 rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically

2011-01-14 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981706#action_12981706
 ] 

Simon Willnauer commented on LUCENE-2868:
-

Actually, I think we need to clarify the description of this issue. This has 
nothing todo with TermCache at all. It actually reads very scary though since 
caches are really tricky and this one is mainly about rewrite cost in MTQ. This 
said, adding a method to Query just for the sake of MTQ rewrite seems kind of 
odd though. We should rather optimize the query structure somehow instead of 
caching a rewrite method. 



 It should be easy to make use of TermCache; rewritten queries should be 
 shared automatically
 

 Key: LUCENE-2868
 URL: https://issues.apache.org/jira/browse/LUCENE-2868
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Karl Wright

 When you have the same query in a query hierarchy multiple times, tremendous 
 savings can now be had if the user knows enough to share the rewritten 
 queries in the hierarchy, due to the TermCache addition.  But this is clumsy 
 and requires a lot of coding by the user to take advantage of.  Lucene should 
 be smart enough to share the rewritten queries automatically.
 This can be most readily (and powerfully) done by introducing a new method to 
 Query.java:
 Query rewriteUsingCache(IndexReader indexReader)
 ... and including a caching implementation right in Query.java which would 
 then work for all.  Of course, all callers would want to use this new method 
 rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2864) add maxtf to fieldinvertstate

2011-01-14 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-2864.
-

Resolution: Fixed
  Assignee: Robert Muir

Committed revision 1058939, 1058944 (3x)

 add maxtf to fieldinvertstate
 -

 Key: LUCENE-2864
 URL: https://issues.apache.org/jira/browse/LUCENE-2864
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Query/Scoring
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2864.patch


 the maximum within-document TF is a very useful scoring value, 
 we should expose it so that people can use it in scoring
 consider the following sim:
 {code}
 @Override
 public float idf(int docFreq, int numDocs) {
   return 1.0F; /* not used */
 }
 @Override
 public float computeNorm(String field, FieldInvertState state) {
   return state.getBoost() / (float) Math.sqrt(state.getMaxTF());
 }
 {code}
 which is surprisingly effective, but more interesting for practical reasons.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



CorruptIndexException when indexing

2011-01-14 Thread Li Li
hi all,
   we have confronted this problem 3 times when testing
   The exception stack is
Exception in thread Lucene Merge Thread #2
org.apache.lucene.index.MergePolicy$MergeException:
org.apache.lucene.index.CorruptIndexException: docs out of order (7286
= 7286 )
at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:355)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:319)
Caused by: org.apache.lucene.index.CorruptIndexException: docs out of
order (7286 = 7286 )
at 
org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:75)
at 
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:880)
at 
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:818)
at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:756)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:187)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5354)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4937)

Or
Exception in thread Lucene Merge Thread #0
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.ArrayIndexOutOfBoundsException: 330
at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:355)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:319)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 330
at org.apache.lucene.util.BitVector.get(BitVector.java:102)
at 
org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:238)
at 
org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:168)
at 
org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:98)
at 
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:870)
at 
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:818)
at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:756)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:187)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5354)


   We did some minor modification based on lucene 2.9.1 and solr
1.4.0. we modified frq file to store 4 bytes for the positions of the
term occured
in these document(Accessing full postions in prx is time consuming
that can't meed our needs). I can't tell it's our bug or lucene's own
bug.
   I searched the mail list and found the mail problem during index
merge posted in 2010-10-21. It's similar to our case.
   It seems the docList in frq file is wrongly stored. When Merging,
when it's decoded, the wrong docID many larger than maxDocs(BitVector
deletedDocs)
which cause the second exception. Or docID delta is less than 0(it
reads wrongly) which cause the first exception
   we are still continueing testing to turn off our modification and
open infoStream in solr-config.xml

   We found a strange phenomenon. when we test, it sometimes hited
exceptions but in our production environment, it never hit any.
   the hardware and software environments are the same. We checked
carefully and find the only difference is this line in solr-config.xml
  ramBufferSizeMB32/ramBufferSizeMB  in testing environment
  ramBufferSizeMB256/ramBufferSizeMBin production environment
  The indexed documents number for each machine is also roughly the
same. 10M+ documents.
  I can't make sure the indice in production env are correct because
even there are some terms' docList are wrong, if the doc delta 0  and
don't have
some deleted documents, it will not hit the 2 exceptions.
  The search results in production env and we don't find any strange results.

  Will when  the ramBufferSizeMB is too small results in index corruption?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2773) Don't create compound file for large segments by default

2011-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981726#action_12981726
 ] 

Michael McCandless commented on LUCENE-2773:


bq. Mike, this also means that an optimize will have no effect if all segments 
 2GB with this as default? It seems kind of odd to me ey?

There was a separate issue for this -- LUCENE-2701.

I agree it's debatable... and it's not clear which way we should default it.

 Don't create compound file for large segments by default
 

 Key: LUCENE-2773
 URL: https://issues.apache.org/jira/browse/LUCENE-2773
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9.4, 3.0.3, 3.1, 4.0

 Attachments: LUCENE-2773.patch


 Spinoff from LUCENE-2762.
 CFS is useful for keeping the open file count down.  But, it costs
 some added time during indexing to build, and also ties up temporary
 disk space, causing eg a large spike on the final merge of an
 optimize.
 Since MergePolicy dictates which segments should be CFS, we can
 change it to only build CFS for smallish merges.
 I think we should also set a maxMergeMB by default so that very large
 merges aren't done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2773) Don't create compound file for large segments by default

2011-01-14 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981729#action_12981729
 ] 

Simon Willnauer commented on LUCENE-2773:
-

bq. There was a separate issue for this - LUCENE-2701.
I think we should reopen and fix this. I expect optimize to have single segment 
semantics if I call optmize() as the JDocs states. However we do that :)

 Don't create compound file for large segments by default
 

 Key: LUCENE-2773
 URL: https://issues.apache.org/jira/browse/LUCENE-2773
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9.4, 3.0.3, 3.1, 4.0

 Attachments: LUCENE-2773.patch


 Spinoff from LUCENE-2762.
 CFS is useful for keeping the open file count down.  But, it costs
 some added time during indexing to build, and also ties up temporary
 disk space, causing eg a large spike on the final merge of an
 optimize.
 Since MergePolicy dictates which segments should be CFS, we can
 change it to only build CFS for smallish merges.
 I think we should also set a maxMergeMB by default so that very large
 merges aren't done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

2011-01-14 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reopened LUCENE-2701:
-


This change together with LUCENE-2773 caused a change of the IW#optimize() and 
friends semantics.
IW#optimize() says:
{code}
 /**
   * Requests an optimize operation on an index, priming the index
   * for the fastest available search. Traditionally this has meant
   * merging all segments into a single segment as is done in the
   * default merge policy, but individual merge policies may implement
   * optimize in different ways.
   *

{code}

Which is not entirely true anymore since default now uses 

{code}
  /** Default maximum segment size.  A segment of this size
   *  or larger will never be merged.  @see setMaxMergeMB */
  public static final double DEFAULT_MAX_MERGE_MB = 2048;
{code}

this is not what I would expect from optimize() even if it would be documented 
that way. A plain optimize call should by default result in a single segment 
IMO. Yet, we could make this set by a flag in LogMergePolicy maybe something 
like LogMergePolicy#useMasMergeSizeForOptimize = false; as a default?

 Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
 

 Key: LUCENE-2701
 URL: https://issues.apache.org/jira/browse/LUCENE-2701
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch


 LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken 
 into consideration in regular merges, yet ignored by findMergesForOptimze. I 
 think it'd be good if we take that into consideration even when optimizing. 
 This will allow the caller to specify two constraints: maxNumSegments and 
 maxMergeMB. Obviously both may not be satisfied, and therefore we will 
 guarantee that if there is any segment above the threshold, the threshold 
 constraint takes precedence and therefore you may end up w/ maxNumSegments 
 (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into 
 consideration.
 As part of this change, I plan to change some methods to protected (from 
 private) and members as well. I realized that if one wishes to implement his 
 own LMP extension, he needs to either put it under o.a.l.index or copy some 
 code over to his impl.
 I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

2011-01-14 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated LUCENE-2868:


Description: 
When you have the same query in a query hierarchy multiple times, tremendous 
savings can now be had if the user knows enough to share the rewritten queries 
in the hierarchy, due to the TermState addition.  But this is clumsy and 
requires a lot of coding by the user to take advantage of.  Lucene should be 
smart enough to share the rewritten queries automatically.

This can be most readily (and powerfully) done by introducing a new method to 
Query.java:

Query rewriteUsingCache(IndexReader indexReader)

... and including a caching implementation right in Query.java which would then 
work for all.  Of course, all callers would want to use this new method rather 
than the current rewrite().


  was:
When you have the same query in a query hierarchy multiple times, tremendous 
savings can now be had if the user knows enough to share the rewritten queries 
in the hierarchy, due to the TermCache addition.  But this is clumsy and 
requires a lot of coding by the user to take advantage of.  Lucene should be 
smart enough to share the rewritten queries automatically.

This can be most readily (and powerfully) done by introducing a new method to 
Query.java:

Query rewriteUsingCache(IndexReader indexReader)

... and including a caching implementation right in Query.java which would then 
work for all.  Of course, all callers would want to use this new method rather 
than the current rewrite().


Summary: It should be easy to make use of TermState; rewritten queries 
should be shared automatically  (was: It should be easy to make use of 
TermCache; rewritten queries should be shared automatically)

 It should be easy to make use of TermState; rewritten queries should be 
 shared automatically
 

 Key: LUCENE-2868
 URL: https://issues.apache.org/jira/browse/LUCENE-2868
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Karl Wright

 When you have the same query in a query hierarchy multiple times, tremendous 
 savings can now be had if the user knows enough to share the rewritten 
 queries in the hierarchy, due to the TermState addition.  But this is clumsy 
 and requires a lot of coding by the user to take advantage of.  Lucene should 
 be smart enough to share the rewritten queries automatically.
 This can be most readily (and powerfully) done by introducing a new method to 
 Query.java:
 Query rewriteUsingCache(IndexReader indexReader)
 ... and including a caching implementation right in Query.java which would 
 then work for all.  Of course, all callers would want to use this new method 
 rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

2011-01-14 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981746#action_12981746
 ] 

Karl Wright commented on LUCENE-2868:
-

I reworded the description.

I think the word cache is correct, but what we really need is simply a cache 
that has the lifetime of a top-level rewrite.  I agree that putting the data in 
the query object itself would not have this characteristic, but on the other 
hand a second Query method that is cache aware seems reasonable.  For example:

Query rewriteMinimal(RewriteCache rc, IndexReader ir)

... where RewriteCache was an object that had a lifetime consistent with the 
highest-level rewrite operation done on the query graph.  The rewriteMinimal() 
method would look for the rewrite of the the current query in the RewriteCache, 
and if found, would return that, otherwise would call plain old rewrite() and 
then save the result.

So the patch would include:
(a) the change as specified to Query.java
(b) an implementation of RewriteCache, which *could* just be simplified to 
MapQuery,Query
(c) changes to the callers of rewrite(), so that the minimal rewrite was called 
instead.

Thoughts?


 It should be easy to make use of TermState; rewritten queries should be 
 shared automatically
 

 Key: LUCENE-2868
 URL: https://issues.apache.org/jira/browse/LUCENE-2868
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Karl Wright

 When you have the same query in a query hierarchy multiple times, tremendous 
 savings can now be had if the user knows enough to share the rewritten 
 queries in the hierarchy, due to the TermState addition.  But this is clumsy 
 and requires a lot of coding by the user to take advantage of.  Lucene should 
 be smart enough to share the rewritten queries automatically.
 This can be most readily (and powerfully) done by introducing a new method to 
 Query.java:
 Query rewriteUsingCache(IndexReader indexReader)
 ... and including a caching implementation right in Query.java which would 
 then work for all.  Of course, all callers would want to use this new method 
 rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



How to submit code?

2011-01-14 Thread Jörg Lang
Hi 

I started looking into Lucence, as I might need it on a project. As there was 
no GermanAnalyzer in the dotNet version, I ported the code that was available 
in the Java version to .NET.

As I new to the OpenSource world, I do not exactly know how I need to proceed, 
that this piece of code is included?
Send it to a contributor?

Thanks for any advice.

Regards
Jörg Lang


[jira] Commented: (LUCENE-2723) Speed up Lucene's low level bulk postings read API

2011-01-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981762#action_12981762
 ] 

Robert Muir commented on LUCENE-2723:
-

Ok, we are caught up to trunk... but we need to integrate getBulkPostingsEnum 
with termstate to fix the nocommits in TermQuery.

This should also finally allow us to fix the cost of that extra per-segment 
docFreq.


 Speed up Lucene's low level bulk postings read API
 --

 Key: LUCENE-2723
 URL: https://issues.apache.org/jira/browse/LUCENE-2723
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2723-termscorer.patch, 
 LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, 
 LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, 
 LUCENE-2723.patch, LUCENE-2723_bulkvint.patch, LUCENE-2723_facetPerSeg.patch, 
 LUCENE-2723_facetPerSeg.patch, LUCENE-2723_openEnum.patch, 
 LUCENE-2723_termscorer.patch, LUCENE-2723_wastedint.patch


 Spinoff from LUCENE-1410.
 The flex DocsEnum has a simple bulk-read API that reads the next chunk
 of docs/freqs.  But it's a poor fit for intblock codecs like FOR/PFOR
 (from LUCENE-1410).  This is not unlike sucking coffee through those
 tiny plastic coffee stirrers they hand out airplanes that,
 surprisingly, also happen to function as a straw.
 As a result we see no perf gain from using FOR/PFOR.
 I had hacked up a fix for this, described at in my blog post at
 http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html
 I'm opening this issue to get that work to a committable point.
 So... I've worked out a new bulk-read API to address performance
 bottleneck.  It has some big changes over the current bulk-read API:
   * You can now also bulk-read positions (but not payloads), but, I
  have yet to cutover positional queries.
   * The buffer contains doc deltas, not absolute values, for docIDs
 and positions (freqs are absolute).
   * Deleted docs are not filtered out.
   * The doc  freq buffers need not be aligned.  For fixed intblock
 codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16,
 Group varint, etc.) they won't be.
 It's still a work in progress...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

2011-01-14 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2868:


Attachment: query-rewriter.patch

I just sketched out what I have in mind could solve this problem and create the 
infrastructure to do way more than just caching a  query#rewrite.
This patch (which is just a sketch to show what I have in mind) adds a 
QueryRewriter class that walks the Query AST and rewrites each query node 
in the tree. The default implementation does nothing special, it just forwards 
to the queryies rewrite method but there seems to be a whole lot of potential 
in such a tree-walker / visitor.  For instance could we subclass it to optimize 
certain queries if we fix the coord problem. Yet another usecase is to decouple 
MTQ rewriter entirely from MTQ (not sure if we want that though) or somebody 
wants to wrap a query during rewrite. 

Even further somebody could rewrite against fieldcache? Maybe this can be even 
more general and just be a QueryVisitor so folks can easily walk the tree.

I think this is really something that should be solved in general AND in a 
different issue.

simon

 It should be easy to make use of TermState; rewritten queries should be 
 shared automatically
 

 Key: LUCENE-2868
 URL: https://issues.apache.org/jira/browse/LUCENE-2868
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Karl Wright
 Attachments: query-rewriter.patch


 When you have the same query in a query hierarchy multiple times, tremendous 
 savings can now be had if the user knows enough to share the rewritten 
 queries in the hierarchy, due to the TermState addition.  But this is clumsy 
 and requires a lot of coding by the user to take advantage of.  Lucene should 
 be smart enough to share the rewritten queries automatically.
 This can be most readily (and powerfully) done by introducing a new method to 
 Query.java:
 Query rewriteUsingCache(IndexReader indexReader)
 ... and including a caching implementation right in Query.java which would 
 then work for all.  Of course, all callers would want to use this new method 
 rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

2011-01-14 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981774#action_12981774
 ] 

Earwin Burrfoot commented on LUCENE-2868:
-

We here use an intermediate query AST, with a number of walkers that do synonym 
substitution, optimization, caching, rewriting for multiple fields, and finally 
- generating a tree of Lucene Queries.

I can share a generic reflection-based visitor that's somewhat more handy than 
default visitor pattern in java.
Usage looks roughly like: 
{code}
class ToStringWalker extends DispatchingVisitorString { // String here stands 
for the type of walk result
  String visit(TermQuery q) {
return {term:  + q.getTerm() + };
  }

  String visit(BooleanQuery q) {
StringBuffer buf = new StringBuffer();
buf.append({boolean: );
for (BooleanQuery.Clause clause: q.clauses()) {
  buf.append(dispatch(clause.getQuery()).append(, ); // Here we 
}
buf.append(});
return buf.toString();
  }

  String visit(SpanQuery q) { // Runs for all SpanQueries
.
  }

  String visit(Query q) { // Runs for all Queries not covered by a more exact 
visit() method 
..
  }
}

Query query = ...;
String stringRepresentation = new ToStringWalker().dispatch(query);
{code}

dispatch() checks its parameter runtime type, picks a visit()'s most close 
overload (according to java rules for compile-time overloaded method 
resolution), and invokes it.

 It should be easy to make use of TermState; rewritten queries should be 
 shared automatically
 

 Key: LUCENE-2868
 URL: https://issues.apache.org/jira/browse/LUCENE-2868
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Karl Wright
 Attachments: query-rewriter.patch


 When you have the same query in a query hierarchy multiple times, tremendous 
 savings can now be had if the user knows enough to share the rewritten 
 queries in the hierarchy, due to the TermState addition.  But this is clumsy 
 and requires a lot of coding by the user to take advantage of.  Lucene should 
 be smart enough to share the rewritten queries automatically.
 This can be most readily (and powerfully) done by introducing a new method to 
 Query.java:
 Query rewriteUsingCache(IndexReader indexReader)
 ... and including a caching implementation right in Query.java which would 
 then work for all.  Of course, all callers would want to use this new method 
 rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2723) Speed up Lucene's low level bulk postings read API

2011-01-14 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2723:


Attachment: LUCENE-2723.patch

here is a fix for the nocommit robert put into TermQuery. All tests pass, i 
will commit in a bit

 Speed up Lucene's low level bulk postings read API
 --

 Key: LUCENE-2723
 URL: https://issues.apache.org/jira/browse/LUCENE-2723
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2723-termscorer.patch, 
 LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, 
 LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, 
 LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723_bulkvint.patch, 
 LUCENE-2723_facetPerSeg.patch, LUCENE-2723_facetPerSeg.patch, 
 LUCENE-2723_openEnum.patch, LUCENE-2723_termscorer.patch, 
 LUCENE-2723_wastedint.patch


 Spinoff from LUCENE-1410.
 The flex DocsEnum has a simple bulk-read API that reads the next chunk
 of docs/freqs.  But it's a poor fit for intblock codecs like FOR/PFOR
 (from LUCENE-1410).  This is not unlike sucking coffee through those
 tiny plastic coffee stirrers they hand out airplanes that,
 surprisingly, also happen to function as a straw.
 As a result we see no perf gain from using FOR/PFOR.
 I had hacked up a fix for this, described at in my blog post at
 http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html
 I'm opening this issue to get that work to a committable point.
 So... I've worked out a new bulk-read API to address performance
 bottleneck.  It has some big changes over the current bulk-read API:
   * You can now also bulk-read positions (but not payloads), but, I
  have yet to cutover positional queries.
   * The buffer contains doc deltas, not absolute values, for docIDs
 and positions (freqs are absolute).
   * Deleted docs are not filtered out.
   * The doc  freq buffers need not be aligned.  For fixed intblock
 codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16,
 Group varint, etc.) they won't be.
 It's still a work in progress...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

2011-01-14 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981778#action_12981778
 ] 

Simon Willnauer commented on LUCENE-2868:
-

bq. I can share a generic reflection-based visitor that's somewhat more handy 
than default visitor pattern in java.
Earwin - I think we should make a new issue and get something like that 
implemented in there which is more general than what I just sketched out. If 
you could share your code that would be awesome!

 It should be easy to make use of TermState; rewritten queries should be 
 shared automatically
 

 Key: LUCENE-2868
 URL: https://issues.apache.org/jira/browse/LUCENE-2868
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Karl Wright
 Attachments: query-rewriter.patch


 When you have the same query in a query hierarchy multiple times, tremendous 
 savings can now be had if the user knows enough to share the rewritten 
 queries in the hierarchy, due to the TermState addition.  But this is clumsy 
 and requires a lot of coding by the user to take advantage of.  Lucene should 
 be smart enough to share the rewritten queries automatically.
 This can be most readily (and powerfully) done by introducing a new method to 
 Query.java:
 Query rewriteUsingCache(IndexReader indexReader)
 ... and including a caching implementation right in Query.java which would 
 then work for all.  Of course, all callers would want to use this new method 
 rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2869) remove Query.getSimilarity()

2011-01-14 Thread Robert Muir (JIRA)
remove Query.getSimilarity()


 Key: LUCENE-2869
 URL: https://issues.apache.org/jira/browse/LUCENE-2869
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir


Spinoff of LUCENE-2854.

See LUCENE-2828 and LUCENE-2854 for reference.

In general, the SimilarityDelegator was problematic with regards to 
back-compat, and if queries
want to score differently, trying to runtime subclass Similarity only causes 
trouble.

The reason we could not fix this in LUCENE-2854 is because:
{noformat}
Michael McCandless added a comment - 08/Jan/11 01:53 PM
bq. Is it possible to remove this method Query.getSimilarity also? I don't 
understand why we need this method!

I would love to! But I think that's for another day...

I looked into this and got stuck with BoostingQuery, which rewrites to an anon 
subclass of BQ overriding its getSimilarity in turn override its coord method. 
Rather twisted... if we can do this differently I think we could remove 
Query.getSimilarity.
{noformat}

here is the method in question:

{noformat}
/** Expert: Returns the Similarity implementation to be used for this query.
 * Subclasses may override this method to specify their own Similarity
 * implementation, perhaps one that delegates through that of the Searcher.
 * By default the Searcher's Similarity implementation is returned.*/
public Similarity getSimilarity(IndexSearcher searcher) {
  return searcher.getSimilarity();
}
{noformat}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2869) remove Query.getSimilarity()

2011-01-14 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2869:


Attachment: LUCENE-2869.patch

Here's a patch.

To fix the BoostingQuery in contrib, it overrides BooleanWeight.
(Also a test that instantiates BooleanScorer with a null weight had to be 
fixed).


 remove Query.getSimilarity()
 

 Key: LUCENE-2869
 URL: https://issues.apache.org/jira/browse/LUCENE-2869
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
 Attachments: LUCENE-2869.patch


 Spinoff of LUCENE-2854.
 See LUCENE-2828 and LUCENE-2854 for reference.
 In general, the SimilarityDelegator was problematic with regards to 
 back-compat, and if queries
 want to score differently, trying to runtime subclass Similarity only causes 
 trouble.
 The reason we could not fix this in LUCENE-2854 is because:
 {noformat}
 Michael McCandless added a comment - 08/Jan/11 01:53 PM
 bq. Is it possible to remove this method Query.getSimilarity also? I don't 
 understand why we need this method!
 I would love to! But I think that's for another day...
 I looked into this and got stuck with BoostingQuery, which rewrites to an 
 anon 
 subclass of BQ overriding its getSimilarity in turn override its coord 
 method. 
 Rather twisted... if we can do this differently I think we could remove 
 Query.getSimilarity.
 {noformat}
 here is the method in question:
 {noformat}
 /** Expert: Returns the Similarity implementation to be used for this query.
  * Subclasses may override this method to specify their own Similarity
  * implementation, perhaps one that delegates through that of the Searcher.
  * By default the Searcher's Similarity implementation is returned.*/
 public Similarity getSimilarity(IndexSearcher searcher) {
   return searcher.getSimilarity();
 }
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

2011-01-14 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981803#action_12981803
 ] 

Jason Rutherglen commented on LUCENE-2701:
--

I agree that there should not be a defaults for the max merge segment size for 
optimize, though it's good to have the option.

 Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
 

 Key: LUCENE-2701
 URL: https://issues.apache.org/jira/browse/LUCENE-2701
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch


 LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken 
 into consideration in regular merges, yet ignored by findMergesForOptimze. I 
 think it'd be good if we take that into consideration even when optimizing. 
 This will allow the caller to specify two constraints: maxNumSegments and 
 maxMergeMB. Obviously both may not be satisfied, and therefore we will 
 guarantee that if there is any segment above the threshold, the threshold 
 constraint takes precedence and therefore you may end up w/ maxNumSegments 
 (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into 
 consideration.
 As part of this change, I plan to change some methods to protected (from 
 private) and members as well. I realized that if one wishes to implement his 
 own LMP extension, he needs to either put it under o.a.l.index or copy some 
 code over to his impl.
 I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

2011-01-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981813#action_12981813
 ] 

Shai Erera commented on LUCENE-2701:


I don't think we need a useDefaultMaxMergeMb. Instead, we can default the 
member to Long.MAX_VAL. That way, if no one sets it, all segments will be 
considered for merge, and if one wants, he can set it.

I expect that if I use IW with a LMP that sets maxMergeMB, then even if I call 
optimize() this setting will take effect.

BTW, I don't remember introducin this defaul as part of this issue. This issue 
only changed LMP to take the already existed setting into account. So maybe 
reverting this default should be handled within the issue I was changed in?

 Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
 

 Key: LUCENE-2701
 URL: https://issues.apache.org/jira/browse/LUCENE-2701
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch


 LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken 
 into consideration in regular merges, yet ignored by findMergesForOptimze. I 
 think it'd be good if we take that into consideration even when optimizing. 
 This will allow the caller to specify two constraints: maxNumSegments and 
 maxMergeMB. Obviously both may not be satisfied, and therefore we will 
 guarantee that if there is any segment above the threshold, the threshold 
 constraint takes precedence and therefore you may end up w/ maxNumSegments 
 (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into 
 consideration.
 As part of this change, I plan to change some methods to protected (from 
 private) and members as well. I realized that if one wishes to implement his 
 own LMP extension, he needs to either put it under o.a.l.index or copy some 
 code over to his impl.
 I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

2011-01-14 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981817#action_12981817
 ] 

Simon Willnauer commented on LUCENE-2701:
-

bq. BTW, I don't remember introducin this defaul as part of this issue. This 
issue only changed LMP to take the already existed setting into account. So 
maybe reverting this default should be handled within the issue I was changed 
in?
True this was done in here:  LUCENE-2773  - but this seemed to be more related?!
bq. I don't think we need a useDefaultMaxMergeMb. Instead, we can default the 
member to Long.MAX_VAL. That way, if no one sets it, all segments will be 
considered for merge, and if one wants, he can set it.

I think mike did that on purpose to prevent large segs from merging during 
indexing so what is wrong with disable that limit during optimize?

 Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
 

 Key: LUCENE-2701
 URL: https://issues.apache.org/jira/browse/LUCENE-2701
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch


 LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken 
 into consideration in regular merges, yet ignored by findMergesForOptimze. I 
 think it'd be good if we take that into consideration even when optimizing. 
 This will allow the caller to specify two constraints: maxNumSegments and 
 maxMergeMB. Obviously both may not be satisfied, and therefore we will 
 guarantee that if there is any segment above the threshold, the threshold 
 constraint takes precedence and therefore you may end up w/ maxNumSegments 
 (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into 
 consideration.
 As part of this change, I plan to change some methods to protected (from 
 private) and members as well. I realized that if one wishes to implement his 
 own LMP extension, he needs to either put it under o.a.l.index or copy some 
 code over to his impl.
 I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-14 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981827#action_12981827
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

I'm taking a guess here, however the 
ThreadAffinityDocumentsWriterThreadPool.getAndLock method looks a little 
suspicious as we're iterating on ThreadStates and on a non-concurrent hashmap 
calling put while not in a lock?  

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, 
 test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-14 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981830#action_12981830
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Also multiple threads can call DocumentsWriterPerThread.addDocument and that's 
resulting in this:

{code}[junit] java.lang.AssertionError: omitTermFreqAndPositions:false 
postings.docFreqs[termID]:0
[junit] at 
org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:143)
[junit] at 
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:234)
[junit] at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:91)
[junit] at 
org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:274)
[junit] at 
org.apache.lucene.index.DocumentsWriterPerThread.addDocument(DocumentsWriterPerThread.java:184)
[junit] at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:374)
[junit] at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1403)
[junit] at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1375)
{code}

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, 
 test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981832#action_12981832
 ] 

Michael Busch commented on LUCENE-2324:
---

bq. as we're iterating on ThreadStates and on a non-concurrent hashmap calling 
put while not in a lock? 

The threadBindings hashmap is a ConcurrentHashMap and the 
getActivePerThreadsIterator() is threadsafe I believe.

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, 
 test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

2011-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981836#action_12981836
 ] 

Michael McCandless commented on LUCENE-2701:


bq. I think mike did that on purpose to prevent large segs from merging during 
indexing.

Right -- large merges are really quite nasty -- mess up searches, NRT 
turnaround, IW.close() suddenly unexpectedly takes like an hour, etc.

But, really the best fix, which I'd love to do at some point, is to fix our 
merge policy so that insanely large merges are done w/ fewer segments (eg only 
2 segments at once).  I think BalancedMP does this.


 Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
 

 Key: LUCENE-2701
 URL: https://issues.apache.org/jira/browse/LUCENE-2701
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch


 LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken 
 into consideration in regular merges, yet ignored by findMergesForOptimze. I 
 think it'd be good if we take that into consideration even when optimizing. 
 This will allow the caller to specify two constraints: maxNumSegments and 
 maxMergeMB. Obviously both may not be satisfied, and therefore we will 
 guarantee that if there is any segment above the threshold, the threshold 
 constraint takes precedence and therefore you may end up w/ maxNumSegments 
 (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into 
 consideration.
 As part of this change, I plan to change some methods to protected (from 
 private) and members as well. I realized that if one wishes to implement his 
 own LMP extension, he needs to either put it under o.a.l.index or copy some 
 code over to his impl.
 I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Release schedule Lucene 4?

2011-01-14 Thread Gregor Heinrich

Dear Lucene team,

I am wondering whether there is an updated Lucene release schedule for the v4.0 
stream.


Any earliest/latest alpha/beta/stable date? And if not yet, where to track such 
info?


Thanks in advance from Germany

gregor

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2870) if a segment is 100% deletions, we should just drop it

2011-01-14 Thread Michael McCandless (JIRA)
if a segment is 100% deletions, we should just drop it
--

 Key: LUCENE-2870
 URL: https://issues.apache.org/jira/browse/LUCENE-2870
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
 Fix For: 3.1, 4.0


I think in IndexWriter if the delCount ever == maxDoc() for a segment we should 
just drop it?

We don't, today, and so we force it to be merged, which is silly.

I won't have time for this any time soon so if someone wants to take it, please 
do!!  Should be simple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-14 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981839#action_12981839
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

bq. The threadBindings hashmap is a ConcurrentHashMap and the 
getActivePerThreadsIterator() is threadsafe I believe.

Sorry yes CHM is used, it all looks thread safe, but there must be multiple 
threads accessing a single DWPT at the same time for some of these errors to be 
occurring.  

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, 
 test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2666) ArrayIndexOutOfBoundsException when iterating over TermDocs

2011-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981843#action_12981843
 ] 

Michael McCandless commented on LUCENE-2666:


Can you run CheckIndex on this index and post the result?  And, enable 
assertions.

And if possible turn on IndexWriter's infoStream and capture/post the output 
leading up to the corruption.

Many updates during indexing is just fine... and I know whether rolling back to 
older Lucene releases will help (until we've isolated the issue).  But: maybe 
try rolling forward to 3.0.3?  It's possible you're hitting a big fixed in 
3.0.3 (though this doesn't ring a bell for me).

 ArrayIndexOutOfBoundsException when iterating over TermDocs
 ---

 Key: LUCENE-2666
 URL: https://issues.apache.org/jira/browse/LUCENE-2666
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.2
Reporter: Shay Banon

 A user got this very strange exception, and I managed to get the index that 
 it happens on. Basically, iterating over the TermDocs causes an AAOIB 
 exception. I easily reproduced it using the FieldCache which does exactly 
 that (the field in question is indexed as numeric). Here is the exception:
 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
   at 
 org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:501)
   at 
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183)
   at 
 org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:470)
   at TestMe.main(TestMe.java:56)
 It happens on the following segment: _26t docCount: 914 delCount: 1 
 delFileName: _26t_1.del
 And as you can see, it smells like a corner case (it fails for document 
 number 912, the AIOOB happens from the deleted docs). The code to recreate it 
 is simple:
 FSDirectory dir = FSDirectory.open(new File(index));
 IndexReader reader = IndexReader.open(dir, true);
 IndexReader[] subReaders = reader.getSequentialSubReaders();
 for (IndexReader subReader : subReaders) {
 Field field = 
 subReader.getClass().getSuperclass().getDeclaredField(si);
 field.setAccessible(true);
 SegmentInfo si = (SegmentInfo) field.get(subReader);
 System.out.println(--  + si);
 if (si.getDocStoreSegment().contains(_26t)) {
 // this is the probleatic one...
 System.out.println(problematic one...);
 FieldCache.DEFAULT.getLongs(subReader, __documentdate, 
 FieldCache.NUMERIC_UTILS_LONG_PARSER);
 }
 }
 Here is the result of a check index on that segment:
   8 of 10: name=_26t docCount=914
 compound=true
 hasProx=true
 numFiles=2
 size (MB)=1.641
 diagnostics = {optimize=false, mergeFactor=10, 
 os.version=2.6.18-194.11.1.el5.centos.plus, os=Linux, mergeDocStores=true, 
 lucene.version=3.0.2 953716 - 2010-06-11 17:13:53, source=merge, 
 os.arch=amd64, java.version=1.6.0, java.vendor=Sun Microsystems Inc.}
 has deletions [delFileName=_26t_1.del]
 test: open reader.OK [1 deleted docs]
 test: fields..OK [32 fields]
 test: field norms.OK [32 fields]
 test: terms, freq, prox...ERROR [114]
 java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
   at 
 org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102)
   at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:616)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:509)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
   at TestMe.main(TestMe.java:47)
 test: stored fields...ERROR [114]
 java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34)
   at 
 org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:684)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:512)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
   at TestMe.main(TestMe.java:47)
 test: term vectorsERROR [114]
 java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 

[jira] Resolved: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2011-01-14 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1821.
---

Resolution: Fixed
  Assignee: Simon Willnauer

This is resolved by adding AtomicReaderContext in 4.0 (LUCENE-2831).

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-2439) Composite readers (Multi/DirIndexReader) should not subclass IndexReader

2011-01-14 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler closed LUCENE-2439.
-

Resolution: Duplicate

Duplicate of LUCENE-2858.

 Composite readers (Multi/DirIndexReader) should not subclass IndexReader
 

 Key: LUCENE-2439
 URL: https://issues.apache.org/jira/browse/LUCENE-2439
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
 Fix For: 4.0


 I'd like to change Multi/DirIndexReader so that they no longer implement the 
 low level methods of IndexReader, and instead act more like an ordered 
 collection of sub readers.  I think to do this we'd need a new interface, 
 common to atomic readers (SegmentReader) and the composite readers, which 
 IndexSearcher would accept.
 We should also require that the core Query scorers always receive an atomic 
 reader.
 We've taken strong initial steps here with flex, by forcing users to use 
 separate MultiFields static methods to obtain Fields/Terms/etc. from a 
 composite reader.  This issue is to finish this cutover.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2010) Remove segments with all documents deleted in commit/flush/close of IndexWriter instead of waiting until a merge occurs.

2011-01-14 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2010:
--

Fix Version/s: 4.0
   3.1

 Remove segments with all documents deleted in commit/flush/close of 
 IndexWriter instead of waiting until a merge occurs.
 

 Key: LUCENE-2010
 URL: https://issues.apache.org/jira/browse/LUCENE-2010
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Uwe Schindler
 Fix For: 3.1, 4.0


 I do not know if this is a bug in 2.9.0, but it seems that segments with all 
 documents deleted are not automatically removed:
 {noformat}
 4 of 14: name=_dlo docCount=5
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=0.059
   diagnostics = {java.version=1.5.0_21, lucene.version=2.9.0 817268P - 
 2009-09-21 10:25:09, os=SunOS,
  os.arch=amd64, java.vendor=Sun Microsystems Inc., os.version=5.10, 
 source=flush}
   has deletions [delFileName=_dlo_1.del]
   test: open reader.OK [5 deleted docs]
   test: fields..OK [136 fields]
   test: field norms.OK [136 fields]
   test: terms, freq, prox...OK [1698 terms; 4236 terms/docs pairs; 0 tokens]
   test: stored fields...OK [0 total field count; avg ? fields per doc]
   test: term vectorsOK [0 total vector count; avg ? term/freq vector 
 fields per doc]
 {noformat}
 Shouldn't such segments not be removed automatically during the next 
 commit/close of IndexWriter?
 *Mike McCandless:*
 Lucene doesn't actually short-circuit this case, ie, if every single doc in a 
 given segment has been deleted, it will still merge it [away] like normal, 
 rather than simply dropping it immediately from the index, which I agree 
 would be a simple optimization. Can you open a new issue? I would think IW 
 can drop such a segment immediately (ie not wait for a merge or optimize) on 
 flushing new deletes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-2870) if a segment is 100% deletions, we should just drop it

2011-01-14 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler closed LUCENE-2870.
-

Resolution: Duplicate

Duplicate of LUCENE-2010.

 if a segment is 100% deletions, we should just drop it
 --

 Key: LUCENE-2870
 URL: https://issues.apache.org/jira/browse/LUCENE-2870
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
 Fix For: 3.1, 4.0


 I think in IndexWriter if the delCount ever == maxDoc() for a segment we 
 should just drop it?
 We don't, today, and so we force it to be merged, which is silly.
 I won't have time for this any time soon so if someone wants to take it, 
 please do!!  Should be simple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Created: (LUCENE-2863) Updating a documenting looses its fields that only indexed, also NumericField tries are completely lost

2011-01-14 Thread Erick Erickson
This is behaving as intended if I'm reading this correctly. Lucene has
never fetched fields that aren't stored, and that's what you're
asking it to do. To see why, consider indexing but not storing
a normal text field with, say stop word removal and stemming. The
*only* data kept in the index is the analyzed data, so even if you
did reconstruct the field (no easy task BTW), you'd have something that
was not the original text and would be pretty unsatisfactory.

Kudos for providing the test case by the way, that makes figuring out
what the answer is much easier...

If this makes sense, could you close the JIRA? If not we can hash
it out a bit more...

Best
Erick

On Wed, Jan 12, 2011 at 2:12 PM, Tamas Sandor (JIRA) j...@apache.orgwrote:

 Updating a documenting looses its fields that only indexed, also
 NumericField tries are completely lost

 ---

 Key: LUCENE-2863
 URL: https://issues.apache.org/jira/browse/LUCENE-2863
 Project: Lucene - Java
  Issue Type: Bug
  Components: Store
Affects Versions: 3.0.3, 3.0.2
 Environment: WindowsXP, Java1.6.20 using a RamDirectory
Reporter: Tamas Sandor


 I have a code snippet (see below) which creates a new document with
 standard (stored, indexed), *not-stored, indexed-only* and some
 *NumericFields*. Then it updates the document via adding a new string field.
 The result is that all the fields that are not stored but indexed-only and
 especially NumericFields the trie tokens are completly lost from index after
 update or delete/add.
 {code:java}
 Directory ramDir = new RamDirectory();
 IndexWriter writer = new IndexWriter(ramDir, new WhitespaceAnalyzer(),
 MaxFieldLength.UNLIMITED);
 Document doc = new Document();
 doc.add(new Field(ID, HO1234, Store.YES, Index.NOT_ANALYZED_NO_NORMS));
 doc.add(new Field(PATTERN, HELLO, Store.NO,
 Index.NOT_ANALYZED_NO_NORMS));
 doc.add(new NumericField(LAT, Store.YES,
 true).setDoubleValue(51.48826603066d));
 doc.add(new NumericField(LNG, Store.YES,
 true).setDoubleValue(-0.08913399651646614d));
 writer.addDocument(doc);
 doc = new Document();
 doc.add(new Field(ID, HO, Store.YES, Index.NOT_ANALYZED_NO_NORMS));
 doc.add(new Field(PATTERN, BELLO, Store.NO,
 Index.NOT_ANALYZED_NO_NORMS));
 doc.add(new NumericField(LAT, Store.YES,
 true).setDoubleValue(101.48826603066d));
 doc.add(new NumericField(LNG, Store.YES,
 true).setDoubleValue(-100.08913399651646614d));
 writer.addDocument(doc);

 Term t = new Term(ID, HO1234);
 Query q = new TermQuery(t);
 IndexSearcher seacher = new IndexSearcher(writer.getReader());
 TopDocs hits = seacher.search(q, 1);
 if (hits.scoreDocs.length  0) {
  Document ndoc = seacher.doc(hits.scoreDocs[0].doc);
  ndoc.add(new Field(FINAL, FINAL, Store.YES,
 Index.NOT_ANALYZED_NO_NORMS));
  writer.updateDocument(t, ndoc);
 //  writer.deleteDocuments(q);
 //  writer.addDocument(ndoc);
 } else {
  LOG.info(Couldn't find the document via the query);
 }

 seacher = new IndexSearcher(writer.getReader());
 hits = seacher.search(new TermQuery(new Term(PATTERN, HELLO)), 1);
 LOG.info(_hits HELLO: + hits.totalHits); // should be 1 but it's 0

 writer.close();
 {code}

 And I have a boundingbox query based on *NumericRangeQuery*. After the
 document update it doesn't return any hit.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-14 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981895#action_12981895
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Also, why are we always (well, likely) assigning the DWPT to a different thread 
state if tryLock returns false?  If there's a lot of contention (eg, far more 
incoming threads than DWPTs), then won't the thread assignation code become a 
hotspot?

In ThreadAffinityDocumentsWriterThreadPool.clearThreadBindings(ThreadState 
perThread) we're actually clearing the entire map.  When this's called in 
IW.flush (which is unsynced on IW), if there are multiple concurrent flushes, 
then perhaps a single DWPT is in use by multiple threads.  To safeguard against 
this and perhaps more easily add an assertion, maybe we should lock on the DWPT 
rather than ThreadState?

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, 
 test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1301) Solr + Hadoop

2011-01-14 Thread Alexander Kanarsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981930#action_12981930
 ] 

Alexander Kanarsky commented on SOLR-1301:
--

Note for the Hadoop 0.21 users: the current patch can be used as is with 
0.21, but you will need to make sure to compile it with appropriate jars 
(hadoop-common-0.21.0.jar and hadoop-mapred-0.21.0.jar instead of 
hadoop-0.20.x-core.jar). Also, as a workaround, I had to put all the relevant 
jars (solr, solrj etc.) to the lib folder of the job's jar file (i.e.  
apache-solr-hadoop-1.4.x-dev.jar) to avoid 
InvocationTargetException/ClassNotFound exceptions I did not have with Hadoop 
0.20.

 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: Next

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-1301) Solr + Hadoop

2011-01-14 Thread Alexander Kanarsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981930#action_12981930
 ] 

Alexander Kanarsky edited comment on SOLR-1301 at 1/14/11 4:27 PM:
---

Note for the Hadoop 0.21 users: the current patch can be used as is with 
0.21, but you will need to make sure to compile it with appropriate jars 
(hadoop-common-0.21.0.jar and hadoop-mapred-0.21.0.jar instead of 
hadoop-0.20.x-core.jar). Also, as a workaround, I had to put all the relevant 
jars (solr, solrj etc.) to the lib folder of the job's jar file (i.e.  
apache-solr-hadoop-xxx-dev.jar) to avoid 
InvocationTargetException/ClassNotFound exceptions I did not have with Hadoop 
0.20.

  was (Author: kanarsky):
Note for the Hadoop 0.21 users: the current patch can be used as is with 
0.21, but you will need to make sure to compile it with appropriate jars 
(hadoop-common-0.21.0.jar and hadoop-mapred-0.21.0.jar instead of 
hadoop-0.20.x-core.jar). Also, as a workaround, I had to put all the relevant 
jars (solr, solrj etc.) to the lib folder of the job's jar file (i.e.  
apache-solr-hadoop-1.4.x-dev.jar) to avoid 
InvocationTargetException/ClassNotFound exceptions I did not have with Hadoop 
0.20.
  
 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: Next

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2611) IntelliJ IDEA and Eclipse setup

2011-01-14 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981967#action_12981967
 ] 

Steven Rowe commented on LUCENE-2611:
-

bq. And perhaps the copyright setup should be set up for ASL.

bq. I've used the copyright plugin a lot and its a great way to ensure that the 
ASL is added to any new files. Might be useful to add it to reduce the hassle 
for new contributors.

Committed IntelliJ IDEA Copyright Plugin configuration for the Apache Software 
Licence: trunk rev. 1059199, branch_3x rev. 1059200

 IntelliJ IDEA and Eclipse setup
 ---

 Key: LUCENE-2611
 URL: https://issues.apache.org/jira/browse/LUCENE-2611
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Build
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2611-branch-3x-part2.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-part2.patch, LUCENE-2611.patch, 
 LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
 LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
 LUCENE-2611_eclipse.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, 
 LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, 
 LUCENE-2611_test_2.patch


 Setting up Lucene/Solr in IntelliJ IDEA or Eclipse can be time-consuming.
 The attached patches add a new top level directory {{dev-tools/}} with 
 sub-dirs {{idea/}} and {{eclipse/}} containing basic setup files for trunk, 
 as well as top-level ant targets named idea and eclipse that copy these 
 files into the proper locations.  This arrangement avoids the messiness 
 attendant to in-place project configuration files directly checked into 
 source control.
 The IDEA configuration includes modules for Lucene and Solr, each Lucene and 
 Solr contrib, and each analysis module.  A JUnit run configuration per module 
 is included.
 The Eclipse configuration includes a source entry for each 
 source/test/resource location and classpath setup: a library entry for each 
 jar.
 For IDEA, once {{ant idea}} has been run, the only configuration that must be 
 performed manually is configuring the project-level JDK.  For Eclipse, once 
 {{ant eclipse}} has been run, the user has to refresh the project 
 (right-click on the project and choose Refresh).
 If these patches is committed, Subversion svn:ignore properties should be 
 added/modified to ignore the destination IDEA and Eclipse configuration 
 locations.
 Iam Jambour has written up on the Lucene wiki a detailed set of instructions 
 for applying the 3.X branch patch for IDEA: 
 http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-3.x - Build # 242 - Failure

2011-01-14 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-3.x/242/

All tests passed

Build Log (for compile errors):
[...truncated 21064 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (SOLR-975) admin-extra.html not currectly display when using multicore configuration

2011-01-14 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved SOLR-975.
---

   Resolution: Fixed
Fix Version/s: 4.0
 Assignee: Yonik Seeley

Thanks for verifying Edward

 admin-extra.html not currectly display when using multicore configuration
 -

 Key: SOLR-975
 URL: https://issues.apache.org/jira/browse/SOLR-975
 Project: Solr
  Issue Type: Bug
  Components: web gui
Affects Versions: 1.4
 Environment: Jetty openjdk 1.6.0 1.0.b12 (EPEL package for EL5)
Reporter: Edward Rudd
Assignee: Yonik Seeley
 Fix For: 4.0


 I'm having cross-talk issues with using the Solr nightlies (and probably w/ 
 1.3.0 release but have not tested as I needed newer features of the 
 DataImportHandler in the nightlies) 
 Basic scenario for this bug is as follows
 I have two cores configured and BOTH have a customized admin-extra.html, 
 however going to the admin pages uses the SAME admin-extra.html for all 
 cores.   the one used is whichever core is browsed first..This looks like 
 a caching bug where the cache is not taking into account the Core.
 Basically my admin-extra.html has a link to the data importer script and a 
 link to reload the core (which has to have the core name explicitly in the 
 per-core admin-extra.html).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2315) analysis.jsp highlight matches no longer works

2011-01-14 Thread Hoss Man (JIRA)
analysis.jsp highlight matches no longer works


 Key: SOLR-2315
 URL: https://issues.apache.org/jira/browse/SOLR-2315
 Project: Solr
  Issue Type: Bug
  Components: web gui
Reporter: Hoss Man
 Fix For: 3.1, 4.0


As noted by Teruhiko Kurosaka on the mailing list, at some point since Solr 
1.4, highlight matches stoped working on the analysis.jsp  -- on both the 3x 
and trunk branches

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-01-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982028#action_12982028
 ] 

Shai Erera commented on LUCENE-1540:


Patch looks good !

Can you make TrecContentSource.read() public and not package-private? That way 
people can use it outside benchmark's package as well, supporting 
other/newer/older TREC formats.

 Improvements to contrib.benchmark for TREC collections
 --

 Key: LUCENE-1540
 URL: https://issues.apache.org/jira/browse/LUCENE-1540
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Tim Armstrong
Assignee: Doron Cohen
Priority: Minor
 Attachments: LUCENE-1540.patch


 The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
 are quite limited and do not support some of the variations in format of 
 older TREC collections.  
 I have been doing some benchmarking work with Lucene and have had to modify 
 the package to support:
 * Older TREC document formats, which the current parser fails on due to 
 missing document headers.
 * Variations in query format - newlines after title tag causing the query 
 parser to get confused.
 * Ability to detect and read in uncompressed text collections
 * Storage of document numbers by default without storing full text.
 I can submit a patch if there is interest, although I will probably want to 
 write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Solr-3.x - Build # 228 - Failure

2011-01-14 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Solr-3.x/228/

All tests passed

Build Log (for compile errors):
[...truncated 20279 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 3783 - Failure

2011-01-14 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/3783/

1 tests failed.
REGRESSION:  org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration

Error Message:
null

Stack Trace:
junit.framework.AssertionFailedError: 
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1127)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1059)
at 
org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:227)




Build Log (for compile errors):
[...truncated 8229 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org