Contributing code
All, Now that we've moved past the proposal stage and defined our Initial Committers list, I'd like to address the topic of how to be a Contributor to Lucene.Net. Some quick things to note upfront about roles. Previously I made a point of distinguishing between Contributors and Committers at ASF. This was meant to help motivated individuals to decide what level of commitment they wanted to make to the project. I did not intend to suggest that there is a special status of being a Contributor. I listed those who had come forward offering support in the proposal mostly to show that the community around the project was vital with a lot of motivated individuals. I hope that this wasn't interpreted as implying a special status to those people, or implying that others, not on that list, could not be contributors. There is no special status of Contributor that someone must gain prior to submitting code. Anyone can write and submit code patches at any time. As soon as you have done that, you are a Contributor. All code contributions to ASF projects follow the same pattern. First, a JIRA issue is created for the patch, with a description of the change, and with the patch file attached to it. A project Committer will find the issue, review the patch, and commit to SVN (or reject the patch and provide an explanation). Here's a quick guideline to the process for committing code to Lucene.Net: Step-by-Step Example Suppose I have downloaded the source code, and made a change to 'HelloWorld.cs'. Suppose I'm using TortoiseSVN. STEP 1: Make a patch file From TortoiseSVN, right click on the changed file/files and select 'Create Patch' from the 'TortoiseSVN' context menu. Save it as 'HelloWorld.cs.patch'. STEP 2: Create a JIRA Issue Lucene.Net's JIRA issue track is located here: https://issues.apache.org/jira/browse/LUCENENET If you don't have account in JIRA, you can sign up easily (click 'Login' in upper right and from that screen, click 'SignUp') Once you're logged in to JIRA, you can create a new issue in the issue tracker. For code patches, use issue type 'Improvement' or 'Bug'. Please describe the patch you made with enough information that someone else can understand both the code and the reasons why you patched it. STEP 3: Attach patch file to the JIRA Issue After creating the issue, attach the 'HelloWorld.cs.patch' file to the issue. For large patches, you may want to compress the source code into a zip file. STEP 4: Committer will apply or reject patch A Committer will find the new issue, review the patch and either commit to SVN or reject the patch with an explanation. This often involves a discussion in the comments for the issue. Please remain engaged with the conversation to ensure the completion of the issue. Perhaps only a small change needs to be made to the patch in order for it to be accepted. An example of an issue that follows this process is here: http://issues.apache.org/jira/browse/LUCENENET-331 I'd like to see a description of this process be available on the project web page. I think this is a point of confusion for a lot of would-be contributors. Thanks, Troy
Small change in one of the sample file, i.e., samples/mansearch.py
Hello, I have just installed pylucene and tested it some of the sample scripts. In samples/mansearch.py, line 68 should be parser = QueryParser(Version.LUCENE_CURRENT,keywords, StandardAnalyzer(Version.LUCENE_CURRENT)) rather than parser = QueryParser(keywords, StandardAnalyzer(Version.LUCENE_CURRENT)) Maybe you could update that. Many thanks. Jean-Luc
Re: Small change in one of the sample file, i.e., samples/mansearch.py
On Fri, 14 Jan 2011, Jean Luc Truchtersheim wrote: I have just installed pylucene and tested it some of the sample scripts. In samples/mansearch.py, line 68 should be parser = QueryParser(Version.LUCENE_CURRENT,keywords, StandardAnalyzer(Version.LUCENE_CURRENT)) rather than parser = QueryParser(keywords, StandardAnalyzer(Version.LUCENE_CURRENT)) Maybe you could update that. Fixed in rev 1059118 of pylucene_2_9 branch. Fixed in rev 1059131 of pylucene_3_0 branch. Fixed in rev 1059134 of branch_3_x branch. Thanks ! Andi..
[jira] Commented: (LUCENE-2773) Don't create compound file for large segments by default
[ https://issues.apache.org/jira/browse/LUCENE-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981674#action_12981674 ] Simon Willnauer commented on LUCENE-2773: - bq. So for 3.x/trunk (which already take deletions into account by default), I'll switch maxMergeMB default to 2 GB. I think this is an OK default given that it means your biggest segments will range from 2GB - 20GB. Mike, this also means that an optimize will have no effect if all segments 2GB with this as default? It seems kind of odd to me ey? Don't create compound file for large segments by default Key: LUCENE-2773 URL: https://issues.apache.org/jira/browse/LUCENE-2773 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.4, 3.0.3, 3.1, 4.0 Attachments: LUCENE-2773.patch Spinoff from LUCENE-2762. CFS is useful for keeping the open file count down. But, it costs some added time during indexing to build, and also ties up temporary disk space, causing eg a large spike on the final merge of an optimize. Since MergePolicy dictates which segments should be CFS, we can change it to only build CFS for smallish merges. I think we should also set a maxMergeMB by default so that very large merges aren't done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981682#action_12981682 ] Simon Willnauer commented on LUCENE-2868: - {quote} When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermCache addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. {quote} First of all, I get nervous when it gets to stuff like this! Hence, I can see when this could be useful, for instance if you have one and the same FuzzyQuery / RegexpQuery which has a rather large setup cost in more than one clause in a boolean query then this would absolutely help. For other queryies like TermQuery the TermState cache in TermsEnum already helps you a lot so for those this wouldn't make a big difference though. bq. Query rewriteUsingCache(IndexReader indexReader) I think one major issue here is how would you clear a cache here. WeakReferences would work but I would't to put any cache into any query. In general we shouldn't make any query heavy weight or somewhate stateful at all. Yet, if we would pass a RewriteCache into Query#rewrite(IR, RC) that has a per IS#search lifetime this could actually work. This would also be easy to implement Query#rewrite(IR, RC) would just forward to Query#rewrite(IR) for by default and combining (BooleanQuery) queries could override the new one. Eventually, MultiTermQuery can provide such an impl and check the cache if it needs to rewrite itself or return an already rewritten version. It should be easy to make use of TermCache; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermCache addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981685#action_12981685 ] Karl Wright commented on LUCENE-2868: - Fine by me if you have a better way of doing it! Who would create the RewriteCache object? The IndexSearcher? It should be easy to make use of TermCache; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermCache addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981686#action_12981686 ] Simon Willnauer commented on LUCENE-2868: - bq. Who would create the RewriteCache object? The IndexSearcher? it could.. or just be an overloaded IS.search method It should be easy to make use of TermCache; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermCache addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2868) It should be easy to make use of TermCache; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981706#action_12981706 ] Simon Willnauer commented on LUCENE-2868: - Actually, I think we need to clarify the description of this issue. This has nothing todo with TermCache at all. It actually reads very scary though since caches are really tricky and this one is mainly about rewrite cost in MTQ. This said, adding a method to Query just for the sake of MTQ rewrite seems kind of odd though. We should rather optimize the query structure somehow instead of caching a rewrite method. It should be easy to make use of TermCache; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermCache addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2864) add maxtf to fieldinvertstate
[ https://issues.apache.org/jira/browse/LUCENE-2864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-2864. - Resolution: Fixed Assignee: Robert Muir Committed revision 1058939, 1058944 (3x) add maxtf to fieldinvertstate - Key: LUCENE-2864 URL: https://issues.apache.org/jira/browse/LUCENE-2864 Project: Lucene - Java Issue Type: New Feature Components: Query/Scoring Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1, 4.0 Attachments: LUCENE-2864.patch the maximum within-document TF is a very useful scoring value, we should expose it so that people can use it in scoring consider the following sim: {code} @Override public float idf(int docFreq, int numDocs) { return 1.0F; /* not used */ } @Override public float computeNorm(String field, FieldInvertState state) { return state.getBoost() / (float) Math.sqrt(state.getMaxTF()); } {code} which is surprisingly effective, but more interesting for practical reasons. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
CorruptIndexException when indexing
hi all, we have confronted this problem 3 times when testing The exception stack is Exception in thread Lucene Merge Thread #2 org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: docs out of order (7286 = 7286 ) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:355) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:319) Caused by: org.apache.lucene.index.CorruptIndexException: docs out of order (7286 = 7286 ) at org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:75) at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:880) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:818) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:756) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:187) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5354) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4937) Or Exception in thread Lucene Merge Thread #0 org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException: 330 at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:355) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:319) Caused by: java.lang.ArrayIndexOutOfBoundsException: 330 at org.apache.lucene.util.BitVector.get(BitVector.java:102) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:238) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:168) at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:98) at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:870) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:818) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:756) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:187) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5354) We did some minor modification based on lucene 2.9.1 and solr 1.4.0. we modified frq file to store 4 bytes for the positions of the term occured in these document(Accessing full postions in prx is time consuming that can't meed our needs). I can't tell it's our bug or lucene's own bug. I searched the mail list and found the mail problem during index merge posted in 2010-10-21. It's similar to our case. It seems the docList in frq file is wrongly stored. When Merging, when it's decoded, the wrong docID many larger than maxDocs(BitVector deletedDocs) which cause the second exception. Or docID delta is less than 0(it reads wrongly) which cause the first exception we are still continueing testing to turn off our modification and open infoStream in solr-config.xml We found a strange phenomenon. when we test, it sometimes hited exceptions but in our production environment, it never hit any. the hardware and software environments are the same. We checked carefully and find the only difference is this line in solr-config.xml ramBufferSizeMB32/ramBufferSizeMB in testing environment ramBufferSizeMB256/ramBufferSizeMBin production environment The indexed documents number for each machine is also roughly the same. 10M+ documents. I can't make sure the indice in production env are correct because even there are some terms' docList are wrong, if the doc delta 0 and don't have some deleted documents, it will not hit the 2 exceptions. The search results in production env and we don't find any strange results. Will when the ramBufferSizeMB is too small results in index corruption? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2773) Don't create compound file for large segments by default
[ https://issues.apache.org/jira/browse/LUCENE-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981726#action_12981726 ] Michael McCandless commented on LUCENE-2773: bq. Mike, this also means that an optimize will have no effect if all segments 2GB with this as default? It seems kind of odd to me ey? There was a separate issue for this -- LUCENE-2701. I agree it's debatable... and it's not clear which way we should default it. Don't create compound file for large segments by default Key: LUCENE-2773 URL: https://issues.apache.org/jira/browse/LUCENE-2773 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.4, 3.0.3, 3.1, 4.0 Attachments: LUCENE-2773.patch Spinoff from LUCENE-2762. CFS is useful for keeping the open file count down. But, it costs some added time during indexing to build, and also ties up temporary disk space, causing eg a large spike on the final merge of an optimize. Since MergePolicy dictates which segments should be CFS, we can change it to only build CFS for smallish merges. I think we should also set a maxMergeMB by default so that very large merges aren't done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2773) Don't create compound file for large segments by default
[ https://issues.apache.org/jira/browse/LUCENE-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981729#action_12981729 ] Simon Willnauer commented on LUCENE-2773: - bq. There was a separate issue for this - LUCENE-2701. I think we should reopen and fix this. I expect optimize to have single segment semantics if I call optmize() as the JDocs states. However we do that :) Don't create compound file for large segments by default Key: LUCENE-2773 URL: https://issues.apache.org/jira/browse/LUCENE-2773 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.4, 3.0.3, 3.1, 4.0 Attachments: LUCENE-2773.patch Spinoff from LUCENE-2762. CFS is useful for keeping the open file count down. But, it costs some added time during indexing to build, and also ties up temporary disk space, causing eg a large spike on the final merge of an optimize. Since MergePolicy dictates which segments should be CFS, we can change it to only build CFS for smallish merges. I think we should also set a maxMergeMB by default so that very large merges aren't done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reopened LUCENE-2701: - This change together with LUCENE-2773 caused a change of the IW#optimize() and friends semantics. IW#optimize() says: {code} /** * Requests an optimize operation on an index, priming the index * for the fastest available search. Traditionally this has meant * merging all segments into a single segment as is done in the * default merge policy, but individual merge policies may implement * optimize in different ways. * {code} Which is not entirely true anymore since default now uses {code} /** Default maximum segment size. A segment of this size * or larger will never be merged. @see setMaxMergeMB */ public static final double DEFAULT_MAX_MERGE_MB = 2048; {code} this is not what I would expect from optimize() even if it would be documented that way. A plain optimize call should by default result in a single segment IMO. Yet, we could make this set by a flag in LogMergePolicy maybe something like LogMergePolicy#useMasMergeSizeForOptimize = false; as a default? Factor maxMergeSize into findMergesForOptimize in LogMergePolicy Key: LUCENE-2701 URL: https://issues.apache.org/jira/browse/LUCENE-2701 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1, 4.0 Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration. As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl. I'll attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated LUCENE-2868: Description: When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermState addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). was: When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermCache addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). Summary: It should be easy to make use of TermState; rewritten queries should be shared automatically (was: It should be easy to make use of TermCache; rewritten queries should be shared automatically) It should be easy to make use of TermState; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermState addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981746#action_12981746 ] Karl Wright commented on LUCENE-2868: - I reworded the description. I think the word cache is correct, but what we really need is simply a cache that has the lifetime of a top-level rewrite. I agree that putting the data in the query object itself would not have this characteristic, but on the other hand a second Query method that is cache aware seems reasonable. For example: Query rewriteMinimal(RewriteCache rc, IndexReader ir) ... where RewriteCache was an object that had a lifetime consistent with the highest-level rewrite operation done on the query graph. The rewriteMinimal() method would look for the rewrite of the the current query in the RewriteCache, and if found, would return that, otherwise would call plain old rewrite() and then save the result. So the patch would include: (a) the change as specified to Query.java (b) an implementation of RewriteCache, which *could* just be simplified to MapQuery,Query (c) changes to the callers of rewrite(), so that the minimal rewrite was called instead. Thoughts? It should be easy to make use of TermState; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermState addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
How to submit code?
Hi I started looking into Lucence, as I might need it on a project. As there was no GermanAnalyzer in the dotNet version, I ported the code that was available in the Java version to .NET. As I new to the OpenSource world, I do not exactly know how I need to proceed, that this piece of code is included? Send it to a contributor? Thanks for any advice. Regards Jörg Lang
[jira] Commented: (LUCENE-2723) Speed up Lucene's low level bulk postings read API
[ https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981762#action_12981762 ] Robert Muir commented on LUCENE-2723: - Ok, we are caught up to trunk... but we need to integrate getBulkPostingsEnum with termstate to fix the nocommits in TermQuery. This should also finally allow us to fix the cost of that extra per-segment docFreq. Speed up Lucene's low level bulk postings read API -- Key: LUCENE-2723 URL: https://issues.apache.org/jira/browse/LUCENE-2723 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723_bulkvint.patch, LUCENE-2723_facetPerSeg.patch, LUCENE-2723_facetPerSeg.patch, LUCENE-2723_openEnum.patch, LUCENE-2723_termscorer.patch, LUCENE-2723_wastedint.patch Spinoff from LUCENE-1410. The flex DocsEnum has a simple bulk-read API that reads the next chunk of docs/freqs. But it's a poor fit for intblock codecs like FOR/PFOR (from LUCENE-1410). This is not unlike sucking coffee through those tiny plastic coffee stirrers they hand out airplanes that, surprisingly, also happen to function as a straw. As a result we see no perf gain from using FOR/PFOR. I had hacked up a fix for this, described at in my blog post at http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html I'm opening this issue to get that work to a committable point. So... I've worked out a new bulk-read API to address performance bottleneck. It has some big changes over the current bulk-read API: * You can now also bulk-read positions (but not payloads), but, I have yet to cutover positional queries. * The buffer contains doc deltas, not absolute values, for docIDs and positions (freqs are absolute). * Deleted docs are not filtered out. * The doc freq buffers need not be aligned. For fixed intblock codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16, Group varint, etc.) they won't be. It's still a work in progress... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2868: Attachment: query-rewriter.patch I just sketched out what I have in mind could solve this problem and create the infrastructure to do way more than just caching a query#rewrite. This patch (which is just a sketch to show what I have in mind) adds a QueryRewriter class that walks the Query AST and rewrites each query node in the tree. The default implementation does nothing special, it just forwards to the queryies rewrite method but there seems to be a whole lot of potential in such a tree-walker / visitor. For instance could we subclass it to optimize certain queries if we fix the coord problem. Yet another usecase is to decouple MTQ rewriter entirely from MTQ (not sure if we want that though) or somebody wants to wrap a query during rewrite. Even further somebody could rewrite against fieldcache? Maybe this can be even more general and just be a QueryVisitor so folks can easily walk the tree. I think this is really something that should be solved in general AND in a different issue. simon It should be easy to make use of TermState; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright Attachments: query-rewriter.patch When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermState addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981774#action_12981774 ] Earwin Burrfoot commented on LUCENE-2868: - We here use an intermediate query AST, with a number of walkers that do synonym substitution, optimization, caching, rewriting for multiple fields, and finally - generating a tree of Lucene Queries. I can share a generic reflection-based visitor that's somewhat more handy than default visitor pattern in java. Usage looks roughly like: {code} class ToStringWalker extends DispatchingVisitorString { // String here stands for the type of walk result String visit(TermQuery q) { return {term: + q.getTerm() + }; } String visit(BooleanQuery q) { StringBuffer buf = new StringBuffer(); buf.append({boolean: ); for (BooleanQuery.Clause clause: q.clauses()) { buf.append(dispatch(clause.getQuery()).append(, ); // Here we } buf.append(}); return buf.toString(); } String visit(SpanQuery q) { // Runs for all SpanQueries . } String visit(Query q) { // Runs for all Queries not covered by a more exact visit() method .. } } Query query = ...; String stringRepresentation = new ToStringWalker().dispatch(query); {code} dispatch() checks its parameter runtime type, picks a visit()'s most close overload (according to java rules for compile-time overloaded method resolution), and invokes it. It should be easy to make use of TermState; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright Attachments: query-rewriter.patch When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermState addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2723) Speed up Lucene's low level bulk postings read API
[ https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2723: Attachment: LUCENE-2723.patch here is a fix for the nocommit robert put into TermQuery. All tests pass, i will commit in a bit Speed up Lucene's low level bulk postings read API -- Key: LUCENE-2723 URL: https://issues.apache.org/jira/browse/LUCENE-2723 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723_bulkvint.patch, LUCENE-2723_facetPerSeg.patch, LUCENE-2723_facetPerSeg.patch, LUCENE-2723_openEnum.patch, LUCENE-2723_termscorer.patch, LUCENE-2723_wastedint.patch Spinoff from LUCENE-1410. The flex DocsEnum has a simple bulk-read API that reads the next chunk of docs/freqs. But it's a poor fit for intblock codecs like FOR/PFOR (from LUCENE-1410). This is not unlike sucking coffee through those tiny plastic coffee stirrers they hand out airplanes that, surprisingly, also happen to function as a straw. As a result we see no perf gain from using FOR/PFOR. I had hacked up a fix for this, described at in my blog post at http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html I'm opening this issue to get that work to a committable point. So... I've worked out a new bulk-read API to address performance bottleneck. It has some big changes over the current bulk-read API: * You can now also bulk-read positions (but not payloads), but, I have yet to cutover positional queries. * The buffer contains doc deltas, not absolute values, for docIDs and positions (freqs are absolute). * Deleted docs are not filtered out. * The doc freq buffers need not be aligned. For fixed intblock codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16, Group varint, etc.) they won't be. It's still a work in progress... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981778#action_12981778 ] Simon Willnauer commented on LUCENE-2868: - bq. I can share a generic reflection-based visitor that's somewhat more handy than default visitor pattern in java. Earwin - I think we should make a new issue and get something like that implemented in there which is more general than what I just sketched out. If you could share your code that would be awesome! It should be easy to make use of TermState; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright Attachments: query-rewriter.patch When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermState addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2869) remove Query.getSimilarity()
remove Query.getSimilarity() Key: LUCENE-2869 URL: https://issues.apache.org/jira/browse/LUCENE-2869 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Spinoff of LUCENE-2854. See LUCENE-2828 and LUCENE-2854 for reference. In general, the SimilarityDelegator was problematic with regards to back-compat, and if queries want to score differently, trying to runtime subclass Similarity only causes trouble. The reason we could not fix this in LUCENE-2854 is because: {noformat} Michael McCandless added a comment - 08/Jan/11 01:53 PM bq. Is it possible to remove this method Query.getSimilarity also? I don't understand why we need this method! I would love to! But I think that's for another day... I looked into this and got stuck with BoostingQuery, which rewrites to an anon subclass of BQ overriding its getSimilarity in turn override its coord method. Rather twisted... if we can do this differently I think we could remove Query.getSimilarity. {noformat} here is the method in question: {noformat} /** Expert: Returns the Similarity implementation to be used for this query. * Subclasses may override this method to specify their own Similarity * implementation, perhaps one that delegates through that of the Searcher. * By default the Searcher's Similarity implementation is returned.*/ public Similarity getSimilarity(IndexSearcher searcher) { return searcher.getSimilarity(); } {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2869) remove Query.getSimilarity()
[ https://issues.apache.org/jira/browse/LUCENE-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2869: Attachment: LUCENE-2869.patch Here's a patch. To fix the BoostingQuery in contrib, it overrides BooleanWeight. (Also a test that instantiates BooleanScorer with a null weight had to be fixed). remove Query.getSimilarity() Key: LUCENE-2869 URL: https://issues.apache.org/jira/browse/LUCENE-2869 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-2869.patch Spinoff of LUCENE-2854. See LUCENE-2828 and LUCENE-2854 for reference. In general, the SimilarityDelegator was problematic with regards to back-compat, and if queries want to score differently, trying to runtime subclass Similarity only causes trouble. The reason we could not fix this in LUCENE-2854 is because: {noformat} Michael McCandless added a comment - 08/Jan/11 01:53 PM bq. Is it possible to remove this method Query.getSimilarity also? I don't understand why we need this method! I would love to! But I think that's for another day... I looked into this and got stuck with BoostingQuery, which rewrites to an anon subclass of BQ overriding its getSimilarity in turn override its coord method. Rather twisted... if we can do this differently I think we could remove Query.getSimilarity. {noformat} here is the method in question: {noformat} /** Expert: Returns the Similarity implementation to be used for this query. * Subclasses may override this method to specify their own Similarity * implementation, perhaps one that delegates through that of the Searcher. * By default the Searcher's Similarity implementation is returned.*/ public Similarity getSimilarity(IndexSearcher searcher) { return searcher.getSimilarity(); } {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981803#action_12981803 ] Jason Rutherglen commented on LUCENE-2701: -- I agree that there should not be a defaults for the max merge segment size for optimize, though it's good to have the option. Factor maxMergeSize into findMergesForOptimize in LogMergePolicy Key: LUCENE-2701 URL: https://issues.apache.org/jira/browse/LUCENE-2701 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1, 4.0 Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration. As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl. I'll attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981813#action_12981813 ] Shai Erera commented on LUCENE-2701: I don't think we need a useDefaultMaxMergeMb. Instead, we can default the member to Long.MAX_VAL. That way, if no one sets it, all segments will be considered for merge, and if one wants, he can set it. I expect that if I use IW with a LMP that sets maxMergeMB, then even if I call optimize() this setting will take effect. BTW, I don't remember introducin this defaul as part of this issue. This issue only changed LMP to take the already existed setting into account. So maybe reverting this default should be handled within the issue I was changed in? Factor maxMergeSize into findMergesForOptimize in LogMergePolicy Key: LUCENE-2701 URL: https://issues.apache.org/jira/browse/LUCENE-2701 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1, 4.0 Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration. As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl. I'll attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981817#action_12981817 ] Simon Willnauer commented on LUCENE-2701: - bq. BTW, I don't remember introducin this defaul as part of this issue. This issue only changed LMP to take the already existed setting into account. So maybe reverting this default should be handled within the issue I was changed in? True this was done in here: LUCENE-2773 - but this seemed to be more related?! bq. I don't think we need a useDefaultMaxMergeMb. Instead, we can default the member to Long.MAX_VAL. That way, if no one sets it, all segments will be considered for merge, and if one wants, he can set it. I think mike did that on purpose to prevent large segs from merging during indexing so what is wrong with disable that limit during optimize? Factor maxMergeSize into findMergesForOptimize in LogMergePolicy Key: LUCENE-2701 URL: https://issues.apache.org/jira/browse/LUCENE-2701 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1, 4.0 Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration. As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl. I'll attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981827#action_12981827 ] Jason Rutherglen commented on LUCENE-2324: -- I'm taking a guess here, however the ThreadAffinityDocumentsWriterThreadPool.getAndLock method looks a little suspicious as we're iterating on ThreadStates and on a non-concurrent hashmap calling put while not in a lock? Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, test.out See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981830#action_12981830 ] Jason Rutherglen commented on LUCENE-2324: -- Also multiple threads can call DocumentsWriterPerThread.addDocument and that's resulting in this: {code}[junit] java.lang.AssertionError: omitTermFreqAndPositions:false postings.docFreqs[termID]:0 [junit] at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:143) [junit] at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:234) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:91) [junit] at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:274) [junit] at org.apache.lucene.index.DocumentsWriterPerThread.addDocument(DocumentsWriterPerThread.java:184) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:374) [junit] at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1403) [junit] at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1375) {code} Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, test.out See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981832#action_12981832 ] Michael Busch commented on LUCENE-2324: --- bq. as we're iterating on ThreadStates and on a non-concurrent hashmap calling put while not in a lock? The threadBindings hashmap is a ConcurrentHashMap and the getActivePerThreadsIterator() is threadsafe I believe. Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, test.out See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981836#action_12981836 ] Michael McCandless commented on LUCENE-2701: bq. I think mike did that on purpose to prevent large segs from merging during indexing. Right -- large merges are really quite nasty -- mess up searches, NRT turnaround, IW.close() suddenly unexpectedly takes like an hour, etc. But, really the best fix, which I'd love to do at some point, is to fix our merge policy so that insanely large merges are done w/ fewer segments (eg only 2 segments at once). I think BalancedMP does this. Factor maxMergeSize into findMergesForOptimize in LogMergePolicy Key: LUCENE-2701 URL: https://issues.apache.org/jira/browse/LUCENE-2701 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1, 4.0 Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration. As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl. I'll attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Release schedule Lucene 4?
Dear Lucene team, I am wondering whether there is an updated Lucene release schedule for the v4.0 stream. Any earliest/latest alpha/beta/stable date? And if not yet, where to track such info? Thanks in advance from Germany gregor - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2870) if a segment is 100% deletions, we should just drop it
if a segment is 100% deletions, we should just drop it -- Key: LUCENE-2870 URL: https://issues.apache.org/jira/browse/LUCENE-2870 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Fix For: 3.1, 4.0 I think in IndexWriter if the delCount ever == maxDoc() for a segment we should just drop it? We don't, today, and so we force it to be merged, which is silly. I won't have time for this any time soon so if someone wants to take it, please do!! Should be simple. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981839#action_12981839 ] Jason Rutherglen commented on LUCENE-2324: -- bq. The threadBindings hashmap is a ConcurrentHashMap and the getActivePerThreadsIterator() is threadsafe I believe. Sorry yes CHM is used, it all looks thread safe, but there must be multiple threads accessing a single DWPT at the same time for some of these errors to be occurring. Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, test.out See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2666) ArrayIndexOutOfBoundsException when iterating over TermDocs
[ https://issues.apache.org/jira/browse/LUCENE-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981843#action_12981843 ] Michael McCandless commented on LUCENE-2666: Can you run CheckIndex on this index and post the result? And, enable assertions. And if possible turn on IndexWriter's infoStream and capture/post the output leading up to the corruption. Many updates during indexing is just fine... and I know whether rolling back to older Lucene releases will help (until we've isolated the issue). But: maybe try rolling forward to 3.0.3? It's possible you're hitting a big fixed in 3.0.3 (though this doesn't ring a bell for me). ArrayIndexOutOfBoundsException when iterating over TermDocs --- Key: LUCENE-2666 URL: https://issues.apache.org/jira/browse/LUCENE-2666 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.2 Reporter: Shay Banon A user got this very strange exception, and I managed to get the index that it happens on. Basically, iterating over the TermDocs causes an AAOIB exception. I easily reproduced it using the FieldCache which does exactly that (the field in question is indexed as numeric). Here is the exception: Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127) at org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:501) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183) at org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:470) at TestMe.main(TestMe.java:56) It happens on the following segment: _26t docCount: 914 delCount: 1 delFileName: _26t_1.del And as you can see, it smells like a corner case (it fails for document number 912, the AIOOB happens from the deleted docs). The code to recreate it is simple: FSDirectory dir = FSDirectory.open(new File(index)); IndexReader reader = IndexReader.open(dir, true); IndexReader[] subReaders = reader.getSequentialSubReaders(); for (IndexReader subReader : subReaders) { Field field = subReader.getClass().getSuperclass().getDeclaredField(si); field.setAccessible(true); SegmentInfo si = (SegmentInfo) field.get(subReader); System.out.println(-- + si); if (si.getDocStoreSegment().contains(_26t)) { // this is the probleatic one... System.out.println(problematic one...); FieldCache.DEFAULT.getLongs(subReader, __documentdate, FieldCache.NUMERIC_UTILS_LONG_PARSER); } } Here is the result of a check index on that segment: 8 of 10: name=_26t docCount=914 compound=true hasProx=true numFiles=2 size (MB)=1.641 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.18-194.11.1.el5.centos.plus, os=Linux, mergeDocStores=true, lucene.version=3.0.2 953716 - 2010-06-11 17:13:53, source=merge, os.arch=amd64, java.version=1.6.0, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_26t_1.del] test: open reader.OK [1 deleted docs] test: fields..OK [32 fields] test: field norms.OK [32 fields] test: terms, freq, prox...ERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127) at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102) at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:616) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:509) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299) at TestMe.main(TestMe.java:47) test: stored fields...ERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34) at org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:684) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:512) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299) at TestMe.main(TestMe.java:47) test: term vectorsERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at
[jira] Resolved: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-1821. --- Resolution: Fixed Assignee: Simon Willnauer This is resolved by adding AtomicReaderContext in 4.0 (LUCENE-2831). Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-2439) Composite readers (Multi/DirIndexReader) should not subclass IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler closed LUCENE-2439. - Resolution: Duplicate Duplicate of LUCENE-2858. Composite readers (Multi/DirIndexReader) should not subclass IndexReader Key: LUCENE-2439 URL: https://issues.apache.org/jira/browse/LUCENE-2439 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Fix For: 4.0 I'd like to change Multi/DirIndexReader so that they no longer implement the low level methods of IndexReader, and instead act more like an ordered collection of sub readers. I think to do this we'd need a new interface, common to atomic readers (SegmentReader) and the composite readers, which IndexSearcher would accept. We should also require that the core Query scorers always receive an atomic reader. We've taken strong initial steps here with flex, by forcing users to use separate MultiFields static methods to obtain Fields/Terms/etc. from a composite reader. This issue is to finish this cutover. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2010) Remove segments with all documents deleted in commit/flush/close of IndexWriter instead of waiting until a merge occurs.
[ https://issues.apache.org/jira/browse/LUCENE-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2010: -- Fix Version/s: 4.0 3.1 Remove segments with all documents deleted in commit/flush/close of IndexWriter instead of waiting until a merge occurs. Key: LUCENE-2010 URL: https://issues.apache.org/jira/browse/LUCENE-2010 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Uwe Schindler Fix For: 3.1, 4.0 I do not know if this is a bug in 2.9.0, but it seems that segments with all documents deleted are not automatically removed: {noformat} 4 of 14: name=_dlo docCount=5 compound=true hasProx=true numFiles=2 size (MB)=0.059 diagnostics = {java.version=1.5.0_21, lucene.version=2.9.0 817268P - 2009-09-21 10:25:09, os=SunOS, os.arch=amd64, java.vendor=Sun Microsystems Inc., os.version=5.10, source=flush} has deletions [delFileName=_dlo_1.del] test: open reader.OK [5 deleted docs] test: fields..OK [136 fields] test: field norms.OK [136 fields] test: terms, freq, prox...OK [1698 terms; 4236 terms/docs pairs; 0 tokens] test: stored fields...OK [0 total field count; avg ? fields per doc] test: term vectorsOK [0 total vector count; avg ? term/freq vector fields per doc] {noformat} Shouldn't such segments not be removed automatically during the next commit/close of IndexWriter? *Mike McCandless:* Lucene doesn't actually short-circuit this case, ie, if every single doc in a given segment has been deleted, it will still merge it [away] like normal, rather than simply dropping it immediately from the index, which I agree would be a simple optimization. Can you open a new issue? I would think IW can drop such a segment immediately (ie not wait for a merge or optimize) on flushing new deletes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-2870) if a segment is 100% deletions, we should just drop it
[ https://issues.apache.org/jira/browse/LUCENE-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler closed LUCENE-2870. - Resolution: Duplicate Duplicate of LUCENE-2010. if a segment is 100% deletions, we should just drop it -- Key: LUCENE-2870 URL: https://issues.apache.org/jira/browse/LUCENE-2870 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Fix For: 3.1, 4.0 I think in IndexWriter if the delCount ever == maxDoc() for a segment we should just drop it? We don't, today, and so we force it to be merged, which is silly. I won't have time for this any time soon so if someone wants to take it, please do!! Should be simple. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Created: (LUCENE-2863) Updating a documenting looses its fields that only indexed, also NumericField tries are completely lost
This is behaving as intended if I'm reading this correctly. Lucene has never fetched fields that aren't stored, and that's what you're asking it to do. To see why, consider indexing but not storing a normal text field with, say stop word removal and stemming. The *only* data kept in the index is the analyzed data, so even if you did reconstruct the field (no easy task BTW), you'd have something that was not the original text and would be pretty unsatisfactory. Kudos for providing the test case by the way, that makes figuring out what the answer is much easier... If this makes sense, could you close the JIRA? If not we can hash it out a bit more... Best Erick On Wed, Jan 12, 2011 at 2:12 PM, Tamas Sandor (JIRA) j...@apache.orgwrote: Updating a documenting looses its fields that only indexed, also NumericField tries are completely lost --- Key: LUCENE-2863 URL: https://issues.apache.org/jira/browse/LUCENE-2863 Project: Lucene - Java Issue Type: Bug Components: Store Affects Versions: 3.0.3, 3.0.2 Environment: WindowsXP, Java1.6.20 using a RamDirectory Reporter: Tamas Sandor I have a code snippet (see below) which creates a new document with standard (stored, indexed), *not-stored, indexed-only* and some *NumericFields*. Then it updates the document via adding a new string field. The result is that all the fields that are not stored but indexed-only and especially NumericFields the trie tokens are completly lost from index after update or delete/add. {code:java} Directory ramDir = new RamDirectory(); IndexWriter writer = new IndexWriter(ramDir, new WhitespaceAnalyzer(), MaxFieldLength.UNLIMITED); Document doc = new Document(); doc.add(new Field(ID, HO1234, Store.YES, Index.NOT_ANALYZED_NO_NORMS)); doc.add(new Field(PATTERN, HELLO, Store.NO, Index.NOT_ANALYZED_NO_NORMS)); doc.add(new NumericField(LAT, Store.YES, true).setDoubleValue(51.48826603066d)); doc.add(new NumericField(LNG, Store.YES, true).setDoubleValue(-0.08913399651646614d)); writer.addDocument(doc); doc = new Document(); doc.add(new Field(ID, HO, Store.YES, Index.NOT_ANALYZED_NO_NORMS)); doc.add(new Field(PATTERN, BELLO, Store.NO, Index.NOT_ANALYZED_NO_NORMS)); doc.add(new NumericField(LAT, Store.YES, true).setDoubleValue(101.48826603066d)); doc.add(new NumericField(LNG, Store.YES, true).setDoubleValue(-100.08913399651646614d)); writer.addDocument(doc); Term t = new Term(ID, HO1234); Query q = new TermQuery(t); IndexSearcher seacher = new IndexSearcher(writer.getReader()); TopDocs hits = seacher.search(q, 1); if (hits.scoreDocs.length 0) { Document ndoc = seacher.doc(hits.scoreDocs[0].doc); ndoc.add(new Field(FINAL, FINAL, Store.YES, Index.NOT_ANALYZED_NO_NORMS)); writer.updateDocument(t, ndoc); // writer.deleteDocuments(q); // writer.addDocument(ndoc); } else { LOG.info(Couldn't find the document via the query); } seacher = new IndexSearcher(writer.getReader()); hits = seacher.search(new TermQuery(new Term(PATTERN, HELLO)), 1); LOG.info(_hits HELLO: + hits.totalHits); // should be 1 but it's 0 writer.close(); {code} And I have a boundingbox query based on *NumericRangeQuery*. After the document update it doesn't return any hit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981895#action_12981895 ] Jason Rutherglen commented on LUCENE-2324: -- Also, why are we always (well, likely) assigning the DWPT to a different thread state if tryLock returns false? If there's a lot of contention (eg, far more incoming threads than DWPTs), then won't the thread assignation code become a hotspot? In ThreadAffinityDocumentsWriterThreadPool.clearThreadBindings(ThreadState perThread) we're actually clearing the entire map. When this's called in IW.flush (which is unsynced on IW), if there are multiple concurrent flushes, then perhaps a single DWPT is in use by multiple threads. To safeguard against this and perhaps more easily add an assertion, maybe we should lock on the DWPT rather than ThreadState? Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, test.out See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981930#action_12981930 ] Alexander Kanarsky commented on SOLR-1301: -- Note for the Hadoop 0.21 users: the current patch can be used as is with 0.21, but you will need to make sure to compile it with appropriate jars (hadoop-common-0.21.0.jar and hadoop-mapred-0.21.0.jar instead of hadoop-0.20.x-core.jar). Also, as a workaround, I had to put all the relevant jars (solr, solrj etc.) to the lib folder of the job's jar file (i.e. apache-solr-hadoop-1.4.x-dev.jar) to avoid InvocationTargetException/ClassNotFound exceptions I did not have with Hadoop 0.20. Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: Next Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981930#action_12981930 ] Alexander Kanarsky edited comment on SOLR-1301 at 1/14/11 4:27 PM: --- Note for the Hadoop 0.21 users: the current patch can be used as is with 0.21, but you will need to make sure to compile it with appropriate jars (hadoop-common-0.21.0.jar and hadoop-mapred-0.21.0.jar instead of hadoop-0.20.x-core.jar). Also, as a workaround, I had to put all the relevant jars (solr, solrj etc.) to the lib folder of the job's jar file (i.e. apache-solr-hadoop-xxx-dev.jar) to avoid InvocationTargetException/ClassNotFound exceptions I did not have with Hadoop 0.20. was (Author: kanarsky): Note for the Hadoop 0.21 users: the current patch can be used as is with 0.21, but you will need to make sure to compile it with appropriate jars (hadoop-common-0.21.0.jar and hadoop-mapred-0.21.0.jar instead of hadoop-0.20.x-core.jar). Also, as a workaround, I had to put all the relevant jars (solr, solrj etc.) to the lib folder of the job's jar file (i.e. apache-solr-hadoop-1.4.x-dev.jar) to avoid InvocationTargetException/ClassNotFound exceptions I did not have with Hadoop 0.20. Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: Next Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2611) IntelliJ IDEA and Eclipse setup
[ https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981967#action_12981967 ] Steven Rowe commented on LUCENE-2611: - bq. And perhaps the copyright setup should be set up for ASL. bq. I've used the copyright plugin a lot and its a great way to ensure that the ASL is added to any new files. Might be useful to add it to reduce the hassle for new contributors. Committed IntelliJ IDEA Copyright Plugin configuration for the Apache Software Licence: trunk rev. 1059199, branch_3x rev. 1059200 IntelliJ IDEA and Eclipse setup --- Key: LUCENE-2611 URL: https://issues.apache.org/jira/browse/LUCENE-2611 Project: Lucene - Java Issue Type: New Feature Components: Build Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Priority: Minor Fix For: 3.1, 4.0 Attachments: LUCENE-2611-branch-3x-part2.patch, LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, LUCENE-2611-part2.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611_eclipse.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test_2.patch Setting up Lucene/Solr in IntelliJ IDEA or Eclipse can be time-consuming. The attached patches add a new top level directory {{dev-tools/}} with sub-dirs {{idea/}} and {{eclipse/}} containing basic setup files for trunk, as well as top-level ant targets named idea and eclipse that copy these files into the proper locations. This arrangement avoids the messiness attendant to in-place project configuration files directly checked into source control. The IDEA configuration includes modules for Lucene and Solr, each Lucene and Solr contrib, and each analysis module. A JUnit run configuration per module is included. The Eclipse configuration includes a source entry for each source/test/resource location and classpath setup: a library entry for each jar. For IDEA, once {{ant idea}} has been run, the only configuration that must be performed manually is configuring the project-level JDK. For Eclipse, once {{ant eclipse}} has been run, the user has to refresh the project (right-click on the project and choose Refresh). If these patches is committed, Subversion svn:ignore properties should be added/modified to ignore the destination IDEA and Eclipse configuration locations. Iam Jambour has written up on the Lucene wiki a detailed set of instructions for applying the 3.X branch patch for IDEA: http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-3.x - Build # 242 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-3.x/242/ All tests passed Build Log (for compile errors): [...truncated 21064 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (SOLR-975) admin-extra.html not currectly display when using multicore configuration
[ https://issues.apache.org/jira/browse/SOLR-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved SOLR-975. --- Resolution: Fixed Fix Version/s: 4.0 Assignee: Yonik Seeley Thanks for verifying Edward admin-extra.html not currectly display when using multicore configuration - Key: SOLR-975 URL: https://issues.apache.org/jira/browse/SOLR-975 Project: Solr Issue Type: Bug Components: web gui Affects Versions: 1.4 Environment: Jetty openjdk 1.6.0 1.0.b12 (EPEL package for EL5) Reporter: Edward Rudd Assignee: Yonik Seeley Fix For: 4.0 I'm having cross-talk issues with using the Solr nightlies (and probably w/ 1.3.0 release but have not tested as I needed newer features of the DataImportHandler in the nightlies) Basic scenario for this bug is as follows I have two cores configured and BOTH have a customized admin-extra.html, however going to the admin pages uses the SAME admin-extra.html for all cores. the one used is whichever core is browsed first..This looks like a caching bug where the cache is not taking into account the Core. Basically my admin-extra.html has a link to the data importer script and a link to reload the core (which has to have the core name explicitly in the per-core admin-extra.html). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2315) analysis.jsp highlight matches no longer works
analysis.jsp highlight matches no longer works Key: SOLR-2315 URL: https://issues.apache.org/jira/browse/SOLR-2315 Project: Solr Issue Type: Bug Components: web gui Reporter: Hoss Man Fix For: 3.1, 4.0 As noted by Teruhiko Kurosaka on the mailing list, at some point since Solr 1.4, highlight matches stoped working on the analysis.jsp -- on both the 3x and trunk branches -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections
[ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982028#action_12982028 ] Shai Erera commented on LUCENE-1540: Patch looks good ! Can you make TrecContentSource.read() public and not package-private? That way people can use it outside benchmark's package as well, supporting other/newer/older TREC formats. Improvements to contrib.benchmark for TREC collections -- Key: LUCENE-1540 URL: https://issues.apache.org/jira/browse/LUCENE-1540 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Tim Armstrong Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-1540.patch The benchmarking utilities for TREC test collections (http://trec.nist.gov) are quite limited and do not support some of the variations in format of older TREC collections. I have been doing some benchmarking work with Lucene and have had to modify the package to support: * Older TREC document formats, which the current parser fails on due to missing document headers. * Variations in query format - newlines after title tag causing the query parser to get confused. * Ability to detect and read in uncompressed text collections * Storage of document numbers by default without storing full text. I can submit a patch if there is interest, although I will probably want to write unit tests for the new functionality first. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Solr-3.x - Build # 228 - Failure
Build: https://hudson.apache.org/hudson/job/Solr-3.x/228/ All tests passed Build Log (for compile errors): [...truncated 20279 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 3783 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/3783/ 1 tests failed. REGRESSION: org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration Error Message: null Stack Trace: junit.framework.AssertionFailedError: at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1127) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1059) at org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:227) Build Log (for compile errors): [...truncated 8229 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org