RE: [JENKINS] Lucene-trunk - Build # 1548 - Still Failing
Robert: Thanks for fixing, I triggered a new full build! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Apache Jenkins Server [mailto:hud...@hudson.apache.org] Sent: Monday, May 02, 2011 4:08 AM To: dev@lucene.apache.org Subject: [JENKINS] Lucene-trunk - Build # 1548 - Still Failing Build: https://builds.apache.org/hudson/job/Lucene-trunk/1548/ No tests ran. Build Log (for compile errors): [...truncated 9474 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3057) LuceneTestCase#newFSDirectoryImpl misses to set LockFactory if ctor call throws exception
[ https://issues.apache.org/jira/browse/LUCENE-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027570#comment-13027570 ] Simon Willnauer commented on LUCENE-3057: - bq. Hi Simon, I think you meant to set the lockfactory in the finally block? thanks robert for catching this I removed the return statement in revision 1098375. Backported to 3.x in revision 1098505 LuceneTestCase#newFSDirectoryImpl misses to set LockFactory if ctor call throws exception - Key: LUCENE-3057 URL: https://issues.apache.org/jira/browse/LUCENE-3057 Project: Lucene - Java Issue Type: Bug Components: Tests Affects Versions: 4.0 Reporter: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3057.patch, LUCENE-3057_bug.patch selckin reported on IRC that if you run ant test -Dtestcase=TestLockFactory -Dtestmethod=testNativeFSLockFactoryPrefix -Dtests.directory=FSDirectory the test fails. Since FSDirectory is an abstract class it can not be instantiated so our code falls back to FSDirector.open. yet we miss to set the given lockFactory though. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-3041: --- Assignee: Simon Willnauer Support Query Visting / Walking --- Key: LUCENE-3041 URL: https://issues.apache.org/jira/browse/LUCENE-3041 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Chris Male Assignee: Simon Willnauer Priority: Minor Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch Out of the discussion in LUCENE-2868, it could be useful to add a generic Query Visitor / Walker that could be used for more advanced rewriting, optimizations or anything that requires state to be stored as each Query is visited. We could keep the interface very simple: {code} public interface QueryVisitor { Query visit(Query query); } {code} and then use a reflection based visitor like Earwin suggested, which would allow implementators to provide visit methods for just Querys that they are interested in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027575#comment-13027575 ] Simon Willnauer commented on LUCENE-3041: - bq.New patch that implements what I said in the previous comments (except for the IS changes). Chris, patch looks good! Are you going to add the IS changes here too? I wonder if we could move the MethodDispatchException into InvocationDispatcher as a nested class I don't think we need an extra file for this class. Support Query Visting / Walking --- Key: LUCENE-3041 URL: https://issues.apache.org/jira/browse/LUCENE-3041 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Chris Male Priority: Minor Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch Out of the discussion in LUCENE-2868, it could be useful to add a generic Query Visitor / Walker that could be used for more advanced rewriting, optimizations or anything that requires state to be stored as each Query is visited. We could keep the interface very simple: {code} public interface QueryVisitor { Query visit(Query query); } {code} and then use a reflection based visitor like Earwin suggested, which would allow implementators to provide visit methods for just Querys that they are interested in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3056) Support Query Rewriting Caching
[ https://issues.apache.org/jira/browse/LUCENE-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3056: Component/s: Search Lucene Fields: [New, Patch Available] (was: [New]) Affects Version/s: 4.0 Fix Version/s: 4.0 Support Query Rewriting Caching --- Key: LUCENE-3056 URL: https://issues.apache.org/jira/browse/LUCENE-3056 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 4.0 Reporter: Chris Male Fix For: 4.0 Attachments: LUCENE-3056.patch Out of LUCENE-3041, its become apparent that using a Visitor / Walker isn't right for caching the rewrites of Querys. Although we still intend to introduce the Query / Walker for advanced query transformations, rewriting still serves a purpose for very specific implementation detail writing. As such, it can be very expensive. So I think we should introduce first class support for rewrite caching. I also feel the key is to make the caching as transparent as possible, to reduce the strain on Query implementors. The TermState idea gave me the idea of maybe making a RewriteState / RewriteCache / RewriteInterceptor, which would be consulted for rewritten Querys. It would then maintain an internal cache that it would check. If a value wasn't found, it'd then call Query#rewrite, and cache the result. By having this external rewrite source, people could 'pre' rewrite Querys if they were particularly expensive but also common. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3041: Lucene Fields: [New, Patch Available] (was: [New]) Affects Version/s: 4.0 Fix Version/s: 4.0 Support Query Visting / Walking --- Key: LUCENE-3041 URL: https://issues.apache.org/jira/browse/LUCENE-3041 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 4.0 Reporter: Chris Male Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch Out of the discussion in LUCENE-2868, it could be useful to add a generic Query Visitor / Walker that could be used for more advanced rewriting, optimizations or anything that requires state to be stored as each Query is visited. We could keep the interface very simple: {code} public interface QueryVisitor { Query visit(Query query); } {code} and then use a reflection based visitor like Earwin suggested, which would allow implementators to provide visit methods for just Querys that they are interested in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3056) Support Query Rewriting Caching
[ https://issues.apache.org/jira/browse/LUCENE-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027578#comment-13027578 ] Simon Willnauer commented on LUCENE-3056: - Hey chris, here are some comments: * I like that you only have to change BooleanQuery to enable this!! nice! * Can we rename RewriteState into RewriteContext its just more consistent to all the other ctx we pass to query and scorer? * Can we rename DefaultRewriteState into CachingRewriteContext and make a RewriteContext that simply does query.rewrite() that way nothing changes by default and we can use a static instance in Query#rewrite(IndexReader) maybe as an anonymous inner class in Query? * Can we move CachingRewriteContext into lucene/src/java/org/apache/lucene/util? This change somewhat depends on LUCENE-3041 since we might wanna pass that RewriteContext on a per segment level right? So maybe we should link those issues. Support Query Rewriting Caching --- Key: LUCENE-3056 URL: https://issues.apache.org/jira/browse/LUCENE-3056 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 4.0 Reporter: Chris Male Fix For: 4.0 Attachments: LUCENE-3056.patch Out of LUCENE-3041, its become apparent that using a Visitor / Walker isn't right for caching the rewrites of Querys. Although we still intend to introduce the Query / Walker for advanced query transformations, rewriting still serves a purpose for very specific implementation detail writing. As such, it can be very expensive. So I think we should introduce first class support for rewrite caching. I also feel the key is to make the caching as transparent as possible, to reduce the strain on Query implementors. The TermState idea gave me the idea of maybe making a RewriteState / RewriteCache / RewriteInterceptor, which would be consulted for rewritten Querys. It would then maintain an internal cache that it would check. If a value wasn't found, it'd then call Query#rewrite, and cache the result. By having this external rewrite source, people could 'pre' rewrite Querys if they were particularly expensive but also common. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2480) Text extraction of password protected files
[ https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shinichiro Abe updated SOLR-2480: - Attachment: SOLR-2480-idea1.patch Text extraction of password protected files --- Key: SOLR-2480 URL: https://issues.apache.org/jira/browse/SOLR-2480 Project: Solr Issue Type: Improvement Components: contrib - Solr Cell (Tika extraction) Affects Versions: 3.1 Reporter: Shinichiro Abe Priority: Minor Attachments: SOLR-2480-idea1.patch Proposal: There are password-protected files. PDF, Office documents in 2007 format/97 format. These files are posted using SolrCell. We do not have to read these files if we do not know the reading password of files. So, these files may not be extracted text. My requirement is that these files should be processed normally without extracting text, and without throwing exception. This background: Now, when you post a password-protected file, solr returns 500 server error. Solr catches the error in ExtractingDocumentLoader and throws TikException. I use ManifoldCF. If the solr server responds 500, ManifoldCF judge is that this document should be retried because I have absolutely no idea what happened. And it attempts to retry posting many times without getting the password. In the other case, my customer posts the files with embedded images. Sometimes it seems that solr throws TikaException of unknown cause. He wants to post just metadata without extracting text, but makes him stop posting by the exception. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2480) Text extraction of password protected files
[ https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027579#comment-13027579 ] Shinichiro Abe commented on SOLR-2480: -- {quote} But I think you want Solr to skip the content field because tika cannot extract it for some reasons but add meta data fields, right? {quote} Yes, I want to post the metadate without contents that throw parse-error. ExtractingDocumentLoader also should be fixed. This patch expresses improvement ideas(1). And I think SOLR-445 can resolve improvement ideas(2). Text extraction of password protected files --- Key: SOLR-2480 URL: https://issues.apache.org/jira/browse/SOLR-2480 Project: Solr Issue Type: Improvement Components: contrib - Solr Cell (Tika extraction) Affects Versions: 3.1 Reporter: Shinichiro Abe Priority: Minor Attachments: SOLR-2480-idea1.patch Proposal: There are password-protected files. PDF, Office documents in 2007 format/97 format. These files are posted using SolrCell. We do not have to read these files if we do not know the reading password of files. So, these files may not be extracted text. My requirement is that these files should be processed normally without extracting text, and without throwing exception. This background: Now, when you post a password-protected file, solr returns 500 server error. Solr catches the error in ExtractingDocumentLoader and throws TikException. I use ManifoldCF. If the solr server responds 500, ManifoldCF judge is that this document should be retried because I have absolutely no idea what happened. And it attempts to retry posting many times without getting the password. In the other case, my customer posts the files with embedded images. Sometimes it seems that solr throws TikaException of unknown cause. He wants to post just metadata without extracting text, but makes him stop posting by the exception. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3060) Revise ThreadAffinityDocumentsWriterThreadPool queue handling
Revise ThreadAffinityDocumentsWriterThreadPool queue handling - Key: LUCENE-3060 URL: https://issues.apache.org/jira/browse/LUCENE-3060 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 4.0 Reporter: Simon Willnauer Priority: Minor Fix For: 4.0 Spin-off from LUCENE-3023... In ThreadAffinityDocumentsWriterThreadPool#getAndLock() we had talked about switching from a per-threadstate queue (safeway model) to a single queue (whole foods) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3023) Land DWPT on trunk
[ https://issues.apache.org/jira/browse/LUCENE-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027583#comment-13027583 ] Simon Willnauer commented on LUCENE-3023: - bq. In ThreadAffinityDocumentsWriterThreadPool#getAndLock() we had talked about switching from a per-threadstate queue (safeway model) to a single queue (whole foods). I'm wondering if we should do that before we commit or change that later as a separate patch? I opened LUCENE-3060 for this. @buschmi maybe you can add some more info to that issue if you recall the discussion? {quote} Committed merged branch to trunk revision: 1098427 Moved branch away as tag in revision: 1098428 {quote} AWESOME! :) Land DWPT on trunk -- Key: LUCENE-3023 URL: https://issues.apache.org/jira/browse/LUCENE-3023 Project: Lucene - Java Issue Type: Task Affects Versions: CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3023-svn-diff.patch, LUCENE-3023-ws-changes.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023_CHANGES.patch, LUCENE-3023_CHANGES.patch, LUCENE-3023_iw_iwc_jdoc.patch, LUCENE-3023_simonw_review.patch, LUCENE-3023_svndiff.patch, LUCENE-3023_svndiff.patch, diffMccand.py, diffSources.patch, diffSources.patch, realtime-TestAddIndexes-3.txt, realtime-TestAddIndexes-5.txt, realtime-TestIndexWriterExceptions-assert-6.txt, realtime-TestIndexWriterExceptions-npe-1.txt, realtime-TestIndexWriterExceptions-npe-2.txt, realtime-TestIndexWriterExceptions-npe-4.txt, realtime-TestOmitTf-corrupt-0.txt With LUCENE-2956 we have resolved the last remaining issue for LUCENE-2324 so we can proceed landing the DWPT development on trunk soon. I think one of the bigger issues here is to make sure that all JavaDocs for IW etc. are still correct though. I will start going through that first. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3023) Land DWPT on trunk
[ https://issues.apache.org/jira/browse/LUCENE-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027587#comment-13027587 ] Uwe Schindler commented on LUCENE-3023: --- The first full Jenkins Build also succeeded. When reviewing the first Clover Build report, I noticed 2 new final classes, that have no code coverage at all (see [https://builds.apache.org/hudson/job/Lucene-trunk/1549/clover-report/org/apache/lucene/index/pkg-summary.html]): - DocFieldConsumers - DocFieldConsumersPerField I am not sure if those are old relicts (dead code) or newly added ones, but not yet used. Land DWPT on trunk -- Key: LUCENE-3023 URL: https://issues.apache.org/jira/browse/LUCENE-3023 Project: Lucene - Java Issue Type: Task Affects Versions: CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3023-svn-diff.patch, LUCENE-3023-ws-changes.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023_CHANGES.patch, LUCENE-3023_CHANGES.patch, LUCENE-3023_iw_iwc_jdoc.patch, LUCENE-3023_simonw_review.patch, LUCENE-3023_svndiff.patch, LUCENE-3023_svndiff.patch, diffMccand.py, diffSources.patch, diffSources.patch, realtime-TestAddIndexes-3.txt, realtime-TestAddIndexes-5.txt, realtime-TestIndexWriterExceptions-assert-6.txt, realtime-TestIndexWriterExceptions-npe-1.txt, realtime-TestIndexWriterExceptions-npe-2.txt, realtime-TestIndexWriterExceptions-npe-4.txt, realtime-TestOmitTf-corrupt-0.txt With LUCENE-2956 we have resolved the last remaining issue for LUCENE-2324 so we can proceed landing the DWPT development on trunk soon. I think one of the bigger issues here is to make sure that all JavaDocs for IW etc. are still correct though. I will start going through that first. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027589#comment-13027589 ] Chris Male commented on LUCENE-3041: bq. Are you going to add the IS changes here too? Yup, I'm just working through the best way to expose the API in the IS while supporting per segment walking. I'll have something together in the next day or two. bq. I wonder if we could move the MethodDispatchException into InvocationDispatcher as a nested class Good call. I'll make the change and upload something immediately. Support Query Visting / Walking --- Key: LUCENE-3041 URL: https://issues.apache.org/jira/browse/LUCENE-3041 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 4.0 Reporter: Chris Male Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch Out of the discussion in LUCENE-2868, it could be useful to add a generic Query Visitor / Walker that could be used for more advanced rewriting, optimizations or anything that requires state to be stored as each Query is visited. We could keep the interface very simple: {code} public interface QueryVisitor { Query visit(Query query); } {code} and then use a reflection based visitor like Earwin suggested, which would allow implementators to provide visit methods for just Querys that they are interested in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3056) Support Query Rewriting Caching
[ https://issues.apache.org/jira/browse/LUCENE-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027592#comment-13027592 ] Chris Male commented on LUCENE-3056: bq. This change somewhat depends on LUCENE-3041 since we might wanna pass that RewriteContext on a per segment level right? Yeah, thats very true. I'm wondering whether its best to rethink the signatures of the #search methods in IS since we need to incorporate both this and LUCENE-3041. I'll upload a patch shortly addressing the other improvements. Support Query Rewriting Caching --- Key: LUCENE-3056 URL: https://issues.apache.org/jira/browse/LUCENE-3056 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 4.0 Reporter: Chris Male Fix For: 4.0 Attachments: LUCENE-3056.patch Out of LUCENE-3041, its become apparent that using a Visitor / Walker isn't right for caching the rewrites of Querys. Although we still intend to introduce the Query / Walker for advanced query transformations, rewriting still serves a purpose for very specific implementation detail writing. As such, it can be very expensive. So I think we should introduce first class support for rewrite caching. I also feel the key is to make the caching as transparent as possible, to reduce the strain on Query implementors. The TermState idea gave me the idea of maybe making a RewriteState / RewriteCache / RewriteInterceptor, which would be consulted for rewritten Querys. It would then maintain an internal cache that it would check. If a value wasn't found, it'd then call Query#rewrite, and cache the result. By having this external rewrite source, people could 'pre' rewrite Querys if they were particularly expensive but also common. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3056) Support Query Rewriting Caching
[ https://issues.apache.org/jira/browse/LUCENE-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male updated LUCENE-3056: --- Attachment: LUCENE-3056.patch Patch implementing Simon's suggestions - RewriteState - RewriteContext - DefaultRewriteState - org.apache.lucene.util.CachingRewriteContext - Query now has a static anonymous inner class instance which does simple rewrite. Support Query Rewriting Caching --- Key: LUCENE-3056 URL: https://issues.apache.org/jira/browse/LUCENE-3056 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 4.0 Reporter: Chris Male Fix For: 4.0 Attachments: LUCENE-3056.patch, LUCENE-3056.patch Out of LUCENE-3041, its become apparent that using a Visitor / Walker isn't right for caching the rewrites of Querys. Although we still intend to introduce the Query / Walker for advanced query transformations, rewriting still serves a purpose for very specific implementation detail writing. As such, it can be very expensive. So I think we should introduce first class support for rewrite caching. I also feel the key is to make the caching as transparent as possible, to reduce the strain on Query implementors. The TermState idea gave me the idea of maybe making a RewriteState / RewriteCache / RewriteInterceptor, which would be consulted for rewritten Querys. It would then maintain an internal cache that it would check. If a value wasn't found, it'd then call Query#rewrite, and cache the result. By having this external rewrite source, people could 'pre' rewrite Querys if they were particularly expensive but also common. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
How should one impl own MergeScheduler
Hi I wanted to impl my own MergeScheduler (a variation of SerialMergeScheduler, which does minor additional work), and found out I cannot really, for lack of visible API on IndexWriter, such as getNextMerge() and merge(OneMerge) -- both exist, but are package-private. It got me thinking -- how can anyone impl his own MergeScheduler today? Perhaps people impl MergePolicy only? Would it make sense to open this API to our users? Is there other API we should consider opening w.r.t. MergeScheduler/Policy? Shai
[jira] [Updated] (SOLR-2472) StatsComponent should support hierarchical facets
[ https://issues.apache.org/jira/browse/SOLR-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Drozdov updated SOLR-2472: - Affects Version/s: 4.0 StatsComponent should support hierarchical facets - Key: SOLR-2472 URL: https://issues.apache.org/jira/browse/SOLR-2472 Project: Solr Issue Type: New Feature Affects Versions: 3.1, 4.0 Reporter: Dmitry Drozdov Attachments: SOLR-2472.patch Original Estimate: 24h Remaining Estimate: 24h It is currently possible to get only single layer of faceting in StatsComponent. The proposal is it make it possible to specify stats.facet parameter like this: stats=truestats.field=sFieldstats.facet=fField1,fField2 and get the response like this: lst name=stats lst name=stats_fields lst name=sField double name=min1.0/double double name=max1.0/double double name=sum4.0/double long name=count4/long long name=missing0/long double name=sumOfSquares/double double name=mean/double double name=stddev/double lst name=facets lst name=fField1 lst name=fField1Value1 double name=min1.0/double double name=max1.0/double double name=sum2.0/double long name=count2/long long name=missing0/long double name=sumOfSquares/double double name=mean/double double name=stddev/double lst name=facets lst name=fField2 lst name=fField2Value1 double name=min1.0/double double name=max1.0/double double name=sum1.0/double long name=count1/long long name=missing0/long double name=sumOfSquares/double double name=mean/double double name=stddev/double /lst lst name=fField2Value2 double name=min1.0/double double name=max1.0/double double name=sum1.0/double long name=count1/long long name=missing0/long double name=sumOfSquares/double double name=mean/double double name=stddev/double /lst /lst /lst /lst lst name=fField1Value2 double name=min1.0/double double name=max1.0/double double name=sum2.0/double long name=count2/long long name=missing0/long double name=sumOfSquares/double double name=mean/double double name=stddev/double lst name=facets lst name=fField2 lst name=fField2Value1 double name=min1.0/double double name=max1.0/double double name=sum1.0/double long name=count1/long long name=missing0/long double name=sumOfSquares/double double name=mean/double double name=stddev/double /lst lst name=fField2Value2 double name=min1.0/double double name=max1.0/double double name=sum1.0/double long name=count1/long long name=missing0/long double name=sumOfSquares/double double name=mean/double double name=stddev/double /lst /lst /lst /lst /lst /lst /lst /lst /lst -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-3.x - Build # 7642 - Failure
Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/7642/ 1 tests failed. REGRESSION: org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe Error Message: Java heap space Stack Trace: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2894) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:589) at java.lang.StringBuffer.append(StringBuffer.java:337) at java.text.RuleBasedCollator.getCollationKey(RuleBasedCollator.java:617) at org.apache.lucene.collation.CollationKeyFilter.incrementToken(CollationKeyFilter.java:93) at org.apache.lucene.collation.CollationTestBase.assertThreadSafe(CollationTestBase.java:304) at org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe(TestCollationKeyAnalyzer.java:89) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1091) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1023) Build Log (for compile errors): [...truncated 9253 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: How should one impl own MergeScheduler
I think we should open up these APIs? And, we should make a test case that lives outside of oal.index, to assert that in fact all needed APIs are not package private? Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 5:09 AM, Shai Erera ser...@gmail.com wrote: Hi I wanted to impl my own MergeScheduler (a variation of SerialMergeScheduler, which does minor additional work), and found out I cannot really, for lack of visible API on IndexWriter, such as getNextMerge() and merge(OneMerge) -- both exist, but are package-private. It got me thinking -- how can anyone impl his own MergeScheduler today? Perhaps people impl MergePolicy only? Would it make sense to open this API to our users? Is there other API we should consider opening w.r.t. MergeScheduler/Policy? Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2483) DIH - an uppercase problem in query parameters
DIH - an uppercase problem in query parameters -- Key: SOLR-2483 URL: https://issues.apache.org/jira/browse/SOLR-2483 Project: Solr Issue Type: Bug Components: clients - java, contrib - DataImportHandler Affects Versions: 3.1 Environment: Windows Vista Java 1.6 Reporter: Lubo Torok I have two tables called PROBLEM and KOMENTAR(means 'comment' in English) in DB. One problem can have more comments. I want to index them all. schema.xml looks as follows ... some fields ... field name=problem_id type=string stored=true required=true/ ... some fields... data-config.xml: document name=problemy entity name=problem query=select to_char(id) as problem_id, nazov as problem_nazov, cislo as problem_cislo, popis as problem_popis from problem pk=problem_id entity name=komentar query=select id as komentar_id, nazov as komentar_nazov, text as komentar_text from komentar where to_char(fk_problem)='${problem.PROBLEM_ID}'/ /entity /document If you write '${problem.PROBLEM_ID}' in lower case, i.e. '${problem.problem_id}' SOLR will not import the inner entity. Seems strange to me and it took me some time to figure this out. Note that primary key in PROBLEM is called ID. I defined the alias problem_id (yes,lower case) in SQL. In schema, there is this field defined as problem_id again in lower case. But, when I run http://localhost:8983/solr/dataimport?command=full-importdebug=trueverbose=on so I can see some debug information there is this part ... lst name=verbose-output − lst name=entity:problem − lst name=document#1 − str name=query select to_char(id) as problem_id, nazov as problem_nazov, cislo as problem_cislo, popis as problem_popis from problem /str str name=time-taken0:0:0.465/str str--- row #1-/str str name=PROBLEM_NAZOVtest zodpovedneho/str str name=PROBLEM_ID2533274790395945/str str name=PROBLEM_CISLO201009304/str str name=PROBLEM_POPIScsfdewafedewfw/str str-/str − lst name=entity:komentar − str name=query select id as komentar_id, nazov as komentar_nazov, text as komentar_text from komentar where to_char(fk_problem)='2533274790395945' /str ... where you can see that, internally, the fields of PROBLEM are represented in uppercase despite the user (me) had not defined them this way. The result is I guess that parameter referring to the parent entity ${entity.field} should always be in uppercase, i.e. ${entity.FIELD}. Here is an example of the indexed entity as written after full-import command with debug and verbose on: arr name=documents − lst − arr name=problem_nazov strtest zodpovedneho/str /arr − arr name=problem_id str2533274790395945/str /arr − arr name=problem_cislo str201009304/str /arr − arr name=problem_popis strcsfdewafedewfw/str /arr − arr name=komentar_id strjava.math.BigDecimal:5066549580791985/str /arr − arr name=komentar_text stra.TXT/str /arr /lst here are the field names in lower case. I consider this as a bug. Maybe I am wrong and its a feature. I work with SOLR only for few days. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027612#comment-13027612 ] Earwin Burrfoot commented on LUCENE-3041: - The static cache is now not threadsafe. And original had nice diagnostics for ambigous dispatches. Why not just take it and cut over to JDK reflection and CHM? Support Query Visting / Walking --- Key: LUCENE-3041 URL: https://issues.apache.org/jira/browse/LUCENE-3041 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 4.0 Reporter: Chris Male Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch Out of the discussion in LUCENE-2868, it could be useful to add a generic Query Visitor / Walker that could be used for more advanced rewriting, optimizations or anything that requires state to be stored as each Query is visited. We could keep the interface very simple: {code} public interface QueryVisitor { Query visit(Query query); } {code} and then use a reflection based visitor like Earwin suggested, which would allow implementators to provide visit methods for just Querys that they are interested in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027612#comment-13027612 ] Earwin Burrfoot edited comment on LUCENE-3041 at 5/2/11 10:30 AM: -- The static cache is now not threadsafe. And original had nice diagnostics for ambigous dispatches. Why not just take it and cut over to JDK reflection and CHM? Same can be said for tests. What about throwing original invocation exception instead of the wrapper? Since we're emulating a language feature, a simple method call, it's logical to only throw custom exceptions in .. well .. exceptional cases, like ambiguity/no matching method. If client code throws Errors/RuntimeExceptions, they should be transparently rethrown. was (Author: earwin): The static cache is now not threadsafe. And original had nice diagnostics for ambigous dispatches. Why not just take it and cut over to JDK reflection and CHM? Support Query Visting / Walking --- Key: LUCENE-3041 URL: https://issues.apache.org/jira/browse/LUCENE-3041 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 4.0 Reporter: Chris Male Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch Out of the discussion in LUCENE-2868, it could be useful to add a generic Query Visitor / Walker that could be used for more advanced rewriting, optimizations or anything that requires state to be stored as each Query is visited. We could keep the interface very simple: {code} public interface QueryVisitor { Query visit(Query query); } {code} and then use a reflection based visitor like Earwin suggested, which would allow implementators to provide visit methods for just Querys that they are interested in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation
Open IndexWriter API to allow custom MergeScheduler implementation -- Key: LUCENE-3061 URL: https://issues.apache.org/jira/browse/LUCENE-3061 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which makes it impossible for someone to implement his own MergeScheduler. We should open up these API, as well as any other that can be useful for custom MS implementations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 7642 - Failure
I slurped this hprof down and opened it w/ YourKit... Something weird is going on, because there is a single massive (151 MB) string, stack local to one of the threads, filled with character U+00B2. The test itself looks innocuous; I don't think it creates any massive stack local strings. I'm baffled. Robert maybe something crazy is happening in RuleBasedCollator? Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 5:53 AM, Apache Jenkins Server hud...@hudson.apache.org wrote: Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/7642/ 1 tests failed. REGRESSION: org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe Error Message: Java heap space Stack Trace: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2894) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:589) at java.lang.StringBuffer.append(StringBuffer.java:337) at java.text.RuleBasedCollator.getCollationKey(RuleBasedCollator.java:617) at org.apache.lucene.collation.CollationKeyFilter.incrementToken(CollationKeyFilter.java:93) at org.apache.lucene.collation.CollationTestBase.assertThreadSafe(CollationTestBase.java:304) at org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe(TestCollationKeyAnalyzer.java:89) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1091) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1023) Build Log (for compile errors): [...truncated 9253 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 7642 - Failure
On Mon, May 2, 2011 at 6:43 AM, Michael McCandless luc...@mikemccandless.com wrote: I slurped this hprof down and opened it w/ YourKit... Something weird is going on, because there is a single massive (151 MB) string, stack local to one of the threads, filled with character U+00B2. The test itself looks innocuous; I don't think it creates any massive stack local strings. I'm baffled. Robert maybe something crazy is happening in RuleBasedCollator? thanks for debugging this... at first I think it must be a jre bug just because the test has been toned down so many times. but, this test doesn't oom intermittently in trunk right? So before disabling the test and saying its out of our hands, it would be good to check that its not a bug in the encoder (Indexablebinarystringtools) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation
[ https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3061: --- Attachment: LUCENE-3061.patch Open up necessary API + add TestCustomMergeScheduler under src/test/o.a.l/index/publicapi. The changes are very trivial. If you would like to suggest alternative package I should put the test in, I will gladly do it. Open IndexWriter API to allow custom MergeScheduler implementation -- Key: LUCENE-3061 URL: https://issues.apache.org/jira/browse/LUCENE-3061 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3061.patch IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which makes it impossible for someone to implement his own MergeScheduler. We should open up these API, as well as any other that can be useful for custom MS implementations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation
[ https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027621#comment-13027621 ] Uwe Schindler commented on LUCENE-3061: --- All of those the public API tests are directly under o.a.lucene at the moment. Open IndexWriter API to allow custom MergeScheduler implementation -- Key: LUCENE-3061 URL: https://issues.apache.org/jira/browse/LUCENE-3061 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3061.patch IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which makes it impossible for someone to implement his own MergeScheduler. We should open up these API, as well as any other that can be useful for custom MS implementations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation
[ https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3061: --- Attachment: LUCENE-3061.patch Thanks Uwe ! Following your comment, I noticed there is a TestMergeSchedulerExternal under o.a.l, which covers extending ConcurrentMergeScheduler. So I moved my MS impl + test case there. I think this is ready to commit Open IndexWriter API to allow custom MergeScheduler implementation -- Key: LUCENE-3061 URL: https://issues.apache.org/jira/browse/LUCENE-3061 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3061.patch, LUCENE-3061.patch IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which makes it impossible for someone to implement his own MergeScheduler. We should open up these API, as well as any other that can be useful for custom MS implementations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation
[ https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027626#comment-13027626 ] Earwin Burrfoot commented on LUCENE-3061: - Mark these as @experimental? Open IndexWriter API to allow custom MergeScheduler implementation -- Key: LUCENE-3061 URL: https://issues.apache.org/jira/browse/LUCENE-3061 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3061.patch, LUCENE-3061.patch IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which makes it impossible for someone to implement his own MergeScheduler. We should open up these API, as well as any other that can be useful for custom MS implementations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation
[ https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027627#comment-13027627 ] Shai Erera commented on LUCENE-3061: I don't think they are experimental though -- they exist for ages. We only made them public. I get your point - you don't think we should commit to this API signature, but IMO we should -- if MS is a valid extension point by applications, we must support this API, otherwise MS cannot be extended at all. Also, getNextMerge() jdoc specifies Expert: the MergeScheeduler calls this method ... - this kind of makes this API public long time ago, only it wasn't. Open IndexWriter API to allow custom MergeScheduler implementation -- Key: LUCENE-3061 URL: https://issues.apache.org/jira/browse/LUCENE-3061 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3061.patch, LUCENE-3061.patch IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which makes it impossible for someone to implement his own MergeScheduler. We should open up these API, as well as any other that can be useful for custom MS implementations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 7642 - Failure
On Mon, May 2, 2011 at 6:43 AM, Michael McCandless luc...@mikemccandless.com wrote: I slurped this hprof down and opened it w/ YourKit... Something weird is going on, because there is a single massive (151 MB) string, stack local to one of the threads, filled with character U+00B2. The test itself looks innocuous; I don't think it creates any massive stack local strings. I'm baffled. Robert maybe something crazy is happening in RuleBasedCollator? upon further investigation, i think it must be a JRE bug. for one, i cannot (and was never able to) repro this locally. for now i'd like to change the test to use randomSimpleString. hopefully this is enough to dodge the bug! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027629#comment-13027629 ] Matthias Pigulla edited comment on SOLR-42 at 5/2/11 12:02 PM: --- I don't think it's a duplicate and the issue is still unresolved at least in regard to [#comment-12625835] and the 1.4.1 release. The input string ??xx yy xx will have the start offsets for xx, yy and xx at 3, 6 and 9 respectively and is off by one. ? ?? ?xx yy xx [spaces added between question marks for JIRA display] will even have 6, 9 and 12, that is, every ?? (as a special degenerated kind of XML PI) will shift the offset by one. was (Author: mpdude): I don't think it's a duplicate and the issue is still unresolved at least in regard to [#comment-12625835] and the 1.4.1 release. The input string ??xx yy xx will have the start offsets for xx, yy and xx at 3, 6 and 9 respectively and is off by one. xx yy xx will even have 6, 9 and 12, that is, every ?? (as a special degenerated kind of XML PI) will shift the offset by one. Highlighting problems with HTMLStripWhitespaceTokenizerFactory -- Key: SOLR-42 URL: https://issues.apache.org/jira/browse/SOLR-42 Project: Solr Issue Type: Bug Components: highlighter Reporter: Andrew May Assignee: Grant Ingersoll Priority: Minor Attachments: HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java, htmlStripReaderTest.html Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable). Example title field: SUP40/SUPAr/SUP39/SUPAr laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia Searching for title:fabrics with highlighting on, the highlighted version has the em tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags). Response from Yonik on the solr-user mailing-list: HTMLStripWhitespaceTokenizerFactory works in two phases... HTMLStripReader removes the HTML and passes the result to WhitespaceTokenizer... at that point, Tokens are generated, but the offsets will correspond to the text after HTML removal, not before. I did it this way so that HTMLStripReader could go before any tokenizer (like StandardTokenizer). Can you open a JIRA bug for this? The fix would be a special version of HTMLStripReader integrated with a WhitespaceTokenizer to keep offsets correct. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Index searcher can't find the doc of any field value
First, this kind of question is better suited for the Lucene User's list, this list is intended for people actively developing the lucene code itself. That said, your problem most likely is that you are indexing your fields UN_TOKENIZED, which means that the information isn't split into words. Try using TOKENIZED. By the way, what version are you using? UN_TOKENIZED has been deprecated for quite some time. You would probably get a lot of value from Luke.. Best Erick On Fri, Apr 29, 2011 at 10:44 PM, soheila dehghanzadeh sally...@gmail.com wrote: Hi Friends, i'm using lucene to index a file with this format, each lines contains 4 elements which separated by space. because I want to retrieve any line with special text in a special part, so I try to add each line to index in a seprate document with 4 fields. for example I named fields A,B,C,D so i use this code to index my file: File file = new File(e://data3); BufferedReader reader = new BufferedReader(new FileReader(file)); IndexWriter writer = new IndexWriter(indexDirectory, new SimpleAnalyzer(),true); writer.setUseCompoundFile(true); String line; while ((line = reader.readLine()) != null) { string[] index = line.split( ); Document document = new Document(); document.add(new Field(A, index[0], Field.Store.YES, Field.Index.UN_TOKENIZED)); document.add(new Field(B, index[1], Field.Store.YES, Field.Index.UN_TOKENIZED)); document.add(new Field(C, index[2], Field.Store.YES, Field.Index.UN_TOKENIZED)); document.add(new Field(D, index[3], Field.Store.YES, Field.Index.UN_TOKENIZED)); writer.addDocument(document); System.out.println(writer.docCount()); } } catch (Exception e) { e.printStackTrace(); } but when i try to search this index with some letters which exist in for example field A it fails to find the document(line) :( my search code is as follows: try { IndexSearcher is = new IndexSearcher(FSDirectory.getDirectory(indexDirectory, false)); Query q = new TermQuery(new Term(A, hello)); Hits hits = is.search(q); for (int i = 0; i hits.length(); i++) { Document doc = hits.doc(i); System.out.println(A: +doc.get(A)+ B:+doc.get(B)+ C:+doc.get(C)+ D:+doc.get(D)); } } catch (Exception e) { e.printStackTrace(); } kindly let me know if there is any error in my code . thanks in advance. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation
[ https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027631#comment-13027631 ] Michael McCandless commented on LUCENE-3061: I think they should be @experimental? (Eg, MS itself is). Open IndexWriter API to allow custom MergeScheduler implementation -- Key: LUCENE-3061 URL: https://issues.apache.org/jira/browse/LUCENE-3061 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3061.patch, LUCENE-3061.patch IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which makes it impossible for someone to implement his own MergeScheduler. We should open up these API, as well as any other that can be useful for custom MS implementations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
MergePolicy Thresholds
Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai
[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027640#comment-13027640 ] Robert Muir commented on LUCENE-3054: - {quote} I propose to change SorterTemplate to fall back to mergeSort once it checks that number of iterations grows e.g. 20 (have to test a little bit). {quote} I like the idea of some guard here to prevent the stack overflow, and hopefully keep the quickSort performance for the places where we know its better than mergesort. SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Priority: Critical Attachments: LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3054: -- Attachment: LUCENE-3054-stackoverflow.patch Patch that shows the issue. SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Priority: Critical Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027643#comment-13027643 ] Uwe Schindler commented on LUCENE-3054: --- As quicksort gets insanely slow when these type of data gets sorted, this also explains Otis' slowdown. SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Priority: Critical Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3062) TestBytesRefHash#testCompact is broken
TestBytesRefHash#testCompact is broken -- Key: LUCENE-3062 URL: https://issues.apache.org/jira/browse/LUCENE-3062 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3062.patch TestBytesRefHash#testCompact fails when run with ant test -Dtestcase=TestBytesRefHash -Dtestmethod=testCompact -Dtests.seed=-7961072421643387492:5612141247152835360 {noformat} [junit] Testsuite: org.apache.lucene.util.TestBytesRefHash [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.454 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestBytesRefHash -Dtestmethod=testCompact -Dtests.seed=-7961072421643387492:5612141247152835360 [junit] NOTE: test params are: codec=PreFlex, locale=et, timezone=Pacific/Tahiti [junit] NOTE: all tests run in this JVM: [junit] [TestBytesRefHash] [junit] NOTE: Linux 2.6.35-28-generic amd64/Sun Microsystems Inc. 1.6.0_24 (64-bit)/cpus=12,threads=1,free=363421800,total=379322368 [junit] - --- [junit] Testcase: testCompact(org.apache.lucene.util.TestBytesRefHash): Caused an ERROR [junit] bitIndex 0: -27 [junit] java.lang.IndexOutOfBoundsException: bitIndex 0: -27 [junit] at java.util.BitSet.set(BitSet.java:262) [junit] at org.apache.lucene.util.TestBytesRefHash.testCompact(TestBytesRefHash.java:146) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) [junit] [junit] [junit] Test org.apache.lucene.util.TestBytesRefHash FAILED {noformat} the test expects that _TestUtil.randomRealisticUnicodeString(random, 1000); will never return the same string. I will upload a patch soon. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3062) TestBytesRefHash#testCompact is broken
[ https://issues.apache.org/jira/browse/LUCENE-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3062: Attachment: LUCENE-3062.patch here is a patch TestBytesRefHash#testCompact is broken -- Key: LUCENE-3062 URL: https://issues.apache.org/jira/browse/LUCENE-3062 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3062.patch TestBytesRefHash#testCompact fails when run with ant test -Dtestcase=TestBytesRefHash -Dtestmethod=testCompact -Dtests.seed=-7961072421643387492:5612141247152835360 {noformat} [junit] Testsuite: org.apache.lucene.util.TestBytesRefHash [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.454 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestBytesRefHash -Dtestmethod=testCompact -Dtests.seed=-7961072421643387492:5612141247152835360 [junit] NOTE: test params are: codec=PreFlex, locale=et, timezone=Pacific/Tahiti [junit] NOTE: all tests run in this JVM: [junit] [TestBytesRefHash] [junit] NOTE: Linux 2.6.35-28-generic amd64/Sun Microsystems Inc. 1.6.0_24 (64-bit)/cpus=12,threads=1,free=363421800,total=379322368 [junit] - --- [junit] Testcase: testCompact(org.apache.lucene.util.TestBytesRefHash): Caused an ERROR [junit] bitIndex 0: -27 [junit] java.lang.IndexOutOfBoundsException: bitIndex 0: -27 [junit] at java.util.BitSet.set(BitSet.java:262) [junit] at org.apache.lucene.util.TestBytesRefHash.testCompact(TestBytesRefHash.java:146) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) [junit] [junit] [junit] Test org.apache.lucene.util.TestBytesRefHash FAILED {noformat} the test expects that _TestUtil.randomRealisticUnicodeString(random, 1000); will never return the same string. I will upload a patch soon. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-3054: - Assignee: Uwe Schindler SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
I did look at it, but I didn't find that it answers this particular need (ending with a segment no bigger than X). Perhaps by tweaking several parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve something, but it's not very clear what is the right combination. Which is related to one of the points -- is it not more intuitive for an app to set this threshold (if it needs any thresholds), than tweaking all of those parameters? If so, then we only need two thresholds (size + mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic (perhaps w/ some adaptations) to derive a merge plan. Shai On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote: Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027662#comment-13027662 ] Uwe Schindler commented on LUCENE-3054: --- Due to the realtime merge (LUCENE-3023), suddenly DocFieldProcessor got a reincarnation of quicksort again... will remove, too SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation
[ https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-3061. Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [New]) Committed revision 1098543 (3x). Committed revision 1098576 (trunk). Open IndexWriter API to allow custom MergeScheduler implementation -- Key: LUCENE-3061 URL: https://issues.apache.org/jira/browse/LUCENE-3061 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3061.patch, LUCENE-3061.patch IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which makes it impossible for someone to implement his own MergeScheduler. We should open up these API, as well as any other that can be useful for custom MS implementations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3023) Land DWPT on trunk
[ https://issues.apache.org/jira/browse/LUCENE-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3023: -- Attachment: LUCENE-3023-quicksort-reincarnation.patch Here the patch. Will commit soon. Land DWPT on trunk -- Key: LUCENE-3023 URL: https://issues.apache.org/jira/browse/LUCENE-3023 Project: Lucene - Java Issue Type: Task Affects Versions: CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3023-quicksort-reincarnation.patch, LUCENE-3023-svn-diff.patch, LUCENE-3023-ws-changes.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023_CHANGES.patch, LUCENE-3023_CHANGES.patch, LUCENE-3023_iw_iwc_jdoc.patch, LUCENE-3023_simonw_review.patch, LUCENE-3023_svndiff.patch, LUCENE-3023_svndiff.patch, diffMccand.py, diffSources.patch, diffSources.patch, realtime-TestAddIndexes-3.txt, realtime-TestAddIndexes-5.txt, realtime-TestIndexWriterExceptions-assert-6.txt, realtime-TestIndexWriterExceptions-npe-1.txt, realtime-TestIndexWriterExceptions-npe-2.txt, realtime-TestIndexWriterExceptions-npe-4.txt, realtime-TestOmitTf-corrupt-0.txt With LUCENE-2956 we have resolved the last remaining issue for LUCENE-2324 so we can proceed landing the DWPT development on trunk soon. I think one of the bigger issues here is to make sure that all JavaDocs for IW etc. are still correct though. I will start going through that first. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-3023) Land DWPT on trunk
[ https://issues.apache.org/jira/browse/LUCENE-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reopened LUCENE-3023: --- I reopen this one, as the merge added a reincarnation of quicksort in DocFieldProcessor (which was previously removed in the corresponding *PerThread class, but lost during the merge). I will fix soon. Land DWPT on trunk -- Key: LUCENE-3023 URL: https://issues.apache.org/jira/browse/LUCENE-3023 Project: Lucene - Java Issue Type: Task Affects Versions: CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3023-quicksort-reincarnation.patch, LUCENE-3023-svn-diff.patch, LUCENE-3023-ws-changes.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023_CHANGES.patch, LUCENE-3023_CHANGES.patch, LUCENE-3023_iw_iwc_jdoc.patch, LUCENE-3023_simonw_review.patch, LUCENE-3023_svndiff.patch, LUCENE-3023_svndiff.patch, diffMccand.py, diffSources.patch, diffSources.patch, realtime-TestAddIndexes-3.txt, realtime-TestAddIndexes-5.txt, realtime-TestIndexWriterExceptions-assert-6.txt, realtime-TestIndexWriterExceptions-npe-1.txt, realtime-TestIndexWriterExceptions-npe-2.txt, realtime-TestIndexWriterExceptions-npe-4.txt, realtime-TestOmitTf-corrupt-0.txt With LUCENE-2956 we have resolved the last remaining issue for LUCENE-2324 so we can proceed landing the DWPT development on trunk soon. I think one of the bigger issues here is to make sure that all JavaDocs for IW etc. are still correct though. I will start going through that first. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Dunno, I'm quite happy with numLargeSegments (you critically misspelled it). It neatly avoids uber-merges, keeps the number of segments at bay, and does not require to recalculate thresholds when my expected index size changes. The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ 2011/5/2 Shai Erera ser...@gmail.com: I did look at it, but I didn't find that it answers this particular need (ending with a segment no bigger than X). Perhaps by tweaking several parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve something, but it's not very clear what is the right combination. Which is related to one of the points -- is it not more intuitive for an app to set this threshold (if it needs any thresholds), than tweaking all of those parameters? If so, then we only need two thresholds (size + mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic (perhaps w/ some adaptations) to derive a merge plan. Shai On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote: Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3023) Land DWPT on trunk
[ https://issues.apache.org/jira/browse/LUCENE-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-3023. --- Resolution: Fixed Removed quicksort in revision 1098592 Land DWPT on trunk -- Key: LUCENE-3023 URL: https://issues.apache.org/jira/browse/LUCENE-3023 Project: Lucene - Java Issue Type: Task Affects Versions: CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3023-quicksort-reincarnation.patch, LUCENE-3023-svn-diff.patch, LUCENE-3023-ws-changes.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023_CHANGES.patch, LUCENE-3023_CHANGES.patch, LUCENE-3023_iw_iwc_jdoc.patch, LUCENE-3023_simonw_review.patch, LUCENE-3023_svndiff.patch, LUCENE-3023_svndiff.patch, diffMccand.py, diffSources.patch, diffSources.patch, realtime-TestAddIndexes-3.txt, realtime-TestAddIndexes-5.txt, realtime-TestIndexWriterExceptions-assert-6.txt, realtime-TestIndexWriterExceptions-npe-1.txt, realtime-TestIndexWriterExceptions-npe-2.txt, realtime-TestIndexWriterExceptions-npe-4.txt, realtime-TestOmitTf-corrupt-0.txt With LUCENE-2956 we have resolved the last remaining issue for LUCENE-2324 so we can proceed landing the DWPT development on trunk soon. I think one of the bigger issues here is to make sure that all JavaDocs for IW etc. are still correct though. I will start going through that first. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ I agree. I wonder tough if the knobs we give on LogMP are intuitive enough. It neatly avoids uber-merges I didn't see that I can define what uber-merge is, right? Can I tell it to stop merging segments of some size? E.g., if my index grew to 100 segments, 40GB each, I don't think that merging 10 40GB segments (to create 400GB segment) is going to speed up my search, for instance. A 40GB segment (probably much less) is already big enough to not be touched anymore. Will BalancedMP stop merging such segments (if all segments are of that order of magnitude)? Shai On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot ear...@gmail.com wrote: Dunno, I'm quite happy with numLargeSegments (you critically misspelled it). It neatly avoids uber-merges, keeps the number of segments at bay, and does not require to recalculate thresholds when my expected index size changes. The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ 2011/5/2 Shai Erera ser...@gmail.com: I did look at it, but I didn't find that it answers this particular need (ending with a segment no bigger than X). Perhaps by tweaking several parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve something, but it's not very clear what is the right combination. Which is related to one of the points -- is it not more intuitive for an app to set this threshold (if it needs any thresholds), than tweaking all of those parameters? If so, then we only need two thresholds (size + mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic (perhaps w/ some adaptations) to derive a merge plan. Shai On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote: Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3054: -- Attachment: LUCENE-3054.patch Here the patch that combines Robert's optimization for PhraseQuery (term with lower docFreq will also have less positions) and the safety for quickSort at all. SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3054: -- Fix Version/s: 4.0 3.2 3.1.1 Set fix versions (also backport to 3.1.1, as its serious for some large PhraseQueries and a serious slowdown then). SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3054: -- Attachment: LUCENE-3054.patch Sorry, the safety net is only needed at 40 (from my tests), before it may affect BytesRefHash performance. I will commit later! SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3054: -- Attachment: LUCENE-3054.patch Better test that fails faster in case of quickSort bug SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[Lucene.Net] fw: resolving github mirror issues
Is there any reason not to replace the old mirror with the newly created one? - Michael -- Hi, On Tue, Apr 26, 2011 at 7:51 PM, Michael Herndon mhern...@wickedsoftware.net wrote: Would it be possible to get the git mirror to reflect that or at least create a new mirror for the lucene.net repo that is under incubator? Unfortunately our mirroring scripts can't handle an svn move that wasn't done as a single commit (svn move .../lucene/lucene.net .../incubator/lucene.net), so I'll need to recreate the mirror. If and when you move back to Lucene or to a TLP, I suggest you move the full svn tree in a single commit. Do you still need the old mirror repository, or is it OK if I simply replace it with the newly created one? BR, Jukka Zitting
[jira] [Resolved] (LUCENE-3062) TestBytesRefHash#testCompact is broken
[ https://issues.apache.org/jira/browse/LUCENE-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-3062. - Resolution: Fixed TestBytesRefHash#testCompact is broken -- Key: LUCENE-3062 URL: https://issues.apache.org/jira/browse/LUCENE-3062 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3062.patch TestBytesRefHash#testCompact fails when run with ant test -Dtestcase=TestBytesRefHash -Dtestmethod=testCompact -Dtests.seed=-7961072421643387492:5612141247152835360 {noformat} [junit] Testsuite: org.apache.lucene.util.TestBytesRefHash [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.454 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestBytesRefHash -Dtestmethod=testCompact -Dtests.seed=-7961072421643387492:5612141247152835360 [junit] NOTE: test params are: codec=PreFlex, locale=et, timezone=Pacific/Tahiti [junit] NOTE: all tests run in this JVM: [junit] [TestBytesRefHash] [junit] NOTE: Linux 2.6.35-28-generic amd64/Sun Microsystems Inc. 1.6.0_24 (64-bit)/cpus=12,threads=1,free=363421800,total=379322368 [junit] - --- [junit] Testcase: testCompact(org.apache.lucene.util.TestBytesRefHash): Caused an ERROR [junit] bitIndex 0: -27 [junit] java.lang.IndexOutOfBoundsException: bitIndex 0: -27 [junit] at java.util.BitSet.set(BitSet.java:262) [junit] at org.apache.lucene.util.TestBytesRefHash.testCompact(TestBytesRefHash.java:146) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) [junit] [junit] [junit] Test org.apache.lucene.util.TestBytesRefHash FAILED {noformat} the test expects that _TestUtil.randomRealisticUnicodeString(random, 1000); will never return the same string. I will upload a patch soon. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Actually the new TieredMergePolicy (only on trunk currently but I plan to backport for 3.2) lets you set the max merged segment size (maxMergedSegmentMB). It's only an estimate, but if it's set, it tries to pick a merge reaching around that target size. Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3054: -- Attachment: LUCENE-3054.patch Final patch. After some discussion with robert: The use of QuickSort is fine after the comparator was fixed to not only sort by docFreq. SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027702#comment-13027702 ] Uwe Schindler commented on LUCENE-3054: --- Committed trunk revision: 1098633 Now merging... SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-3054. --- Resolution: Fixed Merged 3.x revision: 1098639 Merged 3.1 revision: 1098641 SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3063) factor CharTokenizer/CharacterUtils into analyzers module
[ https://issues.apache.org/jira/browse/LUCENE-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3063: Attachment: LUCENE-3063.patch factor CharTokenizer/CharacterUtils into analyzers module - Key: LUCENE-3063 URL: https://issues.apache.org/jira/browse/LUCENE-3063 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-3063.patch Currently these analysis components are in the lucene core, but should really be .util in the analyzers module. Also, with MockTokenizer extending Tokenizer directly, we can add some additional checks in the future to try to ensure our consumers are being good consumers (e.g. calling reset). This is mentioned in http://wiki.apache.org/lucene-java/TestIdeas, I didn't implement it here yet, this is just the factoring. I think we should try to do this before LUCENE-3040. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3063) factor CharTokenizer/CharacterUtils into analyzers module
factor CharTokenizer/CharacterUtils into analyzers module - Key: LUCENE-3063 URL: https://issues.apache.org/jira/browse/LUCENE-3063 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-3063.patch Currently these analysis components are in the lucene core, but should really be .util in the analyzers module. Also, with MockTokenizer extending Tokenizer directly, we can add some additional checks in the future to try to ensure our consumers are being good consumers (e.g. calling reset). This is mentioned in http://wiki.apache.org/lucene-java/TestIdeas, I didn't implement it here yet, this is just the factoring. I think we should try to do this before LUCENE-3040. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened LUCENE-3054: Reopening so we can discuss things further...: QuickSort is dangerous! Yet, it's definitely faster than MergeSort for some cases (~20% faster when sorting terms for writing segment, in quick test I ran on Wikipedia content). So the core issue is we should not use QS when there's a risk of any ties, because in that case it can run really slowly or hit infinite recursion. And we (well, Otis; thank you!) found one such place today (where MultiPhraseQuery sorts its terms) where we could have many ties and thus run very slowly / hit stack overflow. I appreciate the motivation for the safety net, but, it makes me nervous... because, say we had done this a few months back... then Otis likely would not have reported the issue? Ie, the MultiPhraseQuery would run slowly... which could evade detection (people may just think it's slow). I prefer brittle fails over silent slowdowns because the brittle fail gets your attention and you get a real fix in. Silent slowdowns evade detection. Sort of like the difference between a virus and spyware... Also, what's preventing us from accidentally using QS somewhere in the future, where we shouldn't? What's going to catch us? Robert's first patch would catch this and protect us going forward? Or, maybe we could strengthen that approach and assert cmp != 0 inside QS (ie, no ties are allowed to be passed to QS)? Though, using asserts only is risky, because it could be the comparator may return 0, but it's just that none of our test cases tickled it. Maybe instead we could do this in a type-safe way: make a new NoTiesComparator whose compare method can only return LESS_THAN or GREATER_THAN? And then QS would require NoTiesComparator. Could that work? SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027730#comment-13027730 ] Michael McCandless commented on LUCENE-3054: Also, I think PQ.PostingsAndFreq.compare is still able to return ties, if the app puts the same term at the same position (which is a silly thing to do... but, still possible). I think instead of disambiguating by Term, we should disambiguate by ord (ie, position of this term in the array of the query itself), since that can never be the same for entries in the array? SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any way, or do you think it can easily be ported to 3x? Shai On Mon, May 2, 2011 at 6:34 PM, Michael McCandless luc...@mikemccandless.com wrote: Actually the new TieredMergePolicy (only on trunk currently but I plan to backport for 3.2) lets you set the max merged segment size (maxMergedSegmentMB). It's only an estimate, but if it's set, it tries to pick a merge reaching around that target size. Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
I think it should be an easy port... Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote: Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any way, or do you think it can easily be ported to 3x? Shai On Mon, May 2, 2011 at 6:34 PM, Michael McCandless luc...@mikemccandless.com wrote: Actually the new TieredMergePolicy (only on trunk currently but I plan to backport for 3.2) lets you set the max merged segment size (maxMergedSegmentMB). It's only an estimate, but if it's set, it tries to pick a merge reaching around that target size. Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2945) Surround Query doesn't properly handle equals/hashcode
[ https://issues.apache.org/jira/browse/LUCENE-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027755#comment-13027755 ] Paul Elschot commented on LUCENE-2945: -- Does the latest patch solve the original problem as expected? Surround Query doesn't properly handle equals/hashcode -- Key: LUCENE-2945 URL: https://issues.apache.org/jira/browse/LUCENE-2945 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.0.3, 3.1, 4.0 Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 3.1.1, 4.0 Attachments: LUCENE-2945-partial1.patch, LUCENE-2945.patch, LUCENE-2945.patch, LUCENE-2945.patch, LUCENE-2945c.patch, LUCENE-2945d.patch, LUCENE-2945d.patch In looking at using the surround queries with Solr, I am hitting issues caused by collisions due to equals/hashcode not being implemented on the anonymous inner classes that are created by things like DistanceQuery (branch 3.x, near line 76) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: MergePolicy Thresholds
Hi Shai and Mike, Testing the TieredMP on our large indexes has been on my todo list since I read Mikes blog post http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html. If you port it to the 3.x branch Shai, I'll be more than happy to test it with our very large (300GB+) indexes. Besides being able to set the max merged segment size, I'm especially interested in using the maxSegmentsPerTier parameter. From Mike's blog post: ...maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, May 02, 2011 2:19 PM To: dev@lucene.apache.org Subject: Re: MergePolicy Thresholds I think it should be an easy port... Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote: Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any way, or do you think it can easily be ported to 3x? Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027772#comment-13027772 ] Dawid Weiss commented on LUCENE-3054: - I'm sure many of you know this, but there is a new implementation of mergesort in java.util.Collections -- it is based on a few clever heuristics (so it is a merge sort, only a finely tuned one) and has been ported/ partially inspired by the sort in Python as far as I recall. Maybe it'd be sensible to compare against this and see what happens. I know Lucene/Solr would rather have its own implementation so that it doesn't rely on the standard library, but in my benchmarks the implementation in Collections.sort() was hard to beat... SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027774#comment-13027774 ] Uwe Schindler commented on LUCENE-3054: --- Dawid: There are two problems we have seen with native sort: - it copies the array/collection always first, this caused slowdown for lots of places especiall in automaton - so it never sorts in plcace - we sometimes need to sort multiple arrays in parallel, one as sort key - especially in TermsHash/BytesRefHash. This is where SorterTemplate comes into the game: it supports separate swap(i,j) and compare(i,j) operations. Uwe SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-3.x - Build # 7659 - Failure
Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/7659/ 1 tests failed. REGRESSION: org.apache.solr.client.solrj.TestLBHttpSolrServer.testSimple Error Message: expected:3 but was:2 Stack Trace: junit.framework.AssertionFailedError: expected:3 but was:2 at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1112) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1040) at org.apache.solr.client.solrj.TestLBHttpSolrServer.testSimple(TestLBHttpSolrServer.java:127) Build Log (for compile errors): [...truncated 10762 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3059) PulsingTermState.clone leaks memory
[ https://issues.apache.org/jira/browse/LUCENE-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-3059. Resolution: Fixed PulsingTermState.clone leaks memory --- Key: LUCENE-3059 URL: https://issues.apache.org/jira/browse/LUCENE-3059 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3059.patch I looked at the heap dump from the OOME this morning (thank you Uwe for turning this on!), and I think it's a real memory leak. Well, not really a leak; rather, the cloned PulsingTermState, which we cache in the terms dict cache, is hanging onto large byte[] unnecessarily. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027780#comment-13027780 ] Dawid Weiss commented on LUCENE-3054: - Thanks Uwe, I didn't know about it. Still, the algorithm folks developing OpenJDK have implemented is public, so an improvement can be filed -- maybe somebody will find the time to implement it in a version suitable for Lucene. http://en.wikipedia.org/wiki/Timsort SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Link to nightly build test reports on main Lucene site needs updating
Thanks for fixing++ Tom -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Sunday, May 01, 2011 6:05 AM To: dev@lucene.apache.org; simon.willna...@gmail.com; java-u...@lucene.apache.org Subject: RE: Link to nightly build test reports on main Lucene site needs updating I fixed the nightly docs, once the webserver mirrors them from SVN they should appear. The developer-resources page was completely broken. It now also contains references to the stable 3.x branch as most users would prefer that one to fix latest bugs but don’t want to have a backwards-incompatible version. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de
[jira] [Resolved] (SOLR-2467) Custom analyzer load exceptions are not logged.
[ https://issues.apache.org/jira/browse/SOLR-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved SOLR-2467. Resolution: Fixed Fix Version/s: 4.0 3.2 Thanks for reporting this Alex Committed revision 1098760. Committed revision 1098764. Custom analyzer load exceptions are not logged. --- Key: SOLR-2467 URL: https://issues.apache.org/jira/browse/SOLR-2467 Project: Solr Issue Type: Bug Affects Versions: 3.1 Reporter: Alexander Kistanov Priority: Minor Fix For: 3.2, 4.0 If any exception occurred on custom analyzer load the following catch code is working: {code:title=solr/src/java/org/apache/solr/schema/IndexSchema.java} } catch (Exception e) { throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, Cannot load analyzer: +analyzerName ); } {code} Analyzer load exception e is not logged at all. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027808#comment-13027808 ] Michael McCandless commented on LUCENE-3054: So, there are two known improvements to our QS, to try to avoid the O(N^2) worst-case, both from Robert Sedgewick. First, it's better to select median of low/mid/high as the pivot (http://en.wikipedia.org/wiki/Quicksort#Choice_of_pivot). Second, we should handle equal values better (http://www.angelfire.com/pq/jamesbarbetti/articles/sorting/001_QuicksortIsBroken.htm#Duplicates). See also Lucy's nice QS impl: http://svn.apache.org/viewvc/incubator/lucy/trunk/core/Lucy/Util/SortUtils.c?revision=1098445view=markup#l331 which I think addresses the above two issues, and goes even further (eq-to-pivot values are explicitly moved to the middle and then not recursed on). The thing is, fixing these will make our QS more general, at the expense of some added cost for the cases we know work fine today (eg sorting terms before flushing a segment). Maybe we leave our QS as is (except, changing the 40 to be dynamic depending on input length), noting that you should not use it if your comparator does not break ties, and even if it does there are still risks because of potentially bad pivot selection? Or, maybe we remove QS always use MS? Yes, there's a hit to the sort when flushing the segment, but this is a tiny cost compared to the rest of segment flushing... Separately we can look into whether the tool timsort is faster for sorting terms for flush SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3029) MultiPhraseQuery assigns different scores to identical docs when using 0 pos-incr
[ https://issues.apache.org/jira/browse/LUCENE-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-3029. Resolution: Fixed MultiPhraseQuery assigns different scores to identical docs when using 0 pos-incr - Key: LUCENE-3029 URL: https://issues.apache.org/jira/browse/LUCENE-3029 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.0.4, 3.2, 4.0 Attachments: LUCENE-3029.patch If you have two identical docs with tokens a b c all zero pos-incr (ie they occur on the same position), and you run a MultiPhraseQuery with [a, b] and [c] (all pos incr 0)... then the two docs will get different scores despite being identical. Admittedly it's a strange query... but I think the scorer ought to count the phrase as having tf=1 for each doc. The problem is that we are missing a tie-breaker for the PhraseQuery used by ExactPhraseScorer, and so the PQ ends up flip/flopping such that every other document gets the same score. Ie, even docIDs all get one score and odd docIDs all get another score. Once I added the hard tie-breaker (ord) the scores are the same. However... there's a separate bug, that can over-count the tf, such that if I create the MPQ like this: {noformat} mpq.add(new Term[] {new Term(field, a)}, 0); mpq.add(new Term[] {new Term(field, b), new Term(field, c)}, 0); {noformat} I get tf=2 per doc, but if I create it like this: {noformat} mpq.add(new Term[] {new Term(field, b), new Term(field, c)}, 0); mpq.add(new Term[] {new Term(field, a)}, 0); {noformat} I get tf=1 (which I think is correct?). This happens because MultipleTermPositions freely returns the same position more than once: it just unions the positions of the two streams, so when both have their term at pos=0, you'll get pos=0 twice, which is not good and leads to over-counting tf. Unfortunately, I don't see a performant way to fix that... and I'm not sure that it really matters that much in practice. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2484) Make SynonymFilterFactory more extendable
[ https://issues.apache.org/jira/browse/SOLR-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McKinley updated SOLR-2484: Attachment: SOLR-2484-SynonymFilterFactory.patch patch with a simple test Make SynonymFilterFactory more extendable - Key: SOLR-2484 URL: https://issues.apache.org/jira/browse/SOLR-2484 Project: Solr Issue Type: Improvement Reporter: Ryan McKinley Priority: Trivial Fix For: 3.2, 4.0 Attachments: SOLR-2484-SynonymFilterFactory.patch As is, reading rules from the ResourceLoader is baked into inform(ResourceLoader loader). It would be nice to be able to load custom rules w/o needing to rewrite the whole thing. This issue changes two things: # Changes ListString to IterableString because we don't really need a list # adds protected IterableString loadRules( String synonyms, ResourceLoader loader ) -- so subclasses could fill their own -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2484) Make SynonymFilterFactory more extendable
Make SynonymFilterFactory more extendable - Key: SOLR-2484 URL: https://issues.apache.org/jira/browse/SOLR-2484 Project: Solr Issue Type: Improvement Reporter: Ryan McKinley Priority: Trivial Fix For: 3.2, 4.0 Attachments: SOLR-2484-SynonymFilterFactory.patch As is, reading rules from the ResourceLoader is baked into inform(ResourceLoader loader). It would be nice to be able to load custom rules w/o needing to rewrite the whole thing. This issue changes two things: # Changes ListString to IterableString because we don't really need a list # adds protected IterableString loadRules( String synonyms, ResourceLoader loader ) -- so subclasses could fill their own -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2399) Solr Admin Interface, reworked
[ https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Matheis (steffkes) updated SOLR-2399: Description: *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], [Query-Form|http://files.mathe.is/solr-admin/02_query.png], [Plugins|http://files.mathe.is/solr-admin/05_plugins.png], [Logging|http://files.mathe.is/solr-admin/07_logging.png], [Analysis|http://files.mathe.is/solr-admin/04_analysis.png], [Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], [Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png], [Core-Admin|http://files.mathe.is/solr-admin/09_coreadmin.png] Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI was: *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], [Query-Form|http://files.mathe.is/solr-admin/02_query.png], [Plugins|http://files.mathe.is/solr-admin/05_plugins.png], [Logging|http://files.mathe.is/solr-admin/07_logging.png], [Analysis|http://files.mathe.is/solr-admin/04_analysis.png], [Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], [Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png] Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI Solr Admin Interface, reworked -- Key: SOLR-2399 URL: https://issues.apache.org/jira/browse/SOLR-2399 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Priority: Minor Fix For: 4.0 *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], [Query-Form|http://files.mathe.is/solr-admin/02_query.png], [Plugins|http://files.mathe.is/solr-admin/05_plugins.png], [Logging|http://files.mathe.is/solr-admin/07_logging.png], [Analysis|http://files.mathe.is/solr-admin/04_analysis.png], [Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], [Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png], [Core-Admin|http://files.mathe.is/solr-admin/09_coreadmin.png] Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2399) Solr Admin Interface, reworked
[ https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027846#comment-13027846 ] Stefan Matheis (steffkes) commented on SOLR-2399: - Just because i had a quick Idea for it this morning -- the [Core-Admin Screen|http://files.mathe.is/solr-admin/09_coreadmin.png]. Add Core will open an additional Layer within a Form where you could type all required informations. Actually the Functionality for the Buttons is missing, will be added tomorrow Solr Admin Interface, reworked -- Key: SOLR-2399 URL: https://issues.apache.org/jira/browse/SOLR-2399 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Priority: Minor Fix For: 4.0 *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], [Query-Form|http://files.mathe.is/solr-admin/02_query.png], [Plugins|http://files.mathe.is/solr-admin/05_plugins.png], [Logging|http://files.mathe.is/solr-admin/07_logging.png], [Analysis|http://files.mathe.is/solr-admin/04_analysis.png], [Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], [Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png] Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2399) Solr Admin Interface, reworked
[ https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027850#comment-13027850 ] Otis Gospodnetic commented on SOLR-2399: Thanks for doing all this, Stefan! I looked at the Analysis screenshot and found it a bit hard to eyeball quickly because the whole things feels very pale, which makes it hard for an eye to quickly jump from tokenizer, to token filter, to next token filter, etc. It's also not immediately obvious what left side vs. right side are, so maybe a more visible Index-time Analysis and Query-time Analysis may help. Solr Admin Interface, reworked -- Key: SOLR-2399 URL: https://issues.apache.org/jira/browse/SOLR-2399 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Priority: Minor Fix For: 4.0 *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], [Query-Form|http://files.mathe.is/solr-admin/02_query.png], [Plugins|http://files.mathe.is/solr-admin/05_plugins.png], [Logging|http://files.mathe.is/solr-admin/07_logging.png], [Analysis|http://files.mathe.is/solr-admin/04_analysis.png], [Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], [Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png], [Core-Admin|http://files.mathe.is/solr-admin/09_coreadmin.png] Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2485) Remove BaseResponseWriter, GenericBinaryResponseWriter, and GenericTextResponseWriter
Remove BaseResponseWriter, GenericBinaryResponseWriter, and GenericTextResponseWriter - Key: SOLR-2485 URL: https://issues.apache.org/jira/browse/SOLR-2485 Project: Solr Issue Type: Task Components: Response Writers Reporter: Ryan McKinley Fix For: 4.0 In SOLR-1566, we dramatically refactored the response writer framework -- BaseResponseWriter, GenericBinaryResponseWriter, and GenericTextResponseWriter got left out in the cold because they don't have any tests and it is unclear how they are supposed to work. With the new refactoring, I think the goals of these classes are better supported by extending BinaryResponseWriter and TextResponseWriter. in 3.x, these classes should be deprecated and suggest using BinaryResponseWriter and TextResponseWriter -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ I agree. I wonder tough if the knobs we give on LogMP are intuitive enough. It neatly avoids uber-merges I didn't see that I can define what uber-merge is, right? Can I tell it to stop merging segments of some size? E.g., if my index grew to 100 segments, 40GB each, I don't think that merging 10 40GB segments (to create 400GB segment) is going to speed up my search, for instance. A 40GB segment (probably much less) is already big enough to not be touched anymore. No, you can't. But you can tell it to have exactly (not 'at most') N top-tier segments and try to keep their sizes close with merges. Whatever that size may be. And this is exactly what I want. And defining max cap on segment size is not what I want. So the same set of knobs can be intuitive and meaningful for one person, and useless for another. And you can't pick the best one. Will BalancedMP stop merging such segments (if all segments are of that order of magnitude)? Shai On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot ear...@gmail.com wrote: Dunno, I'm quite happy with numLargeSegments (you critically misspelled it). It neatly avoids uber-merges, keeps the number of segments at bay, and does not require to recalculate thresholds when my expected index size changes. The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ 2011/5/2 Shai Erera ser...@gmail.com: I did look at it, but I didn't find that it answers this particular need (ending with a segment no bigger than X). Perhaps by tweaking several parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve something, but it's not very clear what is the right combination. Which is related to one of the points -- is it not more intuitive for an app to set this threshold (if it needs any thresholds), than tweaking all of those parameters? If so, then we only need two thresholds (size + mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic (perhaps w/ some adaptations) to derive a merge plan. Shai On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote: Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail:
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 7666 - Failure
Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/7666/ All tests passed Build Log (for compile errors): [...truncated 7968 lines...] [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList whitetok = fieldNames.get(whitetok); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:262: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] indexPart = whitetok.get(index); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:279: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] queryPart = whitetok.get(query); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:288: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList keywordtok = fieldNames.get(keywordtok); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:291: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] indexPart = keywordtok.get(index); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:299: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] queryPart = keywordtok.get(query); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:320: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList fieldTypes = result.get(field_types); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:322: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList textType = fieldTypes.get(charfilthtmlmap); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:331: warning: [unchecked] unchecked cast [javac] found : java.lang.Object [javac] required: java.util.Listorg.apache.solr.common.util.NamedList [javac] ListNamedList tokenList = (ListNamedList)indexPart.get(org.apache.lucene.analysis.core.WhitespaceTokenizer); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:154: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] spellchecker.add(AbstractLuceneSpellChecker.DICTIONARY_NAME, default); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:155: warning: [unchecked]
Re: modularization discussion
On Apr 27, 2011, at 11:45 PM, Greg Stein wrote: On Wed, Apr 27, 2011 at 09:25:14AM -0400, Yonik Seeley wrote: ... But as I said... it seems only fair to meet half way and use the solr namespace for some modules and the lucene namespace for others. Please explain this part to me... I really don't understand. At the risk of speaking for someone else, I think it has to do w/ wanting to maintain brand awareness for Solr. We, as the PMC, currently produce two products: Apache Lucene and Apache Solr. I believe Yonik's concern is that if everything is just labeled Lucene, then Solr is just seen as a very thin shell around Lucene (which, IMO, would still not be the case, since wiring together a server app like Solr is non-trivial, but that is my opinion and I'm not sure if Yonik share's it). Solr has never been a thin shell around Lucene and never will be. However, In some ways, this gets at why I believe Yonik was interested in a Solr TLP: so that Solr could stand on it's own as a brand and as a first class Apache product steered by a PMC that is aligned solely w/ producing the Solr (i.e. as a TLP) product as opposed to the two products we produce now. (Note, my vote on such a TLP was -1, so please don't confuse me as arguing for the point, I'm just trying to, hopefully, explain it) That being said, 99% of consumers of Solr never even know what is in the underlying namespace b/c they only ever interact w/ Solr via HTTP (which has solr in the namespace by default) at the server API level, so at least in my mind, I don't care what the namespace used underneath is. Call it lusolr for all I care. What does fairness have to do with the codebase? I can't speak to this, but perhaps it's just the wrong choice of words and would have been better said: please don't take this as a reason to gut Solr and call everything Lucene. Isn't the whole point of the Lucene project to create the best code possible, for the benefit of our worldwide users? It is. We do that primarily through the release of two products: Lucene and Solr. Lucene is a Java class library. A good deal of programming is required to create anything meaningful in terms of a production ready search server. Solr is a server that takes and makes most things that are programming tasks in Lucene configuration tasks as well as adds a fair bit of functionality (distributed search, replication, faceting, auto-suggest, etc.) and is thus that much easier to put in production (I've seen people be in production on Solr in a matter of days/weeks, I've never seen that in Lucene) The crux of this debate is whether these additional pieces are better served as modules (I think they are) or tightly coupled inside of Solr (which does have a few benefits from a dev. point of view, even though I firmly believe they are outweighed by the positives of modularization.)And, while I think most of us agree that modularization makes sense, that doesn't mean there aren't reasons against it. I also believe we need to take it on a case by case basis. I also don't think every patch has to be in it's final place on first commit. As Otis so often says, it's just software. If it doesn't work, change it. Thus, if people contribute and it lands in Solr, the committer who commits it need not immediately move it (although, hopefully they will) or ask the contributor to do so, as that will likely dampen contributions. Likewise for Lucene. Along with that, if and when others wish to refactor, then they should by all means be allowed to do so assuming of course, all tests across both products still pass. In short, I believe people should still contribute where they see they can add the most value and according to their time schedules. Additionally, others who have more time or the ability to refactor for reusability should be free to do so as well. I don't know what the outcome of this thread should be, so I guess we need to just move forward and keep coding away and working to make things better. Do others see anything broader here? A vote? That would be symbolic, I guess, but doesn't force anyone to do anything since there isn't a specific issue at hand other than a broad concept that is seen as good. -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2399) Solr Admin Interface, reworked
[ https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027878#comment-13027878 ] Ryan McKinley commented on SOLR-2399: - Stefan -- this stuff is looking great! Would you mind uploading a snapshot of your repo to this issue? I would like to start a branch in the apache repo, but we need to have the proper Apache release boxes ticked (part of the process when you upload a patch) Solr Admin Interface, reworked -- Key: SOLR-2399 URL: https://issues.apache.org/jira/browse/SOLR-2399 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Priority: Minor Fix For: 4.0 *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], [Query-Form|http://files.mathe.is/solr-admin/02_query.png], [Plugins|http://files.mathe.is/solr-admin/05_plugins.png], [Logging|http://files.mathe.is/solr-admin/07_logging.png], [Analysis|http://files.mathe.is/solr-admin/04_analysis.png], [Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], [Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png], [Core-Admin|http://files.mathe.is/solr-admin/09_coreadmin.png] Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 7667 - Still Failing
Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/7667/ All tests passed Build Log (for compile errors): [...truncated 7958 lines...] [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList whitetok = fieldNames.get(whitetok); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:262: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] indexPart = whitetok.get(index); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:279: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] queryPart = whitetok.get(query); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:288: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList keywordtok = fieldNames.get(keywordtok); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:291: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] indexPart = keywordtok.get(index); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:299: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] queryPart = keywordtok.get(query); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:320: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList fieldTypes = result.get(field_types); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:322: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList textType = fieldTypes.get(charfilthtmlmap); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:331: warning: [unchecked] unchecked cast [javac] found : java.lang.Object [javac] required: java.util.Listorg.apache.solr.common.util.NamedList [javac] ListNamedList tokenList = (ListNamedList)indexPart.get(org.apache.lucene.analysis.core.WhitespaceTokenizer); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:154: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] spellchecker.add(AbstractLuceneSpellChecker.DICTIONARY_NAME, default); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:155: warning: [unchecked]
[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027892#comment-13027892 ] Uwe Schindler commented on LUCENE-3054: --- {quote} Maybe we leave our QS as is (except, changing the 40 to be dynamic depending on input length), noting that you should not use it if your comparator does not break ties, and even if it does there are still risks because of potentially bad pivot selection? {quote} That looks like this: http://en.wikipedia.org/wiki/Introsort We only need a good recursion depth where to switch! SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 7668 - Still Failing
Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/7668/ All tests passed Build Log (for compile errors): [...truncated 7958 lines...] [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList whitetok = fieldNames.get(whitetok); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:262: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] indexPart = whitetok.get(index); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:279: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] queryPart = whitetok.get(query); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:288: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList keywordtok = fieldNames.get(keywordtok); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:291: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] indexPart = keywordtok.get(index); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:299: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList [javac] queryPart = keywordtok.get(query); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:320: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList fieldTypes = result.get(field_types); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:322: warning: [unchecked] unchecked conversion [javac] found : org.apache.solr.common.util.NamedList [javac] required: org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList [javac] NamedListNamedList textType = fieldTypes.get(charfilthtmlmap); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:331: warning: [unchecked] unchecked cast [javac] found : java.lang.Object [javac] required: java.util.Listorg.apache.solr.common.util.NamedList [javac] ListNamedList tokenList = (ListNamedList)indexPart.get(org.apache.lucene.analysis.core.WhitespaceTokenizer); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:154: warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of the raw type org.apache.solr.common.util.NamedList [javac] spellchecker.add(AbstractLuceneSpellChecker.DICTIONARY_NAME, default); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:155: warning: [unchecked]
[jira] [Commented] (SOLR-2191) Change SolrException cstrs that take Throwable to default to alreadyLogged=false
[ https://issues.apache.org/jira/browse/SOLR-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027915#comment-13027915 ] Hoss Man commented on SOLR-2191: Is anyone else interested in entertaining the notion that the alreadyLogged concept is more trouble then it's worth and we should just rip the whole damn thing out? (deprecate logOnce, etc...) is there such a thing as logging an exception too much? and if there is, couldn't we fix those code paths to be less chatty? Change SolrException cstrs that take Throwable to default to alreadyLogged=false Key: SOLR-2191 URL: https://issues.apache.org/jira/browse/SOLR-2191 Project: Solr Issue Type: Bug Reporter: Mark Miller Fix For: Next Attachments: SOLR-2191.patch Because of misuse, many exceptions are now not logged at all - can be painful when doing dev. I think we should flip this setting and work at removing any double logging - losing logging is worse (and it almost looks like we lose more logging than we would get in double logging) - and bad solrexception/logging patterns are proliferating. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3054: -- Attachment: LUCENE-3054-dynamic.patch Here a patch which implements what introsort does: if the depth of recursion is 75% of log2(n), switch to mergeSort. Also this patch moves all remaining quickSort calls to mergeSort on search side, where the comparators are not good. A few remaining ones in indexer keep alive, but those are all unique sets of terms or field names (needs some more review tomorrow). Mike: What do you think, maybe you can do some benchmarking? SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-dynamic.patch, LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2484) Make SynonymFilterFactory more extendable
[ https://issues.apache.org/jira/browse/SOLR-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027920#comment-13027920 ] Steven Rowe commented on SOLR-2484: --- Ryan, [Jenkins|https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/7666/] is unhappy with {{import visad.UnimplementedException}}: {noformat} compileTests: [mkdir] Created dir: /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/tests [javac] Compiling 264 source files to /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/tests [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/analysis/TestSynonymMap.java:32: package visad does not exist [javac] import visad.UnimplementedException; [javac] ^ {noformat} Make SynonymFilterFactory more extendable - Key: SOLR-2484 URL: https://issues.apache.org/jira/browse/SOLR-2484 Project: Solr Issue Type: Improvement Reporter: Ryan McKinley Priority: Trivial Fix For: 3.2, 4.0 Attachments: SOLR-2484-SynonymFilterFactory.patch As is, reading rules from the ResourceLoader is baked into inform(ResourceLoader loader). It would be nice to be able to load custom rules w/o needing to rewrite the whole thing. This issue changes two things: # Changes ListString to IterableString because we don't really need a list # adds protected IterableString loadRules( String synonyms, ResourceLoader loader ) -- so subclasses could fill their own -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: modularization discussion
In short, I believe people should still contribute where they see they can add the most value and according to their time schedules. Additionally, others who have more time or the ability to refactor for reusability should be free to do so as well. I agree that people should be able to contribute where they can; at the same time as a single unified project (lucene+solr) I think there is an objective 'right' place for things -- code designed to have maximum utility and reusablity (minimum dependencies without sacrificing functionality). Starting things in the right place is often easier then refactoring later -- that said, i don't think it should be a requirement as long as we all agree that things can (and should) be moved to a more reusable place if someone is willing to do the work. Thinking about the issue that triggered this debate... in SOLR-2272 (the pseudo-join stuff), I think the heart of the problem was the idea that once committed, this new feature could not be moved around. With this discussion, I think we agree that it should be refactored if someone is willing to do the work. It may even be reasonable for someone to mark it as @lucene.experimental if there is serious concern about how hard it is to refactor (and that person is planning to put in some effort to move things in the right direction) ryan
Re: jira issues falling off the radar -- Next JIRA version
: It'd be nice if Jira could auto-magically treat Next as whatever : release really is next. EG, say we all agree 3.2 is our next : release, then ideally Jira would treat all Next issues as if they were : marked with 3.2. FWIW: you can rename jira versions w/o losing information about what issues were associated with that version. (It's useful when you have release code names before you know what the version will actually be) but i don't really think we need to utilize that -- our release process makes it pretty self evident what the next release on any given branch will be : But... lacking that, maybe we really shouldn't use Next at all, and Agreed. I take most of the blame for introducing hte concept of next... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3calpine.deb.1.10.1005251052040.24...@radix.cryptio.net%3E ... but in my defense: no one said they thought it was a bad idea. The way things shook out after the 3x branch was created just evolved differently then i anticipated, resulting in no *need* to track this concept as an abstract version -- we have feature releases from both branches, and people use their judegment to decide which features hsould be backported, updating Jira as the go. We should definitely kill of Next ... i would suggest just removing it, and not bulk applying a new version (there is no requirement that issues have a version) : On a related note, I don't know what to make of the 1.5 version, nor what : to make of issues marked as Closed for Next. Some house cleaning is in : order. : : We should clean these up. Should we just roll them over to 3.2? see the above link about the clean up i already did for the the 1.5 Fix (an easier view of hte full work log is via markmail: http://markmail.org/thread/7r4lfqddmjkqa3qy ). I made a concious decision at that time not to *remove* the 1.5 version from any issue, because those fixes/features do in fact exist on the 1.5 branch, which still exists, instead i focused on trying to ensure that all those issues had the *other* version info (ie: 3.1 or 4.0) tracked on them properly (as best i could based on CHANGES.txt) I still don't think it makes sense to *remove* the 1.5 version completely, but I went ahead updated Jira to change the status of 1.5 to Archived -- so it no longer shows up as an option when editing or searching ofr issues, but if you look at an issue that is mapped to 1.5 that metadata is still there. As far as the 27 issues that are Closed|Resolve AND Next ... it looks like most of them are issues that were either duplicates, or abandoned (in which case people rarely remember to unset the version info)... https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+SOLR+AND+resolution+in+%28Fixed%2C+%22Won%27t+Fix%22%2C+Duplicate%2C+Invalid%2C+Incomplete%2C+%22Cannot+Reproduce%22%2C+Later%2C+%22Not+A+Problem%22%29+AND+fixVersion+%3D+12315093 ...it looks like only 4 of them were genuinely Fixed and need their version info updated (which just means auditing hte commits to seee what branches and when) ... the rest can probably e ingored if we just delete Next as aversion. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3054: -- Attachment: (was: LUCENE-3054-dynamic.patch) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-dynamic.patch, LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays
[ https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027948#comment-13027948 ] Uwe Schindler commented on LUCENE-3054: --- Studying the C++ STL code showed that they use 2 * log2(n) as depth limit. I implemented that. It showed that for the most cases in Lucene (BytesRefHash), it uses quicksort (so no change to performance). The other cases use already mergeSort and the bad test in TestArrayUtil switches sucessfully to mergeSort. SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays Key: LUCENE-3054 URL: https://issues.apache.org/jira/browse/LUCENE-3054 Project: Lucene - Java Issue Type: Task Affects Versions: 3.1 Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.1.1, 3.2, 4.0 Attachments: LUCENE-3054-dynamic.patch, LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch Looking at Otis's sort problem on the mailing list, he said: {noformat} * looked for other places where this call is made - found it in MultiPhraseQuery$MultiPhraseWeight and changed that call from ArrayUtil.quickSort to ArrayUtil.mergeSort * now we no longer see SorterTemplate.quickSort in deep recursion when we do a thread dump {noformat} I thought this was interesting because PostingsAndFreq's comparator looks like it needs a tiebreaker. I think in our sorts we should add some asserts to try to catch some of these broken comparators. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3063) factor CharTokenizer/CharacterUtils into analyzers module
[ https://issues.apache.org/jira/browse/LUCENE-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-3063. - Resolution: Fixed Committed revision 1098871. If there are any problems with hudson i'll yank it... for now I'll open a followup issue to add the additional checks to MockTokenizer factor CharTokenizer/CharacterUtils into analyzers module - Key: LUCENE-3063 URL: https://issues.apache.org/jira/browse/LUCENE-3063 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-3063.patch Currently these analysis components are in the lucene core, but should really be .util in the analyzers module. Also, with MockTokenizer extending Tokenizer directly, we can add some additional checks in the future to try to ensure our consumers are being good consumers (e.g. calling reset). This is mentioned in http://wiki.apache.org/lucene-java/TestIdeas, I didn't implement it here yet, this is just the factoring. I think we should try to do this before LUCENE-3040. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 7670 - Failure
Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/7670/ 2 tests failed. FAILED: org.apache.lucene.util.automaton.TestLevenshteinAutomata.testUpdateSameDoc Error Message: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. Stack Trace: junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. at java.lang.Thread.run(Thread.java:636) FAILED: TEST-org.apache.lucene.index.TestRollingUpdates.xml.init Error Message: Stack Trace: Test report file /home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/TEST-org.apache.lucene.index.TestRollingUpdates.xml was length 0 Build Log (for compile errors): [...truncated 3174 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3064) add checks to MockTokenizer to enforce proper consumption
add checks to MockTokenizer to enforce proper consumption - Key: LUCENE-3064 URL: https://issues.apache.org/jira/browse/LUCENE-3064 Project: Lucene - Java Issue Type: Test Reporter: Robert Muir Fix For: 4.0 we can enforce things like consumer properly iterates through tokenstream lifeycle via MockTokenizer. this could catch bugs in consumers that don't call reset(), etc. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org