Lucene-Solr-tests-only-3.x - Build # 3533 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/3533/ 1 tests failed. FAILED: org.apache.lucene.util.TestVersion.testFilter Error Message: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. Stack Trace: junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. at java.lang.Thread.run(Thread.java:636) Build Log (for compile errors): [...truncated 8470 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979264#action_12979264 ] Lance Norskog commented on SOLR-2129: - bq.Don't want to at least log this? } catch (AnalysisEngineProcessException e) { // do nothing } bq. I wanted the UIMA enrichment pipeline to be error safe but I agree it'd be reasonable to log the error in this case (even if I don't like logging exceptions in general). Please do not hide errors in any way. Nobody reads logs. If it fails in production, I want to know immediately and fix it. Please just throw all exceptions up the stack. > Provide a Solr module for dynamic metadata extraction/indexing with Apache > UIMA > --- > > Key: SOLR-2129 > URL: https://issues.apache.org/jira/browse/SOLR-2129 > Project: Solr > Issue Type: New Feature >Reporter: Tommaso Teofili >Assignee: Robert Muir > Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, > SOLR-2129-version-5.patch, SOLR-2129-version2.patch, > SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch > > > Provide components to enable Apache UIMA automatic metadata extraction to be > exploited when indexing documents. > The purpose of this is to get unstructured information "inside" a document > and create structured metadata (as fields) to enrich each document. > Basically this can be done with a custom UpdateRequestProcessor which > triggers UIMA while indexing documents. > The basic UIMA implementation of UpdateRequestProcessor extracts sentences > (with a tokenizer and an hidden Markov model tagger), named entities, > language, suggested category, keywords and concepts (exploiting external > services from OpenCalais and AlchemyAPI). Such an implementation can be > easily extended adding or selecting different UIMA analysis engines, both > from UIMA repositories on the web or creating new ones from scratch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key
[ https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979255#action_12979255 ] Uwe Schindler commented on LUCENE-2855: --- One thing in your patch: Lucene tests should always extend LuceneTestCase (which is Junit4) > Contrib queryparser should not use CharSequence as Map key > -- > > Key: LUCENE-2855 > URL: https://issues.apache.org/jira/browse/LUCENE-2855 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 3.0.3 >Reporter: Adriano Crestani >Assignee: Adriano Crestani > Fix For: 3.0.4 > > Attachments: lucene_2855_adriano_crestani_2011_01_08.patch > > > Today, contrib query parser uses Map in many different > places, which may lead to problems, since CharSequence interface does not > enforce the implementation of hashcode and equals methods. Today, it's > causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) > method, that does not works as expected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key
[ https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979253#action_12979253 ] Uwe Schindler commented on LUCENE-2855: --- +1 to commit. In general, one should never use interfaces as keys in maps (as long as they don't declare the equals and hashcode methods inside the interface). > Contrib queryparser should not use CharSequence as Map key > -- > > Key: LUCENE-2855 > URL: https://issues.apache.org/jira/browse/LUCENE-2855 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 3.0.3 >Reporter: Adriano Crestani >Assignee: Adriano Crestani > Fix For: 3.0.4 > > Attachments: lucene_2855_adriano_crestani_2011_01_08.patch > > > Today, contrib query parser uses Map in many different > places, which may lead to problems, since CharSequence interface does not > enforce the implementation of hashcode and equals methods. Today, it's > causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) > method, that does not works as expected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979252#action_12979252 ] Jason Rutherglen commented on LUCENE-2324: -- {quote}I think segment 1 shouldn't be committed, ie. a global flush should be all or nothing. This means we would have to delay the commit of the segments until all DWPTs flushed successfully.{quote} If a DWPT aborts during flush, we simply throw an exception, however we still keep the successfully flushed segment(s). If there's an abort on any DWPT during commit then we throw away any successfully flushed segments as well. I think that makes sense, eg, all or nothing. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key
[ https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adriano Crestani updated LUCENE-2855: - Attachment: lucene_2855_adriano_crestani_2011_01_08.patch Here is the fix for the problem raised at thread [1]. The patch also includes a junit to make sure the problem doesn't show up again. If there are no concerns in two days, I will go ahead and commit the patch. [1] - http://lucene.markmail.org/thread/mbb5wlxttsa6sges > Contrib queryparser should not use CharSequence as Map key > -- > > Key: LUCENE-2855 > URL: https://issues.apache.org/jira/browse/LUCENE-2855 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 3.0.3 >Reporter: Adriano Crestani >Assignee: Adriano Crestani > Fix For: 3.0.4 > > Attachments: lucene_2855_adriano_crestani_2011_01_08.patch > > > Today, contrib query parser uses Map in many different > places, which may lead to problems, since CharSequence interface does not > enforce the implementation of hashcode and equals methods. Today, it's > causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) > method, that does not works as expected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979248#action_12979248 ] Michael Busch commented on LUCENE-2324: --- {quote} I think start simple - the addDocument always happens? Ie it's never coordinated w/ the ongoing flush. It picks a free DWPT like normal, and since flush is single threaded, there should always be a free DWPT? {quote} Yeah I agree. The change I'll make then is to not have the global lock and return a DWPT immediately to the pool and set it to 'idle' after its flush completed. {quote} I think we should continue what we do today? Ie, if it's an 'aborting' exception, then the entire segment held by that DWPT is discarded? And we then throw this exc back to caller (and don't try to flush any other segments)? {quote} What I meant was the following situation: Suppose we have two DWPTs and IW.commit() is called. The first DWPT finishes flushing successfully, is returned to the pool and idle again. The second DWPT flush fails with an aborting exception. Should the segment of the first DWPT make it into the index or not? I think segment 1 shouldn't be committed, ie. a global flush should be all or nothing. This means we would have to delay the commit of the segments until all DWPTs flushed successfully. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979247#action_12979247 ] Michael Busch commented on LUCENE-2324: --- bq. I think the risk is a new DWPT likely will have been created during flush, which'd make the returning DWPT inutile. The DWPT will not be removed from the pool, just marked as busy during flush, like as its state is busy (or currently called "non-idle" in the code) during addDocumentI(). So no new DWPT would be created during flush if the maxThreadState limit was already reached. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979243#action_12979243 ] Jason Rutherglen commented on LUCENE-2324: -- To further clarify, we also no longer have global aborts? Each abort only applies to an individual DWPT? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979229#action_12979229 ] Jason Rutherglen commented on LUCENE-2324: -- {quote}the "flush the world" case? (Ie the app calls IW.commit or IW.getReader). In this case the thread just one by one pulls all DWPTs that have any indexed docs out of production, flushes them, clears them, and returns them to production?{quote} The 2 cases are: A) Flush every DWPT sequentually (aka flush the world) and B) flush by RAM usage when adding docs or deleting. A is clear! I think with B we're saying even if the calling thread is bound to DWPT #1, if DWPT #2 is greater in size and the aggregate RAM usage exceeds the max, using the calling thread, we take DWPT #2 out of production, flush, and return it? {quote}The behavior of calling IW.close while other threads are still adding docs has never been defined (and, shouldn't be) except that we won't corrupt your index, and we'll get all docs indexed before .close was called, committed. So I think even for this case we don't need a global lock.{quote} Great, that simplifies and clarifies that we do not require a global lock. {quote}But, you're right: maybe we should sometimes "prune" DWPTs. Or simply stop recycling any RAM, so that a just-flushed DWPT is an empty shell.{quote} I'm not sure how we'd prune, typically object pools have a separate eviction thread, I think that's going overboard? Maybe we can simply throw out the DWPT and put recycling byte[]s and/or pooling DWPTs back in later if it's necessary? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979193#action_12979193 ] Samuel García Martínez commented on SOLR-236: - The NPE noticed by Shekhar Nirkhe is caused by some errors on filter query cache and the signature key that is using to store cached results. To sum up, if you perform a filter query and then, you perform that query using collapse field, that query result is already cached, but not cached as expected by this component. Resulting that the DocSet implementation is not the expected one, and, as cached result, the DocumentCollector is not executed at any time. As soon as i can ill post a patch using combined key to cache results, formed by the collector class and the query itself. Colbenson - Findability Experts http://www.colbenson.es/ > Field collapsing > > > Key: SOLR-236 > URL: https://issues.apache.org/jira/browse/SOLR-236 > Project: Solr > Issue Type: New Feature > Components: search >Affects Versions: 1.3 >Reporter: Emmanuel Keller >Assignee: Shalin Shekhar Mangar > Fix For: Next > > Attachments: collapsing-patch-to-1.3.0-dieter.patch, > collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, > collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, > field-collapse-3.patch, field-collapse-4-with-solrj.patch, > field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, > field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, > field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, > field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, > field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, > field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, > field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, > field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, > field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, > NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, > quasidistributed.additional.patch, > SOLR-236-1_4_1-paging-totals-working.patch, SOLR-236-1_4_1.patch, > SOLR-236-distinctFacet.patch, SOLR-236-FieldCollapsing.patch, > SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, > SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, > SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, > SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, > SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, > SOLR-236_collapsing.patch > > > This patch include a new feature called "Field collapsing". > "Used in order to collapse a group of results with similar value for a given > field to a single entry in the result set. Site collapsing is a special case > of this, where all results for a given web site is collapsed into one or two > entries in the result set, typically with an associated "more documents from > this site" link. See also Duplicate detection." > http://www.fastsearch.com/glossary.aspx?m=48&amid=299 > The implementation add 3 new query parameters (SolrParams): > "collapse.field" to choose the field used to group results > "collapse.type" normal (default value) or adjacent > "collapse.max" to select how many continuous results are allowed before > collapsing > TODO (in progress): > - More documentation (on source code) > - Test cases > Two patches: > - "field_collapsing.patch" for current development version > - "field_collapsing_1.1.0.patch" for Solr-1.1.0 > P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2829) improve termquery "pk lookup" performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2829: --- Attachment: LUCENE-2829.patch New patch. I added VirtualMethods to Sim to make sure Sim subclasses that don't override idfExplain that takes docFreq are still called. > improve termquery "pk lookup" performance > - > > Key: LUCENE-2829 > URL: https://issues.apache.org/jira/browse/LUCENE-2829 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Robert Muir >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2829.patch, LUCENE-2829.patch, LUCENE-2829.patch > > > For things that are like primary keys and don't exist in some segments (worst > case is primary/unique key that only exists in 1) > we do wasted seeks. > While LUCENE-2694 tries to solve some of this issue with TermState, I'm > concerned we could every backport that to 3.1 for example. > This is a simpler solution here just to solve this one problem in > termquery... we could just revert it in trunk when we resolve LUCENE-2694, > but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979189#action_12979189 ] Michael McCandless commented on LUCENE-2324: bq. The proposed change is simply the thread calling add doc will flush it's DWPT if needed, take it offline while doing so, and return it when completed. Wait -- this is the "addDocument" case right? (I thought we were still talking about the "flush the world" case...). bq. I think the risk is a new DWPT likely will have been created during flush, which'd make the returning DWPT inutile? A new DWPT will have been created only if more than one thread is indexing docs right? In which case this is fine? Ie the old DWPT (just flushed) will just go back into rotation, and when another thread comes in it can take it? But, you're right: maybe we should sometimes "prune" DWPTs. Or simply stop recycling any RAM, so that a just-flushed DWPT is an empty shell. bq. However I think we may still need the global lock for close, eg, today we're preventing the user from adding docs during close, after this issue is merged that behavior would change? Well, the threads still adding docs will hit AlreadyClosedException? (But, that's just "best effort"). The behavior of calling IW.close while other threads are still adding docs has never been defined (and, shouldn't be) except that we won't corrupt your index, and we'll get all docs indexed before .close was called, committed. So I think even for this case we don't need a global lock. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979190#action_12979190 ] Michael McCandless commented on LUCENE-2324: {quote} And there's the case of the thread calling flush doesn't yet have a DWPT, it's going to need to get one assigned to it, however the one assigned may not be the max ram consumer. What'll we do then? If the user explicitly called flush we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however that gets hairy with wait notifies (almost like the global lock?). {quote} Wait -- why would the thread calling flush need to have a DWPT assigned to it? You're talking about the "flush the world" case? (Ie the app calls IW.commit or IW.getReader). In this case the thread just one by one pulls all DWPTs that have any indexed docs out of production, flushes them, clears them, and returns them to production? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2288) clean up compiler warnings
[ https://issues.apache.org/jira/browse/SOLR-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979188#action_12979188 ] Hoss Man commented on SOLR-2288: Reminder to self: feedback from rmuir on the mailing list to replace the static EMPTY set/map refs w/type info that i added with direct usage like this... - this(fieldName, fieldType, analyzer, EMPTY_STRING_SET); + this(fieldName, fieldType, analyzer, Collections.emptySet()); > clean up compiler warnings > -- > > Key: SOLR-2288 > URL: https://issues.apache.org/jira/browse/SOLR-2288 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man >Assignee: Hoss Man > Attachments: SOLR-2288_namedlist.patch, warning.cleanup.patch > > > there's a ton of compiler warning in the solr tree, and it's high time we > cleaned them up, or annotate them to be suppressed so we can start making a > bigger stink when/if code is added to the tree thta produces warnings (we'll > never do a good job of noticing new warnings when we have ~175 existing ones) > Using this issue to track related commits > The goal of this issue should not be to change any functionality or APIs, > just deal with each warning in the most appropriate way; > * fix generic declarations > * add SuppressWarning anotation if it's safe in context -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1056612 - in /lucene/dev/trunk/solr/src/java/org/apache/solr: handler/ handler/component/ request/ search/
: > + public static final Set EMPTY_STRING_SET = Collections.emptySet(); : > + : : I don't know about this commit... i see a lot of EMPTY set's and maps : defined statically here. ... : I think we should be using the Collection methods, for example on your : first file: Hmmm... i am using the Collections method, it's the same set/map in each case, i'm just creating static ref's to them with the type information. My reading of the javadocs was that the implementation of emptySet() was going to just return the same immutable instance every time anyway, so there didn't seem to be any functional diff in reusing it like this -- it seemed like the natureal way to migrate from using Collections.EMPTY_SET, use our own local ref of the same object w/type info. : - this(fieldName, fieldType, analyzer, EMPTY_STRING_SET); : + this(fieldName, fieldType, analyzer, Collections.emptySet()); Ah... see, i didn't even know that syntax was valid to bind the generic on a static method. I'd only ever done the binding in the assignmet. yeah, sure -- i'll make a note to myself to go back and clean those up. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm
[ https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2854. Resolution: Fixed > Deprecate SimilarityDelegator and Similarity.lengthNorm > --- > > Key: LUCENE-2854 > URL: https://issues.apache.org/jira/browse/LUCENE-2854 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch > > > SimilarityDelegator is a back compat trap (see LUCENE-2828). > Apps should just [statically] subclass Sim or DefaultSim; if they really need > "runtime subclassing" then they can make their own app-level delegator. > Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm > in favor of computeNorm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2828) SimilarityDelegator broke back-compat for subclasses overriding lengthNorm
[ https://issues.apache.org/jira/browse/LUCENE-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979178#action_12979178 ] Michael McCandless commented on LUCENE-2828: We won't fix this for 3.x or 4.0, since we've deprecated SimilarityDelegator, and forced hard cutover from Sim.lengthNorm -> Sim.computeNorm (LUCENE-2854). But I'll leave this open in case we do another 2.9/3.0 release. > SimilarityDelegator broke back-compat for subclasses overriding lengthNorm > -- > > Key: LUCENE-2828 > URL: https://issues.apache.org/jira/browse/LUCENE-2828 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3 >Reporter: Michael McCandless > Fix For: 2.9.5, 3.0.4 > > Attachments: LUCENE-2828.patch > > > In LUCENE-1420, we added Similarity.computeNorm to let the norm computation > have access to the raw information (length, boost, etc.). > But this class broke back compat with SimilarityDelegator. We did add > computeNorm there, but, it's impl just forwards to the delegee's computeNorm. > In the case where a subclass of SimilarityDelegator overrides lengthNorm, > that method will no longer be invoked. > Not quite sure how to fix this since, somehow, we have to determine whether > the delegee's impl of computeNorm should be favored over the subclasses impl > of the "legacy" lengthNorm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2828) SimilarityDelegator broke back-compat for subclasses overriding lengthNorm
[ https://issues.apache.org/jira/browse/LUCENE-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2828: --- Fix Version/s: 3.0.4 2.9.5 > SimilarityDelegator broke back-compat for subclasses overriding lengthNorm > -- > > Key: LUCENE-2828 > URL: https://issues.apache.org/jira/browse/LUCENE-2828 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3 >Reporter: Michael McCandless > Fix For: 2.9.5, 3.0.4 > > Attachments: LUCENE-2828.patch > > > In LUCENE-1420, we added Similarity.computeNorm to let the norm computation > have access to the raw information (length, boost, etc.). > But this class broke back compat with SimilarityDelegator. We did add > computeNorm there, but, it's impl just forwards to the delegee's computeNorm. > In the case where a subclass of SimilarityDelegator overrides lengthNorm, > that method will no longer be invoked. > Not quite sure how to fix this since, somehow, we have to determine whether > the delegee's impl of computeNorm should be favored over the subclasses impl > of the "legacy" lengthNorm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm
[ https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979174#action_12979174 ] Michael McCandless commented on LUCENE-2854: bq. Is it possible to remove this method Query.getSimilarity also? I don't understand why we need this method! I would love to! But I think that's for another day... I looked into this and got stuck with BoostingQuery, which rewrites to an anon subclass of BQ overriding its getSimilarity in turn override its coord method. Rather twisted... if we can do this differently I think we could remove Query.getSimilarity. > Deprecate SimilarityDelegator and Similarity.lengthNorm > --- > > Key: LUCENE-2854 > URL: https://issues.apache.org/jira/browse/LUCENE-2854 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch > > > SimilarityDelegator is a back compat trap (see LUCENE-2828). > Apps should just [statically] subclass Sim or DefaultSim; if they really need > "runtime subclassing" then they can make their own app-level delegator. > Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm > in favor of computeNorm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979164#action_12979164 ] Robert Muir commented on LUCENE-1260: - bq. Is there no way to remove this stupid static default and deprecate Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for the case of NormsWriter? I think this is totally what we should try to do in trunk, especially after LUCENE-2846. In this case, i want to fix the issue in a backwards-compatible way for lucene 3.x The warning is a little crazy I know, really people shouldnt rely upon their encoder being used for *fake norms*. But i think its fair to document the corner case, just because its not really fixable easily in 3.x For trunk, here is what i suggest: * LUCENE-2846: remove all uses of fake norms. We never fill fake norms anymore at all, once we fix this issue. If you have a non-atomic reader with two segments, and one has no norms, then the whole norms[] should be null. this is consistent with omitTF. So, for example MultiNorms would never create fake norms. * LUCENE-2854: Mike is working on some issues i think where BooleanQuery uses this static or some other silliness with Similarity, i think we can clean that up there. * finally at this point, I would like to remove Similarity.getDefault/setDefault alltogether. I would prefer instead that IndexSearcher has a single 'DefaultSimilarity' that is the default value if you don't provide one, and likewise with IndexWriterConfig. > Norm codec strategy in Similarity > - > > Key: LUCENE-1260 > URL: https://issues.apache.org/jira/browse/LUCENE-1260 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.1 >Reporter: Karl Wettin >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, > Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, > LUCENE-1260_defaultsim.patch > > > The static span and resolution of the 8 bit norms codec might not fit with > all applications. > My use case requires that 100f-250f is discretized in 60 bags instead of the > default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979162#action_12979162 ] Jason Rutherglen commented on LUCENE-2324: -- And there's the case of the thread calling flush doesn't yet have a DWPT, it's going to need to get one assigned to it, however the one assigned may not be the max ram consumer. What'll we do then? If the user explicitly called flush we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however that gets hairy with wait notifies (almost like the global lock?). > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979160#action_12979160 ] Uwe Schindler commented on LUCENE-1260: --- bq. Here's a patch for the general case, and it also adds a warning that you should set your similarity with Similarity.setDefault, especially if you omit norms. Is there no way to remove this stupid static default and deprecate Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for the case of NormsWriter? > Norm codec strategy in Similarity > - > > Key: LUCENE-1260 > URL: https://issues.apache.org/jira/browse/LUCENE-1260 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.1 >Reporter: Karl Wettin >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, > Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, > LUCENE-1260_defaultsim.patch > > > The static span and resolution of the 8 bit norms codec might not fit with > all applications. > My use case requires that 100f-250f is discretized in 60 bags instead of the > default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1260: Attachment: LUCENE-1260_defaultsim.patch Here's a patch for the general case, and it also adds a warning that you should set your similarity with Similarity.setDefault, especially if you omit norms. We can backport this to 3.x The other cases involve fake norms, which I think we should completely remove in trunk with LUCENE-2846, then there is no longer an issue and we can remove the warning in trunk. > Norm codec strategy in Similarity > - > > Key: LUCENE-1260 > URL: https://issues.apache.org/jira/browse/LUCENE-1260 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.1 >Reporter: Karl Wettin >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, > Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, > LUCENE-1260_defaultsim.patch > > > The static span and resolution of the 8 bit norms codec might not fit with > all applications. > My use case requires that 100f-250f is discretized in 60 bags instead of the > default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979149#action_12979149 ] Jason Rutherglen commented on LUCENE-2324: -- {quote}As soon as a DWPT is pulled from production for flushing, it loses all thread affinity and becomes unavailable until its flush finishes. When a thread needs a DWPT, it tries to pick the one it last had (affinity) but if that one's busy, it picks a new one. If none are available but we are below our max DWPT count, it spins up a new one?{quote} Right. {quote}With the proposed approach, all docs added (or in the process of being added) will make it into the flushed segments once the flush returns; newly added docs after the flush call started may or not make it. But this is fine? I mean, if the app has stronger requirements then it should externally sync?{quote} Ok. The proposed change is simply the thread calling add doc will flush it's DWPT if needed, take it offline while doing so, and return it when completed. I think the risk is a new DWPT likely will have been created during flush, which'd make the returning DWPT inutile? {quote}Why would we lose them? Wouldn't that DWPT just go back into rotation once the flush is done?{quote} Yes, we just need to change the existing code a bit then. However I think we may still need the global lock for close, eg, today we're preventing the user from adding docs during close, after this issue is merged that behavior would change? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979146#action_12979146 ] Michael McCandless commented on LUCENE-2324: {quote} What if the user wants a guaranteed hard flush of all state up to the point of the flush call (won't they want this sometimes with getReader)? If we're flushing sequentially (without pausing all threads) we're removing that? Maybe we'll need to give the option of global lock/stop or sequential flush? {quote} What's a "hard flush"? With the proposed approach, all docs added (or in the process of being added) will make it into the flushed segments once the flush returns; newly added docs after the flush call started may or not make it. But this is fine? I mean, if the app has stronger requirements then it should externally sync? bq. Also I think we need to clear the thread bindings of a DWPT just prior to the flush of the DWPT? Right. As soon as a DWPT is pulled from production for flushing, it loses all thread affinity and becomes unavailable until its flush finishes. When a thread needs a DWPT, it tries to pick the one it last had (affinity) but if that one's busy, it picks a new one. If none are available but we are below our max DWPT count, it spins up a new one? {quote} Then, what happens to reusing the DWPT if we're flushing it, and we spin a new DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[] recycling? {quote} Why would we lose them? Wouldn't that DWPT just go back into rotation once the flush is done? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979144#action_12979144 ] Tommaso Teofili edited comment on SOLR-2129 at 1/8/11 11:09 AM: Changes are: - Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor. - Make the getAE method in OverridingParamAEProvider synchronized to support concurrent requests to the provider. - Make the getAEProvider method in AEProviderFactory synchronized and make the cache "core aware", each core has now an AEProvider for each analysis engine's path. - The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter instead of a SolrConfig object. I tested it with multiple cores and concurrent updates for each core. was (Author: teofili): Changes are: # Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor. # Make the getAE method in OverridingParamAEProvider synchronized to support concurrent requests to the provider. # Make the getAEProvider method in AEProviderFactory synchronized and make the cache "core aware", each core has now an AEProvider for each analysis engine's path. # The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter instead of a SolrConfig object. I tested it with multiple cores and concurrent updates for each core. > Provide a Solr module for dynamic metadata extraction/indexing with Apache > UIMA > --- > > Key: SOLR-2129 > URL: https://issues.apache.org/jira/browse/SOLR-2129 > Project: Solr > Issue Type: New Feature >Reporter: Tommaso Teofili >Assignee: Robert Muir > Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, > SOLR-2129-version-5.patch, SOLR-2129-version2.patch, > SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch > > > Provide components to enable Apache UIMA automatic metadata extraction to be > exploited when indexing documents. > The purpose of this is to get unstructured information "inside" a document > and create structured metadata (as fields) to enrich each document. > Basically this can be done with a custom UpdateRequestProcessor which > triggers UIMA while indexing documents. > The basic UIMA implementation of UpdateRequestProcessor extracts sentences > (with a tokenizer and an hidden Markov model tagger), named entities, > language, suggested category, keywords and concepts (exploiting external > services from OpenCalais and AlchemyAPI). Such an implementation can be > easily extended adding or selecting different UIMA analysis engines, both > from UIMA repositories on the web or creating new ones from scratch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979144#action_12979144 ] Tommaso Teofili edited comment on SOLR-2129 at 1/8/11 11:09 AM: Changes are: # Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor. # Make the getAE method in OverridingParamAEProvider synchronized to support concurrent requests to the provider. # Make the getAEProvider method in AEProviderFactory synchronized and make the cache "core aware", each core has now an AEProvider for each analysis engine's path. # The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter instead of a SolrConfig object. I tested it with multiple cores and concurrent updates for each core. was (Author: teofili): Changes are: # drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor # make the getAE method in OverridingParamAEProvider synchronized to support concurrent requests to the provider # make the getAEProvider method in AEProviderFactory synchronized and make the cache "core aware", each core has now an AEProvider for each analysis engine's path # the UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter instead of a SolrConfig object I tested it with multiple cores and concurrent updates for each core. > Provide a Solr module for dynamic metadata extraction/indexing with Apache > UIMA > --- > > Key: SOLR-2129 > URL: https://issues.apache.org/jira/browse/SOLR-2129 > Project: Solr > Issue Type: New Feature >Reporter: Tommaso Teofili >Assignee: Robert Muir > Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, > SOLR-2129-version-5.patch, SOLR-2129-version2.patch, > SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch > > > Provide components to enable Apache UIMA automatic metadata extraction to be > exploited when indexing documents. > The purpose of this is to get unstructured information "inside" a document > and create structured metadata (as fields) to enrich each document. > Basically this can be done with a custom UpdateRequestProcessor which > triggers UIMA while indexing documents. > The basic UIMA implementation of UpdateRequestProcessor extracts sentences > (with a tokenizer and an hidden Markov model tagger), named entities, > language, suggested category, keywords and concepts (exploiting external > services from OpenCalais and AlchemyAPI). Such an implementation can be > easily extended adding or selecting different UIMA analysis engines, both > from UIMA repositories on the web or creating new ones from scratch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tommaso Teofili updated SOLR-2129: -- Attachment: SOLR-2129-version-5.patch Changes are: # drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor # make the getAE method in OverridingParamAEProvider synchronized to support concurrent requests to the provider # make the getAEProvider method in AEProviderFactory synchronized and make the cache "core aware", each core has now an AEProvider for each analysis engine's path # the UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter instead of a SolrConfig object I tested it with multiple cores and concurrent updates for each core. > Provide a Solr module for dynamic metadata extraction/indexing with Apache > UIMA > --- > > Key: SOLR-2129 > URL: https://issues.apache.org/jira/browse/SOLR-2129 > Project: Solr > Issue Type: New Feature >Reporter: Tommaso Teofili >Assignee: Robert Muir > Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, > SOLR-2129-version-5.patch, SOLR-2129-version2.patch, > SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch > > > Provide components to enable Apache UIMA automatic metadata extraction to be > exploited when indexing documents. > The purpose of this is to get unstructured information "inside" a document > and create structured metadata (as fields) to enrich each document. > Basically this can be done with a custom UpdateRequestProcessor which > triggers UIMA while indexing documents. > The basic UIMA implementation of UpdateRequestProcessor extracts sentences > (with a tokenizer and an hidden Markov model tagger), named entities, > language, suggested category, keywords and concepts (exploiting external > services from OpenCalais and AlchemyAPI). Such an implementation can be > easily extended adding or selecting different UIMA analysis engines, both > from UIMA repositories on the web or creating new ones from scratch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-3.x - Build # 3511 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/3511/ 1 tests failed. REGRESSION: org.apache.lucene.search.TestThreadSafe.testLazyLoadThreadSafety Error Message: unable to create new native thread Stack Trace: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:614) at org.apache.lucene.search.TestThreadSafe.doTest(TestThreadSafe.java:133) at org.apache.lucene.search.TestThreadSafe.testLazyLoadThreadSafety(TestThreadSafe.java:152) at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:255) Build Log (for compile errors): [...truncated 8566 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm
[ https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979141#action_12979141 ] Robert Muir commented on LUCENE-2854: - Is it possible to remove this method Query.getSimilarity also? I don't understand why we need this method! {noformat} /** Expert: Returns the Similarity implementation to be used for this query. * Subclasses may override this method to specify their own Similarity * implementation, perhaps one that delegates through that of the Searcher. * By default the Searcher's Similarity implementation is returned.*/ {noformat} > Deprecate SimilarityDelegator and Similarity.lengthNorm > --- > > Key: LUCENE-2854 > URL: https://issues.apache.org/jira/browse/LUCENE-2854 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch > > > SimilarityDelegator is a back compat trap (see LUCENE-2828). > Apps should just [statically] subclass Sim or DefaultSim; if they really need > "runtime subclassing" then they can make their own app-level delegator. > Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm > in favor of computeNorm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm
[ https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2854: Attachment: LUCENE-2854_fuzzylikethis.patch here is the patch for fuzzylikethis for trunk... so you can remove the delegator completely in trunk. > Deprecate SimilarityDelegator and Similarity.lengthNorm > --- > > Key: LUCENE-2854 > URL: https://issues.apache.org/jira/browse/LUCENE-2854 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch > > > SimilarityDelegator is a back compat trap (see LUCENE-2828). > Apps should just [statically] subclass Sim or DefaultSim; if they really need > "runtime subclassing" then they can make their own app-level delegator. > Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm > in favor of computeNorm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979139#action_12979139 ] Jason Rutherglen commented on LUCENE-2324: -- Also, don't we need the global lock for commit/close? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: LICENSE/NOTICE file contents
>> Nope - wasn't me that added the license stuff into NOTICE.txt ;-) But, including Jetty's NOTICE seems appropriate for our NOTICE. It's just the license parts of the HSQLDB and SLF4J that should be moved to LICENSE.txt << The NOTICE text is actually different from the LICENSE text for HSQLDB, which is why I thought it must have come from an HSQLDB NOTICE file. Karl - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979138#action_12979138 ] Jason Rutherglen commented on LUCENE-2324: -- {quote}So all that's guaranteed after the global flush() returns is that all state present prior to when flush() is invoked, is moved to disk. Ie if addDocs are still happening concurrently then the DWPTs will start filling up again even while the "global flush" runs. That's fine.{quote} What if the user wants a guaranteed hard flush of all state up to the point of the flush call (won't they want this sometimes with getReader)? If we're flushing sequentially (without pausing all threads) we're removing that? Maybe we'll need to give the option of global lock/stop or sequential flush? Also I think we need to clear the thread bindings of a DWPT just prior to the flush of the DWPT? Otherwise (when multiple threads are mapped to a single DWPT) the other threads will wait on the [main] DWPT flush when they should be spinning up a new DWPT? Then, what happens to reusing the DWPT if we're flushing it, and we spin a new DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[] recycling? Maybe we need to and share and sync the byte[] pooling between DWPTs or will that noticeably affect indexing performance? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: LICENSE/NOTICE file contents
On Sat, Jan 8, 2011 at 10:06 AM, Yonik Seeley wrote: > There also wasn't any business about "and then add _nothing_ unless > you can find explicit policy documented > somewhere in the ASF that says it is required." I was following > examples from other projects and any docs I could find at the time, > but this was back in '06. > Not sure there is now either, this is likely just someone's opinion. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: LICENSE/NOTICE file contents
On Sat, Jan 8, 2011 at 8:10 AM, wrote: > From svn, Yonik seems to be the go-to guy for LICENSE and NOTICE stuff. > Yonik, do you remember why the HSQLDB and Jetty notice text was included in > Solr's NOTICE.txt? Nope - wasn't me that added the license stuff into NOTICE.txt ;-) But, including Jetty's NOTICE seems appropriate for our NOTICE. It's just the license parts of the HSQLDB and SLF4J that should be moved to LICENSE.txt There also wasn't any business about "and then add _nothing_ unless you can find explicit policy documented somewhere in the ASF that says it is required." I was following examples from other projects and any docs I could find at the time, but this was back in '06. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: LICENSE/NOTICE file contents
Because they are shipped with Solr. I don't see why it hurts to give people information about what's in the download. On Jan 8, 2011, at 8:10 AM, wrote: > From svn, Yonik seems to be the go-to guy for LICENSE and NOTICE stuff. > Yonik, do you remember why the HSQLDB and Jetty notice text was included in > Solr's NOTICE.txt? The incubator won't release ManifoldCF until we answer > this question. ;-) > > Karl > > > From: ext Robert Muir [rcm...@gmail.com] > Sent: Saturday, January 08, 2011 7:11 AM > To: dev@lucene.apache.org > Subject: Re: LICENSE/NOTICE file contents > > You are probably right... the LICENSE.txt also contains many instances > of incorrect capitalization, I noticed that all versions of of this > file I can find anywhere have this problem :) > > On Sat, Jan 8, 2011 at 6:14 AM, wrote: >> This list might be interested to know that the current Solr LICENSE and >> NOTICE file contents are not Apache standard. The ManifoldCF project based >> its LICENSE and NOTICE files on the Solr ones and got the following icy >> reception in the incubator: >> >> The NOTICE file is still incorrect and includes a lot of unnecessary >> stuff. Understanding how to do releases with the correct legal files >> is one of the important parts of incubation and as this is the first >> release for the poddling i think this needs to be sorted out. >> >> For the NOTICE file, start with the following text (between the ---'s): >> >> --- >> Apache ManifestCF >> Copyright 2010 The Apache Software Foundation >> >> This product includes software developed by >> The Apache Software Foundation (http://www.apache.org/). >> --- >> >> and then add _nothing_ unless you can find explicit policy documented >> somewhere in the ASF that says it is required. If someone wants to add >> something ask for the URL where the requirement is documented. The >> NOTICE file should only include required notices, the other text thats >> in the current NOTICE file could go in a README file, see >> http://www.apache.org/legal/src-headers.html#notice >> >> For the LICENSE file, it should start with the AL as the current one >> does, and then include the text for all the other licenses used in the >> distribution. Those license that are currently in the NOTICE file >> should be moved to the LICENSE file and then you need to verify that >> all the 3rd party dependencies in the src and binary distributions are >> also in the LICENSE files of those distributions. >> >> << >> >> Our NOTICE includes the following, which was taken from Solr (because we >> have a similar dependency). I'd like to know whether it is a valid thing to >> include, and where it says that "somewhere in Apache": >> >> = >> == Jetty Notice== >> = >> == >> Jetty Web Container >> Copyright 1995-2006 Mort Bay Consulting Pty Ltd >> == >> >> This product includes some software developed at The Apache Software >> Foundation (http://www.apache.org/). >> >> The javax.servlet package used by Jetty is copyright >> Sun Microsystems, Inc and Apache Software Foundation. It is >> distributed under the Common Development and Distribution License. >> You can obtain a copy of the license at >> https://glassfish.dev.java.net/public/CDDLv1.0.html. >> >> The UnixCrypt.java code ~Implements the one way cryptography used by >> Unix systems for simple password protection. Copyright 1996 Aki Yoshida, >> modified April 2001 by Iris Van den Broeke, Daniel Deville. >> >> The default JSP implementation is provided by the Glassfish JSP engine >> from project Glassfish http://glassfish.dev.java.net. Copyright 2005 >> Sun Microsystems, Inc. and portions Copyright Apache Software Foundation. >> >> Some portions of the code are Copyright: >> 2006 Tim Vernum >> 1999 Jason Gilbert. >> >> The jboss integration module contains some LGPL code. >> >> = >> == HSQLDB Notice == >> = >> >> For content, code, and products originally developed by Thomas Mueller and >> the Hypersonic SQL Group: >> >> Copyright (c) 1995-2000 by the Hypersonic SQL Group. >> All rights reserved. >> >> Redistribution and use in source and binary forms, with or without >> modification, are permitted provided that the following conditions are met: >> >> Redistributions of source code must retain the above copyright notice, this >> list of conditions and the following disclaimer. >> >> Redistributions in binary form must reproduce the
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979129#action_12979129 ] Michael McCandless commented on LUCENE-2324: bq. I guess we don't really need the global lock. A thread performing the "global flush" could still acquire each thread state before it starts flushing, but return a threadState to the pool once that particular threadState is done flushing? Good question... we could (in theory) also flush them concurrently? But, since we don't "own" the threads in IW, we can't easily do that, so I think no global lock, go through all DWPTs w/ current thread and flush, sequentially? So all that's guaranteed after the global flush() returns is that all state present prior to when flush() is invoked, is moved to disk. Ie if addDocs are still happening concurrently then the DWPTs will start filling up again even while the "global flush" runs. That's fine. {quote} A related question is: Do we want to piggyback on multiple threads when a global flush happens? Eg. Thread 1 called commit, Thread 2 shortly afterwards addDocument(). When should addDocument() happen? a) After all DWPTs finished flushing? b) After at least one DWPT finished flushing and is available again? c) Or should Thread 2 be used to help flushing DWPTs in parallel with Thread 1? a) is currently implemented, but I think not really what we want. b) is probably best for RT, because it means the lowest indexing latency for the new document to be added. c) probably means the best overall throughput (depending even on hardware like disk speed, etc) {quote} I think start simple -- the addDocument always happens? Ie it's never coordinated w/ the ongoing flush. It picks a free DWPT like normal, and since flush is single threaded, there should always be a free DWPT? Longer term c) would be great, or, if IW has an ES then it'd send multiple flush jobs to the ES. {quote} For whatever option we pick, we'll have to carefully think about error handling. It's quite straightforward for a) (just commit all flushed segments to SegmentInfos when the global flush completed succesfully). But for b) and c) it's unclear what should happen if a DWPT flush fails after some completed already successfully before. {quote} I think we should continue what we do today? Ie, if it's an 'aborting' exception, then the entire segment held by that DWPT is discarded? And we then throw this exc back to caller (and don't try to flush any other segments)? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm
[ https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979128#action_12979128 ] Michael McCandless commented on LUCENE-2854: The above patch applies to 3.x For trunk I plan to remove SimliarityDelegator from core, and move it (deprecated) into contrib/queries/... (private to FuzzyLikeThisQ). At some point [later] we can fix FuzzyLikeThisQ to not use it... > Deprecate SimilarityDelegator and Similarity.lengthNorm > --- > > Key: LUCENE-2854 > URL: https://issues.apache.org/jira/browse/LUCENE-2854 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2854.patch > > > SimilarityDelegator is a back compat trap (see LUCENE-2828). > Apps should just [statically] subclass Sim or DefaultSim; if they really need > "runtime subclassing" then they can make their own app-level delegator. > Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm > in favor of computeNorm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets
The weird thing is, all of our collectors, IMO, are optimized for the non-paging scenario, when I would venture to guess that the very large majority of users out there do paging. AFAICT, about the only people who don't do paging are those who do deep, downstream analysis which requires them to retrieve 100's or 1000's or more of results at a time (I've seen as much as a million used in production) as part of a batch job. See https://issues.apache.org/jira/browse/LUCENE-2215 and https://issues.apache.org/jira/browse/SOLR-1726 for the issues tracking this. -Grant On Jan 8, 2011, at 7:11 AM, Earwin Burrfoot wrote: > On Mon, Jan 3, 2011 at 18:18, Yonik Seeley wrote: >> On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / >> Cominvent wrote: >>> The problem with large "start" is probably worse when sharding is involved. >>> Anyone know how the shard component goes about fetching >>> start=100&rows=10 from say 10 shards? Does it have to merge sorted >>> lists of 1mill+10 docsids from each shard which is the worst case? >> >> Yep, that's how it works today. >> > > Technically, if your docs have a non-biased (in regards to their > sort-value) distribution across shards, you can fetch much less than > topN docs from each shard. > I played with the idea, and it worked for me. Though later I dropped > the opto, as it complicated things somewhat and my users aren't > querying gazillions of docs often. > > > -- > Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) > Phone: +7 (495) 683-567-4 > ICQ: 104465785 > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: LICENSE/NOTICE file contents
>From svn, Yonik seems to be the go-to guy for LICENSE and NOTICE stuff. >Yonik, do you remember why the HSQLDB and Jetty notice text was included in >Solr's NOTICE.txt? The incubator won't release ManifoldCF until we answer >this question. ;-) Karl From: ext Robert Muir [rcm...@gmail.com] Sent: Saturday, January 08, 2011 7:11 AM To: dev@lucene.apache.org Subject: Re: LICENSE/NOTICE file contents You are probably right... the LICENSE.txt also contains many instances of incorrect capitalization, I noticed that all versions of of this file I can find anywhere have this problem :) On Sat, Jan 8, 2011 at 6:14 AM, wrote: > This list might be interested to know that the current Solr LICENSE and > NOTICE file contents are not Apache standard. The ManifoldCF project based > its LICENSE and NOTICE files on the Solr ones and got the following icy > reception in the incubator: > >>> > The NOTICE file is still incorrect and includes a lot of unnecessary > stuff. Understanding how to do releases with the correct legal files > is one of the important parts of incubation and as this is the first > release for the poddling i think this needs to be sorted out. > > For the NOTICE file, start with the following text (between the ---'s): > > --- > Apache ManifestCF > Copyright 2010 The Apache Software Foundation > > This product includes software developed by > The Apache Software Foundation (http://www.apache.org/). > --- > > and then add _nothing_ unless you can find explicit policy documented > somewhere in the ASF that says it is required. If someone wants to add > something ask for the URL where the requirement is documented. The > NOTICE file should only include required notices, the other text thats > in the current NOTICE file could go in a README file, see > http://www.apache.org/legal/src-headers.html#notice > > For the LICENSE file, it should start with the AL as the current one > does, and then include the text for all the other licenses used in the > distribution. Those license that are currently in the NOTICE file > should be moved to the LICENSE file and then you need to verify that > all the 3rd party dependencies in the src and binary distributions are > also in the LICENSE files of those distributions. > > << > > Our NOTICE includes the following, which was taken from Solr (because we have > a similar dependency). I'd like to know whether it is a valid thing to > include, and where it says that "somewhere in Apache": > >>> > = > == Jetty Notice== > = > == > Jetty Web Container > Copyright 1995-2006 Mort Bay Consulting Pty Ltd > == > > This product includes some software developed at The Apache Software > Foundation (http://www.apache.org/). > > The javax.servlet package used by Jetty is copyright > Sun Microsystems, Inc and Apache Software Foundation. It is > distributed under the Common Development and Distribution License. > You can obtain a copy of the license at > https://glassfish.dev.java.net/public/CDDLv1.0.html. > > The UnixCrypt.java code ~Implements the one way cryptography used by > Unix systems for simple password protection. Copyright 1996 Aki Yoshida, > modified April 2001 by Iris Van den Broeke, Daniel Deville. > > The default JSP implementation is provided by the Glassfish JSP engine > from project Glassfish http://glassfish.dev.java.net. Copyright 2005 > Sun Microsystems, Inc. and portions Copyright Apache Software Foundation. > > Some portions of the code are Copyright: > 2006 Tim Vernum > 1999 Jason Gilbert. > > The jboss integration module contains some LGPL code. > > = > == HSQLDB Notice == > = > > For content, code, and products originally developed by Thomas Mueller and > the Hypersonic SQL Group: > > Copyright (c) 1995-2000 by the Hypersonic SQL Group. > All rights reserved. > > Redistribution and use in source and binary forms, with or without > modification, are permitted provided that the following conditions are met: > > Redistributions of source code must retain the above copyright notice, this > list of conditions and the following disclaimer. > > Redistributions in binary form must reproduce the above copyright notice, > this list of conditions and the following disclaimer in the documentation > and/or other materials provided with the distribution. > > Neither the name of the Hypersonic SQL Group nor the names of its > contributors may be used to endorse or promote products derived from this
[jira] Updated: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm
[ https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2854: --- Attachment: LUCENE-2854.patch I think we should simply make a hard break on the Sim.lengthNorm -> computeNorm cutover. Subclassing sim is an expert thing, and, I'd rather apps see a compilation error on upgrade so that they realize their lengthNorm wasn't being called this whole time because of LUCENE-2828 (and that they must now cutover to computeNorm). So I made lengthNorm final (and throws UOE), computeNorm abstract. I deprecated SimilarityDelegator, and fixed BQ to not use it anymore. The only other use is FuzzyLikeThisQuery, but fixing that is a little too involved for today. > Deprecate SimilarityDelegator and Similarity.lengthNorm > --- > > Key: LUCENE-2854 > URL: https://issues.apache.org/jira/browse/LUCENE-2854 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2854.patch > > > SimilarityDelegator is a back compat trap (see LUCENE-2828). > Apps should just [statically] subclass Sim or DefaultSim; if they really need > "runtime subclassing" then they can make their own app-level delegator. > Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm > in favor of computeNorm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2831) Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context
[ https://issues.apache.org/jira/browse/LUCENE-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979118#action_12979118 ] Michael McCandless commented on LUCENE-2831: bq. It seems we also need to migrate FieldComparator to use ReaderContext (eventually AtomicReaderContext)? +1 And also Collector? > Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context > - > > Key: LUCENE-2831 > URL: https://issues.apache.org/jira/browse/LUCENE-2831 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Fix For: 4.0 > > Attachments: LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, > LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch > > > Spinoff from LUCENE-2694 - instead of passing a reader into Weight#scorer(IR, > boolean, boolean) we should / could revise the API and pass in a struct that > has parent reader, sub reader, ord of that sub. The ord mapping plus the > context with its parent would make several issues way easier. See > LUCENE-2694, LUCENE-2348 and LUCENE-2829 to name some. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: LICENSE/NOTICE file contents
You are probably right... the LICENSE.txt also contains many instances of incorrect capitalization, I noticed that all versions of of this file I can find anywhere have this problem :) On Sat, Jan 8, 2011 at 6:14 AM, wrote: > This list might be interested to know that the current Solr LICENSE and > NOTICE file contents are not Apache standard. The ManifoldCF project based > its LICENSE and NOTICE files on the Solr ones and got the following icy > reception in the incubator: > >>> > The NOTICE file is still incorrect and includes a lot of unnecessary > stuff. Understanding how to do releases with the correct legal files > is one of the important parts of incubation and as this is the first > release for the poddling i think this needs to be sorted out. > > For the NOTICE file, start with the following text (between the ---'s): > > --- > Apache ManifestCF > Copyright 2010 The Apache Software Foundation > > This product includes software developed by > The Apache Software Foundation (http://www.apache.org/). > --- > > and then add _nothing_ unless you can find explicit policy documented > somewhere in the ASF that says it is required. If someone wants to add > something ask for the URL where the requirement is documented. The > NOTICE file should only include required notices, the other text thats > in the current NOTICE file could go in a README file, see > http://www.apache.org/legal/src-headers.html#notice > > For the LICENSE file, it should start with the AL as the current one > does, and then include the text for all the other licenses used in the > distribution. Those license that are currently in the NOTICE file > should be moved to the LICENSE file and then you need to verify that > all the 3rd party dependencies in the src and binary distributions are > also in the LICENSE files of those distributions. > > << > > Our NOTICE includes the following, which was taken from Solr (because we have > a similar dependency). I'd like to know whether it is a valid thing to > include, and where it says that "somewhere in Apache": > >>> > = > == Jetty Notice == > = > == > Jetty Web Container > Copyright 1995-2006 Mort Bay Consulting Pty Ltd > == > > This product includes some software developed at The Apache Software > Foundation (http://www.apache.org/). > > The javax.servlet package used by Jetty is copyright > Sun Microsystems, Inc and Apache Software Foundation. It is > distributed under the Common Development and Distribution License. > You can obtain a copy of the license at > https://glassfish.dev.java.net/public/CDDLv1.0.html. > > The UnixCrypt.java code ~Implements the one way cryptography used by > Unix systems for simple password protection. Copyright 1996 Aki Yoshida, > modified April 2001 by Iris Van den Broeke, Daniel Deville. > > The default JSP implementation is provided by the Glassfish JSP engine > from project Glassfish http://glassfish.dev.java.net. Copyright 2005 > Sun Microsystems, Inc. and portions Copyright Apache Software Foundation. > > Some portions of the code are Copyright: > 2006 Tim Vernum > 1999 Jason Gilbert. > > The jboss integration module contains some LGPL code. > > = > == HSQLDB Notice == > = > > For content, code, and products originally developed by Thomas Mueller and > the Hypersonic SQL Group: > > Copyright (c) 1995-2000 by the Hypersonic SQL Group. > All rights reserved. > > Redistribution and use in source and binary forms, with or without > modification, are permitted provided that the following conditions are met: > > Redistributions of source code must retain the above copyright notice, this > list of conditions and the following disclaimer. > > Redistributions in binary form must reproduce the above copyright notice, > this list of conditions and the following disclaimer in the documentation > and/or other materials provided with the distribution. > > Neither the name of the Hypersonic SQL Group nor the names of its > contributors may be used to endorse or promote products derived from this > software without specific prior written permission. > > THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" > AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE > IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE > ARE DISCLAIMED. IN NO EVENT SHALL THE HYPERSONIC SQL GROUP, > OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, > EXEMPLARY, OR CONSEQUE
Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets
On Mon, Jan 3, 2011 at 18:18, Yonik Seeley wrote: > On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / > Cominvent wrote: >> The problem with large "start" is probably worse when sharding is involved. >> Anyone know how the shard component goes about fetching >> start=100&rows=10 from say 10 shards? Does it have to merge sorted lists >> of 1mill+10 docsids from each shard which is the worst case? > > Yep, that's how it works today. > Technically, if your docs have a non-biased (in regards to their sort-value) distribution across shards, you can fetch much less than topN docs from each shard. I played with the idea, and it worked for me. Though later I dropped the opto, as it complicated things somewhat and my users aren't querying gazillions of docs often. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1056612 - in /lucene/dev/trunk/solr/src/java/org/apache/solr: handler/ handler/component/ request/ search/
On Fri, Jan 7, 2011 at 10:47 PM, wrote: > > + public static final Set EMPTY_STRING_SET = Collections.emptySet(); > + I don't know about this commit... i see a lot of EMPTY set's and maps defined statically here. There is no advantage to doing this, even the javadocs explain: Implementation note: Implementations of this method need not create a separate (Set|Map|List) object for each call. Using this method is likely to have comparable cost to using the like-named field. (Unlike this method, the field does not provide type safety.) I think we should be using the Collection methods, for example on your first file: Index: solr/src/java/org/apache/solr/handler/AnalysisRequestHandlerBase.java === --- solr/src/java/org/apache/solr/handler/AnalysisRequestHandlerBase.java (revision 1056691) +++ solr/src/java/org/apache/solr/handler/AnalysisRequestHandlerBase.java (working copy) @@ -47,8 +47,6 @@ */ public abstract class AnalysisRequestHandlerBase extends RequestHandlerBase { - public static final Set EMPTY_STRING_SET = Collections.emptySet(); - public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception { rsp.add("analysis", doAnalysis(req)); } @@ -343,7 +341,7 @@ * */ public AnalysisContext(String fieldName, FieldType fieldType, Analyzer analyzer) { - this(fieldName, fieldType, analyzer, EMPTY_STRING_SET); + this(fieldName, fieldType, analyzer, Collections.emptySet()); } /** I - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
LICENSE/NOTICE file contents
This list might be interested to know that the current Solr LICENSE and NOTICE file contents are not Apache standard. The ManifoldCF project based its LICENSE and NOTICE files on the Solr ones and got the following icy reception in the incubator: >> The NOTICE file is still incorrect and includes a lot of unnecessary stuff. Understanding how to do releases with the correct legal files is one of the important parts of incubation and as this is the first release for the poddling i think this needs to be sorted out. For the NOTICE file, start with the following text (between the ---'s): --- Apache ManifestCF Copyright 2010 The Apache Software Foundation This product includes software developed by The Apache Software Foundation (http://www.apache.org/). --- and then add _nothing_ unless you can find explicit policy documented somewhere in the ASF that says it is required. If someone wants to add something ask for the URL where the requirement is documented. The NOTICE file should only include required notices, the other text thats in the current NOTICE file could go in a README file, see http://www.apache.org/legal/src-headers.html#notice For the LICENSE file, it should start with the AL as the current one does, and then include the text for all the other licenses used in the distribution. Those license that are currently in the NOTICE file should be moved to the LICENSE file and then you need to verify that all the 3rd party dependencies in the src and binary distributions are also in the LICENSE files of those distributions. << Our NOTICE includes the following, which was taken from Solr (because we have a similar dependency). I'd like to know whether it is a valid thing to include, and where it says that "somewhere in Apache": >> = == Jetty Notice== = == Jetty Web Container Copyright 1995-2006 Mort Bay Consulting Pty Ltd == This product includes some software developed at The Apache Software Foundation (http://www.apache.org/). The javax.servlet package used by Jetty is copyright Sun Microsystems, Inc and Apache Software Foundation. It is distributed under the Common Development and Distribution License. You can obtain a copy of the license at https://glassfish.dev.java.net/public/CDDLv1.0.html. The UnixCrypt.java code ~Implements the one way cryptography used by Unix systems for simple password protection. Copyright 1996 Aki Yoshida, modified April 2001 by Iris Van den Broeke, Daniel Deville. The default JSP implementation is provided by the Glassfish JSP engine from project Glassfish http://glassfish.dev.java.net. Copyright 2005 Sun Microsystems, Inc. and portions Copyright Apache Software Foundation. Some portions of the code are Copyright: 2006 Tim Vernum 1999 Jason Gilbert. The jboss integration module contains some LGPL code. = == HSQLDB Notice == = For content, code, and products originally developed by Thomas Mueller and the Hypersonic SQL Group: Copyright (c) 1995-2000 by the Hypersonic SQL Group. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the Hypersonic SQL Group nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE HYPERSONIC SQL GROUP, OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. This software cons