[jira] [Commented] (LUCENE-2100) Make contrib analyzers final
[ https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034555#comment-13034555 ] Esmond Pitt commented on LUCENE-2100: - Many thanks. > Make contrib analyzers final > > > Key: LUCENE-2100 > URL: https://issues.apache.org/jira/browse/LUCENE-2100 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, > 2.9, 2.9.1, 3.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-2100.patch, LUCENE-2100.patch > > > The analyzers in contrib/analyzers should all be marked final. None of the > Analyzers should ever be subclassed - users should build their own analyzers > if a different combination of filters and Tokenizers is desired. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2100) Make contrib analyzers final
[ https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034549#comment-13034549 ] Robert Muir commented on LUCENE-2100: - Esmond: hi, what you are doing here is exactly the reason why we made it final. By subclassing StandardAnalyzer in this way, the indexer is no longer able to reuse tokenstreams, making analysis very slow and inefficient. The easiest way to get your PorterStemAnalyzer is to just use EnglishAnalyzer, which does just this. Otherwise if you really want to do it yourself, do it like this: {noformat} Analyzer analyzer = new ReusableAnalyzerBase() { protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer tokenizer = new StandardTokenizer(...); TokenStream filteredStream = new StandardFilter(tokenizer, ...); filteredStream = new LowerCaseFilterFilter(filteredStream, ...); filteredStream = new StopFilterFilter(filteredStream, ...); filteredStream = new PorterStemFilter(filteredStream, ...); return new TokenStreamComponents(tokenizer, filteredStream); } }; {noformat} Please see LUCENE-3055 for more examples and a more thorough explanation. The good news is if you implement your analyzer like this, you will see performance improvements! > Make contrib analyzers final > > > Key: LUCENE-2100 > URL: https://issues.apache.org/jira/browse/LUCENE-2100 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, > 2.9, 2.9.1, 3.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-2100.patch, LUCENE-2100.patch > > > The analyzers in contrib/analyzers should all be marked final. None of the > Analyzers should ever be subclassed - users should build their own analyzers > if a different combination of filters and Tokenizers is desired. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2100) Make contrib analyzers final
[ https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034544#comment-13034544 ] Esmond Pitt commented on LUCENE-2100: - Steve Thanks. Maybe you could have a look at this. How do you suggest I recode it? I wrote this 7 years ago and cannot now remember anything about it. Quite possibly the entire thing is now obsolete, but I've been carting it around since before Lucene was even at Apache. All I've ever done is adjust the version number. == public class PorterStemAnalyzer extends StandardAnalyzer { /** * Construct a new instance of PorterStemAnalyzer. */ public PorterStemAnalyzer() { super(Version.LUCENE_30); } @Override public final TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter(super.tokenStream(fieldName, reader)); } } EJP > Make contrib analyzers final > > > Key: LUCENE-2100 > URL: https://issues.apache.org/jira/browse/LUCENE-2100 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, > 2.9, 2.9.1, 3.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-2100.patch, LUCENE-2100.patch > > > The analyzers in contrib/analyzers should all be marked final. None of the > Analyzers should ever be subclassed - users should build their own analyzers > if a different combination of filters and Tokenizers is desired. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2100) Make contrib analyzers final
[ https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034540#comment-13034540 ] Steven Rowe commented on LUCENE-2100: - Hi Esmond, Take a look at [the source code for StandardAnalyzer|http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_3_1/lucene/src/java/org/apache/lucene/analysis/standard/StandardAnalyzer.java?view=markup]. Fewer than 50 lines of code there, if you take out the comments. Copy/paste suddenly seems doable. Lucene's Analyzers are best thought of as examples. Steve > Make contrib analyzers final > > > Key: LUCENE-2100 > URL: https://issues.apache.org/jira/browse/LUCENE-2100 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, > 2.9, 2.9.1, 3.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-2100.patch, LUCENE-2100.patch > > > The analyzers in contrib/analyzers should all be marked final. None of the > Analyzers should ever be subclassed - users should build their own analyzers > if a different combination of filters and Tokenizers is desired. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3107) Binary compatibility broken b/w 3.03 and 3.1.0
[ https://issues.apache.org/jira/browse/LUCENE-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe resolved LUCENE-3107. - Resolution: Invalid >From [item #8 in the "Changes in backward compatibility policy" section in the >3.1.0 >CHANGES.txt|http://lucene.apache.org/java/3_1_0/changes/Changes.html#3.1.0.changes_in_backwards_compatibility_policy]: {quote} LUCENE-2372, LUCENE-2389: StandardAnalyzer, KeywordAnalyzer, PerFieldAnalyzerWrapper, WhitespaceTokenizer are now final. Also removed the now obsolete and deprecated Analyzer.setOverridesTokenStreamMethod(). Analyzer and TokenStream base classes now have an assertion in their ctor, that check subclasses to be final or at least have final implementations of incrementToken(), tokenStream(), and reusableTokenStream(). {quote} > Binary compatibility broken b/w 3.03 and 3.1.0 > -- > > Key: LUCENE-3107 > URL: https://issues.apache.org/jira/browse/LUCENE-3107 > Project: Lucene - Java > Issue Type: Bug > Components: core/index, core/other >Affects Versions: 3.1 > Environment: Windows Vista Microsoft Windows [Version 6.1.7600] > java version "1.6.0_24" > Java(TM) SE Runtime Environment (build 1.6.0_24-b07) > Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing) >Reporter: Esmond Pitt >Priority: Blocker > > StandardAnalyzer became final between 3.0.3 and 3.1.0. Unacceptable binary > incompatibility. See my comment in Lucene-2100. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2736) Wrong implementation of DocIdSetIterator.advance
[ https://issues.apache.org/jira/browse/LUCENE-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2736: --- Attachment: LUCENE-2736.patch Patch with Javadocs fixes. I will commit it later today. > Wrong implementation of DocIdSetIterator.advance > - > > Key: LUCENE-2736 > URL: https://issues.apache.org/jira/browse/LUCENE-2736 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Affects Versions: 3.2, 4.0 >Reporter: Hardy Ferentschik >Assignee: Shai Erera > Attachments: LUCENE-2736.patch > > > Implementations of {{DocIdSetIterator}} behave differently when advanced is > called. Taking the following test for {{OpenBitSet}}, {{DocIdBitSet}} and > {{SortedVIntList}} only {{SortedVIntList}} passes the test: > {code:title=org.apache.lucene.search.TestDocIdSet.java|borderStyle=solid} > ... > public void testAdvanceWithOpenBitSet() throws IOException { > DocIdSet idSet = new OpenBitSet( new long[] { 1121 }, 1 ); // > bits 0, 5, 6, 10 > assertAdvance( idSet ); > } > public void testAdvanceDocIdBitSet() throws IOException { > BitSet bitSet = new BitSet(); > bitSet.set( 0 ); > bitSet.set( 5 ); > bitSet.set( 6 ); > bitSet.set( 10 ); > DocIdSet idSet = new DocIdBitSet(bitSet); > assertAdvance( idSet ); > } > public void testAdvanceWithSortedVIntList() throws IOException { > DocIdSet idSet = new SortedVIntList( 0, 5, 6, 10 ); > assertAdvance( idSet ); > } > private void assertAdvance(DocIdSet idSet) throws IOException { > DocIdSetIterator iter = idSet.iterator(); > int docId = iter.nextDoc(); > assertEquals( "First doc id should be 0", 0, docId ); > docId = iter.nextDoc(); > assertEquals( "Second doc id should be 5", 5, docId ); > docId = iter.advance( 5 ); > assertEquals( "Advancing iterator should return the next doc > id", 6, docId ); > } > {code} > The javadoc for {{advance}} says: > {quote} > Advances to the first *beyond* the current whose document number is greater > than or equal to _target_. > {quote} > This seems to indicate that {{SortedVIntList}} behaves correctly, whereas the > other two don't. > Just looking at the {{DocIdBitSet}} implementation advance is implemented as: > {code} > bitSet.nextSetBit(target); > {code} > where the docs of {{nextSetBit}} say: > {quote} > Returns the index of the first bit that is set to true that occurs *on or > after* the specified starting index > {quote} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3107) Binary compatibility broken b/w 3.03 and 3.1.0
Binary compatibility broken b/w 3.03 and 3.1.0 -- Key: LUCENE-3107 URL: https://issues.apache.org/jira/browse/LUCENE-3107 Project: Lucene - Java Issue Type: Bug Components: core/index, core/other Affects Versions: 3.1 Environment: Windows Vista Microsoft Windows [Version 6.1.7600] java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing) Reporter: Esmond Pitt Priority: Blocker StandardAnalyzer became final between 3.0.3 and 3.1.0. Unacceptable binary incompatibility. See my comment in Lucene-2100. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Bulk changing issues in JIRA
Hi If you ever wondered how to bulk change issues in JIRA, here's the procedure: * View a list of issues, e.g. by query/filter * At the top-right you'll find this: * Click on "Tools" and select * The screen changes so that next to each issue there's a check box. * Mark all the issues you want to change and click "Next" * Select the operation (e.g. Edit) * The next screen (followed by choosing operation "Edit") lets you edit the issues. Note this at the bottom: Deselect if you don't want to spam the list :). FYI, Shai
[jira] [Commented] (LUCENE-2100) Make contrib analyzers final
[ https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034530#comment-13034530 ] Esmond Pitt commented on LUCENE-2100: - Did somebody implement this for 3.1.0? StandardAnalyzer became final between 3.0.3 and 3.1.0. This is *not acceptable.* Binary compatibility must be preserved and to be frank I do not give a good goddam how ugly the code inside looks compared to this requirement. > Make contrib analyzers final > > > Key: LUCENE-2100 > URL: https://issues.apache.org/jira/browse/LUCENE-2100 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, > 2.9, 2.9.1, 3.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-2100.patch, LUCENE-2100.patch > > > The analyzers in contrib/analyzers should all be marked final. None of the > Analyzers should ever be subclassed - users should build their own analyzers > if a different combination of filters and Tokenizers is desired. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2736) Wrong implementation of DocIdSetIterator.advance
[ https://issues.apache.org/jira/browse/LUCENE-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2736: --- Component/s: (was: core/other) core/search Affects Version/s: (was: 3.0.2) 3.2 Assignee: Shai Erera > Wrong implementation of DocIdSetIterator.advance > - > > Key: LUCENE-2736 > URL: https://issues.apache.org/jira/browse/LUCENE-2736 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Affects Versions: 3.2, 4.0 >Reporter: Hardy Ferentschik >Assignee: Shai Erera > > Implementations of {{DocIdSetIterator}} behave differently when advanced is > called. Taking the following test for {{OpenBitSet}}, {{DocIdBitSet}} and > {{SortedVIntList}} only {{SortedVIntList}} passes the test: > {code:title=org.apache.lucene.search.TestDocIdSet.java|borderStyle=solid} > ... > public void testAdvanceWithOpenBitSet() throws IOException { > DocIdSet idSet = new OpenBitSet( new long[] { 1121 }, 1 ); // > bits 0, 5, 6, 10 > assertAdvance( idSet ); > } > public void testAdvanceDocIdBitSet() throws IOException { > BitSet bitSet = new BitSet(); > bitSet.set( 0 ); > bitSet.set( 5 ); > bitSet.set( 6 ); > bitSet.set( 10 ); > DocIdSet idSet = new DocIdBitSet(bitSet); > assertAdvance( idSet ); > } > public void testAdvanceWithSortedVIntList() throws IOException { > DocIdSet idSet = new SortedVIntList( 0, 5, 6, 10 ); > assertAdvance( idSet ); > } > private void assertAdvance(DocIdSet idSet) throws IOException { > DocIdSetIterator iter = idSet.iterator(); > int docId = iter.nextDoc(); > assertEquals( "First doc id should be 0", 0, docId ); > docId = iter.nextDoc(); > assertEquals( "Second doc id should be 5", 5, docId ); > docId = iter.advance( 5 ); > assertEquals( "Advancing iterator should return the next doc > id", 6, docId ); > } > {code} > The javadoc for {{advance}} says: > {quote} > Advances to the first *beyond* the current whose document number is greater > than or equal to _target_. > {quote} > This seems to indicate that {{SortedVIntList}} behaves correctly, whereas the > other two don't. > Just looking at the {{DocIdBitSet}} implementation advance is implemented as: > {code} > bitSet.nextSetBit(target); > {code} > where the docs of {{nextSetBit}} say: > {quote} > Returns the index of the first bit that is set to true that occurs *on or > after* the specified starting index > {quote} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2736) Wrong implementation of DocIdSetIterator.advance
[ https://issues.apache.org/jira/browse/LUCENE-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034527#comment-13034527 ] Shai Erera commented on LUCENE-2736: Thanks Hardy for reporting that. But I think this works exactly as documented? Note that the javadocs of advance() state "*beyond* the current whose document number is *greater than or equal* to target". Also, there's a note in the javadocs: {noformat} * NOTE: when target ≤ current implementations may opt * not to advance beyond their current {@link #docID()}. {noformat} I think that the word 'beyond' is confusing here. Perhaps we can modify the javadocs to: "Advances to the first document whose number is greater than or equal to target" If there are no objections, or better wording, I'll commit this later today, but only to 3.2/4.0 and not 3.0.2 > Wrong implementation of DocIdSetIterator.advance > - > > Key: LUCENE-2736 > URL: https://issues.apache.org/jira/browse/LUCENE-2736 > Project: Lucene - Java > Issue Type: Bug > Components: core/other >Affects Versions: 3.0.2, 4.0 >Reporter: Hardy Ferentschik > > Implementations of {{DocIdSetIterator}} behave differently when advanced is > called. Taking the following test for {{OpenBitSet}}, {{DocIdBitSet}} and > {{SortedVIntList}} only {{SortedVIntList}} passes the test: > {code:title=org.apache.lucene.search.TestDocIdSet.java|borderStyle=solid} > ... > public void testAdvanceWithOpenBitSet() throws IOException { > DocIdSet idSet = new OpenBitSet( new long[] { 1121 }, 1 ); // > bits 0, 5, 6, 10 > assertAdvance( idSet ); > } > public void testAdvanceDocIdBitSet() throws IOException { > BitSet bitSet = new BitSet(); > bitSet.set( 0 ); > bitSet.set( 5 ); > bitSet.set( 6 ); > bitSet.set( 10 ); > DocIdSet idSet = new DocIdBitSet(bitSet); > assertAdvance( idSet ); > } > public void testAdvanceWithSortedVIntList() throws IOException { > DocIdSet idSet = new SortedVIntList( 0, 5, 6, 10 ); > assertAdvance( idSet ); > } > private void assertAdvance(DocIdSet idSet) throws IOException { > DocIdSetIterator iter = idSet.iterator(); > int docId = iter.nextDoc(); > assertEquals( "First doc id should be 0", 0, docId ); > docId = iter.nextDoc(); > assertEquals( "Second doc id should be 5", 5, docId ); > docId = iter.advance( 5 ); > assertEquals( "Advancing iterator should return the next doc > id", 6, docId ); > } > {code} > The javadoc for {{advance}} says: > {quote} > Advances to the first *beyond* the current whose document number is greater > than or equal to _target_. > {quote} > This seems to indicate that {{SortedVIntList}} behaves correctly, whereas the > other two don't. > Just looking at the {{DocIdBitSet}} implementation advance is implemented as: > {code} > bitSet.nextSetBit(target); > {code} > where the docs of {{nextSetBit}} say: > {quote} > Returns the index of the first bit that is set to true that occurs *on or > after* the specified starting index > {quote} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3106) commongrams filter calls incrementToken() after it returns false
[ https://issues.apache.org/jira/browse/LUCENE-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3106: Attachment: LUCENE-3106.patch here's the obvious solution, but there might be a cleaner way to rewrite its loop... > commongrams filter calls incrementToken() after it returns false > > > Key: LUCENE-3106 > URL: https://issues.apache.org/jira/browse/LUCENE-3106 > Project: Lucene - Java > Issue Type: Bug > Components: modules/analysis >Reporter: Robert Muir > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3106.patch, LUCENE-3106_test.patch > > > In LUCENE-3064, we beefed up MockTokenizer with assertions, and I started > cutting over some analysis tests to use MockTokenizer for better coverage. > The commongrams tests fail, because they call incrementToken() after it > already returns false. > In general its my understanding consumers should not do this (and i know of a > few tokenizers that will actually throw exceptions if you do this, just like > java iterators and such). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3106) commongrams filter calls incrementToken() after it returns false
[ https://issues.apache.org/jira/browse/LUCENE-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3106: Component/s: modules/analysis > commongrams filter calls incrementToken() after it returns false > > > Key: LUCENE-3106 > URL: https://issues.apache.org/jira/browse/LUCENE-3106 > Project: Lucene - Java > Issue Type: Bug > Components: modules/analysis >Reporter: Robert Muir > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3106_test.patch > > > In LUCENE-3064, we beefed up MockTokenizer with assertions, and I started > cutting over some analysis tests to use MockTokenizer for better coverage. > The commongrams tests fail, because they call incrementToken() after it > already returns false. > In general its my understanding consumers should not do this (and i know of a > few tokenizers that will actually throw exceptions if you do this, just like > java iterators and such). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley resolved SOLR-2520. Resolution: Fixed Fix Version/s: 3.2 Committed to trunk and 3x. Thanks for bringing this to our attention Benson! > JSONResponseWriter w/json.wrf can produce invalid javascript depending on > unicode chars in response data > > > Key: SOLR-2520 > URL: https://issues.apache.org/jira/browse/SOLR-2520 > Project: Solr > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Benson Margulies > Fix For: 3.2 > > Attachments: SOLR-2520.patch > > > Please see http://timelessrepo.com/json-isnt-a-javascript-subset. > If a stored field contains Unicode characters that are valid in Json but not > valid in Javascript, and you use the query option to ask for JSONP > (json.wrf), solr does *not* escape them, resulting in content that explodes > on contact with browsers. That is, there are certain Unicode characters that > are valid JSON but invalid in Javascript source, and a JSONP response is > javascript source, to be incorporated in an HTML script tag. Further > investigation suggests that only one character is a problem here: U+2029 > must be represented as \u2029 instead of left 'as-is'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2522) Change max() and min() to work on multiValued fields
Change max() and min() to work on multiValued fields - Key: SOLR-2522 URL: https://issues.apache.org/jira/browse/SOLR-2522 Project: Solr Issue Type: Improvement Reporter: Bill Bell Switch max() and min() functions to work on multiValued fields so we can do sort=min(fieldname) asc and the sort would work on multiValued fields... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3106) commongrams filter calls incrementToken() after it returns false
[ https://issues.apache.org/jira/browse/LUCENE-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3106: Attachment: LUCENE-3106_test.patch patch with the test modifications to produce the failure. > commongrams filter calls incrementToken() after it returns false > > > Key: LUCENE-3106 > URL: https://issues.apache.org/jira/browse/LUCENE-3106 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3106_test.patch > > > In LUCENE-3064, we beefed up MockTokenizer with assertions, and I started > cutting over some analysis tests to use MockTokenizer for better coverage. > The commongrams tests fail, because they call incrementToken() after it > already returns false. > In general its my understanding consumers should not do this (and i know of a > few tokenizers that will actually throw exceptions if you do this, just like > java iterators and such). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3106) commongrams filter calls incrementToken() after it returns false
commongrams filter calls incrementToken() after it returns false Key: LUCENE-3106 URL: https://issues.apache.org/jira/browse/LUCENE-3106 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir Fix For: 3.2, 4.0 In LUCENE-3064, we beefed up MockTokenizer with assertions, and I started cutting over some analysis tests to use MockTokenizer for better coverage. The commongrams tests fail, because they call incrementToken() after it already returns false. In general its my understanding consumers should not do this (and i know of a few tokenizers that will actually throw exceptions if you do this, just like java iterators and such). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2339) No error reported when sorting on a multiValued field
[ https://issues.apache.org/jira/browse/SOLR-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034492#comment-13034492 ] Bill Bell commented on SOLR-2339: - Guys, How are we going to support sorting on multiValued fields? Would a function work for this? > No error reported when sorting on a multiValued field > - > > Key: SOLR-2339 > URL: https://issues.apache.org/jira/browse/SOLR-2339 > Project: Solr > Issue Type: Bug > Components: search >Reporter: Hoss Man >Assignee: Hoss Man > Fix For: 3.1, 4.0 > > Attachments: SOLR-2339.patch, SOLR-2339.patch > > > In the past, Solr has relied on the underlying FieldCache to throw an error > in situations where sorting on a field was not possible. however LUCENE-2142 > has changed this, so that FieldCache never throws an error. > In order to maintain the functionality of past Solr releases (ie: error when > users attempt to sort on a field that we known will produce meaningless > results) we should add some sort of check at the Solr level. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034478#comment-13034478 ] Mark Miller commented on LUCENE-152: bq. More specifically: compile time dependencies on compiled BSD libraries are fine, but actually incorporating and releasing code that is under a BSD license is something we're aren't suppose to do (last time i checked) Code is fine to afaik: http://www.apache.org/legal/3party.html > [PATCH] KStem for Lucene > > > Key: LUCENE-152 > URL: https://issues.apache.org/jira/browse/LUCENE-152 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Affects Versions: unspecified > Environment: Operating System: other > Platform: Other >Reporter: Otis Gospodnetic >Priority: Minor > > September 10th 2003 contributionn from "Sergio Guzman-Lara" > > Original email: > Hi all, > I have ported the kstem stemmer to Java and incorporated it to > Lucene. You can get the source code (Kstem.jar) from the following website: > http://ciir.cs.umass.edu/downloads/ > Just click on "KStem Java Implementation" (you will need to register > your e-mail, for free of course, with the CIIR --Center for Intelligent > Information Retrieval, UMass -- and get an access code). > Content of Kstem.jar: > java/org/apache/lucene/analysis/KStemData1.java > java/org/apache/lucene/analysis/KStemData2.java > java/org/apache/lucene/analysis/KStemData3.java > java/org/apache/lucene/analysis/KStemData4.java > java/org/apache/lucene/analysis/KStemData5.java > java/org/apache/lucene/analysis/KStemData6.java > java/org/apache/lucene/analysis/KStemData7.java > java/org/apache/lucene/analysis/KStemData8.java > java/org/apache/lucene/analysis/KStemFilter.java > java/org/apache/lucene/analysis/KStemmer.java > KStemData1.java, ..., KStemData8.java Contain several lists of words > used by Kstem > KStemmer.java Implements the Kstem algorithm > KStemFilter.java Extends TokenFilter applying Kstem > To compile > unjar the file Kstem.jar to Lucene's "src" directory, and compile it > there. > What is Kstem? > A stemmer designed by Bob Krovetz (for more information see > http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). > Copyright issues > This is open source. The actual license agreement is included at the > top of every source file. > Any comments/questions/suggestions are welcome, > Sergio Guzman-Lara > Senior Research Fellow > CIIR UMass -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2424) extracted text from tika has no spaces
[ https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liam O'Boyle updated SOLR-2424: --- Attachment: ET2000 Service Manual.pdf This file has problems which trigger this bug. > extracted text from tika has no spaces > -- > > Key: SOLR-2424 > URL: https://issues.apache.org/jira/browse/SOLR-2424 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) >Affects Versions: 3.1 >Reporter: Yonik Seeley > Attachments: ET2000 Service Manual.pdf > > > Try this: > curl > "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true"; > -F "tutorial=@tutorial.pdf" > And you get text output w/o spaces: > "ThisdocumentcoversthebasicsofrunningSolru"... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034456#comment-13034456 ] Mark Miller commented on LUCENE-152: To extract a bit for clarity: {quote} This form is not for new projects. This is for projects and PMCs that have already been created and are receiving a code donation into an existing codebase. Any code that was developed outside of the ASF SVN repository and our public mailing lists must be processed like this, even if the external developer is already an ASF committer. {quote} > [PATCH] KStem for Lucene > > > Key: LUCENE-152 > URL: https://issues.apache.org/jira/browse/LUCENE-152 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Affects Versions: unspecified > Environment: Operating System: other > Platform: Other >Reporter: Otis Gospodnetic >Priority: Minor > > September 10th 2003 contributionn from "Sergio Guzman-Lara" > > Original email: > Hi all, > I have ported the kstem stemmer to Java and incorporated it to > Lucene. You can get the source code (Kstem.jar) from the following website: > http://ciir.cs.umass.edu/downloads/ > Just click on "KStem Java Implementation" (you will need to register > your e-mail, for free of course, with the CIIR --Center for Intelligent > Information Retrieval, UMass -- and get an access code). > Content of Kstem.jar: > java/org/apache/lucene/analysis/KStemData1.java > java/org/apache/lucene/analysis/KStemData2.java > java/org/apache/lucene/analysis/KStemData3.java > java/org/apache/lucene/analysis/KStemData4.java > java/org/apache/lucene/analysis/KStemData5.java > java/org/apache/lucene/analysis/KStemData6.java > java/org/apache/lucene/analysis/KStemData7.java > java/org/apache/lucene/analysis/KStemData8.java > java/org/apache/lucene/analysis/KStemFilter.java > java/org/apache/lucene/analysis/KStemmer.java > KStemData1.java, ..., KStemData8.java Contain several lists of words > used by Kstem > KStemmer.java Implements the Kstem algorithm > KStemFilter.java Extends TokenFilter applying Kstem > To compile > unjar the file Kstem.jar to Lucene's "src" directory, and compile it > there. > What is Kstem? > A stemmer designed by Bob Krovetz (for more information see > http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). > Copyright issues > This is open source. The actual license agreement is included at the > top of every source file. > Any comments/questions/suggestions are welcome, > Sergio Guzman-Lara > Senior Research Fellow > CIIR UMass -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034454#comment-13034454 ] Mark Miller commented on LUCENE-152: bq. Uh... that may be a stretch. It's what the incubator seems to recommend, and the side have err'd on in the past. http://incubator.apache.org/ip-clearance/index.html If it was developed outside of Apache, we don't really know it's IP history, and that's something we want to take seriously. > [PATCH] KStem for Lucene > > > Key: LUCENE-152 > URL: https://issues.apache.org/jira/browse/LUCENE-152 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Affects Versions: unspecified > Environment: Operating System: other > Platform: Other >Reporter: Otis Gospodnetic >Priority: Minor > > September 10th 2003 contributionn from "Sergio Guzman-Lara" > > Original email: > Hi all, > I have ported the kstem stemmer to Java and incorporated it to > Lucene. You can get the source code (Kstem.jar) from the following website: > http://ciir.cs.umass.edu/downloads/ > Just click on "KStem Java Implementation" (you will need to register > your e-mail, for free of course, with the CIIR --Center for Intelligent > Information Retrieval, UMass -- and get an access code). > Content of Kstem.jar: > java/org/apache/lucene/analysis/KStemData1.java > java/org/apache/lucene/analysis/KStemData2.java > java/org/apache/lucene/analysis/KStemData3.java > java/org/apache/lucene/analysis/KStemData4.java > java/org/apache/lucene/analysis/KStemData5.java > java/org/apache/lucene/analysis/KStemData6.java > java/org/apache/lucene/analysis/KStemData7.java > java/org/apache/lucene/analysis/KStemData8.java > java/org/apache/lucene/analysis/KStemFilter.java > java/org/apache/lucene/analysis/KStemmer.java > KStemData1.java, ..., KStemData8.java Contain several lists of words > used by Kstem > KStemmer.java Implements the Kstem algorithm > KStemFilter.java Extends TokenFilter applying Kstem > To compile > unjar the file Kstem.jar to Lucene's "src" directory, and compile it > there. > What is Kstem? > A stemmer designed by Bob Krovetz (for more information see > http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). > Copyright issues > This is open source. The actual license agreement is included at the > top of every source file. > Any comments/questions/suggestions are welcome, > Sergio Guzman-Lara > Senior Research Fellow > CIIR UMass -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: 3.2.0 (or 3.1.1)
: > I don't disagree, but the devils advocate argument is "given the relative : > size of the change sets, testing a 3.1.1 release is likely to be easier : > then testing a 3.2 release, and the patches commited to the 3.1.x branch : > are less likely to have introduced new bugs (becuase they only contain bug : > fixes and not new features" : thats true, but 3.2 also has better test coverage than 3.1.1 (a couple : TestIdeas were worked off the list), and its in hudson's rotation : every half hour. +1 ... no argument. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034450#comment-13034450 ] Hoss Man commented on LUCENE-152: - bq. even if its Apache 2 licensed code. Uh... that may be a stretch. More specifically: compile time dependencies on compiled BSD libraries are fine, but actually incorporating and *releasing* code that is under a BSD license is something we're aren't suppose to do (last time i checked) > [PATCH] KStem for Lucene > > > Key: LUCENE-152 > URL: https://issues.apache.org/jira/browse/LUCENE-152 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Affects Versions: unspecified > Environment: Operating System: other > Platform: Other >Reporter: Otis Gospodnetic >Priority: Minor > > September 10th 2003 contributionn from "Sergio Guzman-Lara" > > Original email: > Hi all, > I have ported the kstem stemmer to Java and incorporated it to > Lucene. You can get the source code (Kstem.jar) from the following website: > http://ciir.cs.umass.edu/downloads/ > Just click on "KStem Java Implementation" (you will need to register > your e-mail, for free of course, with the CIIR --Center for Intelligent > Information Retrieval, UMass -- and get an access code). > Content of Kstem.jar: > java/org/apache/lucene/analysis/KStemData1.java > java/org/apache/lucene/analysis/KStemData2.java > java/org/apache/lucene/analysis/KStemData3.java > java/org/apache/lucene/analysis/KStemData4.java > java/org/apache/lucene/analysis/KStemData5.java > java/org/apache/lucene/analysis/KStemData6.java > java/org/apache/lucene/analysis/KStemData7.java > java/org/apache/lucene/analysis/KStemData8.java > java/org/apache/lucene/analysis/KStemFilter.java > java/org/apache/lucene/analysis/KStemmer.java > KStemData1.java, ..., KStemData8.java Contain several lists of words > used by Kstem > KStemmer.java Implements the Kstem algorithm > KStemFilter.java Extends TokenFilter applying Kstem > To compile > unjar the file Kstem.jar to Lucene's "src" directory, and compile it > there. > What is Kstem? > A stemmer designed by Bob Krovetz (for more information see > http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). > Copyright issues > This is open source. The actual license agreement is included at the > top of every source file. > Any comments/questions/suggestions are welcome, > Sergio Guzman-Lara > Senior Research Fellow > CIIR UMass -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034439#comment-13034439 ] Mark Miller commented on LUCENE-152: The general rule is that if its a fair amount of code, and it was developed outside of the Apache system, we want a software grant - even if its Apache 2 licensed code. > [PATCH] KStem for Lucene > > > Key: LUCENE-152 > URL: https://issues.apache.org/jira/browse/LUCENE-152 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Affects Versions: unspecified > Environment: Operating System: other > Platform: Other >Reporter: Otis Gospodnetic >Priority: Minor > > September 10th 2003 contributionn from "Sergio Guzman-Lara" > > Original email: > Hi all, > I have ported the kstem stemmer to Java and incorporated it to > Lucene. You can get the source code (Kstem.jar) from the following website: > http://ciir.cs.umass.edu/downloads/ > Just click on "KStem Java Implementation" (you will need to register > your e-mail, for free of course, with the CIIR --Center for Intelligent > Information Retrieval, UMass -- and get an access code). > Content of Kstem.jar: > java/org/apache/lucene/analysis/KStemData1.java > java/org/apache/lucene/analysis/KStemData2.java > java/org/apache/lucene/analysis/KStemData3.java > java/org/apache/lucene/analysis/KStemData4.java > java/org/apache/lucene/analysis/KStemData5.java > java/org/apache/lucene/analysis/KStemData6.java > java/org/apache/lucene/analysis/KStemData7.java > java/org/apache/lucene/analysis/KStemData8.java > java/org/apache/lucene/analysis/KStemFilter.java > java/org/apache/lucene/analysis/KStemmer.java > KStemData1.java, ..., KStemData8.java Contain several lists of words > used by Kstem > KStemmer.java Implements the Kstem algorithm > KStemFilter.java Extends TokenFilter applying Kstem > To compile > unjar the file Kstem.jar to Lucene's "src" directory, and compile it > there. > What is Kstem? > A stemmer designed by Bob Krovetz (for more information see > http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). > Copyright issues > This is open source. The actual license agreement is included at the > top of every source file. > Any comments/questions/suggestions are welcome, > Sergio Guzman-Lara > Senior Research Fellow > CIIR UMass -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: 3.2.0 (or 3.1.1)
On Mon, May 16, 2011 at 7:41 PM, Chris Hostetter wrote: > I don't disagree, but the devils advocate argument is "given the relative > size of the change sets, testing a 3.1.1 release is likely to be easier > then testing a 3.2 release, and the patches commited to the 3.1.x branch > are less likely to have introduced new bugs (becuase they only contain bug > fixes and not new features" > thats true, but 3.2 also has better test coverage than 3.1.1 (a couple TestIdeas were worked off the list), and its in hudson's rotation every half hour. additionally there's at least one or two test coverage things we can backport from trunk to 3.2 just because... which seems more productive than backporting things from branch_3x to a bugfix 3.1.1 branch that isn't even being tested by hudson. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: 3.2.0 (or 3.1.1)
: My vote would be to just spend our time on 3.2. people get bugfixes, : better test coverage, and a couple of new features and optimizations, : too. : Is it really going to be harder to release 3.2 than to release 3.1.1? I don't disagree, but the devils advocate argument is "given the relative size of the change sets, testing a 3.1.1 release is likely to be easier then testing a 3.2 release, and the patches commited to the 3.1.x branch are less likely to have introduced new bugs (becuase they only contain bug fixes and not new features" -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: 3.2.0 (or 3.1.1)
: And also, we should adopt that approach going forward (no more bug fix : releases for the stable branch, except for the last release before 4.0 : is out). That means updating the release TODO with e.g., not creating : a branch for 3.2.x, only tag it. When 4.0 is out, we branch 3.x.y out : of the last 3.x tag. I don't know that we need box ourselves in ... if someone discovers a massively critical bug the day after 3.2 is released, it's totally reasonable/sensible to do a quick 3.2.1 release. That said: i don't know that we have to create the 3.2.x branch when we create the 3.2 tag ... we can certainly do a lazy instantiation as needed. Bottom line: 3.x.0 releases are still "feature releases on the stable api branch", and as long as we can maintain intertia on relatively rapid turn arround of feature release then great -- but that doens't mean we should completley rule out having 3.x.y "bug fix" releases. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated SOLR-2520: --- Description: Please see http://timelessrepo.com/json-isnt-a-javascript-subset. If a stored field contains Unicode characters that are valid in Json but not valid in Javascript, and you use the query option to ask for JSONP (json.wrf), solr does *not* escape them, resulting in content that explodes on contact with browsers. That is, there are certain Unicode characters that are valid JSON but invalid in Javascript source, and a JSONP response is javascript source, to be incorporated in an HTML script tag. Further investigation suggests that only one character is a problem here: U+2029 must be represented as \u2029 instead of left 'as-is'. was: Please see http://timelessrepo.com/json-isnt-a-javascript-subset. If a stored field contains Unicode characters that are valid in Json but not valid in Javascript, and you use the query option to ask for jsonp (json.wrt), solr does *not* escape them characters, resulting in content that explodes on contact with browsers. That is, there are certain Unicode characters that are valid JSON but invalid in Javascript source, and a JSONP response is javascript source, to be incorporated in an HTML script tag. > JSONResponseWriter w/json.wrf can produce invalid javascript depending on > unicode chars in response data > > > Key: SOLR-2520 > URL: https://issues.apache.org/jira/browse/SOLR-2520 > Project: Solr > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Benson Margulies > Attachments: SOLR-2520.patch > > > Please see http://timelessrepo.com/json-isnt-a-javascript-subset. > If a stored field contains Unicode characters that are valid in Json but not > valid in Javascript, and you use the query option to ask for JSONP > (json.wrf), solr does *not* escape them, resulting in content that explodes > on contact with browsers. That is, there are certain Unicode characters that > are valid JSON but invalid in Javascript source, and a JSONP response is > javascript source, to be incorporated in an HTML script tag. Further > investigation suggests that only one character is a problem here: U+2029 > must be represented as \u2029 instead of left 'as-is'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated SOLR-2520: --- Description: Please see http://timelessrepo.com/json-isnt-a-javascript-subset. If a stored field contains Unicode characters that are valid in Json but not valid in Javascript, and you use the query option to ask for jsonp (json.wrt), solr does *not* escape them characters, resulting in content that explodes on contact with browsers. That is, there are certain Unicode characters that are valid JSON but invalid in Javascript source, and a JSONP response is javascript source, to be incorporated in an HTML script tag. was: Please see http://timelessrepo.com/json-isnt-a-javascript-subset. If a stored field contains invalid Javascript characters, and you use the query option to ask for jsonp, solr does *not* escape some invalid Unicode characters, resulting in strings that explode on contact with browsers. > JSONResponseWriter w/json.wrf can produce invalid javascript depending on > unicode chars in response data > > > Key: SOLR-2520 > URL: https://issues.apache.org/jira/browse/SOLR-2520 > Project: Solr > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Benson Margulies > Attachments: SOLR-2520.patch > > > Please see http://timelessrepo.com/json-isnt-a-javascript-subset. > If a stored field contains Unicode characters that are valid in Json but not > valid in Javascript, and you use the query option to ask for jsonp > (json.wrt), solr does *not* escape them characters, resulting in content that > explodes on contact with browsers. That is, there are certain Unicode > characters that are valid JSON but invalid in Javascript source, and a JSONP > response is javascript source, to be incorporated in an HTML script tag. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-2520: --- Attachment: SOLR-2520.patch Here's a patch w/ simple test. > JSONResponseWriter w/json.wrf can produce invalid javascript depending on > unicode chars in response data > > > Key: SOLR-2520 > URL: https://issues.apache.org/jira/browse/SOLR-2520 > Project: Solr > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Benson Margulies > Attachments: SOLR-2520.patch > > > Please see http://timelessrepo.com/json-isnt-a-javascript-subset. > If a stored field contains invalid Javascript characters, and you use the > query option to ask for jsonp, solr does *not* escape some invalid Unicode > characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2548) Remove all interning of field names from flex API
[ https://issues.apache.org/jira/browse/LUCENE-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034393#comment-13034393 ] Robert Muir commented on LUCENE-2548: - after seeing LUCENE-3105, i think we should take steps to remove this interning. it looks like this can probably be done safely, according to http://www.cs.umd.edu/~jfoster/papers/issre04.pdf , findbugs, PMD, and JLint all support looking for string equality with == or !=, so we should be able to review all occurrences. > Remove all interning of field names from flex API > - > > Key: LUCENE-2548 > URL: https://issues.apache.org/jira/browse/LUCENE-2548 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Uwe Schindler > Fix For: 4.0 > > > In previous versions of Lucene, interning of fields was important to minimize > string comparison cost when iterating TermEnums, to detect changes in field > name. As we separated field names from terms in flex, no query compares field > names anymore, so the whole performance problematic interning can be removed. > I will start with doing this, but we need to carefully review some places > e.g. in preflex codec. > Maybe before this issue we should remove the Term class completely. :-) > Robert? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2445) unknown handler: standard
[ https://issues.apache.org/jira/browse/SOLR-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034394#comment-13034394 ] Koji Sekiguchi commented on SOLR-2445: -- Any objections about applying this trivial patch to 3.1.1? > unknown handler: standard > - > > Key: SOLR-2445 > URL: https://issues.apache.org/jira/browse/SOLR-2445 > Project: Solr > Issue Type: Bug >Affects Versions: 1.4.1, 3.1, 3.2, 4.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: SOLR-2445.patch, qt-form-jsp.patch > > > To reproduce the problem using example config, go form.jsp, use standard for > qt (it is default) then click Search. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names
[ https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Kristensson updated LUCENE-3105: - Attachment: LUCENE-3105.patch Patch file to eliminate String.intern() calls while opening indexReaders and closing indexWriters. > String.intern() calls slow down IndexWriter.close() and IndexReader.open() > for index with large number of unique field names > > > Key: LUCENE-3105 > URL: https://issues.apache.org/jira/browse/LUCENE-3105 > Project: Lucene - Java > Issue Type: Bug > Components: core/index >Affects Versions: 3.1 >Reporter: Mark Kristensson > Attachments: LUCENE-3105.patch > > > We have one index with several hundred thousand unqiue field names (we're > optimistic that Lucene 4.0 is flexible enough to allow us to change our index > design...) and found that opening an index writer and closing an index reader > results in horribly slow performance on that one index. I have isolated the > problem down to the calls to String.intern() that are used to allow for quick > string comparisons of field names throughout Lucene. These String.intern() > calls are unnecessary and can be replaced with a hashmap lookup. In fact, > StringHelper.java has its own hashmap implementation that it uses in > conjunction with String.intern(). Rather than using a one-off hashmap, I've > elected to use a ConcurrentHashMap in this patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names
String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names Key: LUCENE-3105 URL: https://issues.apache.org/jira/browse/LUCENE-3105 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Mark Kristensson We have one index with several hundred thousand unqiue field names (we're optimistic that Lucene 4.0 is flexible enough to allow us to change our index design...) and found that opening an index writer and closing an index reader results in horribly slow performance on that one index. I have isolated the problem down to the calls to String.intern() that are used to allow for quick string comparisons of field names throughout Lucene. These String.intern() calls are unnecessary and can be replaced with a hashmap lookup. In fact, StringHelper.java has its own hashmap implementation that it uses in conjunction with String.intern(). Rather than using a one-off hashmap, I've elected to use a ConcurrentHashMap in this patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034368#comment-13034368 ] Steven Rowe commented on LUCENE-152: If the original sources are BSD licensed, is a software grant required to incorporate the sources into the Lucene/Solr source tree? > [PATCH] KStem for Lucene > > > Key: LUCENE-152 > URL: https://issues.apache.org/jira/browse/LUCENE-152 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Affects Versions: unspecified > Environment: Operating System: other > Platform: Other >Reporter: Otis Gospodnetic >Priority: Minor > > September 10th 2003 contributionn from "Sergio Guzman-Lara" > > Original email: > Hi all, > I have ported the kstem stemmer to Java and incorporated it to > Lucene. You can get the source code (Kstem.jar) from the following website: > http://ciir.cs.umass.edu/downloads/ > Just click on "KStem Java Implementation" (you will need to register > your e-mail, for free of course, with the CIIR --Center for Intelligent > Information Retrieval, UMass -- and get an access code). > Content of Kstem.jar: > java/org/apache/lucene/analysis/KStemData1.java > java/org/apache/lucene/analysis/KStemData2.java > java/org/apache/lucene/analysis/KStemData3.java > java/org/apache/lucene/analysis/KStemData4.java > java/org/apache/lucene/analysis/KStemData5.java > java/org/apache/lucene/analysis/KStemData6.java > java/org/apache/lucene/analysis/KStemData7.java > java/org/apache/lucene/analysis/KStemData8.java > java/org/apache/lucene/analysis/KStemFilter.java > java/org/apache/lucene/analysis/KStemmer.java > KStemData1.java, ..., KStemData8.java Contain several lists of words > used by Kstem > KStemmer.java Implements the Kstem algorithm > KStemFilter.java Extends TokenFilter applying Kstem > To compile > unjar the file Kstem.jar to Lucene's "src" directory, and compile it > there. > What is Kstem? > A stemmer designed by Bob Krovetz (for more information see > http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). > Copyright issues > This is open source. The actual license agreement is included at the > top of every source file. > Any comments/questions/suggestions are welcome, > Sergio Guzman-Lara > Senior Research Fellow > CIIR UMass -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034352#comment-13034352 ] Benson Margulies commented on SOLR-2520: Yes, that looks like that. > JSONResponseWriter w/json.wrf can produce invalid javascript depending on > unicode chars in response data > > > Key: SOLR-2520 > URL: https://issues.apache.org/jira/browse/SOLR-2520 > Project: Solr > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Benson Margulies > > Please see http://timelessrepo.com/json-isnt-a-javascript-subset. > If a stored field contains invalid Javascript characters, and you use the > query option to ask for jsonp, solr does *not* escape some invalid Unicode > characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3104) Hook up Automated Patch Checking for Lucene/Solr
Hook up Automated Patch Checking for Lucene/Solr Key: LUCENE-3104 URL: https://issues.apache.org/jira/browse/LUCENE-3104 Project: Lucene - Java Issue Type: Task Reporter: Grant Ingersoll It would be really great if we could get feedback to contributors sooner on many things that are basic (tests exist, patch applies cleanly, etc.) >From Nigel Daley on builds@a.o {quote} I revamped the precommit testing in the fall so that it doesn't use Jira email anymore to trigger a build. The process is controlled by https://builds.apache.org/hudson/job/PreCommit-Admin/ which has some documentation up at the top of the job. You can look at the config of the job (do you have access?) to see what it's doing. Any project could use this same admin job -- you just need to ask me to add the project to the Jira filter used by the admin job (https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-xml/12313474/SearchRequest-12313474.xml?tempMax=100 ) once you have the downstream job(s) setup for your specific project. For Hadoop we have 3 downstream builds configured which also have some documentation: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/ https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/ https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/ {quote} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3097) Post grouping faceting
[ https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034312#comment-13034312 ] Martijn van Groningen commented on LUCENE-3097: --- bq. Ie, we just have to insure, at indexing time, that docs within the same "group" are adjacent, if you want to be able to count by unique group values. This means that in the same group also need to be in the same segment, right? Or if we use this mechanism for faceting documents with the same facet need to be in the same segment??? If that is true, it would make the collectors easier. The SentinelIntSet we use in the collectors is not necessary, because we can lookup the norm from the DocIndexTerms. We won't find the same group in a different segment. On the other hand with scalability in mind would make it complex. Since documents with the in the same group need to be in the same segment. Which makes indexing complex. > Post grouping faceting > -- > > Key: LUCENE-3097 > URL: https://issues.apache.org/jira/browse/LUCENE-3097 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Martijn van Groningen >Priority: Minor > Fix For: 3.2, 4.0 > > > This issues focuses on implementing post grouping faceting. > * How to handle multivalued fields. What field value to show with the facet. > * Where the facet counts should be based on > ** Facet counts can be based on the normal documents. Ungrouped counts. > ** Facet counts can be based on the groups. Grouped counts. > ** Facet counts can be based on the combination of group value and facet > value. Matrix counts. > And properly more implementation options. > The first two methods are implemented in the SOLR-236 patch. For the first > option it calculates a DocSet based on the individual documents from the > query result. For the second option it calculates a DocSet for all the most > relevant documents of a group. Once the DocSet is computed the FacetComponent > and StatsComponent use one the DocSet to create facets and statistics. > This last one is a bit more complex. I think it is best explained with an > example. Lets say we search on travel offers: > |||hotel||departure_airport||duration|| > |Hotel a|AMS|5 > |Hotel a|DUS|10 > |Hotel b|AMS|5 > |Hotel b|AMS|10 > If we group by hotel and have a facet for airport. Most end users expect > (according to my experience off course) the following airport facet: > AMS: 2 > DUS: 1 > The above result can't be achieved by the first two methods. You either get > counts AMS:3 and DUS:1 or 1 for both airports. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-2445) unknown handler: standard
[ https://issues.apache.org/jira/browse/SOLR-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1303#comment-1303 ] Gabriele Kahlout edited comment on SOLR-2445 at 5/16/11 8:48 PM: - trivial patch to form.jsp that leaves qt empty (useful for setup scripts and those that need to stick to a 3.1.0 revision). was (Author: simpatico): trivial patch to form.jsp that leaves qt empty (useful for setup scripts and those that need to stick to an 3.1.0 revision). > unknown handler: standard > - > > Key: SOLR-2445 > URL: https://issues.apache.org/jira/browse/SOLR-2445 > Project: Solr > Issue Type: Bug >Affects Versions: 1.4.1, 3.1, 3.2, 4.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: SOLR-2445.patch, qt-form-jsp.patch > > > To reproduce the problem using example config, go form.jsp, use standard for > qt (it is default) then click Search. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3096) MultiSearcher does not work correctly with Not on NumericRange
[ https://issues.apache.org/jira/browse/LUCENE-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034289#comment-13034289 ] hao yan commented on LUCENE-3096: - Thanks! Uwe! > MultiSearcher does not work correctly with Not on NumericRange > -- > > Key: LUCENE-3096 > URL: https://issues.apache.org/jira/browse/LUCENE-3096 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Affects Versions: 3.0.2 >Reporter: John Wang > Fix For: 3.1 > > > Hi, Keith > My colleague xiaoyang and I just confirmed that this is actually due to a > lucene bug on Multisearcher. In particular, > If we search with Not on NumericRange and we use MultiSearcher, we > will wrong search results (However, if we use IndexSearcher, the > result is correct). Basically the NotOfNumericRange does not have > impact on multisearcher. We suspect it is because the createWeight() > function in MultiSearcher and hope you can help us to fix this bug of > lucene. I attached the code to reproduce this case. Please check it > out. > In the attached code, I have two separate functions : > (1) testNumericRangeSingleSearcher(Query query) > where I create 6 documents, with a field called "id"= 1,2,3,4,5,6 > respectively . Then I search by the query which is > +MatchAllDocs -NumericRange(3,3). The expected result then should > be 5 hits since the document 3 is MUST_NOT. > (2) testNumericRangeMultiSearcher(Query query) > where i create 2 RamDirectory(), each of which has 3 documents, > 1,2,3; and 4,5,6. Then I search by the same query as above using > multiSearcher. The expected result should also be 5 hits. > However, from (1), we get 5 hits = expected results, while in (2) we > get 6 hits != expected results. > We also experimented this with our zoie/bobo open source tools and get > the same results because our multi-bobo-browser is built on > multi-searcher in lucene. > I already emailed the lucene community group. Hopefully we can get some > feedback soon. > If you have any further concern, pls let me know! > Thank you very much! > Code: (based on lucene 3.0.x) > import java.io.IOException; > import java.io.PrintStream; > import java.text.DecimalFormat; > import org.apache.lucene.analysis.WhitespaceAnalyzer; > import org.apache.lucene.document.Document; > import org.apache.lucene.document.Field; > import org.apache.lucene.document.NumericField; > import org.apache.lucene.index.CorruptIndexException; > import org.apache.lucene.index.IndexWriter; > import org.apache.lucene.index.Term; > import org.apache.lucene.search.BooleanQuery; > import org.apache.lucene.search.FieldCache; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.MatchAllDocsQuery; > import org.apache.lucene.search.MultiSearcher; > import org.apache.lucene.search.NumericRangeQuery; > import org.apache.lucene.search.Query; > import org.apache.lucene.search.ScoreDoc; > import org.apache.lucene.search.Searchable; > import org.apache.lucene.search.Sort; > import org.apache.lucene.search.SortField; > import org.apache.lucene.search.TermQuery; > import org.apache.lucene.search.TopDocs; > import org.apache.lucene.search.BooleanClause.Occur; > import org.apache.lucene.store.Directory; > import org.apache.lucene.store.LockObtainFailedException; > import org.apache.lucene.store.RAMDirectory; > import com.convertlucene.ConvertFrom2To3; > public class TestNumericRange > { > public final static void main(String[] args) > { >try >{ > BooleanQuery query = new BooleanQuery(); > query.add(NumericRangeQuery.newIntRange("numId", 3, 3, true, > true), Occur.MUST_NOT); > query.add(new MatchAllDocsQuery(), Occur.MUST); > testNumericRangeSingleSearcher(query); > testNumericRangeMultiSearcher(query); >} >catch(Exception e) >{ > e.printStackTrace(); >} > } > public static void testNumericRangeSingleSearcher(Query query) > throws CorruptIndexException, LockObtainFailedException, IOException > { > String[] ids = {"1", "2", "3", "4", "5", "6"}; >Directory directory = new RAMDirectory(); >IndexWriter writer = new IndexWriter(directory, new > WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED); >for (int i = 0; i < ids.length; i++) >{ > Document doc = new Document(); > doc.add(new Field("id", ids[i], >Field.Store.YES, >Field.Index.NOT_ANALYZED)); > doc.add(new NumericField("numId").setIntValue(Integer.valueOf(ids[i]))); > writer.addDocument(doc); >} >writer.close(); >IndexSearcher searcher = new IndexSearcher(directory); >TopDocs docs = searcher.search(query, 10); >System.out.println("SingleSearcher: testNumericRange: hitNum: " + > docs.totalHits); >f
[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3102: --- Component/s: (was: modules/grouping) core/search > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera reassigned LUCENE-3102: -- Assignee: Shai Erera > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034283#comment-13034283 ] Shai Erera commented on LUCENE-3102: Committed revision 1103870 (3x). Committed revision 1103872 (trunk). What's committed: * Move CachingCollector to core * Fix bugs * Add TestCachingCollector * Some refactoring Moving on to next proposed changes. > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: modules/grouping >Reporter: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034247#comment-13034247 ] Michael McCandless commented on LUCENE-3098: Thanks Martijn!! But, in general, you don't have to do the 3.x backport ;) I can do it too... We want to minimize the effort for people to contribute to Lucene/Solr! But thank you for backporting! > Grouped total count > --- > > Key: LUCENE-3098 > URL: https://issues.apache.org/jira/browse/LUCENE-3098 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Martijn van Groningen >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3098-3x.patch, LUCENE-3098-3x.patch, > LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, > LUCENE-3098.patch > > > When grouping currently you can get two counts: > * Total hit count. Which counts all documents that matched the query. > * Total grouped hit count. Which counts all documents that have been grouped > in the top N groups. > Since the end user gets groups in his search result instead of plain > documents with grouping. The total number of groups as total count makes more > sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034242#comment-13034242 ] Simon Willnauer commented on LUCENE-3092: - mike I attached a patch to LUCENE-3100 and tested with the latest patch on this issue. The test randomly fails (after I close the IW in the test!) here is a trace: {noformat} junit-sequential: [junit] Testsuite: org.apache.lucene.store.TestNRTCachingDirectory [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.16 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestNRTCachingDirectory -Dtestmethod=testNRTAndCommit -Dtests.seed=-753565914717395747:-1817581638532977526 [junit] NOTE: test params are: codec=RandomCodecProvider: {docid=SimpleText, body=MockFixedIntBlock(blockSize=1993), title=Pulsing(freqCutoff=3), titleTokenized=MockSep, date=SimpleText}, locale=ar_AE, timezone=America/Santa_Isabel [junit] NOTE: all tests run in this JVM: [junit] [TestNRTCachingDirectory] [junit] NOTE: Mac OS X 10.6.7 x86_64/Apple Inc. 1.6.0_24 (64-bit)/cpus=2,threads=1,free=46213552,total=85000192 [junit] - --- [junit] Testcase: testNRTAndCommit(org.apache.lucene.store.TestNRTCachingDirectory):FAILED [junit] limit=12 actual=16 [junit] junit.framework.AssertionFailedError: limit=12 actual=16 [junit] at org.apache.lucene.index.RandomIndexWriter.doRandomOptimize(RandomIndexWriter.java:165) [junit] at org.apache.lucene.index.RandomIndexWriter.close(RandomIndexWriter.java:199) [junit] at org.apache.lucene.store.TestNRTCachingDirectory.testNRTAndCommit(TestNRTCachingDirectory.java:179) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Test org.apache.lucene.store.TestNRTCachingDirectory FAILED {noformat} > NRTCachingDirectory, to buffer small segments in a RAMDir > - > > Key: LUCENE-3092 > URL: https://issues.apache.org/jira/browse/LUCENE-3092 > Project: Lucene - Java > Issue Type: Improvement > Components: core/store >Reporter: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, > LUCENE-3092.patch, LUCENE-3092.patch > > > I created this simply Directory impl, whose goal is reduce IO > contention in a frequent reopen NRT use case. > The idea is, when reopening quickly, but not indexing that much > content, you wind up with many small files created with time, that can > possibly stress the IO system eg if merges, searching are also > fighting for IO. > So, NRTCachingDirectory puts these newly created files into a RAMDir, > and only when they are merged into a too-large segment, does it then > write-through to the real (delegate) directory. > This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file
[ https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3100: Attachment: LUCENE-3100.patch here is a patch sync'ing the file on successful write during prepareCommit > IW.commit() writes but fails to fsync the N.fnx file > > > Key: LUCENE-3100 > URL: https://issues.apache.org/jira/browse/LUCENE-3100 > Project: Lucene - Java > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Simon Willnauer > Fix For: 4.0 > > Attachments: LUCENE-3100.patch > > > In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising > bug! > Because the new N.fnx file is written at the "last minute" along with the > segments file, it's not included in the sis.files() that IW uses to figure > out which files to sync. > This bug means one could call IW.commit(), successfully, return, and then the > machine could crash and when it comes back up your index could be corrupted. > We should hopefully first fix TestCrash so that it hits this bug (maybe it > needs more/better randomization?), then fix the bug -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated LUCENE-3098: -- Attachment: LUCENE-3098-3x.patch Great! Attached the 3x backport. > Grouped total count > --- > > Key: LUCENE-3098 > URL: https://issues.apache.org/jira/browse/LUCENE-3098 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Martijn van Groningen >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3098-3x.patch, LUCENE-3098-3x.patch, > LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, > LUCENE-3098.patch > > > When grouping currently you can get two counts: > * Total hit count. Which counts all documents that matched the query. > * Total grouped hit count. Which counts all documents that have been grouped > in the top N groups. > Since the end user gets groups in his search result instead of plain > documents with grouping. The total number of groups as total count makes more > sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms
[ https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034224#comment-13034224 ] Michael McCandless commented on LUCENE-3103: +1 -- this is a great test to add, now that we support arbitrary binary terms. > create a simple test that indexes and searches byte[] terms > --- > > Key: LUCENE-3103 > URL: https://issues.apache.org/jira/browse/LUCENE-3103 > Project: Lucene - Java > Issue Type: Test > Components: general/test >Reporter: Robert Muir > Fix For: 4.0 > > Attachments: LUCENE-3103.patch > > > Currently, the only good test that does this is Test2BTerms (disabled by > default) > I think we should test this capability, and also have a simpler example for > how to do this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms
[ https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034220#comment-13034220 ] Uwe Schindler commented on LUCENE-3103: --- Reflection should work correct. No need to change anything. > create a simple test that indexes and searches byte[] terms > --- > > Key: LUCENE-3103 > URL: https://issues.apache.org/jira/browse/LUCENE-3103 > Project: Lucene - Java > Issue Type: Test > Components: general/test >Reporter: Robert Muir > Fix For: 4.0 > > Attachments: LUCENE-3103.patch > > > Currently, the only good test that does this is Test2BTerms (disabled by > default) > I think we should test this capability, and also have a simpler example for > how to do this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms
[ https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034217#comment-13034217 ] Robert Muir commented on LUCENE-3103: - one thing i did previously (seemed overkill but maybe good to do) was to clearAttributes(), setBytesRef() on each incrementToken, more like a normal tokenizer. we could still change it to work like this. in this case clear() set the br to null. another thing to inspect is the reflection api so toString prints the bytes... didnt check this. > create a simple test that indexes and searches byte[] terms > --- > > Key: LUCENE-3103 > URL: https://issues.apache.org/jira/browse/LUCENE-3103 > Project: Lucene - Java > Issue Type: Test > Components: general/test >Reporter: Robert Muir > Fix For: 4.0 > > Attachments: LUCENE-3103.patch > > > Currently, the only good test that does this is Test2BTerms (disabled by > default) > I think we should test this capability, and also have a simpler example for > how to do this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-3098: -- Assignee: Michael McCandless > Grouped total count > --- > > Key: LUCENE-3098 > URL: https://issues.apache.org/jira/browse/LUCENE-3098 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Martijn van Groningen >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, > LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch > > > When grouping currently you can get two counts: > * Total hit count. Which counts all documents that matched the query. > * Total grouped hit count. Which counts all documents that have been grouped > in the top N groups. > Since the end user gets groups in his search result instead of plain > documents with grouping. The total number of groups as total count makes more > sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034214#comment-13034214 ] Michael McCandless commented on LUCENE-3098: Looks great Martijn! I'll commit in a day or two if nobody objects... > Grouped total count > --- > > Key: LUCENE-3098 > URL: https://issues.apache.org/jira/browse/LUCENE-3098 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Martijn van Groningen > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, > LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch > > > When grouping currently you can get two counts: > * Total hit count. Which counts all documents that matched the query. > * Total grouped hit count. Which counts all documents that have been grouped > in the top N groups. > Since the end user gets groups in his search result instead of plain > documents with grouping. The total number of groups as total count makes more > sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034203#comment-13034203 ] Robert Muir commented on SOLR-2519: --- As someone frustrated by this (but who would ultimately like to move past it and try to help with solr's intl), I just wanted to say +1 to Hoss Man's proposal. My only suggestion on what he said is that I would greatly prefer text_en over text_western or whatever for these reasons: 1. the stemming and stopwords and crap here are english. 2. for other western languages, even if you swap these out to be say, french or italian (which is the seemingly obvious way to cut over), the whole WDF+autophrase is still a huge trap (see http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance for an example). in this case use of ElisionFilter can be taken to avoid it. > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated LUCENE-3098: -- Attachment: LUCENE-3098.patch Attached a new patch. * Renamed TotalGroupCountCollector to AllGroupsCollector. This rename reflects more what the collector is actual doing. * Group values are now collected in an ArrayList instead of a LinkedList. The initialSize is now also used for the ArrayList. > Grouped total count > --- > > Key: LUCENE-3098 > URL: https://issues.apache.org/jira/browse/LUCENE-3098 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Martijn van Groningen > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, > LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch > > > When grouping currently you can get two counts: > * Total hit count. Which counts all documents that matched the query. > * Total grouped hit count. Which counts all documents that have been grouped > in the top N groups. > Since the end user gets groups in his search result instead of plain > documents with grouping. The total number of groups as total count makes more > sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034197#comment-13034197 ] Yonik Seeley commented on SOLR-2520: It looks like we already escape \u2028 (see SOLR-1936), so we should just do the same for \u2029? > JSONResponseWriter w/json.wrf can produce invalid javascript depending on > unicode chars in response data > > > Key: SOLR-2520 > URL: https://issues.apache.org/jira/browse/SOLR-2520 > Project: Solr > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Benson Margulies > > Please see http://timelessrepo.com/json-isnt-a-javascript-subset. > If a stored field contains invalid Javascript characters, and you use the > query option to ask for jsonp, solr does *not* escape some invalid Unicode > characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3103) create a simple test that indexes and searches byte[] terms
[ https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3103: Attachment: LUCENE-3103.patch attached is a first patch... maybe Uwe won't be able to resist rewriting it to make it simpler :) > create a simple test that indexes and searches byte[] terms > --- > > Key: LUCENE-3103 > URL: https://issues.apache.org/jira/browse/LUCENE-3103 > Project: Lucene - Java > Issue Type: Test > Components: general/test >Reporter: Robert Muir > Fix For: 4.0 > > Attachments: LUCENE-3103.patch > > > Currently, the only good test that does this is Test2BTerms (disabled by > default) > I think we should test this capability, and also have a simpler example for > how to do this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3103) create a simple test that indexes and searches byte[] terms
create a simple test that indexes and searches byte[] terms --- Key: LUCENE-3103 URL: https://issues.apache.org/jira/browse/LUCENE-3103 Project: Lucene - Java Issue Type: Test Components: general/test Reporter: Robert Muir Fix For: 4.0 Currently, the only good test that does this is Test2BTerms (disabled by default) I think we should test this capability, and also have a simpler example for how to do this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Reorganizing JIRA components
I renamed all current components, plus deleted two (contrib/analyzers and contrib/wikipedia). core/codecs core/index core/other core/query/scoring core/queryparser core/search core/store core/termvectors general/build general/javadocs general/test general/website modules/analysis modules/benchmark modules/examples modules/grouping modules/highlighter modules/other modules/queryparser modules/spatial modules/spellchecker Shai On Mon, May 16, 2011 at 7:07 AM, Mark Miller wrote: > > On May 15, 2011, at 10:42 PM, Shai Erera wrote: > > > I was aiming at avoiding that scenario. I think every issue should be > assigned to a specific component, and if there isn't one available, we > should create it. > > > Based on history and how these things normally go, unless you are planning > on spending a *lot* of time curating JIRA for the forceable future, this is > an unlikely outcome. Better categories will hopefully mean more compliance, > but I'd bet the standard hodgepodge of JIRA submissions and curation is > going to remain fairly similar to what we have seen. Version is a much more > important field - and even it is not curated even close to this 'ideal' > world level. > > I think every issue should be fully filled out, correctly filled out, cross > linked with all relevant issues, etc, etc. > > But I don't plan on it being the normal scenario ;) > > FWIW: I fill out component sometimes, and other times I'm just not worried > about it. Someone can always come along after us types and random users and > clean up after them, but I surmise that won't last long. > > - Mark Miller > lucidimagination.com > > Lucene/Solr User Conference > May 25-26, San Francisco > www.lucenerevolution.org > > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
[jira] [Commented] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034187#comment-13034187 ] Benson Margulies commented on SOLR-2520: I'd vote for the later. I assume that there is some large inventory of people who are currently using json.wrf=foo and who would benefit from the change. However, I have limited context here, so if anyone else knows more about how users are using this stuff I hope they will speak up. Sorry not to have been fully clear on the first attempt. > JSONResponseWriter w/json.wrf can produce invalid javascript depending on > unicode chars in response data > > > Key: SOLR-2520 > URL: https://issues.apache.org/jira/browse/SOLR-2520 > Project: Solr > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Benson Margulies > > Please see http://timelessrepo.com/json-isnt-a-javascript-subset. > If a stored field contains invalid Javascript characters, and you use the > query option to ask for jsonp, solr does *not* escape some invalid Unicode > characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034185#comment-13034185 ] Michael McCandless commented on SOLR-2519: -- bq. Bottom line: it's less confusing to remove and add new ones with new names then to make radical changes to existing ones. Ahh, this makes great sense! I really like your proposal Hoss, and that's a great point about emails to the mailing lists. So we'd have no more text fieldType. Just text_en (what text now is) and text_general (basically just StandardAnalyzer, but maybe move/absorb "textgen" over). Over time we can add in more language specific text_XX fieldTypes... > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-2520: --- Summary: JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data (was: Solr creates invalid jsonp strings) Benson: thanks for the clarification, i've updated the summary to attempt to clarify the root of the issue. Would make more sense to have a "JavascriptResponseWriter" or to have the JSONResponseWriter do unicode escaping/stripping if/when json.wrf is specified? > JSONResponseWriter w/json.wrf can produce invalid javascript depending on > unicode chars in response data > > > Key: SOLR-2520 > URL: https://issues.apache.org/jira/browse/SOLR-2520 > Project: Solr > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Benson Margulies > > Please see http://timelessrepo.com/json-isnt-a-javascript-subset. > If a stored field contains invalid Javascript characters, and you use the > query option to ask for jsonp, solr does *not* escape some invalid Unicode > characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3102: --- Component/s: (was: contrib/*) modules/grouping > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: modules/grouping >Reporter: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034176#comment-13034176 ] Hoss Man commented on SOLR-2519: bq. Also: existing users would be unaffected by this? They've already copied over / edited their own schema.xml? This is mainly about new users? The trap we've seen with this type of thing in the past (ie: the numeric fields) is that people who tend to use the example configs w/o changing them much refer to the example field types by name when talking about them on the mailing list, not considering that those names can have differnet meanings depending on version. if we make radical changes to a {{}} but leave the name alone, it could confuse a lot of people, ie: "i tried using the 'text' field but it didn't work"; "which version of solr are you using?"; "Solr 4.1"; "that should work, what exactly does your schema look like"; "..."; "that's the schema from 3.6"; "yeah, i started with 3.6 nad then upgraded to 4.1 later", etc... Bottom line: it's less confusing to *remove* {{}} and add new ones with new names then to make radical changes to existing ones. > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034172#comment-13034172 ] Hoss Man commented on SOLR-2519: I feel like we are convoluting two issues here: the "default" behavior of TextField, and the example configs. i don't have any strong opinions about changing the default behavior of TextField when {{autoGeneratePhraseQueries}} is not specified in the {{}} but if we do make such a change, it should be contingent on the schema version property (which we should bump) so that people who upgrade will get consistent behavior with their existing configs (TextField.init already has an example of this for when we changed the default of {{omitNorms}}) as far as the example configs: i agree with yonik, that changing "text" at this point might be confusing ... i think the best way to iterate moving forward would probably be: * rename {{}} and {{}} to something that makes their purpose more clear (text_en, or text_western, or text_european, or some other more general descriptive word for the types of languages were it makes sense) and switch all existing {{}} declarations that currently use use field type "text" to use this new name. * add a new {{}} which is designed (and documented to be a general purpose field type when the language is unknown (it may make sense to fix/repurpose the existing {{}} for this, since it already suggests that's what it's for) * Audit all {{}} declarations that use "text_en" (or whatever name was chosen above) and the existing sample data for those fields to see if it makes more sense to change them to "text_general". also change any where based on usage it shouldn't matter. The end result being that we have no {{}} named "text" in the example configs, so people won't get it confused with previous versions, and we'll have a new {{}} that works as well as possible with all langauges which we use as much as possible with the example data. > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated LUCENE-3098: -- Attachment: LUCENE-3098.patch Attached patch with the discussed changes. 3x patch follows soon. > Grouped total count > --- > > Key: LUCENE-3098 > URL: https://issues.apache.org/jira/browse/LUCENE-3098 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Martijn van Groningen > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, > LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch > > > When grouping currently you can get two counts: > * Total hit count. Which counts all documents that matched the query. > * Total grouped hit count. Which counts all documents that have been grouped > in the top N groups. > Since the end user gets groups in his search result instead of plain > documents with grouping. The total number of groups as total count makes more > sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2520) Solr creates invalid jsonp strings
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034159#comment-13034159 ] Benson Margulies commented on SOLR-2520: Fun happens when you specify something in json.wrf. This demands 'jsonp' instead of json, which results in the result being treated as javascript, not json. wt=json&json.wrf=SOME_PREFIX will cause Solr to respond with SOME_PREFIX({whatever it was otherwise going to return}) instead of just {whatever it was otherwise going to return} If there is then an interesting Unicode character in there, Chrome implodes and firefox quietly rejects. > Solr creates invalid jsonp strings > -- > > Key: SOLR-2520 > URL: https://issues.apache.org/jira/browse/SOLR-2520 > Project: Solr > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Benson Margulies > > Please see http://timelessrepo.com/json-isnt-a-javascript-subset. > If a stored field contains invalid Javascript characters, and you use the > query option to ask for jsonp, solr does *not* escape some invalid Unicode > characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034158#comment-13034158 ] Michael McCandless commented on SOLR-2519: -- It's also spooky that "text" fieldType has different index time vs query time analyzers? Ie, WDF is configured differently. > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034154#comment-13034154 ] Michael McCandless commented on SOLR-2519: -- bq. I think maybe there's a misconception that the fieldType named "text" was meant to be generic for all languages. Regardless of what the original intention was, "text" today has become the generic text fieldType new users use on starting with Solr. I mean, it has the perfect name for that :) bq. As I said in the thread, if I had to do it over again, I would have named it "text_en" because that's what it's purpose was. Hindsight is 20/20... but, we can still fix this today. We shouldn't lock ourselves into poor defaults. Especially, as things improve and we get better analyzers, etc., we should be free to improve the defaults in schema.xml to take advantage of these improvements. bq. But at this point, it seems like the best way forward is to leave "text" as an english fieldType and simply add other fieldTypes that can support other languages. I think this is a dangerous approach -- the name (ie, missing _en if in fact it has such English-specific configuration) is misleading and traps new users. Ideally, in the future, we wouldn't even have a "text" fieldType, only text_XX per-language examples and then maybe something like text_general, which you use if you cannot find your language. {quote} Some downsides I see to this patch (i.e. trying to make the 'text' fieldType generic): The current WordDelimiterFilter options the fieldType feel like a trap for non-whitespace-delimited languages. WDF is configured to index catenations as well as splits... so all of the tokens (words?) that are split out are also catenated together and indexed (which seems like it could lead to some truly huge tokens erroneously being indexed.) {quote} Ahh good point. I think we should remove WDF altogether from the generic "text" fieldType. {quote} You left the english stemmer on the "text" fieldType... but if it's supposed to be generic, couldn't this be bad for some other western languages where it could cause stemming collisions of words not related to each other? {quote} +1, we should remove the stemming too from "text". bq. Taking into account all the existing users (and all the existing documentation, examples, tutorial, etc), I favor a more conservative approach of adding new fieldTypes rather than radically changing the behavior of existing ones. Can you point to specific examples (docs, examples, tutorial)? I'd like to understand how much work it is to fix these... My feeling is we should simply do the work here (I'll sign up to it) and fix any places that actually rely on the specifics of "text" fieldType, eg autophrase. We shouldn't avoid fixing things well because it's gonna be more work today, especially if someone (me) is signing up to do it. Also: existing users would be unaffected by this? They've already copied over / edited their own schema.xml? This is mainly about new users? > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2520) Solr creates invalid jsonp strings
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034151#comment-13034151 ] Hoss Man commented on SOLR-2520: I'm confused here: As far as i can tell, the JSONResponseWriter does in fact output valid JSON (the link mentioned points out that there are control characters valid in JSON which are not valid in javascript, but that's what the response writer produces -- JSON) ... so what is the bug? And what do you mean by "the query option to ask for jsonp" ? ... i don't see that option in the JSONResponseWriter (is this bug about some third party response writer?) > Solr creates invalid jsonp strings > -- > > Key: SOLR-2520 > URL: https://issues.apache.org/jira/browse/SOLR-2520 > Project: Solr > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Benson Margulies > > Please see http://timelessrepo.com/json-isnt-a-javascript-subset. > If a stored field contains invalid Javascript characters, and you use the > query option to ask for jsonp, solr does *not* escape some invalid Unicode > characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034120#comment-13034120 ] Yonik Seeley commented on SOLR-2519: I think maybe there's a misconception that the fieldType named "text" was meant to be generic for all languages. As I said in the thread, if I had to do it over again, I would have named it "text_en" because that's what it's purpose was. But at this point, it seems like the best way forward is to leave "text" as an english fieldType and simply add other fieldTypes that can support other languages. Some downsides I see to this patch (i.e. trying to make the 'text' fieldType generic): - The current WordDelimiterFilter options the fieldType feel like a trap for non-whitespace-delimited languages. WDF is configured to index catenations as well as splits... so all of the tokens (words?) that are split out are also catenated together and indexed (which seems like it could lead to some truly huge tokens erroneously being indexed.) - You left the english stemmer on the "text" fieldType... but if it's supposed to be generic, couldn't this be bad for some other western languages where it could cause stemming collisions of words not related to each other? Taking into account all the existing users (and all the existing documentation, examples, tutorial, etc), I favor a more conservative approach of adding new fieldTypes rather than radically changing the behavior of existing ones. Random question: what are the implications of changing from WhitespaceTokenizer to StandardTokenizer, esp w.r.t. WDF? > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be List not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3084: -- Attachment: LUCENE-3084-trunk-only.patch New patch that also has BalancedMergePolicy from contrib refactored to new API (sorry that was missing). > MergePolicy.OneMerge.segments should be List not SegmentInfos > -- > > Key: LUCENE-3084 > URL: https://issues.apache.org/jira/browse/LUCENE-3084 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3084-trunk-only.patch, > LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, > LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch > > > SegmentInfos carries a bunch of fields beyond the list of SI, but for merging > purposes these fields are unused. > We should cutover to List instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush
[ https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034103#comment-13034103 ] Simon Willnauer commented on LUCENE-3090: - Thanks mike for review and testing!! It makes me feel better with those asserts in there now... I will commit tomorrow. > DWFlushControl does not take active DWPT out of the loop on fullFlush > - > > Key: LUCENE-3090 > URL: https://issues.apache.org/jira/browse/LUCENE-3090 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Critical > Fix For: 4.0 > > Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch > > > We have seen several OOM on TestNRTThreads and all of them are caused by > DWFlushControl missing DWPT that are set as flushPending but can't full due > to a full flush going on. Yet that means that those DWPT are filling up in > the background while they should actually be checked out and blocked until > the full flush finishes. Even further we currently stall on the > maxNumThreadStates while we should stall on the num of active thread states. > I will attach a patch tomorrow. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2027) Deprecate Directory.touchFile
[ https://issues.apache.org/jira/browse/LUCENE-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2027: --- Attachment: LUCENE-2027.patch Patch, removing Dir.touchFile from trunk. For 3.x I'll deprecate. > Deprecate Directory.touchFile > - > > Key: LUCENE-2027 > URL: https://issues.apache.org/jira/browse/LUCENE-2027 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Trivial > Fix For: 4.0 > > Attachments: LUCENE-2027.patch > > > Lucene doesn't use this method, and, FindBugs reports that FSDirectory's impl > shouldn't swallow the returned result from File.setLastModified. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-2027) Deprecate Directory.touchFile
[ https://issues.apache.org/jira/browse/LUCENE-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-2027: -- Assignee: Michael McCandless > Deprecate Directory.touchFile > - > > Key: LUCENE-2027 > URL: https://issues.apache.org/jira/browse/LUCENE-2027 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Trivial > Fix For: 4.0 > > Attachments: LUCENE-2027.patch > > > Lucene doesn't use this method, and, FindBugs reports that FSDirectory's impl > shouldn't swallow the returned result from File.setLastModified. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034101#comment-13034101 ] Michael McCandless commented on SOLR-2519: -- I think the attached patch is a good starting point. It fixes the generic "text" fieldType to have good all around defaults for all languages, so that non-whitespace languages work fine. Then, I think we should iteratively add in custom languages over time (as separate issues). We can eg add text_en_autophrase, text_en, text_zh, etc. We should at least do first sweep of nice analyzers module and add fieldTypes for them. This way we will eventually get to the ideal future when we have text_XX coverage for many languages. > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file
[ https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-3100: --- Assignee: Simon Willnauer > IW.commit() writes but fails to fsync the N.fnx file > > > Key: LUCENE-3100 > URL: https://issues.apache.org/jira/browse/LUCENE-3100 > Project: Lucene - Java > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Simon Willnauer > Fix For: 4.0 > > > In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising > bug! > Because the new N.fnx file is written at the "last minute" along with the > segments file, it's not included in the sis.files() that IW uses to figure > out which files to sync. > This bug means one could call IW.commit(), successfully, return, and then the > machine could crash and when it comes back up your index could be corrupted. > We should hopefully first fix TestCrash so that it hits this bug (maybe it > needs more/better randomization?), then fix the bug -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2521) TestJoin.testRandom fails
TestJoin.testRandom fails - Key: SOLR-2521 URL: https://issues.apache.org/jira/browse/SOLR-2521 Project: Solr Issue Type: Bug Reporter: Michael McCandless Fix For: 4.0 Hit this random failure; it reproduces on trunk: {noformat} [junit] Testsuite: org.apache.solr.TestJoin [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 4.512 sec [junit] [junit] - Standard Error - [junit] 2011-05-16 12:51:46 org.apache.solr.TestJoin testRandomJoin [junit] SEVERE: GROUPING MISMATCH: mismatch: '0'!='1' @ response/numFound [junit] request=LocalSolrQueryRequest{echoParams=all&indent=true&q={!join+from%3Dsmall_i+to%3Dsmall3_is}*:*&wt=json} [junit] result={ [junit] "responseHeader":{ [junit] "status":0, [junit] "QTime":0, [junit] "params":{ [junit] "echoParams":"all", [junit] "indent":"true", [junit] "q":"{!join from=small_i to=small3_is}*:*", [junit] "wt":"json"}}, [junit] "response":{"numFound":1,"start":0,"docs":[ [junit] { [junit] "id":"NXEA", [junit] "score_f":87.90162, [junit] "small3_ss":["N", [junit] "v", [junit] "n"], [junit] "small_i":4, [junit] "small2_i":1, [junit] "small2_is":[2], [junit] "small3_is":[69, [junit] 88, [junit] 54, [junit] 80, [junit] 75, [junit] 83, [junit] 57, [junit] 73, [junit] 85, [junit] 52, [junit] 50, [junit] 88, [junit] 51, [junit] 89, [junit] 12, [junit] 8, [junit] 19, [junit] 23, [junit] 53, [junit] 75, [junit] 26, [junit] 99, [junit] 0, [junit] 44]}] [junit] }} [junit] expected={"numFound":0,"start":0,"docs":[]} [junit] model={"NXEA":"Doc(0):[id=NXEA, score_f=87.90162, small3_ss=[N, v, n], small_i=4, small2_i=1, small2_is=2, small3_is=[69, 88, 54, 80, 75, 83, 57, 73, 85, 52, 50, 88, 51, 89, 12, 8, 19, 23, 53, 75, 26, 99, 0, 44]]","JSLZ":"Doc(1):[id=JSLZ, score_f=11.198811, small2_ss=[c, d], small3_ss=[b, R, H, Q, O, f, C, e, Z, u, z, u, w, I, f, _, Y, r, w, u], small_i=6, small2_is=[2, 3], small3_is=[22, 1]]","FAWX":"Doc(2):[id=FAWX, score_f=25.524109, small_s=d, small3_ss=[O, D, X, `, W, z, k, M, j, m, r, [, E, P, w, ^, y, T, e, R, V, H, g, e, I], small_i=2, small2_is=[2, 1], small3_is=[95, 42]]","GDDZ":"Doc(3):[id=GDDZ, score_f=8.483642, small2_ss=[b, e], small3_ss=[o, i, y, l, I, O, r, O, f, d, E, e, d, f, b, P], small2_is=[6, 6], small3_is=[36, 48, 9, 8, 40, 40, 68]]","RBIQ":"Doc(4):[id=RBIQ, score_f=97.06258, small_s=b, small2_s=c, small2_ss=[e, e], small_i=2, small2_is=6, small3_is=[13, 77, 96, 45]]","LRDM":"Doc(5):[id=LRDM, score_f=82.302124, small_s=b, small2_s=a, small2_ss=d, small3_ss=[H, m, O, D, I, J, U, D, f, N, ^, m, I, j, L, s, F, h, A, `, c, j], small2_i=2, small2_is=[2, 7], small3_is=[81, 31, 78, 23, 88, 1, 7, 86, 20, 7, 40, 52, 100, 81, 34, 45, 87, 72, 14, 5]]"} [junit] NOTE: reproduce with: ant test -Dtestcase=TestJoin -Dtestmethod=testRandomJoin -Dtests.seed=-4998031941344546449:8541928265064992444 [junit] NOTE: test params are: codec=RandomCodecProvider: {id=MockRandom, small2_ss=Standard, small2_is=MockFixedIntBlock(blockSize=1738), small2_s=MockFixedIntBlock(blockSize=1738), small3_is=MockVariableIntBlock(baseBlockSize=77), small_i=MockFixedIntBlock(blockSize=1738), small_s=MockVariableIntBlock(baseBlockSize=77), score_f=MockSep, small2_i=Pulsing(freqCutoff=9), small3_ss=SimpleText}, locale=sr_BA, timezone=America/Barbados [junit] NOTE: all tests run in this JVM: [junit] [TestJoin] [junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 1.6.0_21 (64-bit)/cpus=24,threads=1,free=252342544,total=308084736 [junit] - --- [junit] Testcase: testRandomJoin(org.apache.solr.TestJoin): FAILED [junit] mismatch: '0'!='1' @ response/numFound [junit] junit.framework.AssertionFailedError: mismatch: '0'!='1' @ response/numFound [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] at org.apache.solr.TestJoin.testRandomJoin(TestJoin.java:172) [junit] [junit] [junit] Test org.apache.solr.TestJoin FAILED {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira -
[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush
[ https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034095#comment-13034095 ] Michael McCandless commented on LUCENE-3090: Patch looks good but hairy Simon! I ran 144 iters of all (Solr+lucene+lucene-contrib) tests. I hit three fails (one in Solr's TestJoin.testRandomJoin, and two in Solr's HighlighterTest) but I don't think these are related to this patch. > DWFlushControl does not take active DWPT out of the loop on fullFlush > - > > Key: LUCENE-3090 > URL: https://issues.apache.org/jira/browse/LUCENE-3090 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Critical > Fix For: 4.0 > > Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch > > > We have seen several OOM on TestNRTThreads and all of them are caused by > DWFlushControl missing DWPT that are set as flushPending but can't full due > to a full flush going on. Yet that means that those DWPT are filling up in > the background while they should actually be checked out and blocked until > the full flush finishes. Even further we currently stall on the > maxNumThreadStates while we should stall on the num of active thread states. > I will attach a patch tomorrow. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be List not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034093#comment-13034093 ] Michael McCandless commented on LUCENE-3084: Uwe, this looks like a great step forward? Even if there are other things to fix later, we should commit this first (progress not perfection)? Thanks! On backporting, this is an experimental API, and it's rather "expert" for code to be interacting with SegmentInfos, so I think we can just break it (and advertise we did so)? > MergePolicy.OneMerge.segments should be List not SegmentInfos > -- > > Key: LUCENE-3084 > URL: https://issues.apache.org/jira/browse/LUCENE-3084 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3084-trunk-only.patch, > LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, > LUCENE-3084-trunk-only.patch, LUCENE-3084.patch > > > SegmentInfos carries a bunch of fields beyond the list of SI, but for merging > purposes these fields are unused. > We should cutover to List instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034091#comment-13034091 ] Michael McCandless commented on LUCENE-3102: Patch looks great Shai -- +1 to commit!! Yes that is very sneaky about the private fields in inner/outer classes -- it's good you added a comment explaining it! > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Reporter: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be List not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3084: -- Attachment: LUCENE-3084-trunk-only.patch Here updated patch that removes some List usage from DirectoryReader and IndexWriter for rollback when commit fails. I am still not happy with interacting of IndexWriter code directly with the list, but this should maybe fixed later. This patch could also be backported to cleanup 3.x, but for backwards compatibility, the SegmentInfos class should still extend Vector, but we can make the fields "segment" simply point to this. I am not sure how to "deprecated" extension of a class? A possibility would be to add each Vector method as a overridden one-liner and deprecated, but thats a non-brainer and stupid to do :( > MergePolicy.OneMerge.segments should be List not SegmentInfos > -- > > Key: LUCENE-3084 > URL: https://issues.apache.org/jira/browse/LUCENE-3084 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3084-trunk-only.patch, > LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, > LUCENE-3084-trunk-only.patch, LUCENE-3084.patch > > > SegmentInfos carries a bunch of fields beyond the list of SI, but for merging > purposes these fields are unused. > We should cutover to List instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2505) Output cluster scores
[ https://issues.apache.org/jira/browse/SOLR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2505. - Resolution: Fixed Committed to trunk and branch_3x. > Output cluster scores > - > > Key: SOLR-2505 > URL: https://issues.apache.org/jira/browse/SOLR-2505 > Project: Solr > Issue Type: Improvement > Components: contrib - Clustering >Reporter: Stanislaw Osinski >Assignee: Stanislaw Osinski >Priority: Minor > Fix For: 3.2, 4.0 > > > Carrot2 algorithms compute cluster scores; we could expose them on the output > from Solr clustering component. Along with scores, we can output a boolean > flag that marks the Other Topics groups. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2448. - Resolution: Fixed Committed to trunk and branch_3x. > Upgrade Carrot2 to version 3.5.0 > > > Key: SOLR-2448 > URL: https://issues.apache.org/jira/browse/SOLR-2448 > Project: Solr > Issue Type: Task > Components: contrib - Clustering >Reporter: Stanislaw Osinski >Assignee: Stanislaw Osinski >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, > SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar > > > Carrot2 version 3.5.0 should be available very soon. After the upgrade, it > will be possible to implement a few improvements to the clustering plugin; > I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2449) Loading of Carrot2 resources from Solr config directory
[ https://issues.apache.org/jira/browse/SOLR-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2449. - Resolution: Fixed Committed to trunk and branch_3x. > Loading of Carrot2 resources from Solr config directory > --- > > Key: SOLR-2449 > URL: https://issues.apache.org/jira/browse/SOLR-2449 > Project: Solr > Issue Type: Improvement > Components: contrib - Clustering >Reporter: Stanislaw Osinski >Assignee: Stanislaw Osinski > Fix For: 3.2, 4.0 > > Attachments: SOLR-2449.patch > > > Currently, Carrot2 clustering algorithms read linguistic resources (stop > words, stop labels) from the classpath (Carrot2 JAR), which makes them > difficult to edit/override. The directory from which Carrot2 should read its > resources (absolute, or relative to Solr config dir) could be specified in > the {{engine}} element. By default, the path could be e.g. > {{/clustering/carrot2}}. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words
[ https://issues.apache.org/jira/browse/SOLR-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2450. - Resolution: Fixed Committed to trunk and branch_3x. > Carrot2 clustering should use both its own and Solr's stop words > > > Key: SOLR-2450 > URL: https://issues.apache.org/jira/browse/SOLR-2450 > Project: Solr > Issue Type: Improvement > Components: contrib - Clustering >Reporter: Stanislaw Osinski >Assignee: Stanislaw Osinski >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: SOLR-2450.patch > > > While using only Solr's stop words for clustering isn't a good idea (compared > to indexing, clustering needs more aggressive stop word removal to get > reasonable cluster labels), it would be good if Carrot2 used both its own and > Solr's stop words. > I'm not sure what the best way to implement this would be though. My first > thought was to simply load {{stopwords.txt}} from Solr config dir and merge > them with Carrot2's. But then, maybe a better approach would be to get the > stop words from the StopFilter being used? Ideally, we should also consider > the per-field stop filters configured on the fields used for clustering. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Field should accept BytesRef?
On Mon, May 16, 2011 at 11:29 AM, Jason Rutherglen wrote: >> But when you create an untokenized field (or even a binary field, which is >> stored-only at the moment), you could theoretically index the bytes directly > > Right, if I already have a BytesRef of what needs to be indexed, then > passing the BR into Field/able should reduce garbage collection of > strings? > you can do this with a tokenstream, see http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/Test2BTerms.java for an example (sorry i somehow was confused about your message earlier). - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3102: --- Attachment: LUCENE-3102.patch bq. Only thing is: I would be careful about directly setting those private fields of the cachedScorer; I think (not sure) this incurs an "access" check on each assignment. Maybe make them package protected? Or use a setter? Good catch Mike. I read about it some and found this nice webpage which explains the implications (http://www.glenmccl.com/jperf/). Indeed, if the member is private (whether it's in the inner or outer class), there is an access check. So the right think to do is to declare is protected / package-private, which I did. Thanks for the opportunity to get some education ! Patch fixes this. I intend to commit this shortly + move the class to core + apply to trunk. Then, I'll continue w/ the rest of the improvements. > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Reporter: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Field should accept BytesRef?
> But when you create an untokenized field (or even a binary field, which is > stored-only at the moment), you could theoretically index the bytes directly Right, if I already have a BytesRef of what needs to be indexed, then passing the BR into Field/able should reduce garbage collection of strings? On Sun, May 15, 2011 at 9:59 AM, Uwe Schindler wrote: > Hi, > > I think Jason meant the field value, not the field name. > > Field names should stay Strings, as they are only "identifiers" making them > BytesRefs is not really useful. > > But when you create an untokenized field (or even a binary field, which is > stored-only at the moment), you could theoretically index the bytes directly. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -Original Message- >> From: Robert Muir [mailto:rcm...@gmail.com] >> Sent: Sunday, May 15, 2011 6:22 PM >> To: dev@lucene.apache.org >> Subject: Re: Field should accept BytesRef? >> >> On Sun, May 15, 2011 at 12:05 PM, Jason Rutherglen >> wrote: >> > In the Field object a text value must be of type string, however I >> > think we can allow a BytesRef to be passed in? >> > >> >> it would be nice if we sorted them in byte order too? I think right now >> fields >> are sorted in utf-16 order, but terms are sorted in utf-8 order? (if so, >> this is >> confusing) >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional >> commands, e-mail: dev-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Moving towards Lucene 4.0
We anyway seem to mark every new API as @lucene.experimental these days, so we shouldn't have too much problem when 4.0 is out :). Experimental API is subject to change at any time. We can consider that as an option as well (maybe it adds another option to Robert's?). Though personally, I'm not a big fan of this notion - I think we deceive ourselves and users when we have @experimental on a "stable" branch. Any @experimental API on trunk today falls into this bucket after 4.0 is out. And I'm sure there are a couple in 3.x already. Don't get me wrong - I don't suggest we should stop using it. But I think we should consider to review the @experimental API before every "stable" release, and reduce it over time, not increase it. Shai On Mon, May 16, 2011 at 4:20 PM, Robert Muir wrote: > On Mon, May 16, 2011 at 9:12 AM, Simon Willnauer > wrote: > > I have to admit that branch is very rough and the API is super hard to > > use. For now! > > Lets not be dragged away into discussion how this API should look like > > there will be time > > for that. > > +1, this is what i really meant by "decide how to handle". I don't > think we will be able to quickly "decide how to fix" the branch > itself, i think its really complicated. But we can admit its really > complicated and won't be solved very soon, and try to figure out a > release strategy with this in mind. > > (p.s. sorry simon, you got two copies of this message i accidentally > hit reply instead of reply-all) > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
[jira] [Commented] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034059#comment-13034059 ] Robert Muir commented on SOLR-1942: --- ok thanks Grant. I'll take a look thru the patch some today and post back what I think. > Ability to select codec per field > - > > Key: SOLR-1942 > URL: https://issues.apache.org/jira/browse/SOLR-1942 > Project: Solr > Issue Type: New Feature >Affects Versions: 4.0 >Reporter: Yonik Seeley >Assignee: Grant Ingersoll > Fix For: 4.0 > > Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, > SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch > > > We should use PerFieldCodecWrapper to allow users to select the codec > per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush
[ https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034053#comment-13034053 ] Simon Willnauer commented on LUCENE-3090: - I did 150 runs for all Lucene Tests incl. contrib - no failure so far. Seems to be good to go. > DWFlushControl does not take active DWPT out of the loop on fullFlush > - > > Key: LUCENE-3090 > URL: https://issues.apache.org/jira/browse/LUCENE-3090 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Critical > Fix For: 4.0 > > Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch > > > We have seen several OOM on TestNRTThreads and all of them are caused by > DWFlushControl missing DWPT that are set as flushPending but can't full due > to a full flush going on. Yet that means that those DWPT are filling up in > the background while they should actually be checked out and blocked until > the full flush finishes. Even further we currently stall on the > maxNumThreadStates while we should stall on the num of active thread states. > I will attach a patch tomorrow. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034051#comment-13034051 ] Grant Ingersoll commented on SOLR-1942: --- I thought I would have time last week, but that turned out to not be the case. If you have time, Robert, feel free, otherwise I might be able to get to it later in the week (pending conf. prep). From the sounds of it, it likely just needs to be updated to trunk and then it should be ready to go (we should also doc it on the wiki) > Ability to select codec per field > - > > Key: SOLR-1942 > URL: https://issues.apache.org/jira/browse/SOLR-1942 > Project: Solr > Issue Type: New Feature >Affects Versions: 4.0 >Reporter: Yonik Seeley >Assignee: Grant Ingersoll > Fix For: 4.0 > > Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, > SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch > > > We should use PerFieldCodecWrapper to allow users to select the codec > per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034050#comment-13034050 ] Martijn van Groningen commented on LUCENE-3098: --- That is true. It is just a simple un-orded collection of all values of the group field that have matches the query. I'll include this as well. > Grouped total count > --- > > Key: LUCENE-3098 > URL: https://issues.apache.org/jira/browse/LUCENE-3098 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Martijn van Groningen > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, > LUCENE-3098.patch, LUCENE-3098.patch > > > When grouping currently you can get two counts: > * Total hit count. Which counts all documents that matched the query. > * Total grouped hit count. Which counts all documents that have been grouped > in the top N groups. > Since the end user gets groups in his search result instead of plain > documents with grouping. The total number of groups as total count makes more > sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034040#comment-13034040 ] Michael McCandless commented on LUCENE-3098: Right, we'd make it clear the collection is unordered. It just seems like, since we are building up this collection anyway, we may as well give access to the consumer? > Grouped total count > --- > > Key: LUCENE-3098 > URL: https://issues.apache.org/jira/browse/LUCENE-3098 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Martijn van Groningen > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, > LUCENE-3098.patch, LUCENE-3098.patch > > > When grouping currently you can get two counts: > * Total hit count. Which counts all documents that matched the query. > * Total grouped hit count. Which counts all documents that have been grouped > in the top N groups. > Since the end user gets groups in his search result instead of plain > documents with grouping. The total number of groups as total count makes more > sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml
Hi Mark, Thanks for clarifying the difference between contrib and full committers, I was probably too shy to subscribe myself to the latter group right away :-) For the time being, I'll most likely stick with maintaining the clustering bit and will consult you guys if I have something to contribute in the other areas of the code. S. On Mon, May 16, 2011 at 15:41, Mark Miller wrote: > > Stanislav - we certainly nominated you in the spirit of maintaining the > carrot2 contrib, but you are still a full committer. We have decided to stop > adding new Contrib committers. A full committer may be someone that only > works on part of the project. IMO, a full committer might be someone that > only has commit bits so that he can update the website! We trust full > committers to only mess with what they are comfortable with. So we trust > that you will stick to Carrot2 or other areas you are strong in, and that if > you want to move into other code, you will do so intelligently. Essentially, > by making you a Committer, we are mostly just saying - "we trust you". > > But you are a full committer and not a contrib committer. We no longer mint > new contrib committers. > > - Mark Miller > lucidimagination.com > > Lucene/Solr User Conference > May 25-26, San Francisco > www.lucenerevolution.org > > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034025#comment-13034025 ] Martijn van Groningen commented on LUCENE-3098: --- Hmmm... So you get a list of all grouped values. That can be useful. Only remember that doesn't tell anything about the group head (most relevant document of a group), since we don't sort inside the groups. > Grouped total count > --- > > Key: LUCENE-3098 > URL: https://issues.apache.org/jira/browse/LUCENE-3098 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Martijn van Groningen > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, > LUCENE-3098.patch, LUCENE-3098.patch > > > When grouping currently you can get two counts: > * Total hit count. Which counts all documents that matched the query. > * Total grouped hit count. Which counts all documents that have been grouped > in the top N groups. > Since the end user gets groups in his search result instead of plain > documents with grouping. The total number of groups as total count makes more > sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034018#comment-13034018 ] Robert Muir commented on SOLR-1942: --- any update on this? Would be nice to be able to hook in codecproviders and codecs this way. > Ability to select codec per field > - > > Key: SOLR-1942 > URL: https://issues.apache.org/jira/browse/SOLR-1942 > Project: Solr > Issue Type: New Feature >Affects Versions: 4.0 >Reporter: Yonik Seeley >Assignee: Grant Ingersoll > Fix For: 4.0 > > Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, > SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch > > > We should use PerFieldCodecWrapper to allow users to select the codec > per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml
On May 16, 2011, at 8:55 AM, Stanislaw Osinski wrote: > stanislav you are a full committer afaik?! > > I've been working mostly on the clustering plugin for now, so I'm not sure if > it's right to move me to the core section right away :-) > > Incidentally, I tried to svn up on /www/lucene.apache.org/java/docs at > people.apache.org to push the modifications live, but there is an SVN lock on > that directory. Am I missing anything? I'm assuming that's the right > directory for the commiters list? > > S. > > Stanislav - we certainly nominated you in the spirit of maintaining the carrot2 contrib, but you are still a full committer. We have decided to stop adding new Contrib committers. A full committer may be someone that only works on part of the project. IMO, a full committer might be someone that only has commit bits so that he can update the website! We trust full committers to only mess with what they are comfortable with. So we trust that you will stick to Carrot2 or other areas you are strong in, and that if you want to move into other code, you will do so intelligently. Essentially, by making you a Committer, we are mostly just saying - "we trust you". But you are a full committer and not a contrib committer. We no longer mint new contrib committers. - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml
Hi Steve, That explains everything, thanks! I somehow failed to locate that wiki page and was looking at http://wiki.apache.org/solr/Website_Update_HOWTO instead. S. On Mon, May 16, 2011 at 15:25, Steven A Rowe wrote: > Hi Stanisław, > > > > You don’t need to be logged into people.apache.org to update the website. > > > > Have you seen these instructions? The “unversioned website” section is > what you want, I think: > > > > http://wiki.apache.org/lucene-java/HowToUpdateTheWebsite > > > > Steve > > > > *From:* stac...@gmail.com [mailto:stac...@gmail.com] *On Behalf Of *Stanislaw > Osinski > *Sent:* Monday, May 16, 2011 8:56 AM > > *To:* dev@lucene.apache.org; simon.willna...@gmail.com > *Cc:* java-...@lucene.apache.org; java-comm...@lucene.apache.org > *Subject:* Re: svn commit: r1103709 - in /lucene/java/site: > docs/whoweare.html docs/whoweare.pdf > src/documentation/content/xdocs/whoweare.xml > > > > stanislav you are a full committer afaik?! > > > > I've been working mostly on the clustering plugin for now, so I'm not sure > if it's right to move me to the core section right away :-) > > > > Incidentally, I tried to svn up on /www/lucene.apache.org/java/docs at > people.apache.org to push the modifications live, but there is an SVN lock > on that directory. Am I missing anything? I'm assuming that's the right > directory for the commiters list? > > > > S. > > > > >