[jira] Commented: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)
[ https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844916#action_12844916 ] Simon Willnauer commented on LUCENE-2314: - Small comment on javadoc wording. Maybe like that: {code} /** * Copies the contents of this AttributeSource to the given AttributeSource. * The given instance has to provide all {...@link Attribute}s this instance contains. * The actual attribute implementations must be identical in both {...@link AttributeSource} instances. * Ideally both AttributeSource instances should use the same {...@link AttributeFactory} */ {code} Add AttributeSource.copyTo(AttributeSource) --- Key: LUCENE-2314 URL: https://issues.apache.org/jira/browse/LUCENE-2314 Project: Lucene - Java Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Minor Fix For: 3.1 Attachments: LUCENE-2314.patch, LUCENE-2314.patch One problem with AttributeSource at the moment is the missing insight into AttributeSource.State. If you want to create TokenStreams that inspect cpatured states, you have no chance. Making the contents of State public is a bad idea, as it does not help for inspecting (its a linked list, so you have to iterate). AttributeSource currently contains a cloneAttributes() call, which returns a new AttrubuteSource with all current attributes cloned. This is the (more expensive) captureState. The problem is that you cannot copy back the cloned AS (which is the restoreState). To use this behaviour (by the way, ShingleMatrix can use it), one can alternatively use cloneAttributes and copyTo. You can easily change the cloned attributes and store them in lists and copy them back. The only problem is lower performance of these calls (as State is a very optimized class). One use case could be: {code} AttributeSource state = cloneAttributes(); // do something ... state.getAttribute(TermAttribute.class).setTermBuffer(foobar); // ... more work state.copyTo(this); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)
[ https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844927#action_12844927 ] Simon Willnauer commented on LUCENE-2314: - looks good to me! Add AttributeSource.copyTo(AttributeSource) --- Key: LUCENE-2314 URL: https://issues.apache.org/jira/browse/LUCENE-2314 Project: Lucene - Java Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Minor Fix For: 3.1 Attachments: LUCENE-2314.patch, LUCENE-2314.patch, LUCENE-2314.patch One problem with AttributeSource at the moment is the missing insight into AttributeSource.State. If you want to create TokenStreams that inspect cpatured states, you have no chance. Making the contents of State public is a bad idea, as it does not help for inspecting (its a linked list, so you have to iterate). AttributeSource currently contains a cloneAttributes() call, which returns a new AttrubuteSource with all current attributes cloned. This is the (more expensive) captureState. The problem is that you cannot copy back the cloned AS (which is the restoreState). To use this behaviour (by the way, ShingleMatrix can use it), one can alternatively use cloneAttributes and copyTo. You can easily change the cloned attributes and store them in lists and copy them back. The only problem is lower performance of these calls (as State is a very optimized class). One use case could be: {code} AttributeSource state = cloneAttributes(); // do something ... state.getAttribute(TermAttribute.class).setTermBuffer(foobar); // ... more work state.copyTo(this); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844420#action_12844420 ] Simon Willnauer commented on LUCENE-2309: - The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate over tokens. The IndexWriter should instead implement an interface or use a class that is called for each successful incrementToken() no matter how this increment is implemented. I could imagine a really simple interface like {code} interface AttributeConsumer { void setAttributeSource(AttributeSource src); void next(); void end(); } {code} IW would then pass itself or an istance it uses (DocInverterPerField) to an API expecting such a consumer like: {code} field.consume(this); {code} or something similar. That way we have not dependency on whatever Attribute producer is used. The default implementation is for sure based on an analyzer / tokenstream and alternatives can be exposed via expert API. Even Backwards compatibility could be solved that way easily. bq. Only tests would rely on the analyzers module. I think that's OK? core itself would have no dependence. +1 test dependencies should not block modularization, its just about configuring the classpath though! Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844464#action_12844464 ] Simon Willnauer commented on LUCENE-2309: - bq. [Carrying over discussions on IRC with Chris Male Uwe...] That make it very hard to participate. I can not afford to read through all IRC stuff and I don't get the chance to participate directly unless I watch IRC constantly. We should really move back to JIRA / devlist for such discussions. There is too much going on in IRC. {quote} Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start... {quote} But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I want to implement something aside the TokenStream delegator API? Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844523#action_12844523 ] Simon Willnauer commented on LUCENE-2309: - bq. Then people could freely use Lucene to index off a foreign analysis chain... That is what I was talking about! {quote} I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity. {quote} We can surely hide this implementation completely from field. I consider this being similar to Collector where you pass it explicitly to the search method if you want to have a different behavior. Maybe something like a AttributeProducer. I don't think adding this to field makes a lot of sense at all and it is not worth the complexity. bq. Will the Field also control how stored fields are added? Or only AttributeSourced ones? IMO this is only about inverted fields. bq. We (IW) control the indexing flow, and not the user. The user only gets the possibility to exchange the analysis chain but not the control flow. The user already can mess around with stuff in incrementToken(), the only thing we change / invert is that the indexer does not know about TokenStreams anymore. it does not change the controlflow though. Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2277) QueryNodeImpl throws ConcurrentModificationException on add(ListQueryNode)
[ https://issues.apache.org/jira/browse/LUCENE-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12842332#action_12842332 ] Simon Willnauer commented on LUCENE-2277: - Robert, should the changes text rather say something about the argument that was completely ignored. This was simply a bug due to ignoring the argument but calling a similar named method. Could be a bit picky but I thought I should mention it. Simon QueryNodeImpl throws ConcurrentModificationException on add(ListQueryNode) Key: LUCENE-2277 URL: https://issues.apache.org/jira/browse/LUCENE-2277 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 3.0 Environment: all Reporter: Frank Wesemann Assignee: Robert Muir Priority: Critical Fix For: 3.1 Attachments: addChildren.patch, LUCENE-2277.patch on adding a List of children to a QueryNodeImplemention a ConcurrentModificationException is thrown. This is due to the fact that QueryNodeImpl instead of iteration over the supplied list, iterates over its internal clauses List. Patch: Index: QueryNodeImpl.java === --- QueryNodeImpl.java(revision 911642) +++ QueryNodeImpl.java(working copy) @@ -74,7 +74,7 @@ .getLocalizedMessage(QueryParserMessages.NODE_ACTION_NOT_SUPPORTED)); } -for (QueryNode child : getChildren()) { +for (QueryNode child : children) { add(child); } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837752#action_12837752 ] Simon Willnauer commented on LUCENE-2279: - bq. Should we deprecate (eventually, remove) Analyzer.tokenStream? I would totally agree with that but I guess we can not remove this method until lucene 4.0 which will be hmm in 2020 :) - just joking bq.Maybe we should absorb ReusableAnalyzerBase back into Analyzer? That would be the logical consequence but the problem with ReusableAnalyzerBase is that it will break bw comapt if moved to Analyzer. It assumes both #reusabelTokenStream and #tokenStream to be final and introduces a new factory method. Yet, as an analyzer developer you really want to use the new ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will require you writing half of the code plus gives you reusability of the tokenStream bp. I think Lucene/Solr/Nutch need to eventually get to this point Huge +1 from my side. This could also unify the factory pattern solr uses to build tokenstreams. I would stop right here and ask to discuss it on the dev list, thoughts mike?! eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837465#action_12837465 ] Simon Willnauer commented on LUCENE-2279: - I don't consider this as an issue at all. Each analyzer creating StopFilter instances uses CharArraySet internally and if you write your own you should do so too. The JavaDoc of StopFilter clearly describes what is going on if you use a set in favor of CharArraySet. You should really consider reusabelTokenStream AND use a CharArraySet instance. You should have a look at the current trunk how all the analyzers handle stopwords. Once 3.1 is out you will also be able to subclass ReusableAnalyzerBase which enables reusableTokenStream on the the fly in 99% of the cases. I tend to close this issue though, Robert? eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2255) IndexWriter.getReader() allocates file handles
[ https://issues.apache.org/jira/browse/LUCENE-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831107#action_12831107 ] Simon Willnauer commented on LUCENE-2255: - I see this coming up multiple times, we should document this properly in the javadoc and on the wiki. Jason, aren't you the NRT specialist here. What keeps you from attaching a patch for the IW javadoc? simon IndexWriter.getReader() allocates file handles -- Key: LUCENE-2255 URL: https://issues.apache.org/jira/browse/LUCENE-2255 Project: Lucene - Java Issue Type: Bug Components: Index Environment: Ubuntu 9.10, Java 6 Reporter: Mikkel Kamstrup Erlandsen Attachments: LuceneManyCommits.java I am not sure if this is a bug or really just me not reading the Javadocs right... The IR returned by IW.getReader() leaks file handles if you do not close() it, leading to starvation of the available file handles/process. If it was clear from the docs that this was a *new* reader and not some reference owned by the writer then this would probably be ok. But as I read the docs the reader is internally managed by the IW, which at first shot lead me to believe that I shouldn't close it. So perhaps the docs should be amended to clearly state that this is a caller-owns reader that *must* be closed? Attaching a simple app that illustrates the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2253) Lucene 3.0 - Deprecated QueryParser Constructor in Demo Code [new QueryParser( contents, analyzer)]
[ https://issues.apache.org/jira/browse/LUCENE-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2253: Component/s: Examples Priority: Trivial (was: Major) Issue Type: Task (was: Bug) Changed issue to Task / Trivial. Thanks for reporting this. Lucene 3.0 - Deprecated QueryParser Constructor in Demo Code [new QueryParser( contents, analyzer)] - Key: LUCENE-2253 URL: https://issues.apache.org/jira/browse/LUCENE-2253 Project: Lucene - Java Issue Type: Task Components: Examples Affects Versions: 2.9.1, 3.0 Reporter: Lock Levels Priority: Trivial Original Estimate: 1h Remaining Estimate: 1h Found this issue when following the getting started tutorial with Lucene 3.0. It appears the QueryParser constructor was deprecated The new code in results.jsp should be changed from: new QueryParser(contents, analyzer) to: new QueryParser(Version.LUCENE_CURRENT, contents, analyzer) http://www.locklevels.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2080) Improve the documentation of Version
[ https://issues.apache.org/jira/browse/LUCENE-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830760#action_12830760 ] Simon Willnauer commented on LUCENE-2080: - I like this extension and I think it is important! Yet, I would use the following wording instead: {quote}Additionally, you may need to re-test your entire application to ensure it behaves like expected, as some defaults may have changed and may break functionality in your application.{quote} Improve the documentation of Version Key: LUCENE-2080 URL: https://issues.apache.org/jira/browse/LUCENE-2080 Project: Lucene - Java Issue Type: Task Components: Javadocs Reporter: Robert Muir Assignee: Robert Muir Priority: Trivial Fix For: 2.9.2, 3.0, 3.1 Attachments: LUCENE-2080.patch, LUCENE-2080.patch, LUCENE-2080.patch In my opinion, we should elaborate more on the effects of changing the Version parameter. Particularly, changing this value, even if you recompile your code, likely involves reindexing your data. I do not think this is adequately clear from the current javadocs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2248) Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts
[ https://issues.apache.org/jira/browse/LUCENE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830550#action_12830550 ] Simon Willnauer commented on LUCENE-2248: - bq. Simon, if you like you can use it as basis and start with contrib. will do... Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts - Key: LUCENE-2248 URL: https://issues.apache.org/jira/browse/LUCENE-2248 Project: Lucene - Java Issue Type: Test Components: Analysis, contrib/*, contrib/analyzers, contrib/benchmark, contrib/highlighter, contrib/spatial, contrib/spellchecker, contrib/wikipedia, Index, Javadocs, Other, Query/Scoring, QueryParser, Search, Store, Term Vectors Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Minor Fix For: 3.1 Attachments: LUCENE-2248.patch, LUCENE-2248.patch A lot of tests for the most-recent functionality in Lucene use Version.LUCENE_CURRENT, which is fine in trunk, as we use the most recent version without hassle and changing this in later versions. The problem is, if we copy these tests to backwards branch after 3.1 is out and then start to improve analyzers, we then will have the maintenance hell for backwards tests. And we loose backward compatibility testing for older versions. If we would specify a specific version like LUCENE_31 in our tests, after moving to backwards they must work without any changes! To not always modify all tests after a new version comes out (e.g. after switching to 3.2 dev), I propose to do the following: - declare a static final Version TEST_VERSION = Version.LUCENE_CURRENT (or better) Version.LUCENE_31 in LuceneTestCase(4J). - change all tests that use Version.LUCENE_CURRENT using eclipse refactor to use this constant and remove unneeded import statements. When we then move the tests to backward we must only change one line, depending on how we define this constant: - If in trunk LuceneTestCase it's Version.LUCENE_CURRENT, we just change the backwards branch to use the version numer of the released thing. - If trunk already uses the LUCENE_31 constant (I prefer this), we do not need to change backwards, but instead when switching version numbers we just move trunk forward to the next major version (after added to Version enum). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2245) Remaining contrib testcases should use Version based ctors instead of deprecated ones
[ https://issues.apache.org/jira/browse/LUCENE-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830195#action_12830195 ] Simon Willnauer commented on LUCENE-2245: - According to rmuir this will not interrupt LUCENE-2055. Therefore I will commit this is a bit if nobody objects. Remaining contrib testcases should use Version based ctors instead of deprecated ones - Key: LUCENE-2245 URL: https://issues.apache.org/jira/browse/LUCENE-2245 Project: Lucene - Java Issue Type: Task Components: contrib/* Affects Versions: 3.1 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2245.patch Many testcases in contrib use deprecated ctors for WhitespaceTokenizer / Analyzer etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2248) Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts
[ https://issues.apache.org/jira/browse/LUCENE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829569#action_12829569 ] Simon Willnauer commented on LUCENE-2248: - Uwe, as I already said while we where discussion this, I would add the version to LuceneTestCase (or equivalent for JU4) and then we can do the tests in sub-issues which prevents those super huge patches. thoughts?! Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts - Key: LUCENE-2248 URL: https://issues.apache.org/jira/browse/LUCENE-2248 Project: Lucene - Java Issue Type: Test Components: Analysis, contrib/*, contrib/analyzers, contrib/benchmark, contrib/highlighter, contrib/spatial, contrib/spellchecker, contrib/wikipedia, Index, Javadocs, Other, Query/Scoring, QueryParser, Search, Store, Term Vectors Reporter: Uwe Schindler Priority: Minor Fix For: 3.1 A lot of tests for the most-recent functionality in Lucene use Version.LUCENE_CURRENT, which is fine in trunk, as we use the most recent version without hassle and changing this in later versions. The problem is, if we copy these tests to backwards branch after 3.1 is out and then start to improve analyzers, we then will have the maintenance hell for backwards tests. And we loose backward compatibility testing for older versions. If we would specify a specific version like LUCENE_31 in our tests, after moving to backwards they must work without any changes! To not always modify all tests after a new version comes out (e.g. after switching to 3.2 dev), I propose to do the following: - declare a static final Version TEST_VERSION = Version.LUCENE_CURRENT (or better) Version.LUCENE_31 in LuceneTestCase(4J). - change all tests that use Version.LUCENE_CURRENT using eclipse refactor to use this constant and remove unneeded import statements. When we then move the tests to backward we must only change one line, depending on how we define this constant: - If in trunk LuceneTestCase it's Version.LUCENE_CURRENT, we just change the backwards branch to use the version numer of the released thing. - If trunk already uses the LUCENE_31 constant (I prefer this), we do not need to change backwards, but instead when switching version numbers we just move trunk forward to the next major version (after added to Version enum). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality
[ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829687#action_12829687 ] Simon Willnauer commented on LUCENE-2055: - Robert, nice work! I have one comment on StemmerOverrideFilter The ctor should not always copy the given dictionary dictionary - if is created with such a map we should use the given instance. This is similar to StopFilter vs. StopAnalyzer. Maybe a CharArrayMap.castOrCopy(Map?, String) would be handy in that case. One minor thing, the null check in DutchAnalyzer seems to be unnecessary but anyway thats fine. {code} if (stemdict != null !stemdict.isEmpty()) {code} DutchAnalyzer also has an unused import {code} import java.util.Arrays; {code} except of those +1 from my side Fix buggy stemmers and Remove duplicate analysis functionality -- Key: LUCENE-2055 URL: https://issues.apache.org/jira/browse/LUCENE-2055 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Reporter: Robert Muir Fix For: 3.1 Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead. * analyzers/fr * analyzers/nl * analyzers/ru below are excerpts from this code where they proudly proclaim they use the snowball algorithm. I think we should delete all of this custom stemming code in favor of the actual snowball package. {noformat} /** * A stemmer for French words. * p * The algorithm is based on the work of * Dr Martin Porter on his snowball projectbr * refer to http://snowball.sourceforge.net/french/stemmer.htmlbr * (French stemming algorithm) for details * /p */ public class FrenchStemmer { /** * A stemmer for Dutch words. * p * The algorithm is an implementation of * the a href=http://snowball.tartarus.org/algorithms/dutch/stemmer.html;dutch stemming/a * algorithm in Martin Porter's snowball project. * /p */ public class DutchStemmer { /** * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description). */ class RussianStemmer {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2245) Remaining contrib testcases should use Version based ctors instead of deprecated ones
[ https://issues.apache.org/jira/browse/LUCENE-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829794#action_12829794 ] Simon Willnauer commented on LUCENE-2245: - I will hold off with this patch until LUCENE-2055 is committed don't wanna interrupt roberts work with this cleanup here. Remaining contrib testcases should use Version based ctors instead of deprecated ones - Key: LUCENE-2245 URL: https://issues.apache.org/jira/browse/LUCENE-2245 Project: Lucene - Java Issue Type: Task Components: contrib/* Affects Versions: 3.1 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2245.patch Many testcases in contrib use deprecated ctors for WhitespaceTokenizer / Analyzer etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2242) Contrib CharTokenizer classes should be instantiated using their new Version based ctors
[ https://issues.apache.org/jira/browse/LUCENE-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806839#action_12806839 ] Simon Willnauer commented on LUCENE-2242: - I will commit this in a bit if nobody objects Contrib CharTokenizer classes should be instantiated using their new Version based ctors Key: LUCENE-2242 URL: https://issues.apache.org/jira/browse/LUCENE-2242 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2242.patch Contrib CharTokenizer classes should be instantiated using their new Version based ctors introduced by LUCENE-2183 and LUCENE-2240 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2245) Remaining contrib testcases should use Version based ctors instead of deprecated ones
Remaining contrib testcases should use Version based ctors instead of deprecated ones - Key: LUCENE-2245 URL: https://issues.apache.org/jira/browse/LUCENE-2245 Project: Lucene - Java Issue Type: Task Components: contrib/* Affects Versions: 3.1 Reporter: Simon Willnauer Priority: Minor Fix For: 3.1 Many testcases in contrib use deprecated ctors for WhitespaceTokenizer / Analyzer etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2245) Remaining contrib testcases should use Version based ctors instead of deprecated ones
[ https://issues.apache.org/jira/browse/LUCENE-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2245: Attachment: LUCENE-2245.patch this patch fixes the remaining testcases in contrib. Remaining contrib testcases should use Version based ctors instead of deprecated ones - Key: LUCENE-2245 URL: https://issues.apache.org/jira/browse/LUCENE-2245 Project: Lucene - Java Issue Type: Task Components: contrib/* Affects Versions: 3.1 Reporter: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2245.patch Many testcases in contrib use deprecated ctors for WhitespaceTokenizer / Analyzer etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2245) Remaining contrib testcases should use Version based ctors instead of deprecated ones
[ https://issues.apache.org/jira/browse/LUCENE-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-2245: --- Assignee: Simon Willnauer Remaining contrib testcases should use Version based ctors instead of deprecated ones - Key: LUCENE-2245 URL: https://issues.apache.org/jira/browse/LUCENE-2245 Project: Lucene - Java Issue Type: Task Components: contrib/* Affects Versions: 3.1 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2245.patch Many testcases in contrib use deprecated ctors for WhitespaceTokenizer / Analyzer etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2242) Contrib CharTokenizer classes should be instantiated using their new Version based ctors
[ https://issues.apache.org/jira/browse/LUCENE-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-2242. - Resolution: Fixed Committed revision 905065. Contrib CharTokenizer classes should be instantiated using their new Version based ctors Key: LUCENE-2242 URL: https://issues.apache.org/jira/browse/LUCENE-2242 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2242.patch Contrib CharTokenizer classes should be instantiated using their new Version based ctors introduced by LUCENE-2183 and LUCENE-2240 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2240) SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors
[ https://issues.apache.org/jira/browse/LUCENE-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806658#action_12806658 ] Simon Willnauer commented on LUCENE-2240: - bq. Patch looks good, I will commit this with LUCENE-2241 in a day or two. cool, I will go on with LUCENE-2242 and rest of contrib once this is committed SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors --- Key: LUCENE-2240 URL: https://issues.apache.org/jira/browse/LUCENE-2240 Project: Lucene - Java Issue Type: Task Components: Analysis Affects Versions: 3.1 Reporter: Simon Willnauer Assignee: Uwe Schindler Priority: Minor Fix For: 3.1 Attachments: LUCENE-2240.patch Due to the Changes to CharTokenizer ( LUCENE-2183 ) WhitespaceAnalyzer and SimpleAnalyzer need a Version ctor. Default ctors must be deprecated -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2243) FastVectorHighlighter: support DisjunctionMaxQuery
[ https://issues.apache.org/jira/browse/LUCENE-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806659#action_12806659 ] Simon Willnauer commented on LUCENE-2243: - Koji, could you use a foreach loop instead of the iterator... just my 0.02$ {code} DisjunctionMaxQuery dmq = (DisjunctionMaxQuery)sourceQuery; for (Query query : dmq) { flatten(query, flatQueries); } {code} simon FastVectorHighlighter: support DisjunctionMaxQuery -- Key: LUCENE-2243 URL: https://issues.apache.org/jira/browse/LUCENE-2243 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Affects Versions: 2.9, 2.9.1, 3.0 Reporter: Koji Sekiguchi Priority: Minor Fix For: 3.1 Attachments: LUCENE-2243.patch Add DisjunctionMaxQuery support in FVH. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2238) deprecate ChineseAnalyzer
[ https://issues.apache.org/jira/browse/LUCENE-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-2238. - Resolution: Fixed committed in revision 904521 thanks robert deprecate ChineseAnalyzer - Key: LUCENE-2238 URL: https://issues.apache.org/jira/browse/LUCENE-2238 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Reporter: Robert Muir Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2238.patch The ChineseAnalyzer, ChineseTokenizer, and ChineseFilter (not the smart one, or CJK) indexes chinese text as individual characters and removes english stopwords, etc. In my opinion we should simply deprecate all of this in favor of StandardAnalyzer, StandardTokenizer, and StopFilter, which does the same thing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2239) Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt
Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt -- Key: LUCENE-2239 URL: https://issues.apache.org/jira/browse/LUCENE-2239 Project: Lucene - Java Issue Type: Task Reporter: Simon Willnauer I created this issue as a spin off from http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201001.mbox/%3cf18c9dde1001280051w4af2bc50u1cfd55f85e509...@mail.gmail.com%3e We should decide what to do with NIOFSDirectory, if we want to keep it as the default on none-windows platforms and how we want to document this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2239) Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt
[ https://issues.apache.org/jira/browse/LUCENE-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2239: Component/s: Store Affects Version/s: 2.4 2.4.1 2.9 2.9.1 3.0 Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt -- Key: LUCENE-2239 URL: https://issues.apache.org/jira/browse/LUCENE-2239 Project: Lucene - Java Issue Type: Task Components: Store Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Simon Willnauer I created this issue as a spin off from http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201001.mbox/%3cf18c9dde1001280051w4af2bc50u1cfd55f85e509...@mail.gmail.com%3e We should decide what to do with NIOFSDirectory, if we want to keep it as the default on none-windows platforms and how we want to document this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2239) Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt
[ https://issues.apache.org/jira/browse/LUCENE-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2239: Attachment: LUCENE-2239.patch This patch adds documentation to NIOFSDirectory and provides a testcase triggering the behavior. this might be little out of date now but I thought I add it for completeness Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt -- Key: LUCENE-2239 URL: https://issues.apache.org/jira/browse/LUCENE-2239 Project: Lucene - Java Issue Type: Task Components: Store Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Simon Willnauer Attachments: LUCENE-2239.patch I created this issue as a spin off from http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201001.mbox/%3cf18c9dde1001280051w4af2bc50u1cfd55f85e509...@mail.gmail.com%3e We should decide what to do with NIOFSDirectory, if we want to keep it as the default on none-windows platforms and how we want to document this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2240) SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors
SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors --- Key: LUCENE-2240 URL: https://issues.apache.org/jira/browse/LUCENE-2240 Project: Lucene - Java Issue Type: Task Components: Analysis Affects Versions: 3.1 Reporter: Simon Willnauer Priority: Minor Fix For: 3.1 Due to the Changes to CharTokenizer ( LUCENE-2183 ) WhitespaceAnalyzer and SimpleAnalyzer need a Version ctor. Default ctors must be deprecated -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2240) SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors
[ https://issues.apache.org/jira/browse/LUCENE-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2240: Attachment: LUCENE-2240.patch This patch add the new Version ctors and deprecates the defaiult ctor. I did not change any references as I want to split those up in smaller issues. I already changed all references which resulted in a 400k patch. We should rather do it step by step. SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors --- Key: LUCENE-2240 URL: https://issues.apache.org/jira/browse/LUCENE-2240 Project: Lucene - Java Issue Type: Task Components: Analysis Affects Versions: 3.1 Reporter: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2240.patch Due to the Changes to CharTokenizer ( LUCENE-2183 ) WhitespaceAnalyzer and SimpleAnalyzer need a Version ctor. Default ctors must be deprecated -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2241) Core Tests should call Version based ctors instead of deprecated default ctors
Core Tests should call Version based ctors instead of deprecated default ctors -- Key: LUCENE-2241 URL: https://issues.apache.org/jira/browse/LUCENE-2241 Project: Lucene - Java Issue Type: Task Components: Analysis Affects Versions: 3.1 Reporter: Simon Willnauer Priority: Minor Fix For: 3.1 LUCENE-2183 introduced new ctors for all CharTokenizer subclasses. Core - tests should use those ctors with Version.LUCENE_CURRENT instead of the the deprecated ctors. Yet, LUCENE-2240 introduces more Version ctors For WhitespaceAnalyzer and SimpleAnalyzer. Test should also use their Version ctors instead the default ones. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2241) Core Tests should call Version based ctors instead of deprecated default ctors
[ https://issues.apache.org/jira/browse/LUCENE-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2241: Attachment: LUCENE-2241.patch converted all core tests to use Version ctors Core Tests should call Version based ctors instead of deprecated default ctors -- Key: LUCENE-2241 URL: https://issues.apache.org/jira/browse/LUCENE-2241 Project: Lucene - Java Issue Type: Task Components: Analysis Affects Versions: 3.1 Reporter: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2241.patch LUCENE-2183 introduced new ctors for all CharTokenizer subclasses. Core - tests should use those ctors with Version.LUCENE_CURRENT instead of the the deprecated ctors. Yet, LUCENE-2240 introduces more Version ctors For WhitespaceAnalyzer and SimpleAnalyzer. Test should also use their Version ctors instead the default ones. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2242) Contrib CharTokenizer classes should be instantiated using their new Version based ctors
Contrib CharTokenizer classes should be instantiated using their new Version based ctors Key: LUCENE-2242 URL: https://issues.apache.org/jira/browse/LUCENE-2242 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Contrib CharTokenizer classes should be instantiated using their new Version based ctors introduced by LUCENE-2183 and LUCENE-2240 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2242) Contrib CharTokenizer classes should be instantiated using their new Version based ctors
[ https://issues.apache.org/jira/browse/LUCENE-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2242: Attachment: LUCENE-2242.patch converted contrib/analyzers Contrib CharTokenizer classes should be instantiated using their new Version based ctors Key: LUCENE-2242 URL: https://issues.apache.org/jira/browse/LUCENE-2242 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2242.patch Contrib CharTokenizer classes should be instantiated using their new Version based ctors introduced by LUCENE-2183 and LUCENE-2240 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2183) Supplementary Character Handling in CharTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2183: Attachment: LUCENE-2183.patch Added CHANGES.TXT entry and fixed 2 supplementary chars related bugs in the new version of incrementToken(). Testcases added for the bugs. Supplementary Character Handling in CharTokenizer - Key: LUCENE-2183 URL: https://issues.apache.org/jira/browse/LUCENE-2183 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch CharTokenizer is an abstract base class for all Tokenizers operating on a character level. Yet, those tokenizers still use char primitives instead of int codepoints. CharTokenizer should operate on codepoints and preserve bw compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805909#action_12805909 ] Simon Willnauer commented on LUCENE-2183: - I did run following benchmark alg file against the latest patch (specialized old and new methods), the patch with the proxy methods and the old 3.0 code. The outcome shows that the specialized code is about ~8% faster than the proxy class based code so I would rather keep the specialized code as this class is performance sensitive though .alg file {quote} analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource content.source.forever=false { Rounds { ReadTokens ReadTokens : * NewRound ResetSystemErase} : 10 RepAll {quote} 10 Rounds with the latest patch {quote} [java] Report All (11 out of 12) [java] Operation round runCnt recsPerRunrec/s elapsedSecavgUsedMemavgTotalMem [java] Rounds_10 010 0.00 14.83 5,049,432 66,453,504 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.07 - 34,558,000 - 55,705,600 [java] ReadTokens_Exhaust 110 0.00 1.4041,865,312 60,555,264 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.22 - 34,393,904 - 63,176,704 [java] ReadTokens_Exhaust 310 0.00 1.2415,440,624 64,487,424 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.22 - 7,540,512 - 65,601,536 [java] ReadTokens_Exhaust 510 0.00 1.2150,174,760 67,239,936 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.19 - 22,202,768 - 67,174,400 [java] ReadTokens_Exhaust 710 0.00 1.1920,591,672 68,812,800 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.18 - 63,749,984 - 69,009,408 [java] ReadTokens_Exhaust 910 0.00 1.1922,331,600 68,943,872 {quote} 10 rounds with Proxy Class {quote} [java] Report All (11 out of 12) [java] Operation round runCnt recsPerRunrec/s elapsedSecavgUsedMemavgTotalMem [java] Rounds_10 010 0.00 16.33 5,021,144 67,436,544 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.34 - 44,649,496 - 59,244,544 [java] ReadTokens_Exhaust 110 0.00 1.5336,681,952 61,472,768 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.37 - 13,863,688 - 64,094,208 [java] ReadTokens_Exhaust 310 0.00 1.3450,247,864 65,470,464 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.36 - 14,922,888 - 66,322,432 [java] ReadTokens_Exhaust 510 0.00 1.36 5,718,296 67,371,008 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.32 - 54,583,776 - 67,502,080 [java] ReadTokens_Exhaust 710 0.00 1.3335,739,800 68,943,872 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.32 - 24,985,688 - 69,861,376 [java] ReadTokens_Exhaust 910 0.00 1.2964,138,112 69,730,304 {quote} 10 rounds with current trunk {quote} [java] Report All (11 out of 12) [java] Operation round runCnt recsPerRunrec/s elapsedSecavgUsedMemavgTotalMem [java] Rounds_10 010 0.00 15.19 5,040,928 66,256,896 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.15 - 39,548,440 - 55,443,456 [java] ReadTokens_Exhaust 110 0.00 1.4328,088,544 60,096,512 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.27 - 16,004,088 - 61,800,448 [java] ReadTokens_Exhaust 310 0.00 1.2551,034,016 63,045,632 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.24 - 23,371,056 - 63,504,384 [java] ReadTokens_Exhaust 510 0.00 1.2412,964,368 65,208,320 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.25 - 6,598,128 - 65,601,536 [java] ReadTokens_Exhaust 710 0.00
[jira] Issue Comment Edited: (LUCENE-2183) Supplementary Character Handling in CharTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805909#action_12805909 ] Simon Willnauer edited comment on LUCENE-2183 at 1/28/10 1:16 PM: -- I did run following benchmark alg file against the latest patch (specialized old and new methods), the patch with the proxy methods and the old 3.0 code. The outcome shows that the specialized code is about ~8% faster than the proxy class based code so I would rather keep the specialized code as this class is performance sensitive though .alg file {code} analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource content.source.forever=false { Rounds { ReadTokens ReadTokens : * NewRound ResetSystemErase} : 10 RepAll {code} 10 Rounds with the latest patch {code} [java] Report All (11 out of 12) [java] Operation round runCnt recsPerRunrec/s elapsedSecavgUsedMemavgTotalMem [java] Rounds_10 010 0.00 14.83 5,049,432 66,453,504 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.07 - 34,558,000 - 55,705,600 [java] ReadTokens_Exhaust 110 0.00 1.4041,865,312 60,555,264 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.22 - 34,393,904 - 63,176,704 [java] ReadTokens_Exhaust 310 0.00 1.2415,440,624 64,487,424 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.22 - 7,540,512 - 65,601,536 [java] ReadTokens_Exhaust 510 0.00 1.2150,174,760 67,239,936 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.19 - 22,202,768 - 67,174,400 [java] ReadTokens_Exhaust 710 0.00 1.1920,591,672 68,812,800 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.18 - 63,749,984 - 69,009,408 [java] ReadTokens_Exhaust 910 0.00 1.1922,331,600 68,943,872 {code} 10 rounds with Proxy Class {code} [java] Report All (11 out of 12) [java] Operation round runCnt recsPerRunrec/s elapsedSecavgUsedMemavgTotalMem [java] Rounds_10 010 0.00 16.33 5,021,144 67,436,544 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.34 - 44,649,496 - 59,244,544 [java] ReadTokens_Exhaust 110 0.00 1.5336,681,952 61,472,768 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.37 - 13,863,688 - 64,094,208 [java] ReadTokens_Exhaust 310 0.00 1.3450,247,864 65,470,464 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.36 - 14,922,888 - 66,322,432 [java] ReadTokens_Exhaust 510 0.00 1.36 5,718,296 67,371,008 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.32 - 54,583,776 - 67,502,080 [java] ReadTokens_Exhaust 710 0.00 1.3335,739,800 68,943,872 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.32 - 24,985,688 - 69,861,376 [java] ReadTokens_Exhaust 910 0.00 1.2964,138,112 69,730,304 {code} 10 rounds with current trunk {code} [java] Report All (11 out of 12) [java] Operation round runCnt recsPerRunrec/s elapsedSecavgUsedMemavgTotalMem [java] Rounds_10 010 0.00 15.19 5,040,928 66,256,896 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.15 - 39,548,440 - 55,443,456 [java] ReadTokens_Exhaust 110 0.00 1.4328,088,544 60,096,512 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.27 - 16,004,088 - 61,800,448 [java] ReadTokens_Exhaust 310 0.00 1.2551,034,016 63,045,632 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.24 - 23,371,056 - 63,504,384 [java] ReadTokens_Exhaust 510 0.00 1.2412,964,368 65,208,320 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.25 - 6,598,128 - 65,601,536 [java] ReadTokens_Exhaust 7
[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806028#action_12806028 ] Simon Willnauer commented on LUCENE-2183: - bq. For that a link using javadoc {...@link Character#supplementary} would be good. I will fix this here, as I already have the patcxh applied and will commit it later. Uwe I will take care of it and upload another patch. Thanks for being picky rob! Supplementary Character Handling in CharTokenizer - Key: LUCENE-2183 URL: https://issues.apache.org/jira/browse/LUCENE-2183 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch CharTokenizer is an abstract base class for all Tokenizers operating on a character level. Yet, those tokenizers still use char primitives instead of int codepoints. CharTokenizer should operate on codepoints and preserve bw compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2238) deprecate ChineseAnalyzer
[ https://issues.apache.org/jira/browse/LUCENE-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-2238: --- Assignee: Simon Willnauer deprecate ChineseAnalyzer - Key: LUCENE-2238 URL: https://issues.apache.org/jira/browse/LUCENE-2238 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Reporter: Robert Muir Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2238.patch The ChineseAnalyzer, ChineseTokenizer, and ChineseFilter (not the smart one, or CJK) indexes chinese text as individual characters and removes english stopwords, etc. In my opinion we should simply deprecate all of this in favor of StandardAnalyzer, StandardTokenizer, and StopFilter, which does the same thing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2238) deprecate ChineseAnalyzer
[ https://issues.apache.org/jira/browse/LUCENE-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806269#action_12806269 ] Simon Willnauer commented on LUCENE-2238: - +1 I will commit this later today if nobody objects deprecate ChineseAnalyzer - Key: LUCENE-2238 URL: https://issues.apache.org/jira/browse/LUCENE-2238 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Reporter: Robert Muir Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-2238.patch The ChineseAnalyzer, ChineseTokenizer, and ChineseFilter (not the smart one, or CJK) indexes chinese text as individual characters and removes english stopwords, etc. In my opinion we should simply deprecate all of this in favor of StandardAnalyzer, StandardTokenizer, and StopFilter, which does the same thing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805508#action_12805508 ] Simon Willnauer commented on LUCENE-2183: - Short update: I found a bug in the latest version which was untested I will update soon with a speed comparison between the current version and the version using the proxy class. Supplementary Character Handling in CharTokenizer - Key: LUCENE-2183 URL: https://issues.apache.org/jira/browse/LUCENE-2183 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch CharTokenizer is an abstract base class for all Tokenizers operating on a character level. Yet, those tokenizers still use char primitives instead of int codepoints. CharTokenizer should operate on codepoints and preserve bw compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1845) if the build fails to download JARs for contrib/db, just skip its tests
[ https://issues.apache.org/jira/browse/LUCENE-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-1845: Attachment: LUCENE-1845.patch I haven't looked at this issue for a while now but I figured today that the version we are using is not available for download anymore on the oracle pages. If you follow the link in the build file you will be able to download the zip file but I guess we should upgrade to the latest 3.3 version of BDB-JE. (see http://www.oracle.com/technology/software/products/berkeley-db/je/index.html - version 3.3.69) There is also another mirror that serves the jar directly (a maven repository) that might be more reliable. I updated the patch to load the 3.3.93 version of the jar directly and skip the unzip step as we now download only the jar file. I also updated the maven pom template files to reference the right version of BDB-JE which wasn't the case though. I think we should give the maven-repo mirror a chance though. if the build fails to download JARs for contrib/db, just skip its tests --- Key: LUCENE-1845 URL: https://issues.apache.org/jira/browse/LUCENE-1845 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-1845.patch, LUCENE-1845.patch, LUCENE-1845.txt, LUCENE-1845.txt, LUCENE-1845.txt, LUCENE-1845.txt Every so often our nightly build fails because contrib/db is unable to download the necessary BDB JARs from http://downloads.osafoundation.org. I think in such cases we should simply skip contrib/db's tests, if it's the nightly build that's running, since it's a false positive failure. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1845) if the build fails to download JARs for contrib/db, just skip its tests
[ https://issues.apache.org/jira/browse/LUCENE-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801404#action_12801404 ] Simon Willnauer commented on LUCENE-1845: - mike, can you take this issue it unfortunately touches core stuff :/ simon if the build fails to download JARs for contrib/db, just skip its tests --- Key: LUCENE-1845 URL: https://issues.apache.org/jira/browse/LUCENE-1845 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Simon Willnauer Priority: Minor Fix For: 3.1 Attachments: LUCENE-1845.patch, LUCENE-1845.patch, LUCENE-1845.txt, LUCENE-1845.txt, LUCENE-1845.txt, LUCENE-1845.txt Every so often our nightly build fails because contrib/db is unable to download the necessary BDB JARs from http://downloads.osafoundation.org. I think in such cases we should simply skip contrib/db's tests, if it's the nightly build that's running, since it's a false positive failure. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2220) Stackoverflow when calling deprecated CharArraySet.copy()
Stackoverflow when calling deprecated CharArraySet.copy() - Key: LUCENE-2220 URL: https://issues.apache.org/jira/browse/LUCENE-2220 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 Calling CharArraySet#copy(set) without the version argument (deprecated) with an instance of CharArraySet results in a stack overflow as this method checks if the given set is a CharArraySet and then calls itself again. This was accidentially introduced due to an overloaded alternative method during LUCENE-2169 which was not used in the final patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2220) Stackoverflow when calling deprecated CharArraySet.copy()
[ https://issues.apache.org/jira/browse/LUCENE-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2220: Attachment: LUCENE-2220.patch here is a patch and the extended testcase Stackoverflow when calling deprecated CharArraySet.copy() - Key: LUCENE-2220 URL: https://issues.apache.org/jira/browse/LUCENE-2220 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2220.patch Calling CharArraySet#copy(set) without the version argument (deprecated) with an instance of CharArraySet results in a stack overflow as this method checks if the given set is a CharArraySet and then calls itself again. This was accidentially introduced due to an overloaded alternative method during LUCENE-2169 which was not used in the final patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1845) if the build fails to download JARs for contrib/db, just skip its tests
[ https://issues.apache.org/jira/browse/LUCENE-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801416#action_12801416 ] Simon Willnauer commented on LUCENE-1845: - mike, thanks for resolving this. I already replied to the commit mail but mention it here again for completeness We should add a changes.txt entry to notify users that we upgraded the version. simon if the build fails to download JARs for contrib/db, just skip its tests --- Key: LUCENE-1845 URL: https://issues.apache.org/jira/browse/LUCENE-1845 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.1 Attachments: LUCENE-1845.patch, LUCENE-1845.patch, LUCENE-1845.txt, LUCENE-1845.txt, LUCENE-1845.txt, LUCENE-1845.txt Every so often our nightly build fails because contrib/db is unable to download the necessary BDB JARs from http://downloads.osafoundation.org. I think in such cases we should simply skip contrib/db's tests, if it's the nightly build that's running, since it's a false positive failure. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2198) support protected words in Stemming TokenFilters
[ https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2198: Attachment: LUCENE-2198.patch This patch ports all stemmers in core and contrib/analyzers to make use of the KeywordAttribute. I did not include snowball yet. support protected words in Stemming TokenFilters Key: LUCENE-2198 URL: https://issues.apache.org/jira/browse/LUCENE-2198 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.0 Reporter: Robert Muir Priority: Minor Attachments: LUCENE-2198.patch, LUCENE-2198.patch This is from LUCENE-1515 I propose that all stemming TokenFilters have an 'exclusion set' that bypasses any stemming for words in this set. Some stemming tokenfilters have this, some do not. This would be one way for Karl to implement his new swedish stemmer (as a text file of ignore words). Additionally, it would remove duplication between lucene and solr, as they reimplement snowballfilter since it does not have this functionality. Finally, I think this is a pretty common use case, where people want to ignore things like proper nouns in the stemming. As an alternative design I considered a case where we generalized this to CharArrayMap (and ignoring words would mean mapping them to themselves), which would also provide a mechanism to override the stemming algorithm. But I think this is too expert, could be its own filter, and the only example of this i can find is in the Dutch stemmer. So I think we should just provide ignore with CharArraySet, but if you feel otherwise please comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2198) support protected words in Stemming TokenFilters
[ https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801446#action_12801446 ] Simon Willnauer commented on LUCENE-2198: - I kind of agree with both of you. When I started implementing this attribute I had FlagAttribute in mind but I didn't choose it because users can randomly choose a bit of the word which might lead to unexpected behavior. Another solution I had in mind is to introduce another Attribute (or extend FlagAttribute) holding a Lucene private (not the java visibility keyword) Enum that can be extended in the future. Internally this could use a word or a Bitset (a word will do I guess) where bits can be set according to the enum ord. That way we could encode way more than only one single boolean and the cost of adding new flags / enum values would be minimal. {code} booleanAttribute.isSet(BooelanAttributeEnum.Keyword) {code} something like that, thoughts? support protected words in Stemming TokenFilters Key: LUCENE-2198 URL: https://issues.apache.org/jira/browse/LUCENE-2198 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.0 Reporter: Robert Muir Priority: Minor Attachments: LUCENE-2198.patch, LUCENE-2198.patch This is from LUCENE-1515 I propose that all stemming TokenFilters have an 'exclusion set' that bypasses any stemming for words in this set. Some stemming tokenfilters have this, some do not. This would be one way for Karl to implement his new swedish stemmer (as a text file of ignore words). Additionally, it would remove duplication between lucene and solr, as they reimplement snowballfilter since it does not have this functionality. Finally, I think this is a pretty common use case, where people want to ignore things like proper nouns in the stemming. As an alternative design I considered a case where we generalized this to CharArrayMap (and ignoring words would mean mapping them to themselves), which would also provide a mechanism to override the stemming algorithm. But I think this is too expert, could be its own filter, and the only example of this i can find is in the Dutch stemmer. So I think we should just provide ignore with CharArraySet, but if you feel otherwise please comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2206) integrate snowball stopword lists
[ https://issues.apache.org/jira/browse/LUCENE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801163#action_12801163 ] Simon Willnauer commented on LUCENE-2206: - Robert, patch looks good except of one thing. {code} public static HashSetString getSnowballWordSet(Reader reader) {code} it returns a hashset but should really return a SetString. We plan to change all return types to the interface instead of the implementation. integrate snowball stopword lists - Key: LUCENE-2206 URL: https://issues.apache.org/jira/browse/LUCENE-2206 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2206.patch The snowball project creates stopword lists as well as stemmers, example: http://svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt?view=markup This patch includes the following: * snowball stopword lists for 13 languages in contrib/snowball/resources * all stoplists are unmodified, only added license header and converted each one from whatever encoding it was in to UTF-8 * added getSnowballWordSet to WordListLoader, this is because the format of these files is very different, for example it supports multiple words per line and embedded comments. I did not add any changes to SnowballAnalyzer to actually automatically use these lists yet, i would like us to discuss this in a future issue proposing integrating snowball with contrib/analyzers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2212) add a test for PorterStemFilter
[ https://issues.apache.org/jira/browse/LUCENE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801181#action_12801181 ] Simon Willnauer commented on LUCENE-2212: - Nice robert, I was adding a test class for PorterStemFilter during LUCENE-2198 to test the KeywordAttr. Yet this looks very good though. I wonder if we should use GetResourcesAsStream rather than the system property. the resources should always be on the classpath. add a test for PorterStemFilter --- Key: LUCENE-2212 URL: https://issues.apache.org/jira/browse/LUCENE-2212 Project: Lucene - Java Issue Type: Test Components: Analysis Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: LUCENE-2212.patch, porterTestData.zip There are no tests for PorterStemFilter, yet svn history reveals some (very minor) cleanups, etc. The only thing executing its code in tests is a test or two in SmartChinese tests. This patch runs the StemFilter against Martin Porter's test data set for this stemmer, checking for expected output. The zip file is 100KB added to src/test, if this is too large I can change it to download the data instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2212) add a test for PorterStemFilter
[ https://issues.apache.org/jira/browse/LUCENE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801243#action_12801243 ] Simon Willnauer commented on LUCENE-2212: - bq. updated patch with getResource() + ZipFile :) thanks bq. will commit this test at the end of the day unless anyone objects. +1 go ahead add a test for PorterStemFilter --- Key: LUCENE-2212 URL: https://issues.apache.org/jira/browse/LUCENE-2212 Project: Lucene - Java Issue Type: Test Components: Analysis Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: LUCENE-2212.patch, LUCENE-2212.patch, porterTestData.zip There are no tests for PorterStemFilter, yet svn history reveals some (very minor) cleanups, etc. The only thing executing its code in tests is a test or two in SmartChinese tests. This patch runs the StemFilter against Martin Porter's test data set for this stemmer, checking for expected output. The zip file is 100KB added to src/test, if this is too large I can change it to download the data instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2195) Speedup CharArraySet if set is empty
[ https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801252#action_12801252 ] Simon Willnauer commented on LUCENE-2195: - bq. I do not think unmodifiableset should have a no-arg ctor, so instead i pushed this up to emptychararrayset ok I'm fine with that. {quote} i do not think emptychararrayset need override and throw uoe for removeAll or retainAll, and i don't think the tests were correct in assuming it will throw uoe. it will not throw uoe for say, removeAll only because it is empty. it will just do nothing. {quote} You are right, this should only throw this exception if the set contains it and the Iterator does not implement remove() {code} * Note that this implementation throws an * ttUnsupportedOperationException/tt if the iterator returned by this * collection's iterator method does not implement the ttremove/tt * method and this collection contains the specified object. {code} same is true for AbstractSet#removeAll() retainAll() Thanks for updating it. I think this is good to go though! Speedup CharArraySet if set is empty Key: LUCENE-2195 URL: https://issues.apache.org/jira/browse/LUCENE-2195 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2195.patch, LUCENE-2195.patch, LUCENE-2195.patch, LUCENE-2195.patch CharArraySet#contains(...) always creates a HashCode of the String, Char[] or CharSequence even if the set is empty. contains should return false if set it empty -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2183) Supplementary Character Handling in CharTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2183: Attachment: LUCENE-2183.patch I updated the patch to make use of the nice reflection utils and ported all subclasses of CharTokenizer to the int based API. Due to the addition of Version to CharTokenizer ctors this patch creates a lot of usage of deprecated API. Yet, I haven't changed all the usage of the deprecated ctors, this should be done in another issue IMO. Supplementary Character Handling in CharTokenizer - Key: LUCENE-2183 URL: https://issues.apache.org/jira/browse/LUCENE-2183 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2183.patch, LUCENE-2183.patch CharTokenizer is an abstract base class for all Tokenizers operating on a character level. Yet, those tokenizers still use char primitives instead of int codepoints. CharTokenizer should operate on codepoints and preserve bw compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2214) Remove deprecated StemExclusionSet setters in contrib/analyzers
Remove deprecated StemExclusionSet setters in contrib/analyzers --- Key: LUCENE-2214 URL: https://issues.apache.org/jira/browse/LUCENE-2214 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Reporter: Simon Willnauer Priority: Minor Fix For: 3.1 Lots of stem exclusion sets have been deprecated in 3.0. As we are in contrib land we could now remove them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2183) Supplementary Character Handling in CharTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2183: Attachment: LUCENE-2183.patch Uwe, using an interface doesn't work though as I can not reduce the public visibility in CharTokeinzer. Yet, this patch tries to solve it with an abstract class. To be honest I would rather say we duplicate the code and use a simple boolean switch in incrementToken. Not that nice but def. faster. what do you think? Supplementary Character Handling in CharTokenizer - Key: LUCENE-2183 URL: https://issues.apache.org/jira/browse/LUCENE-2183 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch CharTokenizer is an abstract base class for all Tokenizers operating on a character level. Yet, those tokenizers still use char primitives instead of int codepoints. CharTokenizer should operate on codepoints and preserve bw compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2195) Speedup CharArraySet if set is empty
[ https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800405#action_12800405 ] Simon Willnauer commented on LUCENE-2195: - any comments on the latest patch? Speedup CharArraySet if set is empty Key: LUCENE-2195 URL: https://issues.apache.org/jira/browse/LUCENE-2195 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2195.patch, LUCENE-2195.patch, LUCENE-2195.patch CharArraySet#contains(...) always creates a HashCode of the String, Char[] or CharSequence even if the set is empty. contains should return false if set it empty -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2198) support protected words in Stemming TokenFilters
[ https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2198: Attachment: LUCENE-2198.patch This patch contains an intial design proposal. I tried to name the new attribute a little bit more generic as this could easily be used outside of the stemming domain. all tests pass -- comments welcome. support protected words in Stemming TokenFilters Key: LUCENE-2198 URL: https://issues.apache.org/jira/browse/LUCENE-2198 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.0 Reporter: Robert Muir Priority: Minor Attachments: LUCENE-2198.patch This is from LUCENE-1515 I propose that all stemming TokenFilters have an 'exclusion set' that bypasses any stemming for words in this set. Some stemming tokenfilters have this, some do not. This would be one way for Karl to implement his new swedish stemmer (as a text file of ignore words). Additionally, it would remove duplication between lucene and solr, as they reimplement snowballfilter since it does not have this functionality. Finally, I think this is a pretty common use case, where people want to ignore things like proper nouns in the stemming. As an alternative design I considered a case where we generalized this to CharArrayMap (and ignoring words would mean mapping them to themselves), which would also provide a mechanism to override the stemming algorithm. But I think this is too expert, could be its own filter, and the only example of this i can find is in the Dutch stemmer. So I think we should just provide ignore with CharArraySet, but if you feel otherwise please comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2203) improved snowball testing
[ https://issues.apache.org/jira/browse/LUCENE-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799868#action_12799868 ] Simon Willnauer commented on LUCENE-2203: - looks good to me. I haven't applied it but looks good though! +1 from my side improved snowball testing - Key: LUCENE-2203 URL: https://issues.apache.org/jira/browse/LUCENE-2203 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Attachments: LUCENE-2203.patch Snowball project has test vocabulary files for each language in their svn repository, along with expected output. We should use these tests to ensure all languages are working correctly, and it might be helpful in the future for identifying back breaks/changes if we ever want to upgrade snowball, etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2188) A handy utility class for tracking deprecated overridden methods
[ https://issues.apache.org/jira/browse/LUCENE-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799387#action_12799387 ] Simon Willnauer commented on LUCENE-2188: - good stuff uwe, I will fix LUCENE-2183 now. A handy utility class for tracking deprecated overridden methods Key: LUCENE-2188 URL: https://issues.apache.org/jira/browse/LUCENE-2188 Project: Lucene - Java Issue Type: New Feature Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch This issue provides a new handy utility class that keeps track of overridden deprecated methods in non-final sub classes. This class can be used in new deprecations. See the javadocs for an example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2203) improved snowball testing
[ https://issues.apache.org/jira/browse/LUCENE-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798795#action_12798795 ] Simon Willnauer commented on LUCENE-2203: - Robert, those test seem to be very extensive - thats good! But honestly think we should make those tests optional in some way. The files you are downloading are very large and might be an issues for some folks. The filesize is over 70MB which is a lot for a test. I need to thing about this a little and come up with some suggestions. improved snowball testing - Key: LUCENE-2203 URL: https://issues.apache.org/jira/browse/LUCENE-2203 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Priority: Minor Attachments: LUCENE-2203.patch Snowball project has test vocabulary files for each language in their svn repository, along with expected output. We should use these tests to ensure all languages are working correctly, and it might be helpful in the future for identifying back breaks/changes if we ever want to upgrade snowball, etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false
[ https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798474#action_12798474 ] Simon Willnauer commented on LUCENE-2199: - I plan to commit this in today or tomorrow. Somebody volunteering to backport? simon ShingleFilter skips over trie-shingles if outputUnigram is set to false --- Key: LUCENE-2199 URL: https://issues.apache.org/jira/browse/LUCENE-2199 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2199.patch, LUCENE-2199.patch Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa {quote} I noticed that if I set outputUnigrams to false it gives me the same output for maxShingleSize=2 and maxShingleSize=3. please divide divide this this sentence when i set maxShingleSize to 4 output is: please divide please divide this sentence divide this this sentence I was expecting the output as follows with maxShingleSize=3 and outputUnigrams=false : please divide this divide this sentence {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2200) Several final classes have non-overriding protected members
[ https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798516#action_12798516 ] Simon Willnauer commented on LUCENE-2200: - Robert, when you commit this make sure you mark the Attributes in EdgeNGramTokenFilter.java final thanks. Steve thanks for the patch, such work is always appreciated. simon Several final classes have non-overriding protected members --- Key: LUCENE-2200 URL: https://issues.apache.org/jira/browse/LUCENE-2200 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Reporter: Steven Rowe Assignee: Robert Muir Priority: Trivial Attachments: LUCENE-2200.patch, LUCENE-2200.patch Protected member access in final classes, except where a protected method overrides a superclass's protected method, makes little sense. The attached patch converts final classes' protected access on fields to private, removes two final classes' unused protected constructors, and converts one final class's protected final method to private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798517#action_12798517 ] Simon Willnauer commented on LUCENE-2197: - Yonik, would you commit this issue please. I think we agreed on your solution. simon StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet --- Key: LUCENE-2197 URL: https://issues.apache.org/jira/browse/LUCENE-2197 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 3.1 Reporter: Simon Willnauer Priority: Critical Fix For: 3.1 Attachments: LUCENE-2197.patch, LUCENE-2197.patch With LUCENE-2094 a new CharArraySet is created no matter what type of set is passed to StopFilter. This does not behave as documented and could introduce serious performance problems. Yet, according to the javadoc, the instance of CharArraySet should be passed to CharArraySet.copy (which is very fast for CharArraySet instances) instead of copied via new CharArraySet() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false
[ https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798522#action_12798522 ] Simon Willnauer commented on LUCENE-2199: - I committed this in revision 897672 Robert, would you please backport this to 2.9 / 3.0 - thanks for the offer! simon ShingleFilter skips over trie-shingles if outputUnigram is set to false --- Key: LUCENE-2199 URL: https://issues.apache.org/jira/browse/LUCENE-2199 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2199.patch, LUCENE-2199.patch Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa {quote} I noticed that if I set outputUnigrams to false it gives me the same output for maxShingleSize=2 and maxShingleSize=3. please divide divide this this sentence when i set maxShingleSize to 4 output is: please divide please divide this sentence divide this this sentence I was expecting the output as follows with maxShingleSize=3 and outputUnigrams=false : please divide this divide this sentence {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2198) support protected words in Stemming TokenFilters
[ https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798388#action_12798388 ] Simon Willnauer commented on LUCENE-2198: - bq. So I think we should just provide ignore with CharArraySet, but if you feel otherwise please comment. While I read your proposal a possibly more flexible design came to my mind. We could introduce a StemAttribute that has a method public boolean stem() used by every stemmer to decide if a token should be stemmed. That way we decouple the decision if a token should be stemmed from the stemming algorithm. This also enables custom filters to set the values based on other reasons aside from a term being in a set. The default value for sure it true but can be set on any condition. inside an analyzer we can add a filter right before the stemmer based on a CharArraySet. Yet if the set is empty or null we simply leave the filter out. support protected words in Stemming TokenFilters Key: LUCENE-2198 URL: https://issues.apache.org/jira/browse/LUCENE-2198 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.0 Reporter: Robert Muir Priority: Minor This is from LUCENE-1515 I propose that all stemming TokenFilters have an 'exclusion set' that bypasses any stemming for words in this set. Some stemming tokenfilters have this, some do not. This would be one way for Karl to implement his new swedish stemmer (as a text file of ignore words). Additionally, it would remove duplication between lucene and solr, as they reimplement snowballfilter since it does not have this functionality. Finally, I think this is a pretty common use case, where people want to ignore things like proper nouns in the stemming. As an alternative design I considered a case where we generalized this to CharArrayMap (and ignoring words would mean mapping them to themselves), which would also provide a mechanism to override the stemming algorithm. But I think this is too expert, could be its own filter, and the only example of this i can find is in the Dutch stemmer. So I think we should just provide ignore with CharArraySet, but if you feel otherwise please comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2200) Several final classes have non-overriding protected members
[ https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798389#action_12798389 ] Simon Willnauer commented on LUCENE-2200: - Steve, I briefly looked at your patch. Could we make some of the member vars final too? The reader in CharReader or the defaultAnalyzer in ShingleAnalyzerWrapper for instance. simon Several final classes have non-overriding protected members --- Key: LUCENE-2200 URL: https://issues.apache.org/jira/browse/LUCENE-2200 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Reporter: Steven Rowe Priority: Trivial Attachments: LUCENE-2200.patch Protected member access in final classes, except where a protected method overrides a superclass's protected method, makes little sense. The attached patch converts final classes' protected access on fields to private, removes two final classes' unused protected constructors, and converts one final class's protected final method to private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797952#action_12797952 ] Simon Willnauer commented on LUCENE-2197: - bq. Here's a patch that reverts to the previous behavior of using the set provided. Doesn't seem to lead anywhere to discuss with the performance police when I look at the average size of your comments. :) This was actually meant to be a pattern for analyzer subclasses so I won't be the immutability police here. Yonik, will you take this issue and commit?! bq. We should really avoid this type of nannyism in Lucene. oh well this seems to me like a void * is / isn't evil discussion - nevermind. StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet --- Key: LUCENE-2197 URL: https://issues.apache.org/jira/browse/LUCENE-2197 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 3.1 Reporter: Simon Willnauer Priority: Critical Fix For: 3.1 Attachments: LUCENE-2197.patch, LUCENE-2197.patch With LUCENE-2094 a new CharArraySet is created no matter what type of set is passed to StopFilter. This does not behave as documented and could introduce serious performance problems. Yet, according to the javadoc, the instance of CharArraySet should be passed to CharArraySet.copy (which is very fast for CharArraySet instances) instead of copied via new CharArraySet() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1967) make it easier to access default stopwords for language analyzers
[ https://issues.apache.org/jira/browse/LUCENE-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer closed LUCENE-1967. --- Resolution: Fixed incorporated in LUCENE-2034 make it easier to access default stopwords for language analyzers - Key: LUCENE-1967 URL: https://issues.apache.org/jira/browse/LUCENE-1967 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Robert Muir Assignee: Simon Willnauer Priority: Minor DM Smith made the following comment: (sometimes it is hard to dig out the stop set from the analyzers) Looking around, some of these analyzers have very different ways of storing the default list. One idea is to consider generalizing something like what Simon did with LUCENE-1965, LUCENE-1962, and having all stopwords lists stored as .txt files in resources folder. {code} /** * Returns an unmodifiable instance of the default stop-words set. * @return an unmodifiable instance of the default stop-words set. */ public static SetString getDefaultStopSet() {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false
ShingleFilter skips over trie-shingles if outputUnigram is set to false --- Key: LUCENE-2199 URL: https://issues.apache.org/jira/browse/LUCENE-2199 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 3.0, 2.9.1, 2.9, 2.4.1, 2.4 Reporter: Simon Willnauer Fix For: 3.1 Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa {quote} I noticed that if I set outputUnigrams to false it gives me the same output for maxShingleSize=2 and maxShingleSize=3. please divide divide this this sentence when i set maxShingleSize to 4 output is: please divide please divide this sentence divide this this sentence I was expecting the output as follows with maxShingleSize=3 and outputUnigrams=false : please divide this divide this sentence {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false
[ https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2199: Attachment: LUCENE-2199.patch This patch adds test for trigram and fourgram with and without outputUnigram. All tests pass ShingleFilter skips over trie-shingles if outputUnigram is set to false --- Key: LUCENE-2199 URL: https://issues.apache.org/jira/browse/LUCENE-2199 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2199.patch Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa {quote} I noticed that if I set outputUnigrams to false it gives me the same output for maxShingleSize=2 and maxShingleSize=3. please divide divide this this sentence when i set maxShingleSize to 4 output is: please divide please divide this sentence divide this this sentence I was expecting the output as follows with maxShingleSize=3 and outputUnigrams=false : please divide this divide this sentence {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false
[ https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798162#action_12798162 ] Simon Willnauer commented on LUCENE-2199: - We should likely backport this to 2.9 / 3.0 too ShingleFilter skips over trie-shingles if outputUnigram is set to false --- Key: LUCENE-2199 URL: https://issues.apache.org/jira/browse/LUCENE-2199 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2199.patch Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa {quote} I noticed that if I set outputUnigrams to false it gives me the same output for maxShingleSize=2 and maxShingleSize=3. please divide divide this this sentence when i set maxShingleSize to 4 output is: please divide please divide this sentence divide this this sentence I was expecting the output as follows with maxShingleSize=3 and outputUnigrams=false : please divide this divide this sentence {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false
[ https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-2199: --- Assignee: Simon Willnauer ShingleFilter skips over trie-shingles if outputUnigram is set to false --- Key: LUCENE-2199 URL: https://issues.apache.org/jira/browse/LUCENE-2199 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2199.patch, LUCENE-2199.patch Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa {quote} I noticed that if I set outputUnigrams to false it gives me the same output for maxShingleSize=2 and maxShingleSize=3. please divide divide this this sentence when i set maxShingleSize to 4 output is: please divide please divide this sentence divide this this sentence I was expecting the output as follows with maxShingleSize=3 and outputUnigrams=false : please divide this divide this sentence {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798189#action_12798189 ] Simon Willnauer commented on LUCENE-2197: - bq. Sorry Simon... I think I just got fed up with stuff like this in the JDK over the years (that forces people to write their own implementations for best performance), and you happened to be the closest person at the time :) no worries, thanks for the reply! bq. To the software pedant, that's not safe and would probably be called bad design - ... I understand and I can totally see your point. I was kind of separated due to the kind of short rants (don't get me wrong). I agree with you that we should not do that in a filter as this constructor could be called very very frequently especially if an analyzer does not implement reusableTokenStream. I would still argue that for an analyzer this is a different story and I would want to keep the code in analyzers copying the set. Classes, instantiated so frequently as filters should not introduce possible bottlenecks while analyzers are usually shared that won't be much of a hassle - any performance police issues with this? :) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet --- Key: LUCENE-2197 URL: https://issues.apache.org/jira/browse/LUCENE-2197 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 3.1 Reporter: Simon Willnauer Priority: Critical Fix For: 3.1 Attachments: LUCENE-2197.patch, LUCENE-2197.patch With LUCENE-2094 a new CharArraySet is created no matter what type of set is passed to StopFilter. This does not behave as documented and could introduce serious performance problems. Yet, according to the javadoc, the instance of CharArraySet should be passed to CharArraySet.copy (which is very fast for CharArraySet instances) instead of copied via new CharArraySet() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2195) Speedup CharArraySet if set is empty
Speedup CharArraySet if set is empty Key: LUCENE-2195 URL: https://issues.apache.org/jira/browse/LUCENE-2195 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 CharArraySet#contains(...) always creates a HashCode of the String, Char[] or CharSequence even if the set is empty. contains should return false if set it empty -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2195) Speedup CharArraySet if set is empty
[ https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2195: Attachment: LUCENE-2195.patch here is a patch Speedup CharArraySet if set is empty Key: LUCENE-2195 URL: https://issues.apache.org/jira/browse/LUCENE-2195 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2195.patch CharArraySet#contains(...) always creates a HashCode of the String, Char[] or CharSequence even if the set is empty. contains should return false if set it empty -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2195) Speedup CharArraySet if set is empty
[ https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2195: Attachment: LUCENE-2195.patch Updated patch. This patch does not count==0 check in contains(Object) as the o.toString() could return null and the NPE would be silently swallowed if the set is empty. The null check and NPE are necessary to yield consistent behavior no matter if the set is empty or not. Speedup CharArraySet if set is empty Key: LUCENE-2195 URL: https://issues.apache.org/jira/browse/LUCENE-2195 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2195.patch, LUCENE-2195.patch CharArraySet#contains(...) always creates a HashCode of the String, Char[] or CharSequence even if the set is empty. contains should return false if set it empty -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2196) Spellchecker should implement java.io.Closable
[ https://issues.apache.org/jira/browse/LUCENE-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2196: Attachment: LUCENE-2196.patch Spellchecker should implement java.io.Closable -- Key: LUCENE-2196 URL: https://issues.apache.org/jira/browse/LUCENE-2196 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2196.patch As the most of the lucene classes implement Closable (IndexWriter) Spellchecker should do too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2196) Spellchecker should implement java.io.Closable
Spellchecker should implement java.io.Closable -- Key: LUCENE-2196 URL: https://issues.apache.org/jira/browse/LUCENE-2196 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2196.patch As the most of the lucene classes implement Closable (IndexWriter) Spellchecker should do too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally
[ https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797697#action_12797697 ] Simon Willnauer commented on LUCENE-2108: - Created sep. issue for that purpose LUCENE-2196 SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally - Key: LUCENE-2108 URL: https://issues.apache.org/jira/browse/LUCENE-2108 Project: Lucene - Java Issue Type: Bug Components: contrib/spellchecker Affects Versions: 3.0 Reporter: Eirik Bjorsnos Assignee: Simon Willnauer Fix For: 3.0.1, 3.1 Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108_Lucene_2_9_branch.patch, LUCENE-2108_test_java14.patch I can't find any way to close the IndexSearcher (and IndexReader) that is being used by SpellChecker internally. I've worked around this issue by keeping a single SpellChecker open for each index, but I'd really like to be able to close it and reopen it on demand without leaking file descriptors. Could we add a close() method to SpellChecker that will close the IndexSearcher and null the reference to it? And perhaps add some code that reopens the searcher if the reference to it is null? Or would that break thread safety of SpellChecker? The attached patch adds a close method but leaves it to the user to call setSpellIndex to reopen the searcher if desired. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2196) Spellchecker should implement java.io.Closable
[ https://issues.apache.org/jira/browse/LUCENE-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-2196. - Resolution: Fixed Fix Version/s: 3.0.1 committed in revision 896934 thanks uwe Spellchecker should implement java.io.Closable -- Key: LUCENE-2196 URL: https://issues.apache.org/jira/browse/LUCENE-2196 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Reporter: Simon Willnauer Fix For: 3.0.1, 3.1 Attachments: LUCENE-2196.patch As the most of the lucene classes implement Closable (IndexWriter) Spellchecker should do too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2195) Speedup CharArraySet if set is empty
[ https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2195: Attachment: LUCENE-2195.patch I changed my patch to please Yonik who has performance concerns as well as robert who wants to use EMTPY_SET instead of set == null checks. I agree with robert that I would rather have an empty set instead of null asssigned if the set is omitted or if the default set is empty. Yet, I subclassed UnmodifiableCharArraySet and added a specailized implementation for EMPTY_SET that checks for null to throw the NPE and otherwise always returns false for all contains(...) methods. This class is final and the overhead for the method call should be very small. Speedup CharArraySet if set is empty Key: LUCENE-2195 URL: https://issues.apache.org/jira/browse/LUCENE-2195 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2195.patch, LUCENE-2195.patch, LUCENE-2195.patch CharArraySet#contains(...) always creates a HashCode of the String, Char[] or CharSequence even if the set is empty. contains should return false if set it empty -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797758#action_12797758 ] Simon Willnauer commented on LUCENE-2094: - Hi Yonik, bq. It looks like it was committed as part of this issue, but I can't find any comments here about either the need to make a copy or the need to make a unmodifiable set. I try to help you to reconstruct the whole thing a bit. UnmodifiableCharArraySet was introduces with LUCENE-1688 as far as I recall to replace the static string array (stopwords) in StopAnalyzer. During the refactoring / improvements in contrib/analyzers we decided to make analyzers and tokenfilters immutable and use chararrayset whereever we can. To prevent provided set from being modified while they are in use in a filter the given set is copied and wrapped in an immutable instance of chararrayset. At the same time (still ongoing) we try to convert every set which is likely to be used in a TokenFilter into a charArraySet. Wordlistloader is not done yet but on the list, the plan is to change the return values from HashSet? into Set? and create CharArraySet instances internally. With LUCENE-2034 we introduced StopwordAnalyzerBase which also uses the UnmodifiableCharArraySet with a copy of the given set. The copy of a charArraySet is very fast even for large sets and the creation of a unmodifiableCharArraySet from a CharArraySet instance is basically just an object creation. The background is, again to prevent any modification to those sets while they are in use. bq. This new behavior also no longer matches the javadoc for the constructor. I agree we should adjust the javadoc for ctors expecting stopwords to reflect the behavior. Prepare CharArraySet for Unicode 4.0 Key: LUCENE-2094 URL: https://issues.apache.org/jira/browse/LUCENE-2094 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 3.0 Reporter: Simon Willnauer Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in ignorecase mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2194) improve efficiency of snowballfilter
[ https://issues.apache.org/jira/browse/LUCENE-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797762#action_12797762 ] Simon Willnauer commented on LUCENE-2194: - looks good robert. Nice improvement. improve efficiency of snowballfilter Key: LUCENE-2194 URL: https://issues.apache.org/jira/browse/LUCENE-2194 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: LUCENE-2194.patch snowball stemming currently creates 2 new strings and 1 new stringbuilder for every word. all of this is unnecessary, so don't do it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797772#action_12797772 ] Simon Willnauer commented on LUCENE-2094: - bq. Simon, I think yonik refers to this code in stopfilter itself: I see, the problem with this piece of code is that it has the caseinsensitive flag which would be ignored if I would not create such a set though. As far as I can see even previous version did not really do what the javadoc says. {code} if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } {code} I agree we should prevent this costly operation but it doesn't seem to be easy though. My first impression is to deprecate the ctors which have the ignorecase boolean and fix documentation to use charArraySet if case should be ignored. At the same time we should introduce a getter to charArraySet and only create a new set if the boolean given and the ignorecase member in CharArraySet does not match, provided it is an instance of charArraySet. Prepare CharArraySet for Unicode 4.0 Key: LUCENE-2094 URL: https://issues.apache.org/jira/browse/LUCENE-2094 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 3.0 Reporter: Simon Willnauer Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in ignorecase mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797772#action_12797772 ] Simon Willnauer edited comment on LUCENE-2094 at 1/7/10 7:53 PM: - bq. Simon, I think yonik refers to this code in stopfilter itself: I see, the problem with this piece of code is that it has the caseinsensitive flag which would be ignored if I would not create such a set though. As far as I can see even previous version did not really do what the javadoc says. {code} if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } {code} I agree we should prevent this costly operation but it doesn't seem to be easy though. My first impression is to deprecate the ctors which have the ignorecase boolean and fix documentation to use charArraySet if case should be ignored. At the same time we should introduce a getter to charArraySet and only create a new set if the boolean given and the ignorecase member in CharArraySet does not match, provided it is an instance of charArraySet. This should also be backported to 2.9 / 3.0 to enable solr to at least fix things where possible. was (Author: simonw): bq. Simon, I think yonik refers to this code in stopfilter itself: I see, the problem with this piece of code is that it has the caseinsensitive flag which would be ignored if I would not create such a set though. As far as I can see even previous version did not really do what the javadoc says. {code} if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } {code} I agree we should prevent this costly operation but it doesn't seem to be easy though. My first impression is to deprecate the ctors which have the ignorecase boolean and fix documentation to use charArraySet if case should be ignored. At the same time we should introduce a getter to charArraySet and only create a new set if the boolean given and the ignorecase member in CharArraySet does not match, provided it is an instance of charArraySet. Prepare CharArraySet for Unicode 4.0 Key: LUCENE-2094 URL: https://issues.apache.org/jira/browse/LUCENE-2094 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 3.0 Reporter: Simon Willnauer Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in ignorecase mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797772#action_12797772 ] Simon Willnauer edited comment on LUCENE-2094 at 1/7/10 8:07 PM: - bq. Simon, I think yonik refers to this code in stopfilter itself: Thanks god jira lets me edit my comments :) My X60 was too small to spot the comment about charArraySet and ingoreCase. This is absolutely true - this issue introduced this change and it should 100% use CharArraySet.copy instead of constructing a new CharArraySet I will create a new issue and upload a patch. was (Author: simonw): bq. Simon, I think yonik refers to this code in stopfilter itself: I see, the problem with this piece of code is that it has the caseinsensitive flag which would be ignored if I would not create such a set though. As far as I can see even previous version did not really do what the javadoc says. {code} if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } {code} I agree we should prevent this costly operation but it doesn't seem to be easy though. My first impression is to deprecate the ctors which have the ignorecase boolean and fix documentation to use charArraySet if case should be ignored. At the same time we should introduce a getter to charArraySet and only create a new set if the boolean given and the ignorecase member in CharArraySet does not match, provided it is an instance of charArraySet. This should also be backported to 2.9 / 3.0 to enable solr to at least fix things where possible. Prepare CharArraySet for Unicode 4.0 Key: LUCENE-2094 URL: https://issues.apache.org/jira/browse/LUCENE-2094 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 3.0 Reporter: Simon Willnauer Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in ignorecase mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet
StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet --- Key: LUCENE-2197 URL: https://issues.apache.org/jira/browse/LUCENE-2197 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 3.1 Reporter: Simon Willnauer Priority: Critical Fix For: 3.1 With LUCENE-2094 a new CharArraySet is created no matter what type of set is passed to StopFilter. This does not behave as documented and could introduce serious performance problems. Yet, according to the javadoc, the instance of CharArraySet should be passed to CharArraySet.copy (which is very fast for CharArraySet instances) instead of copied via new CharArraySet() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2197: Attachment: LUCENE-2197.patch StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet --- Key: LUCENE-2197 URL: https://issues.apache.org/jira/browse/LUCENE-2197 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 3.1 Reporter: Simon Willnauer Priority: Critical Fix For: 3.1 Attachments: LUCENE-2197.patch With LUCENE-2094 a new CharArraySet is created no matter what type of set is passed to StopFilter. This does not behave as documented and could introduce serious performance problems. Yet, according to the javadoc, the instance of CharArraySet should be passed to CharArraySet.copy (which is very fast for CharArraySet instances) instead of copied via new CharArraySet() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2147) Improve Spatial Utility like classes
[ https://issues.apache.org/jira/browse/LUCENE-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-2147. - Resolution: Fixed Fix Version/s: 3.1 Committed in revision 896240 Thanks Chris Improve Spatial Utility like classes Key: LUCENE-2147 URL: https://issues.apache.org/jira/browse/LUCENE-2147 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Affects Versions: 3.1 Reporter: Chris Male Assignee: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch - DistanceUnits can be improved by giving functionality to the enum, such as being able to convert between different units, and adding tests. - GeoHashUtils can be improved through some code tidying, documentation, and tests. - SpatialConstants allows us to move all constants, such as the radii and circumferences of Earth, to a single consistent location that we can then use throughout the contrib. This also allows us to improve the transparency of calculations done in the contrib, as users of the contrib can easily see the values being used. Currently this issues does not migrate classes to use these constants, that will happen in issues related to the appropriate classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2188) A handy utility class for tracking deprecated overridden methods
[ https://issues.apache.org/jira/browse/LUCENE-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796910#action_12796910 ] Simon Willnauer commented on LUCENE-2188: - Uwe, I'm not sure if I have a really good replacement for the your names, none of the following suggestions seem to be a 100% match though. for getOverrideDistance() you could call it: * getDefinitionDistanceFrom(Class) * getImplementationDistanceFrom(Class) The term distance is fine IMO, I would rather extend the javadoc a little and explain that this is the distance between the given class and the next class implementing the method on the path from the given class to the base class where the method was initally declared / defined for isOverriddenBy() you could call it: * isDefinedBy() * isImplementedBy() I also wanna mention an option for the class name, VirtualMethod pretty much matches what this class represents. :) A handy utility class for tracking deprecated overridden methods Key: LUCENE-2188 URL: https://issues.apache.org/jira/browse/LUCENE-2188 Project: Lucene - Java Issue Type: New Feature Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch This issue provides a new handy utility class that keeps track of overridden deprecated methods in non-final sub classes. This class can be used in new deprecations. See the javadocs for an example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795974#action_12795974 ] Simon Willnauer commented on LUCENE-2183: - bq. This issue is blocked by: ... I give up... Supplementary Character Handling in CharTokenizer - Key: LUCENE-2183 URL: https://issues.apache.org/jira/browse/LUCENE-2183 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2183.patch CharTokenizer is an abstract base class for all Tokenizers operating on a character level. Yet, those tokenizers still use char primitives instead of int codepoints. CharTokenizer should operate on codepoints and preserve bw compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795882#action_12795882 ] Simon Willnauer commented on LUCENE-2034: - Robert, I see what you are alluding to. Yet, I agree this is a new issue and should be handled separately. The issues would require some changes in the api I guess or rather additions. Yet, we should commit this regardless! I would be happy to make additions to StopwordAnalyzerBase on another issue as long as we haven't released this code we can still change the API while I don't think we have to. #getStopwordSet will always return the set in use while setting the stopwordset depending on the version is internal to the class. Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors - Key: LUCENE-2034 URL: https://issues.apache.org/jira/browse/LUCENE-2034 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2147) Improve Spatial Utility like classes
[ https://issues.apache.org/jira/browse/LUCENE-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795892#action_12795892 ] Simon Willnauer commented on LUCENE-2147: - {quote} I'd say that we remove the flux warnings, but instead put a note in the top level that since this is a contrib module, it will not adhere to Lucene core's strict back compat. policy. {quote} that sounds good, I will put it into a package.html doc and will also add a readme to the project itself. I think this issue is good to go. I will commit this is a few days if nobody objects. Improve Spatial Utility like classes Key: LUCENE-2147 URL: https://issues.apache.org/jira/browse/LUCENE-2147 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Affects Versions: 3.1 Reporter: Chris Male Assignee: Simon Willnauer Attachments: LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch - DistanceUnits can be improved by giving functionality to the enum, such as being able to convert between different units, and adding tests. - GeoHashUtils can be improved through some code tidying, documentation, and tests. - SpatialConstants allows us to move all constants, such as the radii and circumferences of Earth, to a single consistent location that we can then use throughout the contrib. This also allows us to improve the transparency of calculations done in the contrib, as users of the contrib can easily see the values being used. Currently this issues does not migrate classes to use these constants, that will happen in issues related to the appropriate classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2147) Improve Spatial Utility like classes
[ https://issues.apache.org/jira/browse/LUCENE-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795277#action_12795277 ] Simon Willnauer commented on LUCENE-2147: - Since this is the first issue which comes near to be committed some questions arise from my side if we should mark the new API as experimental like the function API in o.a.l.s.function. I think it would make sense to keep a warning that contrib/spatial might slightly change in the future. On the other hand we should try to put more confidence into contrib/spatial for more user acceptance. I currently work for customers that moved away from spatial due to its early stage and flux warnings which is quite understandable though. I would like to hear other opinions regarding this topic - especially opinions of more experienced committers would be appreciated. Improve Spatial Utility like classes Key: LUCENE-2147 URL: https://issues.apache.org/jira/browse/LUCENE-2147 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Affects Versions: 3.1 Reporter: Chris Male Assignee: Simon Willnauer Attachments: LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch - DistanceUnits can be improved by giving functionality to the enum, such as being able to convert between different units, and adding tests. - GeoHashUtils can be improved through some code tidying, documentation, and tests. - SpatialConstants allows us to move all constants, such as the radii and circumferences of Earth, to a single consistent location that we can then use throughout the contrib. This also allows us to improve the transparency of calculations done in the contrib, as users of the contrib can easily see the values being used. Currently this issues does not migrate classes to use these constants, that will happen in issues related to the appropriate classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2173) Simplify and tidy Cartesian Tier Code in Spatial
[ https://issues.apache.org/jira/browse/LUCENE-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-2173: --- Assignee: Simon Willnauer Simplify and tidy Cartesian Tier Code in Spatial Key: LUCENE-2173 URL: https://issues.apache.org/jira/browse/LUCENE-2173 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Affects Versions: 3.1 Reporter: Chris Male Assignee: Simon Willnauer Attachments: LUCENE-2173.patch, LUCENE-2173.patch, LUCENE-2173.patch The Cartesian Tier filtering code in the spatial code can be simplified, tidied and generally improved. Improvements include removing default field name support which isn't the responsibility of the code, adding javadoc, making method names more intuitive and trying to make the complex code in CartesianPolyFilterBuilder more understandable. Few deprecations have to occur as part of this work, but some public methods in CartesianPolyFilterBuilder will be made private where possible so future improvements of this class can occur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2174) Add new SpatialFilter and DistanceFieldComparatorSource to Spatial
[ https://issues.apache.org/jira/browse/LUCENE-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-2174: --- Assignee: Simon Willnauer Add new SpatialFilter and DistanceFieldComparatorSource to Spatial -- Key: LUCENE-2174 URL: https://issues.apache.org/jira/browse/LUCENE-2174 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Affects Versions: 3.1 Reporter: Chris Male Assignee: Simon Willnauer Attachments: LUCENE-2174.patch The current DistanceQueryBuilder and DistanceFieldComparatorSource in Spatial are based on the old filtering process, most of which has been deprecated in previous issues. These will be replaced by a new SpatialFilter class, which is a proper Lucene filter, and a new DistanceFieldComparatorSource which will be relocated and will use the new DistanceFilter interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2152) Abstract Spatial distance filtering process and supported field formats
[ https://issues.apache.org/jira/browse/LUCENE-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795282#action_12795282 ] Simon Willnauer commented on LUCENE-2152: - Chris, indeed this is a tricky one. One problem with arises related to the map used for distance caching is when you want to use spatial with a filter and sort in contrib/remote. At least in the current code (not your patch - haven't looked at it yet though) the sort instance is obtained from the filter and depends on the map instance filled by the filter. After serialization the instance disappears and sort doesn't work anymore on the remote side. if we could decouple the distance storage from the filter implementation we could also come up with a solution for the sorting problem like providing a remoteCollector that has any key value lookup function (internally) that can be used by the sort function to lookup the calculated values. I personally would go one step further and introduce an exchangeable distance calculation function in the first step and a collector in the second. It is even possible to introduce a delegation approach like the following example: {code} DistanceFunction func = new MapCachingDistFunc(new DefaultDistanceFunc(new CustomFieldDecoder()); for docId in docs: if func(docid, reader, point) = dist: bitSet.set(docid) {code} That way we could completely separate the problem into a function interface / abstract class and can provide several implementations. It would also be possible to solve our problem with sorting where we can pass a special RemoteDistanceFunction to both the sort and filter impl. I don't know how it would look like in the impl though. Maybe we can even use this function interface in the customscorequery as well. Just some random ideas Abstract Spatial distance filtering process and supported field formats --- Key: LUCENE-2152 URL: https://issues.apache.org/jira/browse/LUCENE-2152 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Affects Versions: 3.1 Reporter: Chris Male Attachments: LUCENE-2152.patch, LUCENE-2152.patch, LUCENE-2152.patch Currently the second stage of the filtering process in the spatial contrib involves calculating the exact distance for the remaining documents, and filtering out those that fall out of the search radius. Currently this is done through the 2 impls of DistanceFilter, LatLngDistanceFilter and GeoHashDistanceFilter. The main difference between these 2 impls is the format of data they support, the former supporting lat/lngs being stored in 2 distinct fields, while the latter supports geohashed lat/lngs through the GeoHashUtils. This difference should be abstracted out so that the distance filtering process is data format agnostic. The second issue is that the distance filtering algorithm can be considerably optimized by using multiple-threads. Therefore it makes sense to have an abstraction of DistanceFilter which has different implementations, one being a multi-threaded implementation and the other being a blank implementation that can be used when no distance filtering is to occur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2152) Abstract Spatial distance filtering process and supported field formats
[ https://issues.apache.org/jira/browse/LUCENE-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795295#action_12795295 ] Simon Willnauer commented on LUCENE-2152: - bq. Given that, I'm still sort of favouring separating the distance calculation function from the storage mechanism. The actual reasons why I proposed it that way are kind of special. Imagine you do a search 1mile around point X, the next search is 2 miles around point X. Yet for such a case you could simply wrap the function in another cache function using the already existing cache as second level cache. All the logic for that would be encapsulated in a simple function. None of the logic would be necessary in any of the implementations like CustomScoreQuery, Sort or Filter. Yet if you separate them into two interfaces (not necessarily the java interface) you would have to have some logic which checks if the values is already cached somewhere. I'm not bound to this solution just throwing in randoms thoughts which could be useful for users to some extend. For me a distance is just a function and I don't care if it is cached or not. The logic which takes care on caching should be completely transparent IMO. If possible we should prevent calls inside the filter etc. like: {code} if(cached): getFromCache() else getFromFunc() {code} Abstract Spatial distance filtering process and supported field formats --- Key: LUCENE-2152 URL: https://issues.apache.org/jira/browse/LUCENE-2152 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Affects Versions: 3.1 Reporter: Chris Male Attachments: LUCENE-2152.patch, LUCENE-2152.patch, LUCENE-2152.patch Currently the second stage of the filtering process in the spatial contrib involves calculating the exact distance for the remaining documents, and filtering out those that fall out of the search radius. Currently this is done through the 2 impls of DistanceFilter, LatLngDistanceFilter and GeoHashDistanceFilter. The main difference between these 2 impls is the format of data they support, the former supporting lat/lngs being stored in 2 distinct fields, while the latter supports geohashed lat/lngs through the GeoHashUtils. This difference should be abstracted out so that the distance filtering process is data format agnostic. The second issue is that the distance filtering algorithm can be considerably optimized by using multiple-threads. Therefore it makes sense to have an abstraction of DistanceFilter which has different implementations, one being a multi-threaded implementation and the other being a blank implementation that can be used when no distance filtering is to occur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795043#action_12795043 ] Simon Willnauer commented on LUCENE-2183: - Hey guys thanks for your comments. when I started thinking about this issue I had a quick chat with robert and we figured that his solution could be working so I implemented it. Yet, i found 2 problems with it. 1. If a user calls super.isTokenChar(char) and the super class has implemented the int method the UOE will never be thrown and the code does not behave like expected from the user perspective. - This is what robert explained above. We could solve this problem with reflection which leads to the second problem. 2. If a Tokenizer like LowerCaseTokenizer only overrides normalize(char|int) it relies on the superclass implementation of isTokenChar. Yet if we solve problem 1. the user would be forced to override the isTokenChar to just call super.isTokenChar otherwise the reflection code will raise an exception that the int method is not implemented in the concrete class or will use the char API - anyway it will not do what is expected. Working around those two problem was the cause of a new API for CharTokenizer. My personal opinion is that inheritance is the wrong tool for changing behavior I used delegation (like a strategy) to on the one hand define a clear new API and decouple the code changing the behavior of the Tokenizer from the tokenizer itself. Inheritance for me is for extending a class and delegation is for changing behavior in this particular problem. Decoupling the old from the new has several advantages over a reflection / inheritance based solution. 1. if a user does not provide a delegation impl he want to use the old API 2. if a user does provide a delegation impl he has still the ability to choose between charprocessing in 3.0 style or 3.1 style 3. no matter what is provided a user has full flexibility to choose the combination of their choice - old char processing - new int based api (maybe minor though) 4. we can leave all tokeinizer subclasses as their are and define new functions that implement their behavior in parallel. those functions can be made final from the beginning and which prevents users from subclassing them. (all of the existing ones should be final in my opinion - like LowerCaseTokenizer which should call Character.isLetter in the isTokenCodePoint(int) directly instead of subclassing another function.) As a user I would expect lucene to revise their design decisions made years ago when there is a need for it like we have in this issue. It is easier to change behavior in user code by swapping to a new api instead of diggin into an workaround implementation of an old api silently calling a new API. Supplementary Character Handling in CharTokenizer - Key: LUCENE-2183 URL: https://issues.apache.org/jira/browse/LUCENE-2183 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2183.patch CharTokenizer is an abstract base class for all Tokenizers operating on a character level. Yet, those tokenizers still use char primitives instead of int codepoints. CharTokenizer should operate on codepoints and preserve bw compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795091#action_12795091 ] Simon Willnauer commented on LUCENE-2183: - {quote} #2 is no problem at all, instead the reflection code to address #1 must be implemented with these conditions * A is the class implementing method isTokenChar(int) * B is the class implementing method isTokenChar(char) * B is a subclass of A * A is not CharTokenizer {quote} ok here is a scenario: {code} class MySmartDeseretTokenizer extends LetterTokenizer { public boolean isTokenChar(char c) { // we trust that DeseretHighLow surrogates are never unpaired return super.isTokenChar(c) || isDeseretHighLowSurrogate(c); } public char nomalize(char c) { if(isDeseretHighSurrogate(c)) return c; if(isDeseretLowSurrogate(c)) return lowerCaseDeseret('\ud801', c)[1]; return Character.toLowercase(c); } public int normalize(int c) { return Character.toLowerCase(c); } } {code} if somebody has similar code like this they might want to preserve compat because they have different versions of their app. Yet the old app only supports deseret high surrogates but the new one accepts all letter supplementary chars due to super.isTokenChar(int). This scenario will break our reflection solution and users might be disappointed though as the new api is there to bring the unicode support. I don't say this scenario exists but it could be a valid one for a very special usecase. I don't say my proposal is THE way to go but I really don't want to use reflection - this would make things worse IMO. Lets find a solution that fits to all scenarios. bq. in the design you propose under the new api, subclassing is impossible, which I am not sure I like either. Hmm, that is not true. You can still subclass and pass your impl up to the superclass. I haven't implemented that yet but this is def. possible. Supplementary Character Handling in CharTokenizer - Key: LUCENE-2183 URL: https://issues.apache.org/jira/browse/LUCENE-2183 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Simon Willnauer Fix For: 3.1 Attachments: LUCENE-2183.patch CharTokenizer is an abstract base class for all Tokenizers operating on a character level. Yet, those tokenizers still use char primitives instead of int codepoints. CharTokenizer should operate on codepoints and preserve bw compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2147) Improve Spatial Utility like classes
[ https://issues.apache.org/jira/browse/LUCENE-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2147: Attachment: LUCENE-2147.patch Chris, this seems to be ready to be committed soon. I removed the flux warnings in the class JavaDocs, converted the tests to junit 4 and added a CHANGES.TXT notice to make it ready to be committed. Improve Spatial Utility like classes Key: LUCENE-2147 URL: https://issues.apache.org/jira/browse/LUCENE-2147 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Affects Versions: 3.1 Reporter: Chris Male Assignee: Simon Willnauer Attachments: LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch - DistanceUnits can be improved by giving functionality to the enum, such as being able to convert between different units, and adding tests. - GeoHashUtils can be improved through some code tidying, documentation, and tests. - SpatialConstants allows us to move all constants, such as the radii and circumferences of Earth, to a single consistent location that we can then use throughout the contrib. This also allows us to improve the transparency of calculations done in the contrib, as users of the contrib can easily see the values being used. Currently this issues does not migrate classes to use these constants, that will happen in issues related to the appropriate classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org