[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865803#comment-13865803 ] ASF subversion and git services commented on LUCENE-5369: - Commit 1556617 from [~ryantxu] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1556617 ] LUCENE-5369: Added an UpperCaseFilter to make UPPERCASE tokens Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865807#comment-13865807 ] ASF subversion and git services commented on LUCENE-5369: - Commit 1556618 from [~ryantxu] in branch 'dev/trunk' [ https://svn.apache.org/r1556618 ] LUCENE-5369: Added an UpperCaseFilter to make UPPERCASE tokens (merge from 4x) Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865902#comment-13865902 ] Shawn Heisey commented on LUCENE-5369: -- [~ryantxu], this fails precommit because the new files are missing svn:eol-style. I actually ran the precommit because I was worried that it would fail the forbidden-apis check. Looks like that only fails on String#toUpperCase if you don't include a locale. Javadocs for Character say that Character#toUpperCase uses Unicode information, so I guess it's OK -- and precommit passed just fine after I added svn:eol-style native to the new files. Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865911#comment-13865911 ] Uwe Schindler commented on LUCENE-5369: --- Yes Character.toUpperCase is fine and locale invariant. Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
fixing now... sorry On Wed, Jan 8, 2014 at 1:28 PM, Uwe Schindler (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865911#comment-13865911] Uwe Schindler commented on LUCENE-5369: --- Yes Character.toUpperCase is fine and locale invariant. Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865916#comment-13865916 ] ASF subversion and git services commented on LUCENE-5369: - Commit 1556643 from [~ryantxu] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1556643 ] LUCENE-5369: missing eol:style Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865917#comment-13865917 ] ASF subversion and git services commented on LUCENE-5369: - Commit 1556644 from [~ryantxu] in branch 'dev/trunk' [ https://svn.apache.org/r1556644 ] LUCENE-5369: missing eol:style (merge from 4x) Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856910#comment-13856910 ] Yonik Seeley commented on LUCENE-5369: -- +1, looks fine. Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856049#comment-13856049 ] Ryan McKinley commented on LUCENE-5369: --- Unless I hear objections, I would like to commit in the next few weeks thanks ryan Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849361#comment-13849361 ] Ryan McKinley commented on LUCENE-5369: --- [~thetaphi]] or [~rcmuir] any thoughts on this? thanks ryan Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849374#comment-13849374 ] Robert Muir commented on LUCENE-5369: - My only thoughts are the usual ones: to me the analysis chain is not really the best tool to do the job of cleaning up faceting labels? These tasks typically dont require tokenization and work on whole values, and may require stuff like extracting values from one field into another. While its true you can do some of this cleanup (casing/trimming,etc) in the analysis chain by (ab)using the fact that fieldcache uninverts indexed values and using keywordtokenizer and using filters like this, its not very intuitive, and you can't do all of it, whereas using something like solr's updateprocessor chain might be a better place to have this support. There is already overlap, e.g. it can trim field contents as well. Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849372#comment-13849372 ] Uwe Schindler commented on LUCENE-5369: --- Maybe add a boolean option in the factory/filter? To remove code duplication? Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter
[ https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849582#comment-13849582 ] Ryan McKinley commented on LUCENE-5369: --- bq. Maybe add a boolean option in the factory/filter? To remove code duplication? Are you suggesting adding a flag to LowerCaseFilter? I'm think that is more confusing than having a distinct UpperCaseFlter -- and the code duplication is essentially the minimum code required for a functioning Filter bq. to me the analysis chain is not really the best tool to do the job of cleaning up faceting labels I understand and often agree that other tools are more appropriate. But there are lots of cases where the search analysis chain gets you so close to the desired display that duplicating things to a specific facet field seems redundant. This is the analyzer I am working with: {code} analyzer charFilter class=solr.MappingCharFilterFactory mapping=normalize-my-field-chars.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.TrimFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=xxx.UpperCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=path/to/synonyms.txt ignoreCase=false expand=false/ /analyzer {code} Add an UpperCaseFilter -- Key: LUCENE-5369 URL: https://issues.apache.org/jira/browse/LUCENE-5369 Project: Lucene - Core Issue Type: New Feature Reporter: Ryan McKinley Assignee: Ryan McKinley Priority: Minor Attachments: LUCENE-5369-uppercase-filter.patch We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones. However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms) -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org