[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2014-01-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865803#comment-13865803
 ] 

ASF subversion and git services commented on LUCENE-5369:
-

Commit 1556617 from [~ryantxu] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1556617 ]

LUCENE-5369: Added an UpperCaseFilter to make UPPERCASE tokens

 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2014-01-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865807#comment-13865807
 ] 

ASF subversion and git services commented on LUCENE-5369:
-

Commit 1556618 from [~ryantxu] in branch 'dev/trunk'
[ https://svn.apache.org/r1556618 ]

LUCENE-5369: Added an UpperCaseFilter to make UPPERCASE tokens (merge from 4x)

 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2014-01-08 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865902#comment-13865902
 ] 

Shawn Heisey commented on LUCENE-5369:
--

[~ryantxu], this fails precommit because the new files are missing 
svn:eol-style.

I actually ran the precommit because I was worried that it would fail the 
forbidden-apis check.  Looks like that only fails on String#toUpperCase if you 
don't include a locale.  Javadocs for Character say that Character#toUpperCase 
uses Unicode information, so I guess it's OK -- and precommit passed just fine 
after I added svn:eol-style native to the new files.


 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2014-01-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865911#comment-13865911
 ] 

Uwe Schindler commented on LUCENE-5369:
---

Yes Character.toUpperCase is fine and locale invariant.

 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2014-01-08 Thread Ryan McKinley
fixing now... sorry


On Wed, Jan 8, 2014 at 1:28 PM, Uwe Schindler (JIRA) j...@apache.orgwrote:


 [
 https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865911#comment-13865911]

 Uwe Schindler commented on LUCENE-5369:
 ---

 Yes Character.toUpperCase is fine and locale invariant.

  Add an UpperCaseFilter
  --
 
  Key: LUCENE-5369
  URL: https://issues.apache.org/jira/browse/LUCENE-5369
  Project: Lucene - Core
   Issue Type: New Feature
 Reporter: Ryan McKinley
 Assignee: Ryan McKinley
 Priority: Minor
  Attachments: LUCENE-5369-uppercase-filter.patch
 
 
  We should offer a standard way to force upper-case tokens.  I understand
 that lowercase is safer for general search quality because some uppercase
 characters can represent multiple lowercase ones.
  However, having upper-case tokens is often nice for faceting (consider
 normalizing to standard acronyms)



 --
 This message was sent by Atlassian JIRA
 (v6.1.5#6160)

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2014-01-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865916#comment-13865916
 ] 

ASF subversion and git services commented on LUCENE-5369:
-

Commit 1556643 from [~ryantxu] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1556643 ]

LUCENE-5369: missing eol:style

 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2014-01-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865917#comment-13865917
 ] 

ASF subversion and git services commented on LUCENE-5369:
-

Commit 1556644 from [~ryantxu] in branch 'dev/trunk'
[ https://svn.apache.org/r1556644 ]

LUCENE-5369: missing eol:style (merge from 4x)

 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2013-12-26 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856910#comment-13856910
 ] 

Yonik Seeley commented on LUCENE-5369:
--

+1, looks fine.

 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2013-12-23 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856049#comment-13856049
 ] 

Ryan McKinley commented on LUCENE-5369:
---

Unless I hear objections, I would like to commit in the next few weeks

thanks
ryan

 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2013-12-16 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849361#comment-13849361
 ] 

Ryan McKinley commented on LUCENE-5369:
---

[~thetaphi]] or [~rcmuir] any thoughts on this?

thanks
ryan

 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2013-12-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849374#comment-13849374
 ] 

Robert Muir commented on LUCENE-5369:
-

My only thoughts are the usual ones: to me the analysis chain is not really the 
best tool to do the job of cleaning up faceting labels?

These tasks typically dont require tokenization and work on whole values, and 
may require stuff like extracting values from one field into another. While its 
true you can do some of this cleanup (casing/trimming,etc) in the analysis 
chain by (ab)using the fact that fieldcache uninverts indexed values and using 
keywordtokenizer and using filters like this, its not very intuitive, and you 
can't do all of it, whereas using something like solr's updateprocessor chain 
might be a better place to have this support. There is already overlap, e.g. it 
can trim field contents as well.

 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2013-12-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849372#comment-13849372
 ] 

Uwe Schindler commented on LUCENE-5369:
---

Maybe add a boolean option in the factory/filter? To remove code duplication?

 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5369) Add an UpperCaseFilter

2013-12-16 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849582#comment-13849582
 ] 

Ryan McKinley commented on LUCENE-5369:
---

bq. Maybe add a boolean option in the factory/filter? To remove code 
duplication?

Are you suggesting adding a flag to LowerCaseFilter?  I'm think that is more 
confusing than having a distinct UpperCaseFlter -- and the code duplication is 
essentially the minimum code required for a functioning Filter

bq. to me the analysis chain is not really the best tool to do the job of 
cleaning up faceting labels

I understand and often agree that other tools are more appropriate.  But there 
are lots of cases where the search analysis chain gets you so close to the 
desired display that duplicating things to a specific facet field seems 
redundant.

This is the analyzer I am working with:

{code}
analyzer
  charFilter class=solr.MappingCharFilterFactory 
mapping=normalize-my-field-chars.txt/
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.TrimFilterFactory/
  filter class=solr.ASCIIFoldingFilterFactory/
  filter class=xxx.UpperCaseFilterFactory/
  filter class=solr.SynonymFilterFactory synonyms=path/to/synonyms.txt 
ignoreCase=false expand=false/
/analyzer
{code}





 Add an UpperCaseFilter
 --

 Key: LUCENE-5369
 URL: https://issues.apache.org/jira/browse/LUCENE-5369
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Ryan McKinley
Assignee: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-5369-uppercase-filter.patch


 We should offer a standard way to force upper-case tokens.  I understand that 
 lowercase is safer for general search quality because some uppercase 
 characters can represent multiple lowercase ones.
 However, having upper-case tokens is often nice for faceting (consider 
 normalizing to standard acronyms)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org