[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498492
 ] 

Ryan McKinley commented on SOLR-248:


 
 1) would it make sense for the keep option to refer to a file, using the same 
 format as StopFilter ... that way it's easy to reuse the same file (which 
 seems like it would be a common case.
 

probably.  that is a good idea


 2) what is the point of forceFirstLetter=true ? ... if you want to force 
 capitalization, what's the point of making hte keep list?
 

This is one that came of necessity!

with keep=the ...  and input:
 Grand army of the Republic, the arts

I want: Grand Army of the Republic and The Arts

forceFirstLetter only applies to the first character in the token, not to 
each word.


 3) is okPrefix going to force the case for things that have that prefix in an 
 alternate case, or only allow that casing to remain (ie: if i index McKeen, 
 Mckeen, mckeen and MCKEEN what tokens do i wind up with?)
 

As written, if the prefix matches, it assumes the word capitalization is 
correct.  For my input data, this is sufficient -- but it should problem do 
something smarter.

So, if you index McKeen, Mckeen, mckeen, MCKEEN and McKEEN, you would get:

 McKeen, Mckeen, Mckeen, Mckeen And McKEEN

If okPrefix was treated as *the* capitalization for input where the lowercase 
prefix matches mck, it would give:

 McKeen, McKeen, McKeen, McKeen And McKeen



 Capitalization Filter Factory
 -

 Key: SOLR-248
 URL: https://issues.apache.org/jira/browse/SOLR-248
 Project: Solr
  Issue Type: New Feature
Reporter: Ryan McKinley
Priority: Minor
 Attachments: SOLR-248-CapitalizationFilter.patch


 For tokens that are used in faceting, it is nice to have standard 
 capitalization.  
 I want Aerial views and Aerial Views to both be: Aerial Views

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498700
 ] 

Yonik Seeley commented on SOLR-248:
---

Hmmm, this feels slightly strange implementing at the indexing level.
What are the ads/disads vs just lowercasing for indexing and capitalizing at 
the presentation/application layer?


 Capitalization Filter Factory
 -

 Key: SOLR-248
 URL: https://issues.apache.org/jira/browse/SOLR-248
 Project: Solr
  Issue Type: New Feature
Reporter: Ryan McKinley
Priority: Minor
 Attachments: SOLR-248-CapitalizationFilter.patch


 For tokens that are used in faceting, it is nice to have standard 
 capitalization.  
 I want Aerial views and Aerial Views to both be: Aerial Views

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498711
 ] 

Ryan McKinley commented on SOLR-248:


It is a little strange, but (in my case anyway) i think it makes sense...  

I am indexing a bunch of metadata from a bunch of libraries (OAI-PMH) -- I want 
to display the data exactly as it came from the source, but for faceted 
browsing I need to normalize capitalization.

Implemented at the indexing level, I can have different values for the stored 
value and indexed terms.  Also, at the indexing level I can leverage existing 
Tokenizers and Filters to build the tokens that need capitalization -- it keeps 
all the configuration in schema.xml and lets the OAI - solr xml be a simple 
transformation, this way whoever takes care of this need only learn solr 
configuration, not ryan+solr configuration. 

If it is not generally useful I can keep it elsewhere - that is why we have the 
nice plugin framework!



 Capitalization Filter Factory
 -

 Key: SOLR-248
 URL: https://issues.apache.org/jira/browse/SOLR-248
 Project: Solr
  Issue Type: New Feature
Reporter: Ryan McKinley
Priority: Minor
 Attachments: SOLR-248-CapitalizationFilter.patch


 For tokens that are used in faceting, it is nice to have standard 
 capitalization.  
 I want Aerial views and Aerial Views to both be: Aerial Views

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498717
 ] 

Yonik Seeley commented on SOLR-248:
---

 Implemented at the indexing level, I can have different values for the stored 
 value and indexed terms.
One downside is that it complicates certain things like wildcard or prefix 
queries (capitalizing the first letter and lowercasing the second is something 
that the QueryParser does not support).

You could still store the values verbatim, and index as all lowercase.
Then the application could capitalize the results it gets back as it sees fit.
I do see value pushing this type of logic back to the search engine though.

Of course, I think this might be a more general problem in faceting... what to 
actually use as a label for display purposes vs what the terms in the index 
were (think price formatting, labels for more complex facet queries, etc).


 Capitalization Filter Factory
 -

 Key: SOLR-248
 URL: https://issues.apache.org/jira/browse/SOLR-248
 Project: Solr
  Issue Type: New Feature
Reporter: Ryan McKinley
Priority: Minor
 Attachments: SOLR-248-CapitalizationFilter.patch


 For tokens that are used in faceting, it is nice to have standard 
 capitalization.  
 I want Aerial views and Aerial Views to both be: Aerial Views

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread J.J. Larrea (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498817
 ] 

J.J. Larrea commented on SOLR-248:
--

While I fully agree that faceting does raise some odd issues stemming from the 
display of normally-invisible indexed values to humans, and that it  
theoretically should be responsibility of the front-end to translate index 
values into human-readable values, there are great practical advantages in both 
efficiency and convenience to making the indexed values pretty, and to 
centralize as much of that as possible in the Analysis stage.

In particular, I will try this and am very likely to put this into use this 
weekend, so thank you Ryan!  So I'm +1 to adding it to the Solr distribution, 
though to avoid confusing people it should have a JavaDoc comment explaining 
that the main use is in faceting to avoid having to introduce such common logic 
into the presentation-layer.

Regarding the implementation,

1. For 'keep' and 'okPrefix' (and were it not for reverse-compatibility issues, 
for 'words' in StopFilter), it would be nice to have a means to specify either 
a direct list or a filename in the same parameter.  A simple approach might be 
something like keep=word word word... vs. keep=file, or even keep=file 
file word word (with the requirement for backslash-escaping spaces in 
either)...  Or alternately something like txt:filename (vs. xml:filename, 
json:filename, etc.) with an unescaped : being significant.

2. Why is so much of the logic in the Factory?  This drags Solr-specific stuff 
in when a user might want to use just the Analyzer in a non-Solr context. 
Wouldn't it be better in general for Solr Analyzers to be self-complete, with 
the Factory merely being an adaptor between SolrParams  external resources and 
the Analyzer's constructor?

Also, why is keep in a synchronized map, since there is no mutator?  (I know, 
picky picky...)

Good luck with the deadline!


 Capitalization Filter Factory
 -

 Key: SOLR-248
 URL: https://issues.apache.org/jira/browse/SOLR-248
 Project: Solr
  Issue Type: New Feature
Reporter: Ryan McKinley
Priority: Minor
 Attachments: SOLR-248-CapitalizationFilter.patch


 For tokens that are used in faceting, it is nice to have standard 
 capitalization.  
 I want Aerial views and Aerial Views to both be: Aerial Views

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498834
 ] 

Yonik Seeley commented on SOLR-248:
---

 Why is so much of the logic in the Factory?

I haven't looked at this specific code, but this is my preference in general.  
multiple TokenFilters are created per-field instance on the index side, and 
per-query-term on the search side, so it's better to pull all the setup you can 
out of the Filter for performance reasons.


 Capitalization Filter Factory
 -

 Key: SOLR-248
 URL: https://issues.apache.org/jira/browse/SOLR-248
 Project: Solr
  Issue Type: New Feature
Reporter: Ryan McKinley
Priority: Minor
 Attachments: SOLR-248-CapitalizationFilter.patch


 For tokens that are used in faceting, it is nice to have standard 
 capitalization.  
 I want Aerial views and Aerial Views to both be: Aerial Views

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Chris Hostetter

: I haven't looked at this specific code, but this is my preference in
: general.  multiple TokenFilters are created per-field instance on the
: index side, and per-query-term on the search side, so it's better to
: pull all the setup you can out of the Filter for performance reasons.

computation can be done at factory instantiation, but it can make sense to
put the code for the computation in static methods within the Filter class
itself -- so it's more reusable outside of Solr.



-Hoss



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498841
 ] 

Ryan McKinley commented on SOLR-248:


 Why is so much of the logic in the Factory? 

It seemed silly to copy the same things over and over for each time the type is 
indexed or queried...  

 why is keep in a synchronized map,

I'm not sure it needs to be, but i was being cautious...   the map is only 
created once (and never edited) but could be accessed my many threads 
simultaneously.




 Capitalization Filter Factory
 -

 Key: SOLR-248
 URL: https://issues.apache.org/jira/browse/SOLR-248
 Project: Solr
  Issue Type: New Feature
Reporter: Ryan McKinley
Priority: Minor
 Attachments: SOLR-248-CapitalizationFilter.patch


 For tokens that are used in faceting, it is nice to have standard 
 capitalization.  
 I want Aerial views and Aerial Views to both be: Aerial Views

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-23 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498488
 ] 

Hoss Man commented on SOLR-248:
---

1) would it make sense for the keep option to refer to a file, using the same 
format as StopFilter ... that way it's easy to reuse the same file (which seems 
like it would be a common case.

2) what is the point of forceFirstLetter=true ? ... if you want to force 
capitalization, what's the point of making hte keep list?

3) is okPrefix going to force the case for things that have that prefix in an 
alternate case, or only allow that casing to remain (ie: if i index McKeen, 
Mckeen, mckeen and MCKEEN what tokens do i wind up with?)

 Capitalization Filter Factory
 -

 Key: SOLR-248
 URL: https://issues.apache.org/jira/browse/SOLR-248
 Project: Solr
  Issue Type: New Feature
Reporter: Ryan McKinley
Priority: Minor
 Attachments: SOLR-248-CapitalizationFilter.patch


 For tokens that are used in faceting, it is nice to have standard 
 capitalization.  
 I want Aerial views and Aerial Views to both be: Aerial Views

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.