[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498841
 ] 

Ryan McKinley commented on SOLR-248:


> Why is so much of the logic in the Factory? 

It seemed silly to copy the same things over and over for each time the type is 
indexed or queried...  

> why is keep in a synchronized map,

I'm not sure it needs to be, but i was being cautious...   the map is only 
created once (and never edited) but could be accessed my many threads 
simultaneously.




> Capitalization Filter Factory
> -
>
> Key: SOLR-248
> URL: https://issues.apache.org/jira/browse/SOLR-248
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard 
> capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Chris Hostetter

: I haven't looked at this specific code, but this is my preference in
: general.  multiple TokenFilters are created per-field instance on the
: index side, and per-query-term on the search side, so it's better to
: pull all the setup you can out of the Filter for performance reasons.

computation can be done at factory instantiation, but it can make sense to
put the code for the computation in static methods within the Filter class
itself -- so it's more reusable outside of Solr.



-Hoss



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498834
 ] 

Yonik Seeley commented on SOLR-248:
---

> Why is so much of the logic in the Factory?

I haven't looked at this specific code, but this is my preference in general.  
multiple TokenFilters are created per-field instance on the index side, and 
per-query-term on the search side, so it's better to pull all the setup you can 
out of the Filter for performance reasons.


> Capitalization Filter Factory
> -
>
> Key: SOLR-248
> URL: https://issues.apache.org/jira/browse/SOLR-248
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard 
> capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread J.J. Larrea (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498817
 ] 

J.J. Larrea commented on SOLR-248:
--

While I fully agree that faceting does raise some odd issues stemming from the 
display of normally-invisible indexed values to humans, and that it  
theoretically should be responsibility of the front-end to translate index 
values into human-readable values, there are great practical advantages in both 
efficiency and convenience to making the indexed values "pretty", and to 
centralize as much of that as possible in the Analysis stage.

In particular, I will try this and am very likely to put this into use this 
weekend, so thank you Ryan!  So I'm +1 to adding it to the Solr distribution, 
though to avoid confusing people it should have a JavaDoc comment explaining 
that the main use is in faceting to avoid having to introduce such common logic 
into the presentation-layer.

Regarding the implementation,

1. For 'keep' and 'okPrefix' (and were it not for reverse-compatibility issues, 
for 'words' in StopFilter), it would be nice to have a means to specify either 
a direct list or a filename in the same parameter.  A simple approach might be 
something like keep="word word word..." vs. keep=" Capitalization Filter Factory
> -
>
> Key: SOLR-248
> URL: https://issues.apache.org/jira/browse/SOLR-248
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard 
> capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498726
 ] 

Ryan McKinley commented on SOLR-248:


> 
>> Implemented at the indexing level, I can have different values for the 
>> stored value and indexed terms.
> One downside is that it complicates certain things like wildcard or prefix 
> queries
>

currently i'm using copyfield and doing the prefix query on a different 
field... not great but it works!

> 
> Of course, I think this might be a more general problem in faceting... what 
> to actually use as a label for display purposes vs what the terms in the 
> index were (think price formatting, labels for more complex facet queries, 
> etc).
> 

Interesting.  I could index with a lowercase filter then reformat the facet 
results...  I'll take a look at that after the deadline passes ;)


> Capitalization Filter Factory
> -
>
> Key: SOLR-248
> URL: https://issues.apache.org/jira/browse/SOLR-248
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard 
> capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498717
 ] 

Yonik Seeley commented on SOLR-248:
---

> Implemented at the indexing level, I can have different values for the stored 
> value and indexed terms.
One downside is that it complicates certain things like wildcard or prefix 
queries (capitalizing the first letter and lowercasing the second is something 
that the QueryParser does not support).

You could still store the values verbatim, and index as all lowercase.
Then the application could capitalize the results it gets back as it sees fit.
I do see value pushing this type of logic back to the search engine though.

Of course, I think this might be a more general problem in faceting... what to 
actually use as a label for display purposes vs what the terms in the index 
were (think price formatting, labels for more complex facet queries, etc).


> Capitalization Filter Factory
> -
>
> Key: SOLR-248
> URL: https://issues.apache.org/jira/browse/SOLR-248
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard 
> capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498711
 ] 

Ryan McKinley commented on SOLR-248:


It is a little strange, but (in my case anyway) i think it makes sense...  

I am indexing a bunch of metadata from a bunch of libraries (OAI-PMH) -- I want 
to display the data exactly as it came from the source, but for faceted 
browsing I need to normalize capitalization.

Implemented at the indexing level, I can have different values for the stored 
value and indexed terms.  Also, at the indexing level I can leverage existing 
Tokenizers and Filters to build the tokens that need capitalization -- it keeps 
all the configuration in schema.xml and lets the OAI -> solr xml be a simple 
transformation, this way whoever takes care of this need only learn solr 
configuration, not ryan+solr configuration. 

If it is not generally useful I can keep it elsewhere - that is why we have the 
nice plugin framework!



> Capitalization Filter Factory
> -
>
> Key: SOLR-248
> URL: https://issues.apache.org/jira/browse/SOLR-248
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard 
> capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498700
 ] 

Yonik Seeley commented on SOLR-248:
---

Hmmm, this feels slightly strange implementing at the indexing level.
What are the ads/disads vs just lowercasing for indexing and capitalizing at 
the presentation/application layer?


> Capitalization Filter Factory
> -
>
> Key: SOLR-248
> URL: https://issues.apache.org/jira/browse/SOLR-248
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard 
> capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-23 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498492
 ] 

Ryan McKinley commented on SOLR-248:


> 
> 1) would it make sense for the keep option to refer to a file, using the same 
> format as StopFilter ... that way it's easy to reuse the same file (which 
> seems like it would be a common case.
> 

probably.  that is a good idea


> 2) what is the point of forceFirstLetter="true" ? ... if you want to force 
> capitalization, what's the point of making hte keep list?
> 

This is one that came of necessity!

with keep="the ..."  and input:
 "Grand army of the Republic", "the arts"

I want: "Grand Army of the Republic" and "The Arts"

"forceFirstLetter" only applies to the first character in the token, not to 
each word.


> 3) is okPrefix going to force the case for things that have that prefix in an 
> alternate case, or only allow that casing to remain (ie: if i index McKeen, 
> Mckeen, mckeen and MCKEEN what tokens do i wind up with?)
> 

As written, if the prefix matches, it assumes the word capitalization is 
correct.  For my input data, this is sufficient -- but it should problem do 
something smarter.

So, if you index "McKeen, Mckeen, mckeen, MCKEEN and McKEEN", you would get:

 "McKeen, Mckeen, Mckeen, Mckeen And McKEEN"

If "okPrefix" was treated as *the* capitalization for input where the lowercase 
prefix matches "mck", it would give:

 "McKeen, McKeen, McKeen, McKeen And McKeen"



> Capitalization Filter Factory
> -
>
> Key: SOLR-248
> URL: https://issues.apache.org/jira/browse/SOLR-248
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard 
> capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-248) Capitalization Filter Factory

2007-05-23 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498488
 ] 

Hoss Man commented on SOLR-248:
---

1) would it make sense for the keep option to refer to a file, using the same 
format as StopFilter ... that way it's easy to reuse the same file (which seems 
like it would be a common case.

2) what is the point of forceFirstLetter="true" ? ... if you want to force 
capitalization, what's the point of making hte keep list?

3) is okPrefix going to force the case for things that have that prefix in an 
alternate case, or only allow that casing to remain (ie: if i index McKeen, 
Mckeen, mckeen and MCKEEN what tokens do i wind up with?)

> Capitalization Filter Factory
> -
>
> Key: SOLR-248
> URL: https://issues.apache.org/jira/browse/SOLR-248
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard 
> capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.