[jira] Commented: (SOLR-248) Capitalization Filter Factory
[ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498492 ] Ryan McKinley commented on SOLR-248: 1) would it make sense for the keep option to refer to a file, using the same format as StopFilter ... that way it's easy to reuse the same file (which seems like it would be a common case. probably. that is a good idea 2) what is the point of forceFirstLetter=true ? ... if you want to force capitalization, what's the point of making hte keep list? This is one that came of necessity! with keep=the ... and input: Grand army of the Republic, the arts I want: Grand Army of the Republic and The Arts forceFirstLetter only applies to the first character in the token, not to each word. 3) is okPrefix going to force the case for things that have that prefix in an alternate case, or only allow that casing to remain (ie: if i index McKeen, Mckeen, mckeen and MCKEEN what tokens do i wind up with?) As written, if the prefix matches, it assumes the word capitalization is correct. For my input data, this is sufficient -- but it should problem do something smarter. So, if you index McKeen, Mckeen, mckeen, MCKEEN and McKEEN, you would get: McKeen, Mckeen, Mckeen, Mckeen And McKEEN If okPrefix was treated as *the* capitalization for input where the lowercase prefix matches mck, it would give: McKeen, McKeen, McKeen, McKeen And McKeen Capitalization Filter Factory - Key: SOLR-248 URL: https://issues.apache.org/jira/browse/SOLR-248 Project: Solr Issue Type: New Feature Reporter: Ryan McKinley Priority: Minor Attachments: SOLR-248-CapitalizationFilter.patch For tokens that are used in faceting, it is nice to have standard capitalization. I want Aerial views and Aerial Views to both be: Aerial Views -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-248) Capitalization Filter Factory
[ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498700 ] Yonik Seeley commented on SOLR-248: --- Hmmm, this feels slightly strange implementing at the indexing level. What are the ads/disads vs just lowercasing for indexing and capitalizing at the presentation/application layer? Capitalization Filter Factory - Key: SOLR-248 URL: https://issues.apache.org/jira/browse/SOLR-248 Project: Solr Issue Type: New Feature Reporter: Ryan McKinley Priority: Minor Attachments: SOLR-248-CapitalizationFilter.patch For tokens that are used in faceting, it is nice to have standard capitalization. I want Aerial views and Aerial Views to both be: Aerial Views -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-248) Capitalization Filter Factory
[ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498711 ] Ryan McKinley commented on SOLR-248: It is a little strange, but (in my case anyway) i think it makes sense... I am indexing a bunch of metadata from a bunch of libraries (OAI-PMH) -- I want to display the data exactly as it came from the source, but for faceted browsing I need to normalize capitalization. Implemented at the indexing level, I can have different values for the stored value and indexed terms. Also, at the indexing level I can leverage existing Tokenizers and Filters to build the tokens that need capitalization -- it keeps all the configuration in schema.xml and lets the OAI - solr xml be a simple transformation, this way whoever takes care of this need only learn solr configuration, not ryan+solr configuration. If it is not generally useful I can keep it elsewhere - that is why we have the nice plugin framework! Capitalization Filter Factory - Key: SOLR-248 URL: https://issues.apache.org/jira/browse/SOLR-248 Project: Solr Issue Type: New Feature Reporter: Ryan McKinley Priority: Minor Attachments: SOLR-248-CapitalizationFilter.patch For tokens that are used in faceting, it is nice to have standard capitalization. I want Aerial views and Aerial Views to both be: Aerial Views -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-248) Capitalization Filter Factory
[ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498717 ] Yonik Seeley commented on SOLR-248: --- Implemented at the indexing level, I can have different values for the stored value and indexed terms. One downside is that it complicates certain things like wildcard or prefix queries (capitalizing the first letter and lowercasing the second is something that the QueryParser does not support). You could still store the values verbatim, and index as all lowercase. Then the application could capitalize the results it gets back as it sees fit. I do see value pushing this type of logic back to the search engine though. Of course, I think this might be a more general problem in faceting... what to actually use as a label for display purposes vs what the terms in the index were (think price formatting, labels for more complex facet queries, etc). Capitalization Filter Factory - Key: SOLR-248 URL: https://issues.apache.org/jira/browse/SOLR-248 Project: Solr Issue Type: New Feature Reporter: Ryan McKinley Priority: Minor Attachments: SOLR-248-CapitalizationFilter.patch For tokens that are used in faceting, it is nice to have standard capitalization. I want Aerial views and Aerial Views to both be: Aerial Views -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-248) Capitalization Filter Factory
[ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498817 ] J.J. Larrea commented on SOLR-248: -- While I fully agree that faceting does raise some odd issues stemming from the display of normally-invisible indexed values to humans, and that it theoretically should be responsibility of the front-end to translate index values into human-readable values, there are great practical advantages in both efficiency and convenience to making the indexed values pretty, and to centralize as much of that as possible in the Analysis stage. In particular, I will try this and am very likely to put this into use this weekend, so thank you Ryan! So I'm +1 to adding it to the Solr distribution, though to avoid confusing people it should have a JavaDoc comment explaining that the main use is in faceting to avoid having to introduce such common logic into the presentation-layer. Regarding the implementation, 1. For 'keep' and 'okPrefix' (and were it not for reverse-compatibility issues, for 'words' in StopFilter), it would be nice to have a means to specify either a direct list or a filename in the same parameter. A simple approach might be something like keep=word word word... vs. keep=file, or even keep=file file word word (with the requirement for backslash-escaping spaces in either)... Or alternately something like txt:filename (vs. xml:filename, json:filename, etc.) with an unescaped : being significant. 2. Why is so much of the logic in the Factory? This drags Solr-specific stuff in when a user might want to use just the Analyzer in a non-Solr context. Wouldn't it be better in general for Solr Analyzers to be self-complete, with the Factory merely being an adaptor between SolrParams external resources and the Analyzer's constructor? Also, why is keep in a synchronized map, since there is no mutator? (I know, picky picky...) Good luck with the deadline! Capitalization Filter Factory - Key: SOLR-248 URL: https://issues.apache.org/jira/browse/SOLR-248 Project: Solr Issue Type: New Feature Reporter: Ryan McKinley Priority: Minor Attachments: SOLR-248-CapitalizationFilter.patch For tokens that are used in faceting, it is nice to have standard capitalization. I want Aerial views and Aerial Views to both be: Aerial Views -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-248) Capitalization Filter Factory
[ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498834 ] Yonik Seeley commented on SOLR-248: --- Why is so much of the logic in the Factory? I haven't looked at this specific code, but this is my preference in general. multiple TokenFilters are created per-field instance on the index side, and per-query-term on the search side, so it's better to pull all the setup you can out of the Filter for performance reasons. Capitalization Filter Factory - Key: SOLR-248 URL: https://issues.apache.org/jira/browse/SOLR-248 Project: Solr Issue Type: New Feature Reporter: Ryan McKinley Priority: Minor Attachments: SOLR-248-CapitalizationFilter.patch For tokens that are used in faceting, it is nice to have standard capitalization. I want Aerial views and Aerial Views to both be: Aerial Views -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (SOLR-248) Capitalization Filter Factory
: I haven't looked at this specific code, but this is my preference in : general. multiple TokenFilters are created per-field instance on the : index side, and per-query-term on the search side, so it's better to : pull all the setup you can out of the Filter for performance reasons. computation can be done at factory instantiation, but it can make sense to put the code for the computation in static methods within the Filter class itself -- so it's more reusable outside of Solr. -Hoss
[jira] Commented: (SOLR-248) Capitalization Filter Factory
[ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498841 ] Ryan McKinley commented on SOLR-248: Why is so much of the logic in the Factory? It seemed silly to copy the same things over and over for each time the type is indexed or queried... why is keep in a synchronized map, I'm not sure it needs to be, but i was being cautious... the map is only created once (and never edited) but could be accessed my many threads simultaneously. Capitalization Filter Factory - Key: SOLR-248 URL: https://issues.apache.org/jira/browse/SOLR-248 Project: Solr Issue Type: New Feature Reporter: Ryan McKinley Priority: Minor Attachments: SOLR-248-CapitalizationFilter.patch For tokens that are used in faceting, it is nice to have standard capitalization. I want Aerial views and Aerial Views to both be: Aerial Views -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-248) Capitalization Filter Factory
[ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498488 ] Hoss Man commented on SOLR-248: --- 1) would it make sense for the keep option to refer to a file, using the same format as StopFilter ... that way it's easy to reuse the same file (which seems like it would be a common case. 2) what is the point of forceFirstLetter=true ? ... if you want to force capitalization, what's the point of making hte keep list? 3) is okPrefix going to force the case for things that have that prefix in an alternate case, or only allow that casing to remain (ie: if i index McKeen, Mckeen, mckeen and MCKEEN what tokens do i wind up with?) Capitalization Filter Factory - Key: SOLR-248 URL: https://issues.apache.org/jira/browse/SOLR-248 Project: Solr Issue Type: New Feature Reporter: Ryan McKinley Priority: Minor Attachments: SOLR-248-CapitalizationFilter.patch For tokens that are used in faceting, it is nice to have standard capitalization. I want Aerial views and Aerial Views to both be: Aerial Views -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.