Chaiyasit (Sit) Manovit created SOLR-5362:
---------------------------------------------
Summary: SolrCell's order of field operation with lowernames=true
Key: SOLR-5362
URL: https://issues.apache.org/jira/browse/SOLR-5362
Project: Solr
Issue Type: Improvement
Components: contrib - Solr Cell (Tika extraction)
Reporter: Chaiyasit (Sit) Manovit
This follows from SOLR-1634.
I am not sure if SOLR-1856 completely fixes SOLR-1634, particularly when
{{lowernames=true}} comes in to the picture. Consider a case where:
1. Tika generated field {{Category=Foo}} for a doc (e.g., this comes from
user-defined document properties).
2. {{literalsOverride=true}}.
3. {{lowernames=true}}.
4. User supplied {{literal.category=bar}}.
According to the
[rules|http://wiki.apache.org/solr/ExtractingRequestHandler#Order_of_field_operations],
{{literalsOverride}} is applied before {{lowernames}} and, thus, will have no
effect here since the field {{Category}} from Tika and {{literal.category}} are
considered different fields at this stage before {{lowernames=true}} kicks in.
And when {{lowernames=true}} kicks in, it has the effect of merging
{{Category}} into {{category}}, giving it both values {{Foo}} and {{bar}}.
Adding {{fmap.Category=tika_category}} does not help because {{fmap}} is
applied even later, by that time {{category}} already contains both {{Foo}} and
{{bar}}.
Adding {{fmap.Category=tika_category}} *and* with {{lowernames=false}} would do
(regardless of {{literalsOverride}}), but what if we need {{lowernames=true}}
and what if the capitalization of {{Category}} can vary (e.g., {{CATEGORY}}).
Would it make sense to have an option to apply the rules in the order that they
are specified in the config file or URL params rather than always in a static
order?
Thanks.
PS. Marking this as Major because there seems to be no easy workaround
(condition for Minor).
------------------------
Response from Jan Høydahl
([link|https://issues.apache.org/jira/browse/SOLR-1634?focusedCommentId=13797273&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13797273]):
bq. To me it sounds like a potential, very simple solution would be to apply
lowercasing at several places if {{lowernames=true}}
Agreed. Particularly, to apply {{lowernames=true}} as soon as Tika has
extracted a field, before {{literalsOverride}} is even considered.
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]