[jira] [Updated] (SOLR-2802) Toolkit of UpdateProcessors for modifying document values

Hoss Man (Updated) (JIRA) Wed, 07 Dec 2011 18:25:03 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-2802:
---------------------------

    Attachment: SOLR-2802_update_processor_toolkit.patch

I had some time to revisit this issue more again today.

Improvements in this patch:

* exclude options - you can now specify one ore more sets of "exclude" lists 
which are parsed just like the main list of field specifies (examples below)
* improved defaults for ConcatFieldUpdateProcessorFactory - default behavior is 
now to only concat values for fields that the schema says are multiValued=false 
and (StrField or TextField)
* new RemoveBlankFieldUpdateProcessorFactory - removes any 0 length 
CharSequence values it finds, by default looks at all fields
* new FieldLengthUpdateProcessorFactory - replaces any CharSequence values it 
finds with their length, by default it looks at no fields

As part of this work, i tweaked the abstract classes so that the "default" 
assumption about what fields a subclass should match "by default" is still "all 
fields" but it's easy for the subclasses to override this -- the user still has 
the final say, and the abstract class handles that, but if the user doesn't 
configure anything the sub-class can easily say "my default should be ___"

bq. I think I don't completely follow the explicit ruling

I explained myself really terribly before - i was convoluting what should 
really be two orthogonal things:

1) the *field names* that a processor looks at -- the user should have lots of 
options for configuring the field selector explicitly, and if they don't, then 
a sensible default based on the specifics of the processor should be applied, 
and the user should still have the ability to configure exclusion rules on top 
of that default

2) the *values types* that a process will deal with -- regardless of what field 
names a processor is configured with, it should be logical about the types of 
values it finds in those fields.  The FieldLengthUpdateProcessorFactory i just 
added for example only pays attention to values that are CharSequence, if for 
example the SolrInputField already contained an Integer wouldn't make sense to 
toString() that and then find the length of that String vlaue.

bq. I think Date/Number parsing should only be done on compatible fields only. 
I think if a subsequent parser moves / renames fields, then this processor 
should have been configured before the processor that does the Date/Number 
parsing.

But that could easily lead to a chicken-vs-egg problem.  I think ideally you 
should be able to have field names in your SolrInputDocuments (and in your 
processor configurations) that don't exist in your schema at all, so you can 
have "transitory" names that exist purely for passing info arround.

Imagine a situation where you want to let clients submit documents containing a 
"publishDate" field, but you want to be able to cleanly accept real Date 
objects (from java clients) or Strings in a variety of formats, and then you 
want the final index to contain two versions of that date: one indexed 
TrieDateField called "pubDate", and one non indexed StrField called 
"prettyDate" -- ie, there is no  "publishDate" in your schema at all.  You 
could then configure some "ParseDateFieldUpdateProcessor" on the "publishDate" 
even though that field name isn't in your schema, so that you have consistent 
Date objects, and then use a CloneFieldUpdateProcessor and/or 
RenameFieldUpdateProcessor to get that Date object into both your "pubDate" and 
"prettyDate" fields, and then use some sort of FormatDateFieldUpdateProcessor 
on the "prettyDate" field.

There may be other solutions to that type of problem, but I guess the bottom 
line from my perspective is: why bother making a processor deliberately fails 
the user configures it to do something unexpected but still viable?  If they 
want to Parse Strings -> Dates on a TrieIntField, why not just let them do it?  
maybe they've got another processor later that is going to convert that Date to 
"days since epoc" as an integer?


{panel}
Examples of the exclude configuration...

{code}
<updateRequestProcessorChain name="trim-few">
  <processor class="solr.TrimFieldUpdateProcessorFactory">
    <str name="fieldRegex">foo.*</str>
    <str name="fieldRegex">bar.*</str>
    <!-- each set of exclusions is checked independently -->
    <lst name="exclude">
      <str name="typeClass">solr.DateField</str>
    </lst>
    <lst name="exclude">
      <str name="fieldRegex">.*HOSS.*</str>
    </lst>
  </processor>
</updateRequestProcessorChain>
<updateRequestProcessorChain name="trim-some">
  <processor class="solr.TrimFieldUpdateProcessorFactory">
    <str name="fieldRegex">foo.*</str>
    <str name="fieldRegex">bar.*</str>
    <!-- only excluded if it matches all in set -->
    <lst name="exclude">
      <str name="typeClass">solr.DateField</str>
      <str name="fieldRegex">.*HOSS.*</str>
    </lst>
  </processor>
</updateRequestProcessorChain>
{code}

In the "trim-few" case, field names will be excluded if they are DateFields 
_or_ match the "HOSS" regex.  In the "trim-some" case, field names will be 
excluded only if they are _both_ a DateField _and_ match the "HOSS" regex.
{panel}
                
> Toolkit of UpdateProcessors for modifying document values
> ---------------------------------------------------------
>
>                 Key: SOLR-2802
>                 URL: https://issues.apache.org/jira/browse/SOLR-2802
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: SOLR-2802_update_processor_toolkit.patch, 
> SOLR-2802_update_processor_toolkit.patch
>
>
> Frequently users ask about questions about things where the answer is "you 
> could do it with an UpdateProcessor" but the number of our of hte box 
> UpdateProcessors is generally lacking and there aren't even very good base 
> classes for the common case of manipulating field values when adding documents

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2802) Toolkit of UpdateProcessors for modifying document values

Reply via email to