[jira] [Commented] (SOLR-2802) Toolkit of UpdateProcessors for modifying document values

Hoss Man (Commented) (JIRA) Fri, 30 Sep 2011 17:24:09 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118594#comment-13118594
 ]


Hoss Man commented on SOLR-2802:
--------------------------------

bq. I already have a FieldCopy processor which can copy/move fields,

Jan: Yeah ... I designed the base class arround the assumption that we would 
come up with a good "clone fields" processor in SOLR-2599, so that they can 
simply modify the values "in place" and people can clone/rename fields as 
needed before using them

bq. With SOLR-2599, I imagine we could take copyField's out of schema.xml,

Erik: I actually consider them very orthogonal.  Supporting cloning/copying in 
an update processor is a way of saying "when docs are added to the index using 
this Update Chain, take these actions on the fields" but copyField in 
schema.xml is a way of saying "no matter where this doc comes from, the value 
of field X should also be put in field Y"

bq. Before we get too carried away, what about making this even more general 
purpose with scripting, ala SOLR-1725 ?

We definitely should get the Script Processor in for people who don't know java 
but have specific goals, but we shouldn't let support for scripting prevent us 
from implementing some of the more commonly requested actions in java - there's 
a fine line between "you _can_ write scripts to do _anything_ you want" and 
"you _have_ to write scripts to do _everything_ you want"

bq. There's one other update processor that perhaps could fit within this 
framework and become something generally useful in Solr - SOLR-1280

I looked at that one before i started actually because of the "modify in place" 
nature of this base class, it didn't really seem like a good fit to try and 
refactor that one to be a subclass.

bq. I think in general that processors should match nothing by default. Could 
lead to unexpected behaviour for users in the long run.

Martijn: I kept going back and forth on this while i was working on it.  
Ultimately my thought process was that it didn't really make sense for the 
"default" to be a No-Op because if that's the case then what's the point of 
having a default at all?

And if we're going to require that they provide at least one of the field 
selectors, and we want to offer them syntactic sugar for "match all field" why 
not make it the shortest sugar possible?.

I figured it would make sense for the base class to assume that "no args" ment 
let the subclass see all of the fields/values -- and the subclasses could 
enforce their own rules default rules as needed, ala...
* implicitly...
** in the TrimFieldUpdateProcessorFactory attached, it ignores anything that 
isn't an instance of String -- regardless of how it's configured (so it doesn't 
call toString() on an Integer and then try to trim that)
* explicitly
** i imagine that Date/Number parsing update processors should default to only 
trying to parse fields where the FieldType extends DateField/TrieField (the 
Concat processor should probably do the same for StrFields fields configured to 
be multiValued=false now that i think about it).  But unlike how the Trim 
processor works, if they are explicitly configuring it to parse fields named 
"foo.*" they should try to do so regardless of what the field type/settings 
might be, because maybe a subsequent processor will renamed/move those fields 
in the input docs to something that is expecting a Date/Number (or does support 
multivalued fields)

what do you think?

the scenario that still bothers me about all this is that if we put something 
like this in the example schema...

{code}
<updateRequestProcessorChain name="simple" default="true">
 <processor class="solr.TrimFieldUpdateProcessorFactory" />
 <processor class="solr.LogUpdateProcessorFactory" />
 <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
{code}

...(so all strings get trimmed) someone might say "Hey, stop trimming my 
strings!" and it's easy for them to remove that from the example.  But someone 
else might say: "This is exactly what i want _most_ of the time, but I've got 
this one field where whitespace matters, stop trimming that one." -- and now 
he's got to jump through a lot of hoops to keep the trim behavior on all but on 
field  (unless we add some sort of exclusion option(s)).  Even if we make some 
field selection args mandatory for the processor and use this instead...

{code}
<updateRequestProcessorChain name="simple" default="true">
 <processor class="solr.TrimFieldUpdateProcessorFactory">
   <str name="fieldRegex">.*</str>
 </processor>
 <processor class="solr.LogUpdateProcessorFactory" />
 <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
{code}

..that user still has the same amount of pain to deal with.


                
> Toolkit of UpdateProcessors for modifying document values
> ---------------------------------------------------------
>
>                 Key: SOLR-2802
>                 URL: https://issues.apache.org/jira/browse/SOLR-2802
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: SOLR-2802_update_processor_toolkit.patch
>
>
> Frequently users ask about questions about things where the answer is "you 
> could do it with an UpdateProcessor" but the number of our of hte box 
> UpdateProcessors is generally lacking and there aren't even very good base 
> classes for the common case of manipulating field values when adding documents

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2802) Toolkit of UpdateProcessors for modifying document values

Reply via email to