Stephan,

This should really go to the Solr user list rather than the general one
- you might get more response over there.

Upayavira

On Tue, Nov 26, 2013, at 01:52 PM, Stephan Müller wrote:
> Hi,
> 
> we are passing a multivalued field to the
> LanguageIdentifierUpdateProcessor. This multivalued field 
> contains arbitrary types (Integer, String, Date).
> Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument
> doc, String[] fields), 
> which btw does not use the parameter fields, is unable to parse all
> fields of the/a multivalued field.
> The call "Object content = doc.getFieldValue(fieldName);" does not care
> what type the field is and just 
> delegates to SolrInputDocument which in turn calls getFirstValue.
> 
> So, two issues:
> first - if the first value of the multivalued field is not of type
> String, the field is ignored completely.
> 
> second - the concat method does not concat all values of a multivalued
> field. 
> While
> http://www.mail-archive.com/[email protected]/msg90530.html
> states:
> "The feature is designed to detect exactly one language per field.
> In case of multValued, it will concatenate all values before detection."
> I don't see how the code could do this.
> 
> Is this a bug? Is this a special design decision? Did we miss a certain
> configuration, that would allow the 
> Language identification to use all values of a multivalued field?
> We are about to write our own
> LangDetectLanguageIdentifierUpdateProcessorFactory (why is the
> getInstance 
> hardcoded to return LanguageIdentifierUpdateProcessor?) and overwrite
> LanguageIdentifierUpdateProcessor to
> handle all values of a multivalued field, ignoring non-string values.
> 
> Please see configuration below.
> 
> I hope I was able to make myself clear.
> 
> Regards,
> Stephan
> 
> 
> A little background:
> We are using a 3rd-party CMS framework which pulls in some magic SOLR
> configuration (namely the textbody field).
> 
> The field we are passing is defined as 
>     <!--
>       The default text search field.
>       This field and the field name_tokenized are used as default search
>       fields
>       for the /editor and /cmdismax search request handlers in
>       solrconfig.xml.
> 
>       For the Content Feeder the text of all indexed fields of
>       the CoreMedia document is stored in this field.
>       The CAE Feeder by default stores the text of all elements in
>       this field.
>     -->
>     <field name="textbody" type="text_general" stored="false"
>     multiValued="true"/>
> 
> As you can see, it is also used as search field, therefor we want to have
> the actual datatypes on the values.
> The field itself is generated by a processor, prior to calling the
> language identification (see processor chain).
> 
> 
> The processor chain:
>   <updateRequestProcessorChain>
>     <!-- Improve error messages -->
>     <processor class="3rdpartypackage.ErrorHandlingProcessorFactory" />
>     <!-- Blob extraction -->
>     <processor class="3rdpartypackage.BinaryDataProcessorFactory">
>     <!-- some comments -->
>     </processor>
> 
>     <!-- Textbody handling -->
>     <processor class="3rdpartypackage.TextBodyProcessorFactory" />
>     <!-- Copy content of field name to name_tokenized -->
>     <processor class="solr.CloneFieldUpdateProcessorFactory">
>       <str name="source">name</str>
>       <str name="dest">name_tokenized</str>
>     </processor>
>     <!--Language detection -->
>     <processor
>     
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
>       <str name="langid.fl">textbody,name_tokenized</str>
>       <str name="langid.langField">language</str>
>       <str name="langid.fallback">en</str>
>     </processor>
>     <!-- Index into language dependent fields if defined (e.g.
>     textbody_en instead of textbody) -->
>     <processor
>     
> class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsProcessorFactory">
>       <str name="languageField">language</str>
>       <str name="textFields">textbody,name_tokenized</str>
>     </processor>
> 
>     <processor class="solr.RunUpdateProcessorFactory" />
>   </updateRequestProcessorChain>
> 
> 
> -- 
> Diese E-Mail wurde aus dem Sicherheitsverbund E-Mail made in
> Germany versendet: http://www.gmx.net/e-mail-made-in-germany

Reply via email to