Ok, I consider this topic on _this_ list closed. I did a repost on the 'user' list.
Regards, Stephan > Gesendet: Dienstag, 26. November 2013 um 23:03 Uhr > Von: Upayavira <[email protected]> > An: [email protected] > Betreff: Re: LanguageIdentifierUpdateProcessor uses only firstValue() on > multivalued fields > > Stephan, > > This should really go to the Solr user list rather than the general one > - you might get more response over there. > > Upayavira > > On Tue, Nov 26, 2013, at 01:52 PM, Stephan Müller wrote: > > Hi, > > > > we are passing a multivalued field to the > > LanguageIdentifierUpdateProcessor. This multivalued field > > contains arbitrary types (Integer, String, Date). > > Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument > > doc, String[] fields), > > which btw does not use the parameter fields, is unable to parse all > > fields of the/a multivalued field. > > The call "Object content = doc.getFieldValue(fieldName);" does not care > > what type the field is and just > > delegates to SolrInputDocument which in turn calls getFirstValue. > > > > So, two issues: > > first - if the first value of the multivalued field is not of type > > String, the field is ignored completely. > > > > second - the concat method does not concat all values of a multivalued > > field. > > While > > http://www.mail-archive.com/[email protected]/msg90530.html > > states: > > "The feature is designed to detect exactly one language per field. > > In case of multValued, it will concatenate all values before detection." > > I don't see how the code could do this. > > > > Is this a bug? Is this a special design decision? Did we miss a certain > > configuration, that would allow the > > Language identification to use all values of a multivalued field? > > We are about to write our own > > LangDetectLanguageIdentifierUpdateProcessorFactory (why is the > > getInstance > > hardcoded to return LanguageIdentifierUpdateProcessor?) and overwrite > > LanguageIdentifierUpdateProcessor to > > handle all values of a multivalued field, ignoring non-string values. > > > > Please see configuration below. > > > > I hope I was able to make myself clear. > > > > Regards, > > Stephan > > > > > > A little background: > > We are using a 3rd-party CMS framework which pulls in some magic SOLR > > configuration (namely the textbody field). > > > > The field we are passing is defined as > > <!-- > > The default text search field. > > This field and the field name_tokenized are used as default search > > fields > > for the /editor and /cmdismax search request handlers in > > solrconfig.xml. > > > > For the Content Feeder the text of all indexed fields of > > the CoreMedia document is stored in this field. > > The CAE Feeder by default stores the text of all elements in > > this field. > > --> > > <field name="textbody" type="text_general" stored="false" > > multiValued="true"/> > > > > As you can see, it is also used as search field, therefor we want to have > > the actual datatypes on the values. > > The field itself is generated by a processor, prior to calling the > > language identification (see processor chain). > > > > > > The processor chain: > > <updateRequestProcessorChain> > > <!-- Improve error messages --> > > <processor class="3rdpartypackage.ErrorHandlingProcessorFactory" /> > > <!-- Blob extraction --> > > <processor class="3rdpartypackage.BinaryDataProcessorFactory"> > > <!-- some comments --> > > </processor> > > > > <!-- Textbody handling --> > > <processor class="3rdpartypackage.TextBodyProcessorFactory" /> > > <!-- Copy content of field name to name_tokenized --> > > <processor class="solr.CloneFieldUpdateProcessorFactory"> > > <str name="source">name</str> > > <str name="dest">name_tokenized</str> > > </processor> > > <!--Language detection --> > > <processor > > > > class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory"> > > <str name="langid.fl">textbody,name_tokenized</str> > > <str name="langid.langField">language</str> > > <str name="langid.fallback">en</str> > > </processor> > > <!-- Index into language dependent fields if defined (e.g. > > textbody_en instead of textbody) --> > > <processor > > > > class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsProcessorFactory"> > > <str name="languageField">language</str> > > <str name="textFields">textbody,name_tokenized</str> > > </processor> > > > > <processor class="solr.RunUpdateProcessorFactory" /> > > </updateRequestProcessorChain> > > > > > > -- > > Diese E-Mail wurde aus dem Sicherheitsverbund E-Mail made in > > Germany versendet: http://www.gmx.net/e-mail-made-in-germany > -- Diese E-Mail wurde aus dem Sicherheitsverbund E-Mail made in Germany versendet: http://www.gmx.net/e-mail-made-in-germany
