Stephan, This should really go to the Solr user list rather than the general one - you might get more response over there.
Upayavira On Tue, Nov 26, 2013, at 01:52 PM, Stephan Müller wrote: > Hi, > > we are passing a multivalued field to the > LanguageIdentifierUpdateProcessor. This multivalued field > contains arbitrary types (Integer, String, Date). > Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument > doc, String[] fields), > which btw does not use the parameter fields, is unable to parse all > fields of the/a multivalued field. > The call "Object content = doc.getFieldValue(fieldName);" does not care > what type the field is and just > delegates to SolrInputDocument which in turn calls getFirstValue. > > So, two issues: > first - if the first value of the multivalued field is not of type > String, the field is ignored completely. > > second - the concat method does not concat all values of a multivalued > field. > While > http://www.mail-archive.com/[email protected]/msg90530.html > states: > "The feature is designed to detect exactly one language per field. > In case of multValued, it will concatenate all values before detection." > I don't see how the code could do this. > > Is this a bug? Is this a special design decision? Did we miss a certain > configuration, that would allow the > Language identification to use all values of a multivalued field? > We are about to write our own > LangDetectLanguageIdentifierUpdateProcessorFactory (why is the > getInstance > hardcoded to return LanguageIdentifierUpdateProcessor?) and overwrite > LanguageIdentifierUpdateProcessor to > handle all values of a multivalued field, ignoring non-string values. > > Please see configuration below. > > I hope I was able to make myself clear. > > Regards, > Stephan > > > A little background: > We are using a 3rd-party CMS framework which pulls in some magic SOLR > configuration (namely the textbody field). > > The field we are passing is defined as > <!-- > The default text search field. > This field and the field name_tokenized are used as default search > fields > for the /editor and /cmdismax search request handlers in > solrconfig.xml. > > For the Content Feeder the text of all indexed fields of > the CoreMedia document is stored in this field. > The CAE Feeder by default stores the text of all elements in > this field. > --> > <field name="textbody" type="text_general" stored="false" > multiValued="true"/> > > As you can see, it is also used as search field, therefor we want to have > the actual datatypes on the values. > The field itself is generated by a processor, prior to calling the > language identification (see processor chain). > > > The processor chain: > <updateRequestProcessorChain> > <!-- Improve error messages --> > <processor class="3rdpartypackage.ErrorHandlingProcessorFactory" /> > <!-- Blob extraction --> > <processor class="3rdpartypackage.BinaryDataProcessorFactory"> > <!-- some comments --> > </processor> > > <!-- Textbody handling --> > <processor class="3rdpartypackage.TextBodyProcessorFactory" /> > <!-- Copy content of field name to name_tokenized --> > <processor class="solr.CloneFieldUpdateProcessorFactory"> > <str name="source">name</str> > <str name="dest">name_tokenized</str> > </processor> > <!--Language detection --> > <processor > > class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory"> > <str name="langid.fl">textbody,name_tokenized</str> > <str name="langid.langField">language</str> > <str name="langid.fallback">en</str> > </processor> > <!-- Index into language dependent fields if defined (e.g. > textbody_en instead of textbody) --> > <processor > > class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsProcessorFactory"> > <str name="languageField">language</str> > <str name="textFields">textbody,name_tokenized</str> > </processor> > > <processor class="solr.RunUpdateProcessorFactory" /> > </updateRequestProcessorChain> > > > -- > Diese E-Mail wurde aus dem Sicherheitsverbund E-Mail made in > Germany versendet: http://www.gmx.net/e-mail-made-in-germany
