Consider an update processor - it can operate on any field and has access to all fields.

You could have one update processor to combine all the fields to process, into a temporary, dummy field. Then run a language detection update processor on the combined field. Then process the results and place in the desired field. And finally remove any temporary fields.

-- Jack Krupansky
-----Original Message----- From: David Anthony Troiano
Sent: Monday, October 28, 2013 4:47 PM
To: solr-user@lucene.apache.org
Subject: Single multilingual field analyzed based on other field values

Hello,

First some background...

I am indexing a multilingual document set where documents themselves can
contain multiple languages.  The language(s) within my documents are known
ahead of time.  I have tried separate fields per language, and due to the
poor query performance I'm seeing with that approach (many languages /
fields), I'm trying to create a single multilingual field.

One approach to this problem is given in Section
14.6.4<https://docs.google.com/a/basistech.com/file/d/0B3NlE_uL0pqwR0hGV0M1QXBmZm8/edit>of
the new Solr In Action book.  The approach is to take the document
content field and prepend it with the list contained languages followed by
a special delimiter.  A new field type is defined that maps languages to
sub field types, and the new type's tokenizer then runs all of the sub
field type analyzers over the field and merges results, adjusts offsets for
the prepended data, etc.

Due to the tokenizer complexity incurred, I'd like to pursue a more
flexible approach, which is to run the various language-specific analyzers
not based on prepended codes, but instead based on other field values
(i.e., a language field).

I don't see a straightforward way to do this, mostly because a field
analyzer doesn't have access to the rest of the document.  On the flip
side, an UpdateRequestProcessor would have access to the document but
doesn't really give a path to wind up where I want to be (single field with
different analyzers run dynamically).

Finally, my question: is it possible to thread cache document language(s)
during UpdateRequestProcessor execution (where we have access to the full
document), so that the analyzer can then read from the cache to determine
which analyzer(s) to run?  More specifically, if a document is run through
it's URP chain on thread T, will its analyzer(s) also run on thread T and
will no other documents be run through the URP on that thread in the
interim?

Thanks,
Dave

Reply via email to