Re: multiple analyzers for one field

Trey Grainger Thu, 10 Apr 2014 19:43:22 -0700

Hi Michael,

It IS possible to utilize multiple Analyzers within a single field, but
it's not a "built in" capability of Solr right now. I wrote something I
called a "MultiTextField" which provides this capability, and you can see
the code here:
https://github.com/treygrainger/solr-in-action/tree/master/src/main/java/sia/ch14

The general idea is that you can pass in a prefix for each piece of your
content and then use that prefix to dynamically select one or more
Analyzers for each piece of content. So, for example, you could pass in
something like this when indexing your document (for a multiValued field):
<field name="someMultiTextField">en|some text</field>
<field name="someMultiTextField">es|some more text</field>
<field name="someMultiTextField">de,fr|some other text</field>

Then, the MultiTextField will parse the prefixes and dynamically grab an
Analyzer based upon the prefix. In this case, the first input will be
processed using an English Analyzer, the second input will use a spanish
analyzer, and the third input will use both a German and French analyzer,
as defined when the field is defined in the schema.xml:

<fieldType name="multiText"
        class="sia.ch14.MultiTextField" sortMissingLast="true"
        defaultFieldType="text_general"
        fieldMappings="en:text_english,
                       es:text_spanish,
                       fr:text_french,
                       fr:text_german"/>

<field name="someMultiTextField" type="multiText" indexed="true"
multiValued="true" />

If you want to automagically map separate fields into one of these dynamic
analyzer (MultiText) fields with prefixes, you could either pass the text
in multiple times from the client to the same field (with different
Analyzer prefixes each time like shown above), OR you could write an Update
Request Processor that does this for you. I don't think it is possible to
just have the copyField add in prefixes automatically for you, though
someone please correct me if I'm wrong.

If you implement an Update Request Processor, then inside it you would
simply grab the text from each of the relevant fields (i.e. author and
title fields) and then add that field's value to the named MultiText field
with the appropriate Analyzer prefix based upon each field. I made an
example Update Request Processor (see the previous github link and look for
MultiTextFieldLanguageIdentifierUpdateProcessor) that you could look at as
an example of how to supply different analyzer prefixes to different values
within a multiValued field, though you would obviously want to throw away
all the language detection stuff since it doesn't match your specific use
case.

All that being said, this solution may end up being overly complicated for
your use case, so your idea of creating a custom analyzer to just handle
your example might be much less complicated. At any rate, that's the
specific answer to your specific question about whether it is possible to
utilize multiple Analyzers within a field based upon multiple inputs.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Analytics @ CareerBuilder

On Thu, Apr 10, 2014 at 9:05 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> The lack of response to this question makes me think that either there is
> no good answer, or maybe the question was too obtuse.  So I'll give it one
> more go with some more detail ...
>
> My main goal is to implement autocompletion with a mix of words and short
> phrases, where the words are drawn from the text of largish documents, and
> the phrases are author names and document titles.
>
> I think the best way to accomplish this is to concoct a single field that
> contains data from these other "source" fields (as usual with copyField),
> but with some of the fields treated as keywords (ie with their values
> inserted as single tokens), and others tokenized.  I believe this would be
> possible at the Lucene level by calling Document.addField () with multiple
> fields having the same name: some marked as TOKENIZED and others not.  I
> think the tokenized fields would have to share the same analyzer, but
> that's OK for my case.
>
> I can't see how this could be made to happen in Solr without a lot of
> custom coding though. It seems as if the conversion from Solr fields to
> Lucene fields is not an easy thing to influence.  If anyone has an idea how
> to achieve the subgoal, or perhaps a different way of getting at the main
> goal, I'd love to hear about it.
>
> So far my only other idea is to write some kind of custom analyzer that
> treats short texts as keywords and tokenizes longer ones, which is probably
> what I'll look at if nothing else comes up.
>
> Thanks
>
> Mike
>
>
>
> On 4/9/2014 4:16 PM, Michael Sokolov wrote:
>
>> I think I would like to do something like copyfield from a bunch of
>> fields into a single field, but with different analysis for each source,
>> and I'm pretty sure that's not a thing. Is there some alternate way to
>> accomplish my goal?
>>
>> Which is to have a suggester that suggests words from my full text field
>> and complete phrases drawn from my author and title fields all at the same
>> time.  So If I could index author and title using KeyWordAnalyzer, and full
>> text tokenized, that would be the bees knees.
>>
>> -Mike
>>
>
>

Re: multiple analyzers for one field

Reply via email to