Re: Highlighting question in a multi-language index

Daniel Alheiros Wed, 01 Aug 2007 06:24:19 -0700

Hi

I've narrowed down to realize that my problem here is related to the way I
store/index my fields in a multi-language index... I'm going to explain how
I'm doing it and I hope you can come out with some nice way to solve my
problem:


My schema.xml contains the following definitions:

    <!--  definition for "Language Agnostic" text field -->
    <fieldtype name="text_basic" class="solr.TextField"
positionIncrementGap="100">
        <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        </analyzer>
    </fieldtype>

    <!--  Text definition for "ENGLISH" -->
    <fieldtype name="text_english" class="solr.TextField"
positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SynonymFilterFactory"
synonyms="synonyms-english.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords-english.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.EnglishPorterFilterFactory"
protected="protwords-english.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords-english.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.EnglishPorterFilterFactory"
protected="protwords-english.txt"/>
        </analyzer>
    </fieldtype>

    <!--  CPS Text definition for "SPANISH" -->
    <fieldtype name="cpstext_spanish" class="solr.TextField"
positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords-spanish.txt"/>
                <filter class="solr.SnowballPorterFilterFactory"
language="Spanish" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords-spanish.txt"/>
                <filter class="solr.SnowballPorterFilterFactory"
language="Spanish" />
        </analyzer>
    </fieldtype>

    <field name="body"    type="text_basic" indexed="false" stored="true"
multiValued="false" compressed="true" compressionThreshold="1024" />
    <field name="body_en" type="text_english" indexed="true" stored="false"
/>
    <field name="body_es" type="text_spanish" indexed="true" stored="false"
/>

    <copyField source="body_en"    dest="body"/>
    <copyField source="body_es"    dest="body"/>

So I¹m indexing some fields but in fact I¹m storing a different one that
holds a copy of the data used when indexing (independently of the language
of the current document). Each document, depending on its language will have
only one body_XX field (if it¹s a document in English it will have the field
body_en and if it is in Spanish it will have a body_es).

I¹m querying informing that I want to highlight the generic ³body² field (as
I need a stored field to use the highlighting) but it only returns the
proper result if I have on my stored field the same query analyzer structure
as in the language dependent field, and I can¹t do that, because I¹m
indexing content in six completely different languages that doesn¹t share
much in terms of analysis...

The idea in having a generic set of fields (language independent) is about
avoiding different interfaces for the search client (as the same search
client can search in any language) and all this documents are in the same
index for deployment and content management simplicity and because it¹s not
a huge amount of documents that can¹t be together (and the update frequency
is low).

Can you help me again with this? Is this solution feasible using Solr/Lucene
or I¹ll have to change my mind and change the client interface so it will
have to query for it¹s specific fields (and I will need to make those
stored=true)?

Thanks again,
Daniel

On 1/8/07 10:43, "Daniel Alheiros" <[EMAIL PROTECTED]> wrote:

> Hi Mike.
> 
> Thanks for your reply, but seems that I haven't expressed myself clearly.
> Here I go:
> 
> I want that when I search for "butter" all words containing "butter" (like
> "buttered", "butters" ...) are highlighted.
> 
> I'm using the PorterStemmerFilterFactory when indexing but not when
> querying.
> 
> Regards,
> Daniel
> 
> 
> On 31/7/07 18:50, "Mike Klaas" <[EMAIL PROTECTED]> wrote:
> 
>> 
>> On 31-Jul-07, at 9:41 AM, Daniel Alheiros wrote:
>> 
>>> Hi
>>> 
>>> I've started using highlighting and there is something that I
>>> consider a bit
>>> odd... It may be caused by the way I'm indexing or querying I'm
>>> sure, but
>>> just to avoid doing a huge number of tests...
>>> 
>>> I'm querying for "butter" and only exact matches of butter are
>>> returning
>>> highlighted, when I change my query to "butters" it returns both
>>> "butter"
>>> and "butters" highlighted. Is it something that considers the word
>>> and it's
>>> reductions but not match a word that contains the word in the query?
>> 
>> This is because the example Solr distribution is configured to do
>> stemming (see the definition for "text" fieldtype in schema.xml).
>> 
>> Remove PorterStemmerFilterFactory to do exact(er) searching/
>> highlighting only.
>> 
>> -Mike
> 
> 
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain personal
> views which are not the views of the BBC unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in reliance on
> it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
> 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

Re: Highlighting question in a multi-language index

Reply via email to