Re: how to Index and Search non-Eglish Text in solr

Erick Erickson Fri, 10 Jun 2011 05:55:48 -0700

Well, no. Specifying both indexed and stored as "false"
is essentially a no-op, you'd never find anything!


But even with indexed="true", this solution has problems.
It's essentially using a single field to store text from
different languages. The problem is that tokenization,
stemming etc. behaves differently in different
languages, especially when you contrast CJK
and Western languages.

Best
Erick

On Fri, Jun 10, 2011 at 1:05 AM, Mohammad Shariq <shariqn...@gmail.com> wrote:
> Thanks Erick for your help.
> I have another silly question.
> Suppose I created mutiple fieldTypes e.g. news_English, news_Chinese,
> news_Japnese etc.
> after creating these field, can I copy all these to CopyField "*defaultquery"
> *like below :
>
> *<copyField source="news_English" dest="defaultquery"/>
> <copyField source="news_Chinese" dest="defaultquery"/>
> <copyField source="news_Japnese" dest="defaultquery"/>
>
> *and my "defaultquery" looks like :*
> <field name="defaultquery" type="query_text" indexed="false" stored="false"
> multiValued="true"/>
>
> *Is this right way to deal  with multiple language Indexing and searching* *
> ???*
>
> *
>
>
> On 9 June 2011 19:06, Erick Erickson <erickerick...@gmail.com> wrote:
>
>> No, you'd have to create multiple fieldTypes, one for each language....
>>
>> Best
>> Erick
>>
>> On Thu, Jun 9, 2011 at 5:26 AM, Mohammad Shariq <shariqn...@gmail.com>
>> wrote:
>> > Can I specify multiple language in filter tag in schema.xml ???  like
>> below
>> >
>> > <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> >   <analyzer type="index">
>> >      <tokenizer class="solr.
>> > WhitespaceTokenizerFactory"/>
>> >      <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > words="stopwords.txt" enablePositionIncrements="true"/>
>> >      <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1"
>> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> > catenateAll="0" splitOnCaseChange="1"/>
>> >
>> > <filter class="solr.SnowballPorterFilterFactory" language="Dutch" />
>> > <filter class="solr.SnowballPorterFilterFactory" language="English" />
>> > <filter class="solr.SnowballPorterFilterFactory" language="Chinese" />
>> > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> > <tokenizer class="solr.CJKTokenizerFactory"/>
>> >
>> >
>> >
>> >      <filter class="solr.LowerCaseFilterFactory"/><filter
>> > class="solr.SnowballPorterFilterFactory" language="Hungarian" />
>> >
>> >
>> > On 8 June 2011 18:47, Erick Erickson <erickerick...@gmail.com> wrote:
>> >
>> >> This page is a handy reference for individual languages...
>> >> http://wiki.apache.org/solr/LanguageAnalysis
>> >>
>> >> But the usual approach, especially for Chinese/Japanese/Korean
>> >> (CJK) is to index the content in different fields with language-specific
>> >> analyzers then spread your search across the language-specific
>> >> fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
>> >> particularly give "surprising" results if you put words from different
>> >> languages in the same field.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq <shariqn...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> > I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles
>> in
>> >> > English, but my requirement extend to index the news of other
>> languages
>> >> too.
>> >> >
>> >> > This is how my schema looks :
>> >> > <field name="news" type="text" indexed="true" stored="false"
>> >> > required="false"/>
>> >> >
>> >> >
>> >> > And the "text" Field in schema.xml looks like :
>> >> >
>> >> > <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100">
>> >> >    <analyzer type="index">
>> >> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
>> >> > words="stopwords.txt" enablePositionIncrements="true"/>
>> >> >       <filter class="solr.WordDelimiterFilterFactory"
>> >> generateWordParts="1"
>> >> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> >> > catenateAll="0" splitOnCaseChange="1"/>
>> >> >       <filter class="solr.LowerCaseFilterFactory"/>
>> >> >       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>> >> > protected="protwords.txt"/>
>> >> >    </analyzer>
>> >> >    <analyzer type="query">
>> >> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >> >       <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt"
>> >> > ignoreCase="true" expand="true"/>
>> >> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
>> >> > words="stopwords.txt" enablePositionIncrements="true"/>
>> >> >       <filter class="solr.WordDelimiterFilterFactory"
>> >> generateWordParts="1"
>> >> > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> >> > catenateAll="0" splitOnCaseChange="1"/>
>> >> >       <filter class="solr.LowerCaseFilterFactory"/>
>> >> >       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>> >> > protected="protwords.txt"/>
>> >> >    </analyzer>
>> >> > </fieldType>
>> >> >
>> >> >
>> >> > My Problem is :
>> >> > Now I want to index the news articles in other languages to e.g.
>> >> > Chinese,Japnese.
>> >> > How I can I modify my text field so that I can Index the news in other
>> >> lang
>> >> > too and make it searchable ??
>> >> >
>> >> > Thanks
>> >> > Shariq
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >>
>> http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
>> >> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks and Regards
>> > Mohammad Shariq
>> >
>>
>
>
>
> --
> Thanks and Regards
> Mohammad Shariq
>

Re: how to Index and Search non-Eglish Text in solr

Reply via email to