Re: Korean Tokenizer in solr

Alexandre Rafalovitch Mon, 14 Jul 2014 00:31:32 -0700

What happens if you have a new collection with absolute minimum in it
and then add the definition? Start from something like:
https://github.com/arafalov/simplest-solr-config .


Also, is there a long exception earlier in a log. It may have more clues.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Mon, Jul 14, 2014 at 2:15 PM, Poornima Jay
<poornima...@rocketmail.com> wrote:
> Yes, Below is my defined fieldtype
>
> <fieldType name="text_match_phrase_cjk" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type ="index">
>          <tokenizer class="solr.ICUTokenizerFactory"/>
>          <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" 
> han="true"/>
>          <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1" generateNumberParts="1" catenateWords="1" 
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" 
> preserveOriginal="1"/>
>       </analyzer>
>       <analyzer type ="query">
>          <tokenizer class="solr.ICUTokenizerFactory"/>
>          <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" 
> han="true"/>
>          <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1" generateNumberParts="1" catenateWords="0" 
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" 
> preserveOriginal="1"/>
>       </analyzer>
>    </fieldType>
>
> Please correct me if I am doing anything wrong here
>
> Regards,
> Poornima
>
>
> On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch <arafa...@gmail.com> 
> wrote:
>
>
>
> You sure, it's not a spelling error or something other weird like
> that? Because Solr ships with that filter in it's example schema:
>         <filter class="solr.CJKBigramFilterFactory"/>
>
> So, you can compare what you are doing differently with that.
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
>
> On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
> <poornima...@rocketmail.com> wrote:
>> I have upgrade the solr version to 4.8.1. But after making changes in the 
>> schema file i am getting the below error
>> Error instantiating class: 
>> 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
>> I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 
>> 4.8.1. Do I need to make any configuration changes to get this working.
>>
>> Please advice.
>>
>> Regards,
>> Poornima
>>
>>
>> On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch 
>> <arafa...@gmail.com> wrote:
>>
>>
>>
>> I would suggest you read through all 12 (?) articles in this series:
>> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
>> . It will probably lay out most of the issues for you.
>>
>> And if you are starting, I would really suggest using the latest Solr
>> (4.9). A lot more people remember what the latest version has then
>> what was in 3.6. And, as the series above will tell you, some relevant
>> issues had been fixed in more recent Solr versions.
>>
>> Regards,
>>    Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr 
>> proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
>> <poornima...@rocketmail.com> wrote:
>>> Till now I was thinking solr will support KoreanTokenizer. I haven't used 
>>> any other 3rd party one.
>>> Actually the issue i am facing is I need to integrate English, Chinese, 
>>> Japanese and Korean language search in a single site. Based on the user's 
>>> selected language to search the fields will be queried appropriately.
>>>
>>> I tried using cjk for all the 3 languages like below but only few search 
>>> terms work for Chinese and Japanese. nothing works for Korean.
>>>
>>> <fieldtype name="text_cjk" class="solr.TextField" 
>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>>>      <analyzer>
>>>         <tokenizer class="solr.CJKTokenizerFactory" />
>>>         <filter class="solr.CJKWidthFilterFactory"/>
>>>         <filter 
>>> class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>>>         <filter class="solr.ICUTransformFilterFactory" 
>>> id="Traditional-Simplified"/>
>>>         <filter class="solr.ICUTransformFilterFactory" 
>>> id="Katakana-Hiragana"/>
>>>         <filter class="solr.ICUFoldingFilterFactory"/>
>>>         <filter class="solr.CJKBigramFilterFactory" han="true" 
>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>>       </analyzer>
>>>     </fieldtype>
>>>
>>> So i tried to implement individual fieldtype for each language as below
>>>
>>> Chinese
>>>  <fieldType name="text_cjk" class="solr.TextField" 
>>> positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>>      <analyzer>
>>>          <tokenizer class="solr.ICUTokenizerFactory"/>
>>>            <filter class="solr.ICUFoldingFilterFactory"/>
>>>            <filter class="solr.CJKWidthFilterFactory"/>
>>>            <filter class="solr.CJKBigramFilterFactory"/>
>>>        </analyzer>
>>>     </fieldType>
>>>
>>> Japanese
>>> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" 
>>> autoGeneratePhraseQueries="false">
>>>    <analyzer>
>>>      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>>>       <filter class="solr.JapaneseBaseFormFilterFactory"/>
>>>       <filter class="solr.JapanesePartOfSpeechStopFilterFactory" 
>>> tags="stoptags_ja.txt" />
>>>       <filter class="solr.CJKWidthFilterFactory"/>
>>>       <filter class="solr.StopFilterFactory" ignoreCase="true" 
>>> words="stopwords_ja.txt" />
>>>       <filter class="solr.JapaneseKatakanaStemFilterFactory" 
>>> minimumLength="4"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>    </analyzer>
>>> </fieldType>
>>>
>>> Korean
>>> <fieldType name="text_kr" class="solr.TextField" 
>>> positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>>       <analyzer type="index">
>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true" 
>>> hasCNoun="true"  bigrammable="true"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
>>> words="stopwords_kr.txt"/>
>>>       </analyzer>
>>>       <analyzer type="query">
>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false" 
>>> hasCNoun="false"  bigrammable="false"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
>>> words="stopwords_kr.txt"/>
>>>       </analyzer>
>>>     </fieldType>
>>>
>>> I am really struck how to implement this. Please help me.
>>>
>>> Thanks,
>>> Poornima
>>>
>>>
>>>
>>> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch 
>>> <arafa...@gmail.com> wrote:
>>>
>>>
>>>
>>> I don't think Solr ships with Korean Tokenizer, does it?
>>>
>>> If you are using a 3rd party one, you need to give full class name,
>>> not just solr.Korean... And you need the library added in the lib
>>> statement in solrconfig.xml (at least in Solr 4).
>>>
>>> Regards,
>>>    Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr 
>>> proficiency
>>>
>>>
>>>
>>> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
>>> <poornima...@rocketmail.com> wrote:
>>>> I have defined the fieldtype inside the fields section.  When i checked 
>>>> the error log i found the below error
>>>>
>>>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>>>
>>>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or 
>>>> tokenizer & filter list
>>>>
>>>>
>>>> Do i need to add any libraries for koreanTokenizer?
>>>>
>>>> Regards,
>>>> Poornima
>>>>
>>>>
>>>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch 
>>>> <arafa...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Double check your xml file that you don't - for example - define your
>>>> fieldType outside of fields section. Or maybe you have exception
>>>> earlier about some component in the type definition.
>>>>
>>>> This is not about Korean language, it seems. Something more
>>>> fundamentally about XML config.
>>>>
>>>> Regards,
>>>>    Alex.
>>>> Personal website: http://www.outerthoughts.com/
>>>> Current project: http://www.solr-start.com/ - Accelerating your Solr 
>>>> proficiency
>>>>
>>>>
>>>>
>>>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>>>> <poornima...@rocketmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> Anyone tried to implement korean language in solr 3.6.1. I define the 
>>>>> field
>>>>> as below in my schema file but the fieldtype is not working.
>>>>>
>>>>> <fieldType name="text_kr" class="solr.TextField" 
>>>>> positionIncrementGap="1000"
>>>>>>
>>>>>       <analyzer type="index">
>>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>>>> hasCNoun="true"  bigrammable="true"/>
>>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords_kr.txt"/>
>>>>>       </analyzer>
>>>>>       <analyzer type="query">
>>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>>>> hasCNoun="false"  bigrammable="false"/>
>>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords_kr.txt"/>
>>>>>       </analyzer>
>>>>>     </fieldType>
>>>>>
>>>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>>>> 'text_kr' specified on field product_name_kr
>>>>>
>>>>> Regards,
>>>>> Poornima
>>>>>

Re: Korean Tokenizer in solr

Reply via email to