date:20140714

Re: Korean Tokenizer in solr

2014-07-14 Thread Poornima Jay

I have upgrade the solr version to 4.8.1. But after making changes in the 
schema file i am getting the below error
Error instantiating class: 
'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 
4.8.1. Do I need to make any configuration changes to get this working.

Please advice.

Regards,
Poornima


On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch arafa...@gmail.com 
wrote:
 


I would suggest you read through all 12 (?) articles in this series:
http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
. It will probably lay out most of the issues for you.

And if you are starting, I would really suggest using the latest Solr
(4.9). A lot more people remember what the latest version has then
what was in 3.6. And, as the series above will tell you, some relevant
issues had been fixed in more recent Solr versions.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency



On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
poornima...@rocketmail.com wrote:
 Till now I was thinking solr will support KoreanTokenizer. I haven't used any 
 other 3rd party one.
 Actually the issue i am facing is I need to integrate English, Chinese, 
 Japanese and Korean language search in a single site. Based on the user's 
 selected language to search the fields will be queried appropriately.

 I tried using cjk for all the 3 languages like below but only few search 
 terms work for Chinese and Japanese. nothing works for Korean.

 fieldtype name=text_cjk class=solr.TextField 
 positionIncrementGap=1 autoGeneratePhraseQueries=false
      analyzer
         tokenizer class=solr.CJKTokenizerFactory /
         filter class=solr.CJKWidthFilterFactory/
         filter class=edu.stanford.lucene.analysis.CJKFoldingFilterFactory/
         filter class=solr.ICUTransformFilterFactory 
id=Traditional-Simplified/
         filter class=solr.ICUTransformFilterFactory 
id=Katakana-Hiragana/
         filter class=solr.ICUFoldingFilterFactory/
         filter class=solr.CJKBigramFilterFactory han=true 
hiragana=true katakana=true hangul=true outputUnigrams=true /
       /analyzer
     /fieldtype

 So i tried to implement individual fieldtype for each language as below

 Chinese
  fieldType name=text_cjk class=solr.TextField 
positionIncrementGap=1000 autoGeneratePhraseQueries=false
      analyzer
          tokenizer class=solr.ICUTokenizerFactory/
            filter class=solr.ICUFoldingFilterFactory/
            filter class=solr.CJKWidthFilterFactory/
            filter class=solr.CJKBigramFilterFactory/
        /analyzer
     /fieldType

 Japanese
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
    analyzer
      tokenizer class=solr.JapaneseTokenizerFactory mode=search/
       filter class=solr.JapaneseBaseFormFilterFactory/
       filter class=solr.JapanesePartOfSpeechStopFilterFactory 
tags=stoptags_ja.txt /
       filter class=solr.CJKWidthFilterFactory/
       filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_ja.txt /
       filter class=solr.JapaneseKatakanaStemFilterFactory 
minimumLength=4/
       filter class=solr.LowerCaseFilterFactory/
    /analyzer
 /fieldType

 Korean
 fieldType name=text_kr class=solr.TextField positionIncrementGap=1000 
 autoGeneratePhraseQueries=false
       analyzer type=index
         tokenizer class=solr.KoreanTokenizerFactory/
         filter class=solr.KoreanFilterFactory hasOrigin=true 
hasCNoun=true  bigrammable=true/
         filter class=solr.LowerCaseFilterFactory/
         filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_kr.txt/
       /analyzer
       analyzer type=query
         tokenizer class=solr.KoreanTokenizerFactory/
         filter class=solr.KoreanFilterFactory hasOrigin=false 
hasCNoun=false  bigrammable=false/
         filter class=solr.LowerCaseFilterFactory/
         filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_kr.txt/
       /analyzer
     /fieldType

 I am really struck how to implement this. Please help me.

 Thanks,
 Poornima



 On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch arafa...@gmail.com 
 wrote:



 I don't think Solr ships with Korean Tokenizer, does it?

 If you are using a 3rd party one, you need to give full class name,
 not just solr.Korean... And you need the library added in the lib
 statement in solrconfig.xml (at least in Solr 4).

 Regards,
    Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency



 On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
 poornima...@rocketmail.com wrote:
 I have defined the fieldtype inside the fields section.  When i checked the 
 error log i found the below error

 Caused by: java.lang.ClassNotFoundException:

Re: Korean Tokenizer in solr

2014-07-14 Thread Alexandre Rafalovitch

You sure, it's not a spelling error or something other weird like
that? Because Solr ships with that filter in it's example schema:
filter class=solr.CJKBigramFilterFactory/

So, you can compare what you are doing differently with that.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
poornima...@rocketmail.com wrote:
 I have upgrade the solr version to 4.8.1. But after making changes in the 
 schema file i am getting the below error
 Error instantiating class: 
 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
 I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 
 4.8.1. Do I need to make any configuration changes to get this working.

 Please advice.

 Regards,
 Poornima


 On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch arafa...@gmail.com 
 wrote:



 I would suggest you read through all 12 (?) articles in this series:
 http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
 . It will probably lay out most of the issues for you.

 And if you are starting, I would really suggest using the latest Solr
 (4.9). A lot more people remember what the latest version has then
 what was in 3.6. And, as the series above will tell you, some relevant
 issues had been fixed in more recent Solr versions.

 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency



 On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
 poornima...@rocketmail.com wrote:
 Till now I was thinking solr will support KoreanTokenizer. I haven't used 
 any other 3rd party one.
 Actually the issue i am facing is I need to integrate English, Chinese, 
 Japanese and Korean language search in a single site. Based on the user's 
 selected language to search the fields will be queried appropriately.

 I tried using cjk for all the 3 languages like below but only few search 
 terms work for Chinese and Japanese. nothing works for Korean.

 fieldtype name=text_cjk class=solr.TextField 
 positionIncrementGap=1 autoGeneratePhraseQueries=false
  analyzer
 tokenizer class=solr.CJKTokenizerFactory /
 filter class=solr.CJKWidthFilterFactory/
 filter 
 class=edu.stanford.lucene.analysis.CJKFoldingFilterFactory/
 filter class=solr.ICUTransformFilterFactory 
 id=Traditional-Simplified/
 filter class=solr.ICUTransformFilterFactory 
 id=Katakana-Hiragana/
 filter class=solr.ICUFoldingFilterFactory/
 filter class=solr.CJKBigramFilterFactory han=true 
 hiragana=true katakana=true hangul=true outputUnigrams=true /
   /analyzer
 /fieldtype

 So i tried to implement individual fieldtype for each language as below

 Chinese
  fieldType name=text_cjk class=solr.TextField 
 positionIncrementGap=1000 autoGeneratePhraseQueries=false
  analyzer
  tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.CJKWidthFilterFactory/
filter class=solr.CJKBigramFilterFactory/
/analyzer
 /fieldType

 Japanese
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
analyzer
  tokenizer class=solr.JapaneseTokenizerFactory mode=search/
   filter class=solr.JapaneseBaseFormFilterFactory/
   filter class=solr.JapanesePartOfSpeechStopFilterFactory 
 tags=stoptags_ja.txt /
   filter class=solr.CJKWidthFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords_ja.txt /
   filter class=solr.JapaneseKatakanaStemFilterFactory 
 minimumLength=4/
   filter class=solr.LowerCaseFilterFactory/
/analyzer
 /fieldType

 Korean
 fieldType name=text_kr class=solr.TextField positionIncrementGap=1000 
 autoGeneratePhraseQueries=false
   analyzer type=index
 tokenizer class=solr.KoreanTokenizerFactory/
 filter class=solr.KoreanFilterFactory hasOrigin=true 
 hasCNoun=true  bigrammable=true/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords_kr.txt/
   /analyzer
   analyzer type=query
 tokenizer class=solr.KoreanTokenizerFactory/
 filter class=solr.KoreanFilterFactory hasOrigin=false 
 hasCNoun=false  bigrammable=false/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords_kr.txt/
   /analyzer
 /fieldType

 I am really struck how to implement this. Please help me.

 Thanks,
 Poornima



 On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch 
 arafa...@gmail.com wrote:



 I don't think Solr ships with Korean Tokenizer, does it?

 If you are using a

Re: Korean Tokenizer in solr

2014-07-14 Thread Poornima Jay

Yes, Below is my defined fieldtype

fieldType name=text_match_phrase_cjk class=solr.TextField 
positionIncrementGap=100
      analyzer type =index
         tokenizer class=solr.ICUTokenizerFactory/
         filter class=solr.CJKBigramFilterFactory indexUnigrams=true 
han=true/
         filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=0 preserveOriginal=1/
      /analyzer
      analyzer type =query
         tokenizer class=solr.ICUTokenizerFactory/
         filter class=solr.CJKBigramFilterFactory indexUnigrams=true 
han=true/
         filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=0 preserveOriginal=1/
      /analyzer
   /fieldType

Please correct me if I am doing anything wrong here

Regards,
Poornima


On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch arafa...@gmail.com 
wrote:
 


You sure, it's not a spelling error or something other weird like
that? Because Solr ships with that filter in it's example schema:
        filter class=solr.CJKBigramFilterFactory/

So, you can compare what you are doing differently with that.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853



On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
poornima...@rocketmail.com wrote:
 I have upgrade the solr version to 4.8.1. But after making changes in the 
 schema file i am getting the below error
 Error instantiating class: 
 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
 I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 
 4.8.1. Do I need to make any configuration changes to get this working.

 Please advice.

 Regards,
 Poornima


 On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch arafa...@gmail.com 
 wrote:



 I would suggest you read through all 12 (?) articles in this series:
 http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
 . It will probably lay out most of the issues for you.

 And if you are starting, I would really suggest using the latest Solr
 (4.9). A lot more people remember what the latest version has then
 what was in 3.6. And, as the series above will tell you, some relevant
 issues had been fixed in more recent Solr versions.

 Regards,
    Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency



 On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
 poornima...@rocketmail.com wrote:
 Till now I was thinking solr will support KoreanTokenizer. I haven't used 
 any other 3rd party one.
 Actually the issue i am facing is I need to integrate English, Chinese, 
 Japanese and Korean language search in a single site. Based on the user's 
 selected language to search the fields will be queried appropriately.

 I tried using cjk for all the 3 languages like below but only few search 
 terms work for Chinese and Japanese. nothing works for Korean.

 fieldtype name=text_cjk class=solr.TextField 
 positionIncrementGap=1 autoGeneratePhraseQueries=false
      analyzer
         tokenizer class=solr.CJKTokenizerFactory /
         filter class=solr.CJKWidthFilterFactory/
         filter 
class=edu.stanford.lucene.analysis.CJKFoldingFilterFactory/
         filter class=solr.ICUTransformFilterFactory 
id=Traditional-Simplified/
         filter class=solr.ICUTransformFilterFactory 
id=Katakana-Hiragana/
         filter class=solr.ICUFoldingFilterFactory/
         filter class=solr.CJKBigramFilterFactory han=true 
hiragana=true katakana=true hangul=true outputUnigrams=true /
       /analyzer
     /fieldtype

 So i tried to implement individual fieldtype for each language as below

 Chinese
  fieldType name=text_cjk class=solr.TextField 
positionIncrementGap=1000 autoGeneratePhraseQueries=false
      analyzer
          tokenizer class=solr.ICUTokenizerFactory/
            filter class=solr.ICUFoldingFilterFactory/
            filter class=solr.CJKWidthFilterFactory/
            filter class=solr.CJKBigramFilterFactory/
        /analyzer
     /fieldType

 Japanese
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
    analyzer
      tokenizer class=solr.JapaneseTokenizerFactory mode=search/
       filter class=solr.JapaneseBaseFormFilterFactory/
       filter class=solr.JapanesePartOfSpeechStopFilterFactory 
tags=stoptags_ja.txt /
       filter class=solr.CJKWidthFilterFactory/
       filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_ja.txt /
       filter class=solr.JapaneseKatakanaStemFilterFactory 
minimumLength=4/
       filter class=solr.LowerCaseFilterFactory/
    /analyzer
 /fieldType

 Korean
 fieldType name=text_kr

Re: Korean Tokenizer in solr

2014-07-14 Thread Alexandre Rafalovitch

What happens if you have a new collection with absolute minimum in it
and then add the definition? Start from something like:
https://github.com/arafalov/simplest-solr-config .

Also, is there a long exception earlier in a log. It may have more clues.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Mon, Jul 14, 2014 at 2:15 PM, Poornima Jay
poornima...@rocketmail.com wrote:
 Yes, Below is my defined fieldtype

 fieldType name=text_match_phrase_cjk class=solr.TextField 
 positionIncrementGap=100
   analyzer type =index
  tokenizer class=solr.ICUTokenizerFactory/
  filter class=solr.CJKBigramFilterFactory indexUnigrams=true 
 han=true/
  filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=1 generateNumberParts=1 catenateWords=1 
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0 
 preserveOriginal=1/
   /analyzer
   analyzer type =query
  tokenizer class=solr.ICUTokenizerFactory/
  filter class=solr.CJKBigramFilterFactory indexUnigrams=true 
 han=true/
  filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=1 generateNumberParts=1 catenateWords=0 
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 
 preserveOriginal=1/
   /analyzer
/fieldType

 Please correct me if I am doing anything wrong here

 Regards,
 Poornima


 On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch arafa...@gmail.com 
 wrote:



 You sure, it's not a spelling error or something other weird like
 that? Because Solr ships with that filter in it's example schema:
 filter class=solr.CJKBigramFilterFactory/

 So, you can compare what you are doing differently with that.

 Regards,
Alex.
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853



 On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
 poornima...@rocketmail.com wrote:
 I have upgrade the solr version to 4.8.1. But after making changes in the 
 schema file i am getting the below error
 Error instantiating class: 
 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
 I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 
 4.8.1. Do I need to make any configuration changes to get this working.

 Please advice.

 Regards,
 Poornima


 On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch 
 arafa...@gmail.com wrote:



 I would suggest you read through all 12 (?) articles in this series:
 http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
 . It will probably lay out most of the issues for you.

 And if you are starting, I would really suggest using the latest Solr
 (4.9). A lot more people remember what the latest version has then
 what was in 3.6. And, as the series above will tell you, some relevant
 issues had been fixed in more recent Solr versions.

 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency



 On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
 poornima...@rocketmail.com wrote:
 Till now I was thinking solr will support KoreanTokenizer. I haven't used 
 any other 3rd party one.
 Actually the issue i am facing is I need to integrate English, Chinese, 
 Japanese and Korean language search in a single site. Based on the user's 
 selected language to search the fields will be queried appropriately.

 I tried using cjk for all the 3 languages like below but only few search 
 terms work for Chinese and Japanese. nothing works for Korean.

 fieldtype name=text_cjk class=solr.TextField 
 positionIncrementGap=1 autoGeneratePhraseQueries=false
  analyzer
 tokenizer class=solr.CJKTokenizerFactory /
 filter class=solr.CJKWidthFilterFactory/
 filter 
 class=edu.stanford.lucene.analysis.CJKFoldingFilterFactory/
 filter class=solr.ICUTransformFilterFactory 
 id=Traditional-Simplified/
 filter class=solr.ICUTransformFilterFactory 
 id=Katakana-Hiragana/
 filter class=solr.ICUFoldingFilterFactory/
 filter class=solr.CJKBigramFilterFactory han=true 
 hiragana=true katakana=true hangul=true outputUnigrams=true /
   /analyzer
 /fieldtype

 So i tried to implement individual fieldtype for each language as below

 Chinese
  fieldType name=text_cjk class=solr.TextField 
 positionIncrementGap=1000 autoGeneratePhraseQueries=false
  analyzer
  tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.CJKWidthFilterFactory/
filter class=solr.CJKBigramFilterFactory/
/analyzer
 /fieldType

 Japanese
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100

Re: Korean Tokenizer in solr

2014-07-14 Thread Poornima Jay

When I am trying to index the below error comes

java.io.FileNotFoundException: 
/home/searchuser/multicore/apac_content/data/tlog/tlog.000 (No 
such file or directory)





On Monday, 14 July 2014 2:07 PM, Poornima Jay poornima...@rocketmail.com 
wrote:
 


Yes, Below is my defined fieldtype

fieldType name=text_match_phrase_cjk class=solr.TextField 
positionIncrementGap=100
      analyzer type =index
         tokenizer class=solr.ICUTokenizerFactory/
         filter class=solr.CJKBigramFilterFactory indexUnigrams=true 
han=true/
         filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=0 preserveOriginal=1/
      /analyzer
      analyzer type =query
         tokenizer class=solr.ICUTokenizerFactory/
         filter class=solr.CJKBigramFilterFactory indexUnigrams=true 
han=true/
         filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=0 preserveOriginal=1/
      /analyzer
   /fieldType

Please correct me if I am doing anything wrong here

Regards,
Poornima



On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch arafa...@gmail.com 
wrote:



You sure, it's not a spelling error or something other weird like
that? Because Solr ships with that filter in it's example schema:
        filter class=solr.CJKBigramFilterFactory/

So, you can compare what you are doing differently with that.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853



On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
poornima...@rocketmail.com wrote:
 I have upgrade the solr version to 4.8.1. But after making changes in the 
 schema file i am getting the below error
 Error instantiating class: 
 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
 I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 
 4.8.1. Do I need to make any configuration changes to get this working.

 Please advice.

 Regards,
 Poornima


 On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch arafa...@gmail.com 
 wrote:



 I would suggest you read through all 12 (?) articles in this series:
 http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
 . It will probably lay out most of the issues for you.

 And if you are starting, I would really suggest using the latest Solr
 (4.9). A lot more people remember what the latest version has then
 what was in 3.6. And, as the series above will tell you, some relevant
 issues had been fixed in more recent Solr versions.

 Regards,
    Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency



 On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
 poornima...@rocketmail.com wrote:
 Till now I was thinking solr will support KoreanTokenizer. I haven't used 
 any other 3rd party one.
 Actually the issue i am facing is I need to integrate English, Chinese, 
 Japanese and Korean language search in a single site. Based on the user's 
 selected language to search the fields will be queried appropriately.

 I tried using cjk for all the 3 languages like below but only few search 
 terms work for Chinese and Japanese. nothing works for Korean.

 fieldtype name=text_cjk class=solr.TextField 
 positionIncrementGap=1 autoGeneratePhraseQueries=false
      analyzer
         tokenizer class=solr.CJKTokenizerFactory /
         filter class=solr.CJKWidthFilterFactory/
         filter 
class=edu.stanford.lucene.analysis.CJKFoldingFilterFactory/
         filter class=solr.ICUTransformFilterFactory 
id=Traditional-Simplified/
         filter class=solr.ICUTransformFilterFactory 
id=Katakana-Hiragana/
         filter class=solr.ICUFoldingFilterFactory/
         filter class=solr.CJKBigramFilterFactory han=true 
hiragana=true katakana=true hangul=true outputUnigrams=true /
       /analyzer
     /fieldtype

 So i tried to implement individual fieldtype for each language as below

 Chinese
  fieldType name=text_cjk class=solr.TextField 
positionIncrementGap=1000 autoGeneratePhraseQueries=false
      analyzer
          tokenizer class=solr.ICUTokenizerFactory/
            filter class=solr.ICUFoldingFilterFactory/
            filter class=solr.CJKWidthFilterFactory/
            filter class=solr.CJKBigramFilterFactory/
        /analyzer
     /fieldType

 Japanese
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
    analyzer
      tokenizer class=solr.JapaneseTokenizerFactory mode=search/
       filter class=solr.JapaneseBaseFormFilterFactory/
       filter class=solr.JapanesePartOfSpeechStopFilterFactory 
tags=stoptags_ja.txt /
       filter class=solr.CJKWidthFilterFactory/

Re: Solr irregularly having QTime 50000ms, stracing solr cures the problem

2014-07-14 Thread Harald Kirsch

Thanks IJ for the link. I am not sure this can solve my problem, because 
I have only one machine in play anyway.


Harald.

On 12.07.2014 20:49, IJ wrote:

GUess - I had the same issues as you. Was resolved
http://lucene.472066.n3.nabble.com/Slow-QTimes-5-seconds-for-Small-sized-Collections-td4143681.html

was resolved by adding an explicit host mapping entry on /etc/hosts for
inter node solr communication - thereby bypassing DNS Lookups.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-irregularly-having-QTime-5ms-stracing-solr-cures-the-problem-tp4146047p4146858.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Reference numbers for major page fauls per seconds, index size, query throughput

2014-07-14 Thread Harald Kirsch


Hello Erik,

thanks for the reply. Indeed the CPUs are kind of idling during the load 
test. They are not 20% but clearly don't get far beyond 40%.


Changing the number of threads in jmeter has minor effects only on the 
qps, but increases the average latency, as soon as the threads outnumber 
the CPUs --- expected behavior I would say.


I varied the number of results returned between 20 and 10 with no 
remarkable changes in performance.


I restricted to fl=id and even this increased the throughput only 
minimally (meantime the index has 16 million, increase from 2.x qps to 
3). Jmeter reported a reduction in average transferred size from 10kByes 
to 2.5kBytes. This is not really the issue here and in the end we need 
more than the IDs in production anyway.


What really bugs me currently is that htop reports an IORR (supposed to 
be read(2) calls) of between 100 to 200 MByte/s during the load test.


This somehow runs contrary to my understanding of why Solr uses mmapped 
files. There should be no read(2) calls and certainly not 200 MB/s :-/ 
And this did not drop when I restricted to fl=id.


I will try to check this with strace to see were it is reading from.

Hints appreciated. With a bit of luck, I'll get more RAM and can compare 
then.


Thanks,
Harald.


On 12.07.2014 17:58, Erick Erickson wrote:

If the stats you're reporting are during the load test, your CPU is
kind of idling along at  20% which supports your theory.

Just to cover all bases, when you bump the number of threads jmeter is
firing does it make any difference? And how many rows are you
returning? This latter is important because to return documents, Solr
needs to go out to disk, possibly generating your page faults
(guessing here).

One note about your index size it's largely useless to measure
index on disk if for no other reason than the _stored_ data doesn't
really count towards memory requirements for search. The *.fdt an
d*.fdx segment files contain the stored data, so subtract them out

Speaking of which, try just returning the id (fl=id). That should
reduce the disk seeks due to assembling the docs.

But 4 qps for simple term queries seems very slow at first blush.

FWIW,
Erick

On Thu, Jul 10, 2014 at 7:30 AM, Harald Kirsch
harald.kir...@raytion.com wrote:

Hi everyone,

currently I am taking some performance measurements on a Solr installation
and I am trying to figure out if what I see mostly fits expectations:

The data is as follows:

- solr 4.8.1
- 8 millon documents
- mostly office documents with real text content, stored
- index size on disk 90G
- full index memory mapped into virtual memory:
- this is a on a vmware server, 4 cores, 16 GB RAM

PID PR  NI  VIRT  RES  SHR S   %CPU %MEMTIME+  nFLT
961 20   0 93.9g  10g 6.0g S 19 64.5 718:39.81 757k

When I start running a jmeter query test sending requests as fast a possible
with a few threads, it peaks at about 4 qps with a real-world query replay
of mostly 1, 2, sometimes more terms.

What I see are around 150 to 200 major page faults per second, meaning that
Solr is not really happy with what happens to be in memory at any instance
in time.

My hunch is that this hints at a too small RAM footprint. Much more RAM is
needed to get the number of major page faults down.

Would anyone agree or disagree with this analysis. Someone out there saying
200 major page faults/second are normal, there must be another problem?

Thanks,
Harald.

Re: Solr irregularly having QTime 50000ms, stracing solr cures the problem

2014-07-14 Thread Harald Kirsch

This problem seems to completely disappear under load. I started making 
load tests despite fearing them to be useless. It turns out that there 
are no more 5 ms delays under load.


Harald.

On 09.07.2014 09:50, Harald Kirsch wrote:

Good point. I will see if I can get the necessary access rights on this
machine to run tcpdump.

Thanks for the suggestion,
Harald.

On 09.07.2014 00:32, Steve McKay wrote:

Sure sounds like a socket bug, doesn't it? I turn to tcpdump when Solr
starts behaving strangely in a socket-related way. Knowing exactly
what's happening at the transport level is worth a month of guessing
and poking.

On Jul 8, 2014, at 3:53 AM, Harald Kirsch harald.kir...@raytion.com
wrote:


Hi all,

This is what happens when I run a regular wget query to log the
current number of documents indexed:

2014-07-08:07:23:28 QTime=20 numFound=5720168
2014-07-08:07:24:28 QTime=12 numFound=5721126
2014-07-08:07:25:28 QTime=19 numFound=5721126
2014-07-08:07:27:18 QTime=50071 numFound=5721126
2014-07-08:07:29:08 QTime=50058 numFound=5724494
2014-07-08:07:30:58 QTime=50033 numFound=5730710
2014-07-08:07:31:58 QTime=13 numFound=5730710
2014-07-08:07:33:48 QTime=50065 numFound=5734069
2014-07-08:07:34:48 QTime=16 numFound=5737742
2014-07-08:07:36:38 QTime=50037 numFound=5737742
2014-07-08:07:37:38 QTime=12 numFound=5738190
2014-07-08:07:38:38 QTime=23 numFound=5741208
2014-07-08:07:40:29 QTime=50034 numFound=5742067
2014-07-08:07:41:29 QTime=12 numFound=5742067
2014-07-08:07:42:29 QTime=17 numFound=5742067
2014-07-08:07:43:29 QTime=20 numFound=5745497
2014-07-08:07:44:29 QTime=13 numFound=5745981
2014-07-08:07:45:29 QTime=23 numFound=5746420

As you can see, the QTime is just over 50 seconds at irregular
intervals.

This happens independent of whether I am indexing documents with
around 20 dps or not. First I thought about a dependence on the
auto-commit of 5 minutes, but the the 50 seconds hits are too irregular.

Furthermore, and this is *really strange*: when hooking strace on the
solr process, the 50 seconds QTimes disappear completely and
consistently --- a real Heisenbug.

Nevertheless, strace shows that there is a socket timeout of 50
seconds defined in calls like this:

[pid  1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1,
5) = 1 ([{fd=96, revents=POLLIN}]) 0.40

where the fd=96 is the result of

[pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET,
sin_port=htons(57236), sin_addr=inet_addr(ip address of local
host)}, [16]) = 96 0.54

where again fd=122 is the TCP port on which solr was started.

My hunch is that this is communication between the cores of solr.

I tried to search the internet for such a strange connection between
socket timeouts and strace, but could not find anything (the
stackoverflow entry from yesterday is my own :-(


This smells a bit like a race condition/deadlock kind of thing which
is broken up by timing differences introduced by stracing the process.

Any hints appreciated.

For completeness, here is my setup:
- solr-4.8.1,
- cloud version running
- 10 shards on 10 cores in one instance
- hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11,
PATCHLEVEL 2
- hosted on a vmware, 4 CPU cores, 16 GB RAM
- single digit million docs indexed, exact number does not matter
- zero query load


Harald.

Of, To, and Other Small Words

2014-07-14 Thread Teague James

Hello all,

I am working with Solr 4.9.0 and am searching for phrases that contain words
like of or to that Solr seems to be ignoring at index time. Here's what
I tried:

curl http://localhost/solr/update?commit=true -H Content-Type: text/xml
--data-binary 'adddocfield name=id100/fieldfield
name=contentblah blah blah knowledge of science blah blah
blah/field/doc/add'

Then, using a broswer:

http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=id:100

I get zero hits. Search for knowledge or science and I'll get hits.
knowledge of or of science and I get zero hits. I don't want to use
proximity if I can avoid it, as this may introduce too many undesireable
results. Stopwords.txt is blank, yet clearly Solr is ignoring of and to
and possibly more words that I have not discovered through testing yet. Is
there some other configuration file that contains these small words? Is
there any way to force Solr to pay attention to them and not drop them from
the phrase? Any advice is appreciated! Thanks!

-Teague

Re: Of, To, and Other Small Words

2014-07-14 Thread Anshum Gupta

Hi Teague,

The StopFilterFactory (which I think you're using) by default uses
lang/stopwords_en.txt (which wouldn't be empty if you check).
What you're looking at is the stopword.txt. You could either empty
that file out or change the field type for your field.


On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com wrote:
 Hello all,

 I am working with Solr 4.9.0 and am searching for phrases that contain words
 like of or to that Solr seems to be ignoring at index time. Here's what
 I tried:

 curl http://localhost/solr/update?commit=true -H Content-Type: text/xml
 --data-binary 'adddocfield name=id100/fieldfield
 name=contentblah blah blah knowledge of science blah blah
 blah/field/doc/add'

 Then, using a broswer:

 http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=id:100

 I get zero hits. Search for knowledge or science and I'll get hits.
 knowledge of or of science and I get zero hits. I don't want to use
 proximity if I can avoid it, as this may introduce too many undesireable
 results. Stopwords.txt is blank, yet clearly Solr is ignoring of and to
 and possibly more words that I have not discovered through testing yet. Is
 there some other configuration file that contains these small words? Is
 there any way to force Solr to pay attention to them and not drop them from
 the phrase? Any advice is appreciated! Thanks!

 -Teague





-- 

Anshum Gupta
http://www.anshumgupta.net

Re: Of, To, and Other Small Words

2014-07-14 Thread Jack Krupansky

Or, if you happen to leave off the words attribute of the stop filter (or 
misspell the attribute name), it will use the internal Lucene hardwired list 
of stop words.


-- Jack Krupansky

-Original Message- 
From: Anshum Gupta

Sent: Monday, July 14, 2014 4:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Hi Teague,

The StopFilterFactory (which I think you're using) by default uses
lang/stopwords_en.txt (which wouldn't be empty if you check).
What you're looking at is the stopword.txt. You could either empty
that file out or change the field type for your field.


On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com 
wrote:

Hello all,

I am working with Solr 4.9.0 and am searching for phrases that contain 
words
like of or to that Solr seems to be ignoring at index time. Here's 
what

I tried:

curl http://localhost/solr/update?commit=true -H Content-Type: text/xml
--data-binary 'adddocfield name=id100/fieldfield
name=contentblah blah blah knowledge of science blah blah
blah/field/doc/add'

Then, using a broswer:

http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=id:100

I get zero hits. Search for knowledge or science and I'll get hits.
knowledge of or of science and I get zero hits. I don't want to use
proximity if I can avoid it, as this may introduce too many undesireable
results. Stopwords.txt is blank, yet clearly Solr is ignoring of and 
to

and possibly more words that I have not discovered through testing yet. Is
there some other configuration file that contains these small words? Is
there any way to force Solr to pay attention to them and not drop them 
from

the phrase? Any advice is appreciated! Thanks!

-Teague






--

Anshum Gupta
http://www.anshumgupta.net

Strategies for effective prefix queries?

2014-07-14 Thread Hayden Muhl

I'm working on using Solr for autocompleting usernames. I'm running into a
problem with the wildcard queries (e.g. username:al*).

We are tokenizing usernames so that a username like solr-user will be
tokenized into solr and user, and will match both sol and use
prefixes. The problem is when we get solr-u as a prefix, I'm having to
split that up on the client side before I construct a query username:solr*
username:u*. I'm basically using a regex as a poor man's tokenizer.

Is there a better way to approach this? Is there a way to tell Solr to
tokenize a string and use the parts as prefixes?

- Hayden

RE: Of, To, and Other Small Words

2014-07-14 Thread Teague James

Hi Anshum,

Thanks for replying and suggesting this, but the field type I am using (a 
modified text_general) in my schema has the file set to 'stopwords.txt'. 

fieldType name=text_general class=solr.TextField 
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory 
ignoreCase=true words=stopwords.txt /
!-- in this example, we will only use synonyms at 
query time
filter class=solr.SynonymFilterFactory 
synonyms=index_synonyms.txt ignoreCase=true expand=false/--
filter class=solr.LowerCaseFilterFactory/
!-- CHANGE: The NGramFilterFactory was added to 
provide partial word search. This can be changed to
EdgeNGramFilterFactory side=front to only match front 
sided partial searches if matching any
part of a word is undesireable.--
filter class=solr.NGramFilterFactory minGramSize=3 
maxGramSize=10 /
!-- CHANGE: The PorterStemFilterFactory was added to 
allow matches for 'cat' and 'cats' by searching for 'cat' --
filter class=solr.PorterStemFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory 
ignoreCase=true words=stopwords.txt /
filter class=solr.SynonymFilterFactory 
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
!-- CHANGE: The PorterStemFilterFactory was added to 
allow matches for 'cat' and 'cats' by searching for 'cat' --
filter class=solr.PorterStemFilterFactory/
/analyzer
/fieldType 

Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, 
re-indexed, and searched with still zero results. Any other suggestions on 
where I might be able to control this behavior?

-Teague


-Original Message-
From: Anshum Gupta [mailto:ans...@anshumgupta.net] 
Sent: Monday, July 14, 2014 4:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Hi Teague,

The StopFilterFactory (which I think you're using) by default uses 
lang/stopwords_en.txt (which wouldn't be empty if you check).
What you're looking at is the stopword.txt. You could either empty that file 
out or change the field type for your field.


On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com wrote:
 Hello all,

 I am working with Solr 4.9.0 and am searching for phrases that contain 
 words like of or to that Solr seems to be ignoring at index time. 
 Here's what I tried:

 curl http://localhost/solr/update?commit=true -H Content-Type: text/xml
 --data-binary 'adddocfield name=id100/fieldfield 
 name=contentblah blah blah knowledge of science blah blah 
 blah/field/doc/add'

 Then, using a broswer:

 http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=i
 d:100

 I get zero hits. Search for knowledge or science and I'll get hits.
 knowledge of or of science and I get zero hits. I don't want to 
 use proximity if I can avoid it, as this may introduce too many 
 undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring 
 of and to
 and possibly more words that I have not discovered through testing 
 yet. Is there some other configuration file that contains these small 
 words? Is there any way to force Solr to pay attention to them and not 
 drop them from the phrase? Any advice is appreciated! Thanks!

 -Teague





-- 

Anshum Gupta
http://www.anshumgupta.net

Re: Of, To, and Other Small Words

2014-07-14 Thread Alexandre Rafalovitch

Have you tried the Admin UI's Analyze screen. Because it will show you
what happens to the text as it progresses through the tokenizers and
filters. No need to reindex.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 8:10 AM, Teague James teag...@insystechinc.com wrote:
 Hi Anshum,

 Thanks for replying and suggesting this, but the field type I am using (a 
 modified text_general) in my schema has the file set to 'stopwords.txt'.

 fieldType name=text_general class=solr.TextField 
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory 
 ignoreCase=true words=stopwords.txt /
 !-- in this example, we will only use synonyms at 
 query time
 filter class=solr.SynonymFilterFactory 
 synonyms=index_synonyms.txt ignoreCase=true expand=false/--
 filter class=solr.LowerCaseFilterFactory/
 !-- CHANGE: The NGramFilterFactory was added to 
 provide partial word search. This can be changed to
 EdgeNGramFilterFactory side=front to only match 
 front sided partial searches if matching any
 part of a word is undesireable.--
 filter class=solr.NGramFilterFactory 
 minGramSize=3 maxGramSize=10 /
 !-- CHANGE: The PorterStemFilterFactory was added to 
 allow matches for 'cat' and 'cats' by searching for 'cat' --
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory 
 ignoreCase=true words=stopwords.txt /
 filter class=solr.SynonymFilterFactory 
 synonyms=synonyms.txt ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
 !-- CHANGE: The PorterStemFilterFactory was added to 
 allow matches for 'cat' and 'cats' by searching for 'cat' --
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 /fieldType

 Just to be double sure I cleared the list in stopwords_en.txt, restarted 
 Solr, re-indexed, and searched with still zero results. Any other suggestions 
 on where I might be able to control this behavior?

 -Teague


 -Original Message-
 From: Anshum Gupta [mailto:ans...@anshumgupta.net]
 Sent: Monday, July 14, 2014 4:04 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Of, To, and Other Small Words

 Hi Teague,

 The StopFilterFactory (which I think you're using) by default uses 
 lang/stopwords_en.txt (which wouldn't be empty if you check).
 What you're looking at is the stopword.txt. You could either empty that file 
 out or change the field type for your field.


 On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com 
 wrote:
 Hello all,

 I am working with Solr 4.9.0 and am searching for phrases that contain
 words like of or to that Solr seems to be ignoring at index time.
 Here's what I tried:

 curl http://localhost/solr/update?commit=true -H Content-Type: text/xml
 --data-binary 'adddocfield name=id100/fieldfield
 name=contentblah blah blah knowledge of science blah blah
 blah/field/doc/add'

 Then, using a broswer:

 http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=i
 d:100

 I get zero hits. Search for knowledge or science and I'll get hits.
 knowledge of or of science and I get zero hits. I don't want to
 use proximity if I can avoid it, as this may introduce too many
 undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring 
 of and to
 and possibly more words that I have not discovered through testing
 yet. Is there some other configuration file that contains these small
 words? Is there any way to force Solr to pay attention to them and not
 drop them from the phrase? Any advice is appreciated! Thanks!

 -Teague





 --

 Anshum Gupta
 http://www.anshumgupta.net

Re: Strategies for effective prefix queries?

2014-07-14 Thread Alexandre Rafalovitch

Search against both fields (one split, one not split)? Keep original
and tokenized form? I am doing something similar with class name
autocompletes here:
https://github.com/arafalov/Solr-Javadoc/blob/master/JavadocIndex/JavadocCollection/conf/schema.xml#L24

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 8:04 AM, Hayden Muhl haydenm...@gmail.com wrote:
 I'm working on using Solr for autocompleting usernames. I'm running into a
 problem with the wildcard queries (e.g. username:al*).

 We are tokenizing usernames so that a username like solr-user will be
 tokenized into solr and user, and will match both sol and use
 prefixes. The problem is when we get solr-u as a prefix, I'm having to
 split that up on the client side before I construct a query username:solr*
 username:u*. I'm basically using a regex as a poor man's tokenizer.

 Is there a better way to approach this? Is there a way to tell Solr to
 tokenize a string and use the parts as prefixes?

 - Hayden

RE: Of, To, and Other Small Words

2014-07-14 Thread Teague James

Jack,

Thanks for replying and the suggestion. I replied to another suggestion with my 
field type and I do have filter class=solr.StopFilterFactory 
ignoreCase=true words=stopwords.txt /.  There's nothing in the 
stopwords.txt. I even cleaned out stopwords_en.txt just to be certain. Any 
other suggestions on how to control this behavior?

-Teague

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Monday, July 14, 2014 4:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Or, if you happen to leave off the words attribute of the stop filter (or 
misspell the attribute name), it will use the internal Lucene hardwired list of 
stop words.

-- Jack Krupansky

-Original Message-
From: Anshum Gupta
Sent: Monday, July 14, 2014 4:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Hi Teague,

The StopFilterFactory (which I think you're using) by default uses 
lang/stopwords_en.txt (which wouldn't be empty if you check).
What you're looking at is the stopword.txt. You could either empty that file 
out or change the field type for your field.


On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com
wrote:
 Hello all,

 I am working with Solr 4.9.0 and am searching for phrases that contain 
 words like of or to that Solr seems to be ignoring at index time. 
 Here's what I tried:

 curl http://localhost/solr/update?commit=true -H Content-Type: text/xml
 --data-binary 'adddocfield name=id100/fieldfield 
 name=contentblah blah blah knowledge of science blah blah 
 blah/field/doc/add'

 Then, using a broswer:

 http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=i
 d:100

 I get zero hits. Search for knowledge or science and I'll get hits.
 knowledge of or of science and I get zero hits. I don't want to 
 use proximity if I can avoid it, as this may introduce too many 
 undesireable results. Stopwords.txt is blank, yet clearly Solr is 
 ignoring of and to
 and possibly more words that I have not discovered through testing 
 yet. Is there some other configuration file that contains these small 
 words? Is there any way to force Solr to pay attention to them and not 
 drop them from the phrase? Any advice is appreciated! Thanks!

 -Teague





-- 

Anshum Gupta
http://www.anshumgupta.net

RE: Of, To, and Other Small Words

2014-07-14 Thread Teague James

Alex,

Thanks! Great suggestion. I figured out that it was the EdgeNGramFilterFactory. 
Taking that out of the mix did it.

-Teague

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Monday, July 14, 2014 9:14 PM
To: solr-user
Subject: Re: Of, To, and Other Small Words

Have you tried the Admin UI's Analyze screen. Because it will show you what 
happens to the text as it progresses through the tokenizers and filters. No 
need to reindex.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: 
http://www.solr-start.com/ and @solrstart Solr popularizers community: 
https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 8:10 AM, Teague James teag...@insystechinc.com wrote:
 Hi Anshum,

 Thanks for replying and suggesting this, but the field type I am using (a 
 modified text_general) in my schema has the file set to 'stopwords.txt'.

 fieldType name=text_general class=solr.TextField 
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory 
 ignoreCase=true words=stopwords.txt /
 !-- in this example, we will only use synonyms at 
 query time
 filter class=solr.SynonymFilterFactory 
 synonyms=index_synonyms.txt ignoreCase=true expand=false/--
 filter class=solr.LowerCaseFilterFactory/
 !-- CHANGE: The NGramFilterFactory was added to 
 provide partial word search. This can be changed to
 EdgeNGramFilterFactory side=front to only match 
 front sided partial searches if matching any
 part of a word is undesireable.--
 filter class=solr.NGramFilterFactory 
 minGramSize=3 maxGramSize=10 /
 !-- CHANGE: The PorterStemFilterFactory was added to 
 allow matches for 'cat' and 'cats' by searching for 'cat' --
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory 
 ignoreCase=true words=stopwords.txt /
 filter class=solr.SynonymFilterFactory 
 synonyms=synonyms.txt ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
 !-- CHANGE: The PorterStemFilterFactory was added to 
 allow matches for 'cat' and 'cats' by searching for 'cat' --
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 /fieldType

 Just to be double sure I cleared the list in stopwords_en.txt, restarted 
 Solr, re-indexed, and searched with still zero results. Any other suggestions 
 on where I might be able to control this behavior?

 -Teague


 -Original Message-
 From: Anshum Gupta [mailto:ans...@anshumgupta.net]
 Sent: Monday, July 14, 2014 4:04 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Of, To, and Other Small Words

 Hi Teague,

 The StopFilterFactory (which I think you're using) by default uses 
 lang/stopwords_en.txt (which wouldn't be empty if you check).
 What you're looking at is the stopword.txt. You could either empty that file 
 out or change the field type for your field.


 On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com 
 wrote:
 Hello all,

 I am working with Solr 4.9.0 and am searching for phrases that 
 contain words like of or to that Solr seems to be ignoring at index time.
 Here's what I tried:

 curl http://localhost/solr/update?commit=true -H Content-Type: text/xml
 --data-binary 'adddocfield name=id100/fieldfield 
 name=contentblah blah blah knowledge of science blah blah 
 blah/field/doc/add'

 Then, using a broswer:

 http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=
 i
 d:100

 I get zero hits. Search for knowledge or science and I'll get hits.
 knowledge of or of science and I get zero hits. I don't want to 
 use proximity if I can avoid it, as this may introduce too many 
 undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring 
 of and to
 and possibly more words that I have not discovered through testing 
 yet. Is there some other configuration file that contains these small 
 words? Is there any way to force Solr to pay attention to them and 
 not drop them from the phrase? Any advice is appreciated! Thanks!

 -Teague





 --

 Anshum Gupta
 http://www.anshumgupta.net

Re: Of, To, and Other Small Words

2014-07-14 Thread Alexandre Rafalovitch

You could try experimenting with CommonGramsFilterFactory and
CommonGramsQueryFilter (slightly different). There is actually a lot
of cool analyzers bundled with Solr. You can find full list on my site
at: http://www.solr-start.com/info/analyzers

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 8:42 AM, Teague James teag...@insystechinc.com wrote:
 Alex,

 Thanks! Great suggestion. I figured out that it was the 
 EdgeNGramFilterFactory. Taking that out of the mix did it.

 -Teague

 -Original Message-
 From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
 Sent: Monday, July 14, 2014 9:14 PM
 To: solr-user
 Subject: Re: Of, To, and Other Small Words

 Have you tried the Admin UI's Analyze screen. Because it will show you what 
 happens to the text as it progresses through the tokenizers and filters. No 
 need to reindex.

 Regards,
Alex.
 Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: 
 http://www.solr-start.com/ and @solrstart Solr popularizers community: 
 https://www.linkedin.com/groups?gid=6713853


 On Tue, Jul 15, 2014 at 8:10 AM, Teague James teag...@insystechinc.com 
 wrote:
 Hi Anshum,

 Thanks for replying and suggesting this, but the field type I am using (a 
 modified text_general) in my schema has the file set to 'stopwords.txt'.

 fieldType name=text_general class=solr.TextField 
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory 
 ignoreCase=true words=stopwords.txt /
 !-- in this example, we will only use synonyms at 
 query time
 filter class=solr.SynonymFilterFactory 
 synonyms=index_synonyms.txt ignoreCase=true expand=false/--
 filter class=solr.LowerCaseFilterFactory/
 !-- CHANGE: The NGramFilterFactory was added to 
 provide partial word search. This can be changed to
 EdgeNGramFilterFactory side=front to only match 
 front sided partial searches if matching any
 part of a word is undesireable.--
 filter class=solr.NGramFilterFactory 
 minGramSize=3 maxGramSize=10 /
 !-- CHANGE: The PorterStemFilterFactory was added 
 to allow matches for 'cat' and 'cats' by searching for 'cat' --
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory 
 ignoreCase=true words=stopwords.txt /
 filter class=solr.SynonymFilterFactory 
 synonyms=synonyms.txt ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
 !-- CHANGE: The PorterStemFilterFactory was added 
 to allow matches for 'cat' and 'cats' by searching for 'cat' --
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 /fieldType

 Just to be double sure I cleared the list in stopwords_en.txt, restarted 
 Solr, re-indexed, and searched with still zero results. Any other 
 suggestions on where I might be able to control this behavior?

 -Teague


 -Original Message-
 From: Anshum Gupta [mailto:ans...@anshumgupta.net]
 Sent: Monday, July 14, 2014 4:04 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Of, To, and Other Small Words

 Hi Teague,

 The StopFilterFactory (which I think you're using) by default uses 
 lang/stopwords_en.txt (which wouldn't be empty if you check).
 What you're looking at is the stopword.txt. You could either empty that file 
 out or change the field type for your field.


 On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com 
 wrote:
 Hello all,

 I am working with Solr 4.9.0 and am searching for phrases that
 contain words like of or to that Solr seems to be ignoring at index 
 time.
 Here's what I tried:

 curl http://localhost/solr/update?commit=true -H Content-Type: text/xml
 --data-binary 'adddocfield name=id100/fieldfield
 name=contentblah blah blah knowledge of science blah blah
 blah/field/doc/add'

 Then, using a broswer:

 http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=
 i
 d:100

 I get zero hits. Search for knowledge or science and I'll get hits.
 knowledge of or of science and I get zero hits. I don't want to
 use proximity if I can avoid it, as this may introduce too many
 undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring 
 of and to
 and possibly more words that I have not discovered through testing
 yet. Is there some other configuration file

Re: External File Field eating memory

2014-07-14 Thread Apoorva Gaurav

Hey Kamal,
What all config changes have you done to establish replication of external
files and how have you disabled role reloading?


On Wed, Jul 9, 2014 at 11:30 AM, Kamal Kishore Aggarwal 
kkroyal@gmail.com wrote:

 Hi All,

 It was found that external file, which was getting replicated after every
 10 minutes was reloading the core as well. This was increasing the query
 time.

 Thanks
 Kamal Kishore



 On Thu, Jul 3, 2014 at 12:48 PM, Kamal Kishore Aggarwal 
 kkroyal@gmail.com wrote:

  With the above replication configuration, the eff file is getting
  replicated at core/conf/data/external_eff_views (new dir data is being
  created in conf dir) location, but it is not getting replicated at
 core/data/external_eff_views
  on slave.
 
  Please help.
 
 
  On Thu, Jul 3, 2014 at 12:21 PM, Kamal Kishore Aggarwal 
  kkroyal@gmail.com wrote:
 
  Thanks for your guidance Alexandre Rafalovitch.
 
  I am looking into this seriously.
 
  Another question is that I facing error in replication of eff file
 
  This is master replication configuration:
 
  core/conf/solrconfig.xml
 
  requestHandler name=/replication class=solr.ReplicationHandler 
  lst name=master
  str name=replicateAftercommit/str
  str name=replicateAfterstartup/str
  str name=confFiles../data/external_eff_views/str
  /lst
  /requestHandler
 
 
  The eff file is present at core/data/external_eff_views location.
 
 
  On Thu, Jul 3, 2014 at 11:50 AM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
 
  This might be related:
 
  https://issues.apache.org/jira/browse/SOLR-3514
 
 
  On Sat, Jun 28, 2014 at 5:34 PM, Kamal Kishore Aggarwal 
  kkroyal@gmail.com wrote:
 
   Hi Team,
  
   I have recently implemented EFF in solr. There are about 1.5
  lacs(unsorted)
   values in the external file. After this implementation, the server
 has
   become slow. The solr query time has also increased.
  
   Can anybody confirm me if these issues are because of this
  implementation.
   Is that memory does EFF eats up?
  
   Regards
   Kamal Kishore
  
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 
 
 
 




-- 
Thanks  Regards,
Apoorva

Re: Korean Tokenizer in solr

Re: Korean Tokenizer in solr

Re: Korean Tokenizer in solr

Re: Korean Tokenizer in solr

Re: Korean Tokenizer in solr

Re: Solr irregularly having QTime 50000ms, stracing solr cures the problem

Re: Reference numbers for major page fauls per seconds, index size, query throughput

Re: Solr irregularly having QTime 50000ms, stracing solr cures the problem

Of, To, and Other Small Words

Re: Of, To, and Other Small Words

Re: Of, To, and Other Small Words

Strategies for effective prefix queries?

RE: Of, To, and Other Small Words

Re: Of, To, and Other Small Words

Re: Strategies for effective prefix queries?

RE: Of, To, and Other Small Words

RE: Of, To, and Other Small Words

Re: Of, To, and Other Small Words

Re: External File Field eating memory

19 matches

Site Navigation

Mail list logo

Footer information