Re: Korean Tokenizer in solr
I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory' I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working. Please advice. Regards, Poornima On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I would suggest you read through all 12 (?) articles in this series: http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html . It will probably lay out most of the issues for you. And if you are starting, I would really suggest using the latest Solr (4.9). A lot more people remember what the latest version has then what was in 3.6. And, as the series above will tell you, some relevant issues had been fixed in more recent Solr versions. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay poornima...@rocketmail.com wrote: Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one. Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately. I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean. fieldtype name=text_cjk class=solr.TextField positionIncrementGap=1 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.CJKTokenizerFactory / filter class=solr.CJKWidthFilterFactory/ filter class=edu.stanford.lucene.analysis.CJKFoldingFilterFactory/ filter class=solr.ICUTransformFilterFactory id=Traditional-Simplified/ filter class=solr.ICUTransformFilterFactory id=Katakana-Hiragana/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKBigramFilterFactory han=true hiragana=true katakana=true hangul=true outputUnigrams=true / /analyzer /fieldtype So i tried to implement individual fieldtype for each language as below Chinese fieldType name=text_cjk class=solr.TextField positionIncrementGap=1000 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.CJKBigramFilterFactory/ /analyzer /fieldType Japanese fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.JapaneseTokenizerFactory mode=search/ filter class=solr.JapaneseBaseFormFilterFactory/ filter class=solr.JapanesePartOfSpeechStopFilterFactory tags=stoptags_ja.txt / filter class=solr.CJKWidthFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_ja.txt / filter class=solr.JapaneseKatakanaStemFilterFactory minimumLength=4/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Korean fieldType name=text_kr class=solr.TextField positionIncrementGap=1000 autoGeneratePhraseQueries=false analyzer type=index tokenizer class=solr.KoreanTokenizerFactory/ filter class=solr.KoreanFilterFactory hasOrigin=true hasCNoun=true bigrammable=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_kr.txt/ /analyzer analyzer type=query tokenizer class=solr.KoreanTokenizerFactory/ filter class=solr.KoreanFilterFactory hasOrigin=false hasCNoun=false bigrammable=false/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_kr.txt/ /analyzer /fieldType I am really struck how to implement this. Please help me. Thanks, Poornima On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I don't think Solr ships with Korean Tokenizer, does it? If you are using a 3rd party one, you need to give full class name, not just solr.Korean... And you need the library added in the lib statement in solrconfig.xml (at least in Solr 4). Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay poornima...@rocketmail.com wrote: I have defined the fieldtype inside the fields section. When i checked the error log i found the below error Caused by: java.lang.ClassNotFoundException:
Re: Korean Tokenizer in solr
You sure, it's not a spelling error or something other weird like that? Because Solr ships with that filter in it's example schema: filter class=solr.CJKBigramFilterFactory/ So, you can compare what you are doing differently with that. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay poornima...@rocketmail.com wrote: I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory' I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working. Please advice. Regards, Poornima On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I would suggest you read through all 12 (?) articles in this series: http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html . It will probably lay out most of the issues for you. And if you are starting, I would really suggest using the latest Solr (4.9). A lot more people remember what the latest version has then what was in 3.6. And, as the series above will tell you, some relevant issues had been fixed in more recent Solr versions. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay poornima...@rocketmail.com wrote: Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one. Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately. I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean. fieldtype name=text_cjk class=solr.TextField positionIncrementGap=1 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.CJKTokenizerFactory / filter class=solr.CJKWidthFilterFactory/ filter class=edu.stanford.lucene.analysis.CJKFoldingFilterFactory/ filter class=solr.ICUTransformFilterFactory id=Traditional-Simplified/ filter class=solr.ICUTransformFilterFactory id=Katakana-Hiragana/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKBigramFilterFactory han=true hiragana=true katakana=true hangul=true outputUnigrams=true / /analyzer /fieldtype So i tried to implement individual fieldtype for each language as below Chinese fieldType name=text_cjk class=solr.TextField positionIncrementGap=1000 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.CJKBigramFilterFactory/ /analyzer /fieldType Japanese fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.JapaneseTokenizerFactory mode=search/ filter class=solr.JapaneseBaseFormFilterFactory/ filter class=solr.JapanesePartOfSpeechStopFilterFactory tags=stoptags_ja.txt / filter class=solr.CJKWidthFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_ja.txt / filter class=solr.JapaneseKatakanaStemFilterFactory minimumLength=4/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Korean fieldType name=text_kr class=solr.TextField positionIncrementGap=1000 autoGeneratePhraseQueries=false analyzer type=index tokenizer class=solr.KoreanTokenizerFactory/ filter class=solr.KoreanFilterFactory hasOrigin=true hasCNoun=true bigrammable=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_kr.txt/ /analyzer analyzer type=query tokenizer class=solr.KoreanTokenizerFactory/ filter class=solr.KoreanFilterFactory hasOrigin=false hasCNoun=false bigrammable=false/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_kr.txt/ /analyzer /fieldType I am really struck how to implement this. Please help me. Thanks, Poornima On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I don't think Solr ships with Korean Tokenizer, does it? If you are using a
Re: Korean Tokenizer in solr
Yes, Below is my defined fieldtype fieldType name=text_match_phrase_cjk class=solr.TextField positionIncrementGap=100 analyzer type =index tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.CJKBigramFilterFactory indexUnigrams=true han=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ /analyzer analyzer type =query tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.CJKBigramFilterFactory indexUnigrams=true han=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ /analyzer /fieldType Please correct me if I am doing anything wrong here Regards, Poornima On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: You sure, it's not a spelling error or something other weird like that? Because Solr ships with that filter in it's example schema: filter class=solr.CJKBigramFilterFactory/ So, you can compare what you are doing differently with that. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay poornima...@rocketmail.com wrote: I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory' I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working. Please advice. Regards, Poornima On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I would suggest you read through all 12 (?) articles in this series: http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html . It will probably lay out most of the issues for you. And if you are starting, I would really suggest using the latest Solr (4.9). A lot more people remember what the latest version has then what was in 3.6. And, as the series above will tell you, some relevant issues had been fixed in more recent Solr versions. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay poornima...@rocketmail.com wrote: Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one. Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately. I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean. fieldtype name=text_cjk class=solr.TextField positionIncrementGap=1 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.CJKTokenizerFactory / filter class=solr.CJKWidthFilterFactory/ filter class=edu.stanford.lucene.analysis.CJKFoldingFilterFactory/ filter class=solr.ICUTransformFilterFactory id=Traditional-Simplified/ filter class=solr.ICUTransformFilterFactory id=Katakana-Hiragana/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKBigramFilterFactory han=true hiragana=true katakana=true hangul=true outputUnigrams=true / /analyzer /fieldtype So i tried to implement individual fieldtype for each language as below Chinese fieldType name=text_cjk class=solr.TextField positionIncrementGap=1000 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.CJKBigramFilterFactory/ /analyzer /fieldType Japanese fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.JapaneseTokenizerFactory mode=search/ filter class=solr.JapaneseBaseFormFilterFactory/ filter class=solr.JapanesePartOfSpeechStopFilterFactory tags=stoptags_ja.txt / filter class=solr.CJKWidthFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_ja.txt / filter class=solr.JapaneseKatakanaStemFilterFactory minimumLength=4/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Korean fieldType name=text_kr
Re: Korean Tokenizer in solr
What happens if you have a new collection with absolute minimum in it and then add the definition? Start from something like: https://github.com/arafalov/simplest-solr-config . Also, is there a long exception earlier in a log. It may have more clues. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Mon, Jul 14, 2014 at 2:15 PM, Poornima Jay poornima...@rocketmail.com wrote: Yes, Below is my defined fieldtype fieldType name=text_match_phrase_cjk class=solr.TextField positionIncrementGap=100 analyzer type =index tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.CJKBigramFilterFactory indexUnigrams=true han=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ /analyzer analyzer type =query tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.CJKBigramFilterFactory indexUnigrams=true han=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ /analyzer /fieldType Please correct me if I am doing anything wrong here Regards, Poornima On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: You sure, it's not a spelling error or something other weird like that? Because Solr ships with that filter in it's example schema: filter class=solr.CJKBigramFilterFactory/ So, you can compare what you are doing differently with that. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay poornima...@rocketmail.com wrote: I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory' I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working. Please advice. Regards, Poornima On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I would suggest you read through all 12 (?) articles in this series: http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html . It will probably lay out most of the issues for you. And if you are starting, I would really suggest using the latest Solr (4.9). A lot more people remember what the latest version has then what was in 3.6. And, as the series above will tell you, some relevant issues had been fixed in more recent Solr versions. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay poornima...@rocketmail.com wrote: Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one. Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately. I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean. fieldtype name=text_cjk class=solr.TextField positionIncrementGap=1 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.CJKTokenizerFactory / filter class=solr.CJKWidthFilterFactory/ filter class=edu.stanford.lucene.analysis.CJKFoldingFilterFactory/ filter class=solr.ICUTransformFilterFactory id=Traditional-Simplified/ filter class=solr.ICUTransformFilterFactory id=Katakana-Hiragana/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKBigramFilterFactory han=true hiragana=true katakana=true hangul=true outputUnigrams=true / /analyzer /fieldtype So i tried to implement individual fieldtype for each language as below Chinese fieldType name=text_cjk class=solr.TextField positionIncrementGap=1000 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.CJKBigramFilterFactory/ /analyzer /fieldType Japanese fieldType name=text_ja class=solr.TextField positionIncrementGap=100
Re: Korean Tokenizer in solr
When I am trying to index the below error comes java.io.FileNotFoundException: /home/searchuser/multicore/apac_content/data/tlog/tlog.000 (No such file or directory) On Monday, 14 July 2014 2:07 PM, Poornima Jay poornima...@rocketmail.com wrote: Yes, Below is my defined fieldtype fieldType name=text_match_phrase_cjk class=solr.TextField positionIncrementGap=100 analyzer type =index tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.CJKBigramFilterFactory indexUnigrams=true han=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ /analyzer analyzer type =query tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.CJKBigramFilterFactory indexUnigrams=true han=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ /analyzer /fieldType Please correct me if I am doing anything wrong here Regards, Poornima On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: You sure, it's not a spelling error or something other weird like that? Because Solr ships with that filter in it's example schema: filter class=solr.CJKBigramFilterFactory/ So, you can compare what you are doing differently with that. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay poornima...@rocketmail.com wrote: I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory' I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working. Please advice. Regards, Poornima On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I would suggest you read through all 12 (?) articles in this series: http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html . It will probably lay out most of the issues for you. And if you are starting, I would really suggest using the latest Solr (4.9). A lot more people remember what the latest version has then what was in 3.6. And, as the series above will tell you, some relevant issues had been fixed in more recent Solr versions. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay poornima...@rocketmail.com wrote: Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one. Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately. I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean. fieldtype name=text_cjk class=solr.TextField positionIncrementGap=1 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.CJKTokenizerFactory / filter class=solr.CJKWidthFilterFactory/ filter class=edu.stanford.lucene.analysis.CJKFoldingFilterFactory/ filter class=solr.ICUTransformFilterFactory id=Traditional-Simplified/ filter class=solr.ICUTransformFilterFactory id=Katakana-Hiragana/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKBigramFilterFactory han=true hiragana=true katakana=true hangul=true outputUnigrams=true / /analyzer /fieldtype So i tried to implement individual fieldtype for each language as below Chinese fieldType name=text_cjk class=solr.TextField positionIncrementGap=1000 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.CJKBigramFilterFactory/ /analyzer /fieldType Japanese fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.JapaneseTokenizerFactory mode=search/ filter class=solr.JapaneseBaseFormFilterFactory/ filter class=solr.JapanesePartOfSpeechStopFilterFactory tags=stoptags_ja.txt / filter class=solr.CJKWidthFilterFactory/
Re: Solr irregularly having QTime 50000ms, stracing solr cures the problem
Thanks IJ for the link. I am not sure this can solve my problem, because I have only one machine in play anyway. Harald. On 12.07.2014 20:49, IJ wrote: GUess - I had the same issues as you. Was resolved http://lucene.472066.n3.nabble.com/Slow-QTimes-5-seconds-for-Small-sized-Collections-td4143681.html was resolved by adding an explicit host mapping entry on /etc/hosts for inter node solr communication - thereby bypassing DNS Lookups. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-irregularly-having-QTime-5ms-stracing-solr-cures-the-problem-tp4146047p4146858.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Reference numbers for major page fauls per seconds, index size, query throughput
Hello Erik, thanks for the reply. Indeed the CPUs are kind of idling during the load test. They are not 20% but clearly don't get far beyond 40%. Changing the number of threads in jmeter has minor effects only on the qps, but increases the average latency, as soon as the threads outnumber the CPUs --- expected behavior I would say. I varied the number of results returned between 20 and 10 with no remarkable changes in performance. I restricted to fl=id and even this increased the throughput only minimally (meantime the index has 16 million, increase from 2.x qps to 3). Jmeter reported a reduction in average transferred size from 10kByes to 2.5kBytes. This is not really the issue here and in the end we need more than the IDs in production anyway. What really bugs me currently is that htop reports an IORR (supposed to be read(2) calls) of between 100 to 200 MByte/s during the load test. This somehow runs contrary to my understanding of why Solr uses mmapped files. There should be no read(2) calls and certainly not 200 MB/s :-/ And this did not drop when I restricted to fl=id. I will try to check this with strace to see were it is reading from. Hints appreciated. With a bit of luck, I'll get more RAM and can compare then. Thanks, Harald. On 12.07.2014 17:58, Erick Erickson wrote: If the stats you're reporting are during the load test, your CPU is kind of idling along at 20% which supports your theory. Just to cover all bases, when you bump the number of threads jmeter is firing does it make any difference? And how many rows are you returning? This latter is important because to return documents, Solr needs to go out to disk, possibly generating your page faults (guessing here). One note about your index size it's largely useless to measure index on disk if for no other reason than the _stored_ data doesn't really count towards memory requirements for search. The *.fdt an d*.fdx segment files contain the stored data, so subtract them out Speaking of which, try just returning the id (fl=id). That should reduce the disk seeks due to assembling the docs. But 4 qps for simple term queries seems very slow at first blush. FWIW, Erick On Thu, Jul 10, 2014 at 7:30 AM, Harald Kirsch harald.kir...@raytion.com wrote: Hi everyone, currently I am taking some performance measurements on a Solr installation and I am trying to figure out if what I see mostly fits expectations: The data is as follows: - solr 4.8.1 - 8 millon documents - mostly office documents with real text content, stored - index size on disk 90G - full index memory mapped into virtual memory: - this is a on a vmware server, 4 cores, 16 GB RAM PID PR NI VIRT RES SHR S %CPU %MEMTIME+ nFLT 961 20 0 93.9g 10g 6.0g S 19 64.5 718:39.81 757k When I start running a jmeter query test sending requests as fast a possible with a few threads, it peaks at about 4 qps with a real-world query replay of mostly 1, 2, sometimes more terms. What I see are around 150 to 200 major page faults per second, meaning that Solr is not really happy with what happens to be in memory at any instance in time. My hunch is that this hints at a too small RAM footprint. Much more RAM is needed to get the number of major page faults down. Would anyone agree or disagree with this analysis. Someone out there saying 200 major page faults/second are normal, there must be another problem? Thanks, Harald.
Re: Solr irregularly having QTime 50000ms, stracing solr cures the problem
This problem seems to completely disappear under load. I started making load tests despite fearing them to be useless. It turns out that there are no more 5 ms delays under load. Harald. On 09.07.2014 09:50, Harald Kirsch wrote: Good point. I will see if I can get the necessary access rights on this machine to run tcpdump. Thanks for the suggestion, Harald. On 09.07.2014 00:32, Steve McKay wrote: Sure sounds like a socket bug, doesn't it? I turn to tcpdump when Solr starts behaving strangely in a socket-related way. Knowing exactly what's happening at the transport level is worth a month of guessing and poking. On Jul 8, 2014, at 3:53 AM, Harald Kirsch harald.kir...@raytion.com wrote: Hi all, This is what happens when I run a regular wget query to log the current number of documents indexed: 2014-07-08:07:23:28 QTime=20 numFound=5720168 2014-07-08:07:24:28 QTime=12 numFound=5721126 2014-07-08:07:25:28 QTime=19 numFound=5721126 2014-07-08:07:27:18 QTime=50071 numFound=5721126 2014-07-08:07:29:08 QTime=50058 numFound=5724494 2014-07-08:07:30:58 QTime=50033 numFound=5730710 2014-07-08:07:31:58 QTime=13 numFound=5730710 2014-07-08:07:33:48 QTime=50065 numFound=5734069 2014-07-08:07:34:48 QTime=16 numFound=5737742 2014-07-08:07:36:38 QTime=50037 numFound=5737742 2014-07-08:07:37:38 QTime=12 numFound=5738190 2014-07-08:07:38:38 QTime=23 numFound=5741208 2014-07-08:07:40:29 QTime=50034 numFound=5742067 2014-07-08:07:41:29 QTime=12 numFound=5742067 2014-07-08:07:42:29 QTime=17 numFound=5742067 2014-07-08:07:43:29 QTime=20 numFound=5745497 2014-07-08:07:44:29 QTime=13 numFound=5745981 2014-07-08:07:45:29 QTime=23 numFound=5746420 As you can see, the QTime is just over 50 seconds at irregular intervals. This happens independent of whether I am indexing documents with around 20 dps or not. First I thought about a dependence on the auto-commit of 5 minutes, but the the 50 seconds hits are too irregular. Furthermore, and this is *really strange*: when hooking strace on the solr process, the 50 seconds QTimes disappear completely and consistently --- a real Heisenbug. Nevertheless, strace shows that there is a socket timeout of 50 seconds defined in calls like this: [pid 1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1, 5) = 1 ([{fd=96, revents=POLLIN}]) 0.40 where the fd=96 is the result of [pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET, sin_port=htons(57236), sin_addr=inet_addr(ip address of local host)}, [16]) = 96 0.54 where again fd=122 is the TCP port on which solr was started. My hunch is that this is communication between the cores of solr. I tried to search the internet for such a strange connection between socket timeouts and strace, but could not find anything (the stackoverflow entry from yesterday is my own :-( This smells a bit like a race condition/deadlock kind of thing which is broken up by timing differences introduced by stracing the process. Any hints appreciated. For completeness, here is my setup: - solr-4.8.1, - cloud version running - 10 shards on 10 cores in one instance - hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11, PATCHLEVEL 2 - hosted on a vmware, 4 CPU cores, 16 GB RAM - single digit million docs indexed, exact number does not matter - zero query load Harald.
Of, To, and Other Small Words
Hello all, I am working with Solr 4.9.0 and am searching for phrases that contain words like of or to that Solr seems to be ignoring at index time. Here's what I tried: curl http://localhost/solr/update?commit=true -H Content-Type: text/xml --data-binary 'adddocfield name=id100/fieldfield name=contentblah blah blah knowledge of science blah blah blah/field/doc/add' Then, using a broswer: http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=id:100 I get zero hits. Search for knowledge or science and I'll get hits. knowledge of or of science and I get zero hits. I don't want to use proximity if I can avoid it, as this may introduce too many undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring of and to and possibly more words that I have not discovered through testing yet. Is there some other configuration file that contains these small words? Is there any way to force Solr to pay attention to them and not drop them from the phrase? Any advice is appreciated! Thanks! -Teague
Re: Of, To, and Other Small Words
Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com wrote: Hello all, I am working with Solr 4.9.0 and am searching for phrases that contain words like of or to that Solr seems to be ignoring at index time. Here's what I tried: curl http://localhost/solr/update?commit=true -H Content-Type: text/xml --data-binary 'adddocfield name=id100/fieldfield name=contentblah blah blah knowledge of science blah blah blah/field/doc/add' Then, using a broswer: http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=id:100 I get zero hits. Search for knowledge or science and I'll get hits. knowledge of or of science and I get zero hits. I don't want to use proximity if I can avoid it, as this may introduce too many undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring of and to and possibly more words that I have not discovered through testing yet. Is there some other configuration file that contains these small words? Is there any way to force Solr to pay attention to them and not drop them from the phrase? Any advice is appreciated! Thanks! -Teague -- Anshum Gupta http://www.anshumgupta.net
Re: Of, To, and Other Small Words
Or, if you happen to leave off the words attribute of the stop filter (or misspell the attribute name), it will use the internal Lucene hardwired list of stop words. -- Jack Krupansky -Original Message- From: Anshum Gupta Sent: Monday, July 14, 2014 4:03 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com wrote: Hello all, I am working with Solr 4.9.0 and am searching for phrases that contain words like of or to that Solr seems to be ignoring at index time. Here's what I tried: curl http://localhost/solr/update?commit=true -H Content-Type: text/xml --data-binary 'adddocfield name=id100/fieldfield name=contentblah blah blah knowledge of science blah blah blah/field/doc/add' Then, using a broswer: http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=id:100 I get zero hits. Search for knowledge or science and I'll get hits. knowledge of or of science and I get zero hits. I don't want to use proximity if I can avoid it, as this may introduce too many undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring of and to and possibly more words that I have not discovered through testing yet. Is there some other configuration file that contains these small words? Is there any way to force Solr to pay attention to them and not drop them from the phrase? Any advice is appreciated! Thanks! -Teague -- Anshum Gupta http://www.anshumgupta.net
Strategies for effective prefix queries?
I'm working on using Solr for autocompleting usernames. I'm running into a problem with the wildcard queries (e.g. username:al*). We are tokenizing usernames so that a username like solr-user will be tokenized into solr and user, and will match both sol and use prefixes. The problem is when we get solr-u as a prefix, I'm having to split that up on the client side before I construct a query username:solr* username:u*. I'm basically using a regex as a poor man's tokenizer. Is there a better way to approach this? Is there a way to tell Solr to tokenize a string and use the parts as prefixes? - Hayden
RE: Of, To, and Other Small Words
Hi Anshum, Thanks for replying and suggesting this, but the field type I am using (a modified text_general) in my schema has the file set to 'stopwords.txt'. fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/-- filter class=solr.LowerCaseFilterFactory/ !-- CHANGE: The NGramFilterFactory was added to provide partial word search. This can be changed to EdgeNGramFilterFactory side=front to only match front sided partial searches if matching any part of a word is undesireable.-- filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=10 / !-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -- filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ !-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -- filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, re-indexed, and searched with still zero results. Any other suggestions on where I might be able to control this behavior? -Teague -Original Message- From: Anshum Gupta [mailto:ans...@anshumgupta.net] Sent: Monday, July 14, 2014 4:04 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com wrote: Hello all, I am working with Solr 4.9.0 and am searching for phrases that contain words like of or to that Solr seems to be ignoring at index time. Here's what I tried: curl http://localhost/solr/update?commit=true -H Content-Type: text/xml --data-binary 'adddocfield name=id100/fieldfield name=contentblah blah blah knowledge of science blah blah blah/field/doc/add' Then, using a broswer: http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=i d:100 I get zero hits. Search for knowledge or science and I'll get hits. knowledge of or of science and I get zero hits. I don't want to use proximity if I can avoid it, as this may introduce too many undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring of and to and possibly more words that I have not discovered through testing yet. Is there some other configuration file that contains these small words? Is there any way to force Solr to pay attention to them and not drop them from the phrase? Any advice is appreciated! Thanks! -Teague -- Anshum Gupta http://www.anshumgupta.net
Re: Of, To, and Other Small Words
Have you tried the Admin UI's Analyze screen. Because it will show you what happens to the text as it progresses through the tokenizers and filters. No need to reindex. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:10 AM, Teague James teag...@insystechinc.com wrote: Hi Anshum, Thanks for replying and suggesting this, but the field type I am using (a modified text_general) in my schema has the file set to 'stopwords.txt'. fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/-- filter class=solr.LowerCaseFilterFactory/ !-- CHANGE: The NGramFilterFactory was added to provide partial word search. This can be changed to EdgeNGramFilterFactory side=front to only match front sided partial searches if matching any part of a word is undesireable.-- filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=10 / !-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -- filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ !-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -- filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, re-indexed, and searched with still zero results. Any other suggestions on where I might be able to control this behavior? -Teague -Original Message- From: Anshum Gupta [mailto:ans...@anshumgupta.net] Sent: Monday, July 14, 2014 4:04 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com wrote: Hello all, I am working with Solr 4.9.0 and am searching for phrases that contain words like of or to that Solr seems to be ignoring at index time. Here's what I tried: curl http://localhost/solr/update?commit=true -H Content-Type: text/xml --data-binary 'adddocfield name=id100/fieldfield name=contentblah blah blah knowledge of science blah blah blah/field/doc/add' Then, using a broswer: http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=i d:100 I get zero hits. Search for knowledge or science and I'll get hits. knowledge of or of science and I get zero hits. I don't want to use proximity if I can avoid it, as this may introduce too many undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring of and to and possibly more words that I have not discovered through testing yet. Is there some other configuration file that contains these small words? Is there any way to force Solr to pay attention to them and not drop them from the phrase? Any advice is appreciated! Thanks! -Teague -- Anshum Gupta http://www.anshumgupta.net
Re: Strategies for effective prefix queries?
Search against both fields (one split, one not split)? Keep original and tokenized form? I am doing something similar with class name autocompletes here: https://github.com/arafalov/Solr-Javadoc/blob/master/JavadocIndex/JavadocCollection/conf/schema.xml#L24 Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:04 AM, Hayden Muhl haydenm...@gmail.com wrote: I'm working on using Solr for autocompleting usernames. I'm running into a problem with the wildcard queries (e.g. username:al*). We are tokenizing usernames so that a username like solr-user will be tokenized into solr and user, and will match both sol and use prefixes. The problem is when we get solr-u as a prefix, I'm having to split that up on the client side before I construct a query username:solr* username:u*. I'm basically using a regex as a poor man's tokenizer. Is there a better way to approach this? Is there a way to tell Solr to tokenize a string and use the parts as prefixes? - Hayden
RE: Of, To, and Other Small Words
Jack, Thanks for replying and the suggestion. I replied to another suggestion with my field type and I do have filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt /. There's nothing in the stopwords.txt. I even cleaned out stopwords_en.txt just to be certain. Any other suggestions on how to control this behavior? -Teague -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, July 14, 2014 4:26 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Or, if you happen to leave off the words attribute of the stop filter (or misspell the attribute name), it will use the internal Lucene hardwired list of stop words. -- Jack Krupansky -Original Message- From: Anshum Gupta Sent: Monday, July 14, 2014 4:03 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com wrote: Hello all, I am working with Solr 4.9.0 and am searching for phrases that contain words like of or to that Solr seems to be ignoring at index time. Here's what I tried: curl http://localhost/solr/update?commit=true -H Content-Type: text/xml --data-binary 'adddocfield name=id100/fieldfield name=contentblah blah blah knowledge of science blah blah blah/field/doc/add' Then, using a broswer: http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=i d:100 I get zero hits. Search for knowledge or science and I'll get hits. knowledge of or of science and I get zero hits. I don't want to use proximity if I can avoid it, as this may introduce too many undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring of and to and possibly more words that I have not discovered through testing yet. Is there some other configuration file that contains these small words? Is there any way to force Solr to pay attention to them and not drop them from the phrase? Any advice is appreciated! Thanks! -Teague -- Anshum Gupta http://www.anshumgupta.net
RE: Of, To, and Other Small Words
Alex, Thanks! Great suggestion. I figured out that it was the EdgeNGramFilterFactory. Taking that out of the mix did it. -Teague -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Monday, July 14, 2014 9:14 PM To: solr-user Subject: Re: Of, To, and Other Small Words Have you tried the Admin UI's Analyze screen. Because it will show you what happens to the text as it progresses through the tokenizers and filters. No need to reindex. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:10 AM, Teague James teag...@insystechinc.com wrote: Hi Anshum, Thanks for replying and suggesting this, but the field type I am using (a modified text_general) in my schema has the file set to 'stopwords.txt'. fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/-- filter class=solr.LowerCaseFilterFactory/ !-- CHANGE: The NGramFilterFactory was added to provide partial word search. This can be changed to EdgeNGramFilterFactory side=front to only match front sided partial searches if matching any part of a word is undesireable.-- filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=10 / !-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -- filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ !-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -- filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, re-indexed, and searched with still zero results. Any other suggestions on where I might be able to control this behavior? -Teague -Original Message- From: Anshum Gupta [mailto:ans...@anshumgupta.net] Sent: Monday, July 14, 2014 4:04 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com wrote: Hello all, I am working with Solr 4.9.0 and am searching for phrases that contain words like of or to that Solr seems to be ignoring at index time. Here's what I tried: curl http://localhost/solr/update?commit=true -H Content-Type: text/xml --data-binary 'adddocfield name=id100/fieldfield name=contentblah blah blah knowledge of science blah blah blah/field/doc/add' Then, using a broswer: http://localhost/solr/collection1/select?q=knowledge+of+sciencefq= i d:100 I get zero hits. Search for knowledge or science and I'll get hits. knowledge of or of science and I get zero hits. I don't want to use proximity if I can avoid it, as this may introduce too many undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring of and to and possibly more words that I have not discovered through testing yet. Is there some other configuration file that contains these small words? Is there any way to force Solr to pay attention to them and not drop them from the phrase? Any advice is appreciated! Thanks! -Teague -- Anshum Gupta http://www.anshumgupta.net
Re: Of, To, and Other Small Words
You could try experimenting with CommonGramsFilterFactory and CommonGramsQueryFilter (slightly different). There is actually a lot of cool analyzers bundled with Solr. You can find full list on my site at: http://www.solr-start.com/info/analyzers Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:42 AM, Teague James teag...@insystechinc.com wrote: Alex, Thanks! Great suggestion. I figured out that it was the EdgeNGramFilterFactory. Taking that out of the mix did it. -Teague -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Monday, July 14, 2014 9:14 PM To: solr-user Subject: Re: Of, To, and Other Small Words Have you tried the Admin UI's Analyze screen. Because it will show you what happens to the text as it progresses through the tokenizers and filters. No need to reindex. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:10 AM, Teague James teag...@insystechinc.com wrote: Hi Anshum, Thanks for replying and suggesting this, but the field type I am using (a modified text_general) in my schema has the file set to 'stopwords.txt'. fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/-- filter class=solr.LowerCaseFilterFactory/ !-- CHANGE: The NGramFilterFactory was added to provide partial word search. This can be changed to EdgeNGramFilterFactory side=front to only match front sided partial searches if matching any part of a word is undesireable.-- filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=10 / !-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -- filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ !-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -- filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, re-indexed, and searched with still zero results. Any other suggestions on where I might be able to control this behavior? -Teague -Original Message- From: Anshum Gupta [mailto:ans...@anshumgupta.net] Sent: Monday, July 14, 2014 4:04 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James teag...@insystechinc.com wrote: Hello all, I am working with Solr 4.9.0 and am searching for phrases that contain words like of or to that Solr seems to be ignoring at index time. Here's what I tried: curl http://localhost/solr/update?commit=true -H Content-Type: text/xml --data-binary 'adddocfield name=id100/fieldfield name=contentblah blah blah knowledge of science blah blah blah/field/doc/add' Then, using a broswer: http://localhost/solr/collection1/select?q=knowledge+of+sciencefq= i d:100 I get zero hits. Search for knowledge or science and I'll get hits. knowledge of or of science and I get zero hits. I don't want to use proximity if I can avoid it, as this may introduce too many undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring of and to and possibly more words that I have not discovered through testing yet. Is there some other configuration file
Re: External File Field eating memory
Hey Kamal, What all config changes have you done to establish replication of external files and how have you disabled role reloading? On Wed, Jul 9, 2014 at 11:30 AM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: Hi All, It was found that external file, which was getting replicated after every 10 minutes was reloading the core as well. This was increasing the query time. Thanks Kamal Kishore On Thu, Jul 3, 2014 at 12:48 PM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: With the above replication configuration, the eff file is getting replicated at core/conf/data/external_eff_views (new dir data is being created in conf dir) location, but it is not getting replicated at core/data/external_eff_views on slave. Please help. On Thu, Jul 3, 2014 at 12:21 PM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: Thanks for your guidance Alexandre Rafalovitch. I am looking into this seriously. Another question is that I facing error in replication of eff file This is master replication configuration: core/conf/solrconfig.xml requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=replicateAftercommit/str str name=replicateAfterstartup/str str name=confFiles../data/external_eff_views/str /lst /requestHandler The eff file is present at core/data/external_eff_views location. On Thu, Jul 3, 2014 at 11:50 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: This might be related: https://issues.apache.org/jira/browse/SOLR-3514 On Sat, Jun 28, 2014 at 5:34 PM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: Hi Team, I have recently implemented EFF in solr. There are about 1.5 lacs(unsorted) values in the external file. After this implementation, the server has become slow. The solr query time has also increased. Can anybody confirm me if these issues are because of this implementation. Is that memory does EFF eats up? Regards Kamal Kishore -- Regards, Shalin Shekhar Mangar. -- Thanks Regards, Apoorva