RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards
I have gotten nearly everything to work. There are to queries where i dont get back what i want. avaloq frage 1- only returns if i set minGramSize=1 while indexing yh_cug- query parser doesn't remove _ but the indexer does (WDF) so there is no match Is there a way to also query the hole term avaloq frage 1 without tokenizing it? Fieldtype: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer /fieldType -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 18:39 To: solr-user@lucene.apache.org Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards Hi Jack, do you know how i can use local parameters in my solrconfig? The params are visible in the debugquery-output but solr doesn't parse them. lst name=invariants str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str /lst -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 14:44 To: solr-user@lucene.apache.org Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards yes that is exactly what happend in the analyzer. the term i searched for was listed on both sides (index query). here's the rest: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. enablePositionIncrements=true ensures that a 'gap' is left to allow for accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer -Original-Nachricht- Von: Jack Krupansky j...@basetechnology.com An: solr-user@lucene.apache.org Datum: 12/03/2014 13:25 Betreff: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards You didn't show the new index analyzer - it's tricky to assure that index and query are compatible, but the Admin UI Analysis page can help. Generally, using pure defaults for WDF is not what you want, especially for query time. Usually there needs to be a slight asymmetry between index and query for WDF - index generates more terms than query. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Wednesday, March 12, 2014 6:20 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I now have the following: analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class
RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards
I now have the following: analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer The gui analysis shows me that wdf doesn't cut the underscore anymore but it still returns 0 results? Output: lst name=debug str name=rawquerystringyh_cug/str str name=querystringyh_cug/str str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) ((expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord/str str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] (+*:* -expiration:*))^6.0) (div(int(clicks),max(int(displays),const(1^8.0/str lst name=explain/ arr name=expandedSynonyms stryh_cug/str /arr lst name=reasonForNotExpandingSynonyms str name=nameDidntFindAnySynonyms/str str name=explanationNo synonyms found for this query. Check your synonyms file./str /lst lst name=mainQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boost_queries str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str /arr arr name=parsed_boost_queries str(expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str /arr arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=synonymQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=timing -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Dienstag, 11. März 2014 14:25 To: solr-user@lucene.apache.org Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards The usual use of an ngram filter is at index time and not at query time. What exactly are you trying to achieve by using ngram filtering at query time as well as index time? Generally, it is inappropriate to combine the word delimiter filter with the standard tokenizer - the later removes the punctuation that normally influences how WDF treats the parts of a token. Use the white space tokenizer if you intend to use WDF. Which query parser are you using? What fields are being queried? Please post the parsed query string from the debug output - it will show the precise generated query. I think what you are seeing is that the ngram filter is generating tokens like h_cugtest and then the WDF is removing the underscore and then h gets generated as a separate token. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Tuesday, March 11, 2014 5:09 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I got it roght the first time and here is my requesthandler. The field plain_text is searched correctly and has the sam fieldtype as title - text_de queryParser name=synonym_edismax class=solr.SynonymExpandingExtendedDismaxQParserPlugin lst name=synonymAnalyzers lst name=myCoolAnalyzer lst name=tokenizer str name=classstandard/str /lst lst name=filter str name=classshingle/str str name=outputUnigramsIfNoShinglestrue/str str name=outputUnigramstrue/str str name=minShingleSize2/str str name=maxShingleSize4/str /lst lst name=filter str name=classsynonym/str str name=tokenizerFactorysolr.KeywordTokenizerFactory/str str name=synonymssynonyms.txt/str str name=expandtrue/str str name=ignoreCasetrue/str /lst /lst /lst /queryParser requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str
Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards
You didn't show the new index analyzer - it's tricky to assure that index and query are compatible, but the Admin UI Analysis page can help. Generally, using pure defaults for WDF is not what you want, especially for query time. Usually there needs to be a slight asymmetry between index and query for WDF - index generates more terms than query. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Wednesday, March 12, 2014 6:20 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I now have the following: analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer The gui analysis shows me that wdf doesn't cut the underscore anymore but it still returns 0 results? Output: lst name=debug str name=rawquerystringyh_cug/str str name=querystringyh_cug/str str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) ((expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord/str str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] (+*:* -expiration:*))^6.0) (div(int(clicks),max(int(displays),const(1^8.0/str lst name=explain/ arr name=expandedSynonyms stryh_cug/str /arr lst name=reasonForNotExpandingSynonyms str name=nameDidntFindAnySynonyms/str str name=explanationNo synonyms found for this query. Check your synonyms file./str /lst lst name=mainQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boost_queries str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str /arr arr name=parsed_boost_queries str(expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str /arr arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=synonymQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=timing -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Dienstag, 11. März 2014 14:25 To: solr-user@lucene.apache.org Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards The usual use of an ngram filter is at index time and not at query time. What exactly are you trying to achieve by using ngram filtering at query time as well as index time? Generally, it is inappropriate to combine the word delimiter filter with the standard tokenizer - the later removes the punctuation that normally influences how WDF treats the parts of a token. Use the white space tokenizer if you intend to use WDF. Which query parser are you using? What fields are being queried? Please post the parsed query string from the debug output - it will show the precise generated query. I think what you are seeing is that the ngram filter is generating tokens like h_cugtest and then the WDF is removing the underscore and then h gets generated as a separate token. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Tuesday, March 11, 2014 5:09 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I got it roght the first time and here is my requesthandler. The field plain_text is searched correctly and has the sam fieldtype as title - text_de queryParser name=synonym_edismax class=solr.SynonymExpandingExtendedDismaxQParserPlugin lst name=synonymAnalyzers lst name=myCoolAnalyzer lst name=tokenizer str name=classstandard/str /lst lst name=filter str name=classshingle/str str name=outputUnigramsIfNoShinglestrue/str str name=outputUnigramstrue/str str name=minShingleSize2/str str name=maxShingleSize4/str /lst lst name=filter str name=classsynonym/str str name=tokenizerFactorysolr.KeywordTokenizerFactory/str str name=synonymssynonyms.txt/str str name=expandtrue/str str name=ignoreCasetrue/str /lst /lst /lst
Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards
yes that is exactly what happend in the analyzer. the term i searched for was listed on both sides (index query). here's the rest: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. enablePositionIncrements=true ensures that a 'gap' is left to allow for accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer -Original-Nachricht- Von: Jack Krupansky j...@basetechnology.com An: solr-user@lucene.apache.org Datum: 12/03/2014 13:25 Betreff: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards You didn't show the new index analyzer - it's tricky to assure that index and query are compatible, but the Admin UI Analysis page can help. Generally, using pure defaults for WDF is not what you want, especially for query time. Usually there needs to be a slight asymmetry between index and query for WDF - index generates more terms than query. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Wednesday, March 12, 2014 6:20 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I now have the following: analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer The gui analysis shows me that wdf doesn't cut the underscore anymore but it still returns 0 results? Output: lst name=debug str name=rawquerystringyh_cug/str str name=querystringyh_cug/str str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) ((expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord/str str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] (+*:* -expiration:*))^6.0) (div(int(clicks),max(int(displays),const(1^8.0/str lst name=explain/ arr name=expandedSynonyms stryh_cug/str /arr lst name=reasonForNotExpandingSynonyms str name=nameDidntFindAnySynonyms/str str name=explanationNo synonyms found for this query. Check your synonyms file./str /lst lst name=mainQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boost_queries str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str /arr arr name=parsed_boost_queries str(expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str /arr arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=synonymQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=timing -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Dienstag, 11. März 2014 14:25 To: solr-user@lucene.apache.org Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards The usual use of an ngram filter is at index time and not at query time. What exactly are you trying to achieve by using ngram filtering at query time as well as index time? Generally, it is inappropriate to combine the word delimiter filter
RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards
Hi Jack, do you know how i can use local parameters in my solrconfig? The params are visible in the debugquery-output but solr doesn't parse them. lst name=invariants str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str /lst -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 14:44 To: solr-user@lucene.apache.org Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards yes that is exactly what happend in the analyzer. the term i searched for was listed on both sides (index query). here's the rest: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. enablePositionIncrements=true ensures that a 'gap' is left to allow for accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer -Original-Nachricht- Von: Jack Krupansky j...@basetechnology.com An: solr-user@lucene.apache.org Datum: 12/03/2014 13:25 Betreff: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards You didn't show the new index analyzer - it's tricky to assure that index and query are compatible, but the Admin UI Analysis page can help. Generally, using pure defaults for WDF is not what you want, especially for query time. Usually there needs to be a slight asymmetry between index and query for WDF - index generates more terms than query. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Wednesday, March 12, 2014 6:20 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I now have the following: analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer The gui analysis shows me that wdf doesn't cut the underscore anymore but it still returns 0 results? Output: lst name=debug str name=rawquerystringyh_cug/str str name=querystringyh_cug/str str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) ((expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_ coord/str str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] (+*:* -expiration:*))^6.0) (div(int(clicks),max(int(displays),const(1^8.0/str lst name=explain/ arr name=expandedSynonyms stryh_cug/str /arr lst name=reasonForNotExpandingSynonyms str name=nameDidntFindAnySynonyms/str str name=explanationNo synonyms found for this query. Check your synonyms file./str /lst lst name=mainQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boost_queries str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str /arr arr name=parsed_boost_queries str(expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str /arr arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=synonymQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boostfuncs
Re: SOLVED searches for single char tokens instead of from 3 uppwards
sorry i looked at the wrong fieldtype -Original-Nachricht- Von: Andreas Owen a...@conx.ch An: solr-user@lucene.apache.org Datum: 11/03/2014 08:45 Betreff: searches for single char tokens instead of from 3 uppwards i have a field with the following type: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer /fieldType shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a query report of 2 results: lst name=responseHeader int name=status0/int int name=QTime125/int lst name=params str name=debugQuerytrue/str str name=fltitle,roles,organisations,id/str str name=indenttrue/str str name=qyh_cugtest/str str name=_1394522589347/str str name=wtxml/str str name=fqorganisations:* roles:*/str /lst/lst result name=response numFound=5 start=0 .. str name=dms:2681 1.6365329 = (MATCH) sum of: 1.6346203 = (MATCH) max of: 0.14759353 = (MATCH) product of: 0.28596246 = (MATCH) sum of: 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of: 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.035319194 = queryWeight, product of: 5.540098 = idf(docFreq=9, maxDocs=937) 0.0063751927 = queryNorm 0.43282017 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.540098 = idf(docFreq=9, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.0119499 = (MATCH) weight(plain_text:ugt in 0) [DefaultSimilarity], result of: 0.0119499 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.031227252 = queryWeight, product of: 4.8982444 = idf(docFreq=18, maxDocs=937) 0.0063751927 = queryNorm 0.38267535 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 4.8982444 = idf(docFreq=18, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhc in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:hcu in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of: 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.035319194 = queryWeight, product of: 5.540098 = idf(docFreq=9, maxDocs=937) 0.0063751927 = queryNorm 0.43282017 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.540098 = idf(docFreq=9, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:cugt in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4,
RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards
I got it roght the first time and here is my requesthandler. The field plain_text is searched correctly and has the sam fieldtype as title - text_de queryParser name=synonym_edismax class=solr.SynonymExpandingExtendedDismaxQParserPlugin lst name=synonymAnalyzers lst name=myCoolAnalyzer lst name=tokenizer str name=classstandard/str /lst lst name=filter str name=classshingle/str str name=outputUnigramsIfNoShinglestrue/str str name=outputUnigramstrue/str str name=minShingleSize2/str str name=maxShingleSize4/str /lst lst name=filter str name=classsynonym/str str name=tokenizerFactorysolr.KeywordTokenizerFactory/str str name=synonymssynonyms.txt/str str name=expandtrue/str str name=ignoreCasetrue/str /lst /lst /lst /queryParser requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str !-- tested: now or newer or empty gets small boost -- str name=bfdiv(clicks,max(displays,1))^8/str !-- tested -- str name=dftext/str str name=fl*,path,score/str str name=wtjson/str str name=q.opAND/str !-- Highlighting defaults -- str name=hlon/str str name=hl.flplain_text,title/str str name=hl.fragSize200/str str name=hl.simple.prelt;bgt;/str str name=hl.simple.postlt;/bgt;/str !-- lst name=invariants -- str name=faceton/str str name=facet.mincount1/str str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str str name=f.inhaltstyp_s.facet.sortindex/str str name=facet.field{!ex=doctype}doctype/str str name=f.doctype.facet.sortindex/str str name=facet.field{!ex=thema_f}thema_f/str str name=f.thema_f.facet.sortindex/str str name=facet.field{!ex=author_s}author_s/str str name=f.author_s.facet.sortindex/str str name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str str name=f.sachverstaendiger_s.facet.sortindex/str str name=facet.field{!ex=veranstaltung_s}veranstaltung_s/str str name=f.veranstaltung_s.facet.sortindex/str str name=facet.date{!ex=last_modified}last_modified/str str name=facet.date.gap+1MONTH/str str name=facet.date.endNOW/MONTH+1MONTH/str str name=facet.date.startNOW/MONTH-36MONTHS/str str name=facet.date.otherafter/str /lst /requestHandler i have a field with the following type: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer /fieldType shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a query report of 2 results: lst name=responseHeader int name=status0/int int name=QTime125/int lst name=paramsstr name=debugQuerytrue/strstr name=fltitle,roles,organisations,id/strstr name=indenttrue/strstr name=qyh_cugtest/strstr name=_1394522589347/strstr name=wtxml/strstr name=fqorganisations:* roles:*/str
Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards
The usual use of an ngram filter is at index time and not at query time. What exactly are you trying to achieve by using ngram filtering at query time as well as index time? Generally, it is inappropriate to combine the word delimiter filter with the standard tokenizer - the later removes the punctuation that normally influences how WDF treats the parts of a token. Use the white space tokenizer if you intend to use WDF. Which query parser are you using? What fields are being queried? Please post the parsed query string from the debug output - it will show the precise generated query. I think what you are seeing is that the ngram filter is generating tokens like h_cugtest and then the WDF is removing the underscore and then h gets generated as a separate token. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Tuesday, March 11, 2014 5:09 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I got it roght the first time and here is my requesthandler. The field plain_text is searched correctly and has the sam fieldtype as title - text_de queryParser name=synonym_edismax class=solr.SynonymExpandingExtendedDismaxQParserPlugin lst name=synonymAnalyzers lst name=myCoolAnalyzer lst name=tokenizer str name=classstandard/str /lst lst name=filter str name=classshingle/str str name=outputUnigramsIfNoShinglestrue/str str name=outputUnigramstrue/str str name=minShingleSize2/str str name=maxShingleSize4/str /lst lst name=filter str name=classsynonym/str str name=tokenizerFactorysolr.KeywordTokenizerFactory/str str name=synonymssynonyms.txt/str str name=expandtrue/str str name=ignoreCasetrue/str /lst /lst /lst /queryParser requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str !-- tested: now or newer or empty gets small boost -- str name=bfdiv(clicks,max(displays,1))^8/str !-- tested -- str name=dftext/str str name=fl*,path,score/str str name=wtjson/str str name=q.opAND/str !-- Highlighting defaults -- str name=hlon/str str name=hl.flplain_text,title/str str name=hl.fragSize200/str str name=hl.simple.prelt;bgt;/str str name=hl.simple.postlt;/bgt;/str !-- lst name=invariants -- str name=faceton/str str name=facet.mincount1/str str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str str name=f.inhaltstyp_s.facet.sortindex/str str name=facet.field{!ex=doctype}doctype/str str name=f.doctype.facet.sortindex/str str name=facet.field{!ex=thema_f}thema_f/str str name=f.thema_f.facet.sortindex/str str name=facet.field{!ex=author_s}author_s/str str name=f.author_s.facet.sortindex/str str name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str str name=f.sachverstaendiger_s.facet.sortindex/str str name=facet.field{!ex=veranstaltung_s}veranstaltung_s/str str name=f.veranstaltung_s.facet.sortindex/str str name=facet.date{!ex=last_modified}last_modified/str str name=facet.date.gap+1MONTH/str str name=facet.date.endNOW/MONTH+1MONTH/str str name=facet.date.startNOW/MONTH-36MONTHS/str str name=facet.date.otherafter/str /lst /requestHandler i have a field with the following type: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer /fieldType shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a query report of 2 results: lst name=responseHeader int name=status0/int int name=QTime125/int lst name=paramsstr name=debugQuerytrue/strstr name=fltitle,roles,organisations,id/strstr name=indenttrue/strstr name=qyh_cugtest/strstr name=_1394522589347/strstr name=wtxml/strstr name=fqorganisations:* roles:*/str /lst/lst result name=response