subject:"SOLVED searches for single char tokens instead of from 3 uppwards"

RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-13 Thread Andreas Owen

I have gotten nearly everything to work. There are to queries where i dont get 
back what i want.

avaloq frage 1- only returns if i set minGramSize=1 while 
indexing
yh_cug- query parser doesn't remove _ but the 
indexer does (WDF) so there is no match

Is there a way to also query the hole term avaloq frage 1 without tokenizing 
it?

Fieldtype:

fieldType name=text_de class=solr.TextField positionIncrementGap=100
  analyzer type=index 
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
types=at-under-alpha.txt/ 
filter class=solr.StopFilterFactory ignoreCase=true 
words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --
 filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/ !-- remove noun/adjective inflections like plural endings 
-- 
filter class=solr.NGramFilterFactory minGramSize=3 
maxGramSize=15/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   /analyzer
   analyzer type=query
tokenizer class=solr.WhiteSpaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
types=at-under-alpha.txt/ 
filter class=solr.StopFilterFactory 
ignoreCase=true words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --
filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/
  /analyzer
 /fieldType


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Mittwoch, 12. März 2014 18:39
To: solr-user@lucene.apache.org
Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 
3 uppwards

Hi Jack,

do you know how i can use local parameters in my solrconfig? The params are 
visible in the debugquery-output but solr doesn't parse them.

lst name=invariants
str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO 
*]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) 
(+organisations:($org) -roles:[ TO *])/str /lst


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch]
Sent: Mittwoch, 12. März 2014 14:44
To: solr-user@lucene.apache.org
Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

yes that is exactly what happend in the analyzer. the term i searched for was 
listed on both sides (index  query).

here's the rest:

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
 enablePositionIncrements=true ensures that a 'gap' is left to
 allow for accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

-Original-Nachricht- 
 Von: Jack Krupansky j...@basetechnology.com
 An: solr-user@lucene.apache.org
 Datum: 12/03/2014 13:25
 Betreff: Re: NOT SOLVED searches for single char tokens instead of 
 from 3 uppwards
 
 You didn't show the new index analyzer - it's tricky to assure that 
 index and query are compatible, but the Admin UI Analysis page can help.
 
 Generally, using pure defaults for WDF is not what you want, 
 especially for query time. Usually there needs to be a slight 
 asymmetry between index and query for WDF - index generates more terms than 
 query.
 
 -- Jack Krupansky
 
 -Original Message-
 From: Andreas Owen
 Sent: Wednesday, March 12, 2014 6:20 AM
 To: solr-user@lucene.apache.org
 Subject: RE: NOT SOLVED searches for single char tokens instead of 
 from 3 uppwards
 
 I now have the following:
 
 analyzer type=query
 tokenizer class=solr.WhiteSpaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory 
 types=at-under-alpha.txt/ filter
 class

RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-12 Thread Andreas Owen

I now have the following:

analyzer type=query
tokenizer class=solr.WhiteSpaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory 
types=at-under-alpha.txt/ 
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory 
ignoreCase=true words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --
filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/
  /analyzer

The gui analysis shows me that wdf doesn't cut the underscore anymore but it 
still returns 0 results?

Output:

lst name=debug
  str name=rawquerystringyh_cug/str
  str name=querystringyh_cug/str
  str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | 
links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 
| h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | 
contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | 
doctype:yh_cug^10.0)) ((expiration:[1394619501862 TO *] 
(+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord/str
  str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | 
thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | 
inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | 
title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0) 
((expiration:[1394619501862 TO *] (+*:* -expiration:*))^6.0) 
(div(int(clicks),max(int(displays),const(1^8.0/str
  lst name=explain/
  arr name=expandedSynonyms
stryh_cug/str
  /arr
  lst name=reasonForNotExpandingSynonyms
str name=nameDidntFindAnySynonyms/str
str name=explanationNo synonyms found for this query.  Check your 
synonyms file./str
  /lst
  lst name=mainQueryParser
str name=QParserExtendedDismaxQParser/str
null name=altquerystring/
arr name=boost_queries
  str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str
/arr
arr name=parsed_boost_queries
  str(expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) 
-expiration:*))^6.0/str
/arr
arr name=boostfuncs
  strdiv(clicks,max(displays,1))^8/str
/arr
  /lst
  lst name=synonymQueryParser
str name=QParserExtendedDismaxQParser/str
null name=altquerystring/
arr name=boostfuncs
  strdiv(clicks,max(displays,1))^8/str
/arr
  /lst
  lst name=timing




-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Dienstag, 11. März 2014 14:25
To: solr-user@lucene.apache.org
Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

The usual use of an ngram filter is at index time and not at query time. 
What exactly are you trying to achieve by using ngram filtering at query time 
as well as index time?

Generally, it is inappropriate to combine the word delimiter filter with the 
standard tokenizer - the later removes the punctuation that normally influences 
how WDF treats the parts of a token. Use the white space tokenizer if you 
intend to use WDF.

Which query parser are you using? What fields are being queried?

Please post the parsed query string from the debug output - it will show the 
precise generated query.

I think what you are seeing is that the ngram filter is generating tokens like 
h_cugtest and then the WDF is removing the underscore and then h 
gets generated as a separate token.

-- Jack Krupansky

-Original Message-
From: Andreas Owen
Sent: Tuesday, March 11, 2014 5:09 AM
To: solr-user@lucene.apache.org
Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

I got it roght the first time and here is my requesthandler. The field 
plain_text is searched correctly and has the sam fieldtype as title - 
text_de

queryParser name=synonym_edismax 
class=solr.SynonymExpandingExtendedDismaxQParserPlugin
  lst name=synonymAnalyzers
lst name=myCoolAnalyzer
  lst name=tokenizer
str name=classstandard/str
  /lst
  lst name=filter
str name=classshingle/str
str name=outputUnigramsIfNoShinglestrue/str
str name=outputUnigramstrue/str
str name=minShingleSize2/str
str name=maxShingleSize4/str
  /lst
  lst name=filter
str name=classsynonym/str
str name=tokenizerFactorysolr.KeywordTokenizerFactory/str
str name=synonymssynonyms.txt/str
str name=expandtrue/str
str name=ignoreCasetrue/str
  /lst
/lst
  /lst
/queryParser

requestHandler name=/select2 class=solr.SearchHandler
 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=defTypesynonym_edismax/str
   str name=synonymstrue/str
   str name=qfplain_text^10 editorschoice^200
title^20 h_*^14
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
   /str

Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-12 Thread Jack Krupansky

You didn't show the new index analyzer - it's tricky to assure that index 
and query are compatible, but the Admin UI Analysis page can help.


Generally, using pure defaults for WDF is not what you want, especially for 
query time. Usually there needs to be a slight asymmetry between index and 
query for WDF - index generates more terms than query.


-- Jack Krupansky

-Original Message- 
From: Andreas Owen

Sent: Wednesday, March 12, 2014 6:20 AM
To: solr-user@lucene.apache.org
Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 
uppwards


I now have the following:

analyzer type=query
tokenizer class=solr.WhiteSpaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --

filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=German/
 /analyzer

The gui analysis shows me that wdf doesn't cut the underscore anymore but it 
still returns 0 results?


Output:

lst name=debug
 str name=rawquerystringyh_cug/str
 str name=querystringyh_cug/str
 str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | 
links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | 
url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | 
breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | 
editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) 
((expiration:[1394619501862 TO *] 
(+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord/str
 str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | 
thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | 
h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | 
contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | 
doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] 
(+*:* -expiration:*))^6.0) 
(div(int(clicks),max(int(displays),const(1^8.0/str

 lst name=explain/
 arr name=expandedSynonyms
   stryh_cug/str
 /arr
 lst name=reasonForNotExpandingSynonyms
   str name=nameDidntFindAnySynonyms/str
   str name=explanationNo synonyms found for this query.  Check your 
synonyms file./str

 /lst
 lst name=mainQueryParser
   str name=QParserExtendedDismaxQParser/str
   null name=altquerystring/
   arr name=boost_queries
 str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str
   /arr
   arr name=parsed_boost_queries
 str(expiration:[1394619501862 TO *] 
(+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str

   /arr
   arr name=boostfuncs
 strdiv(clicks,max(displays,1))^8/str
   /arr
 /lst
 lst name=synonymQueryParser
   str name=QParserExtendedDismaxQParser/str
   null name=altquerystring/
   arr name=boostfuncs
 strdiv(clicks,max(displays,1))^8/str
   /arr
 /lst
 lst name=timing




-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Dienstag, 11. März 2014 14:25
To: solr-user@lucene.apache.org
Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 
uppwards


The usual use of an ngram filter is at index time and not at query time.
What exactly are you trying to achieve by using ngram filtering at query 
time as well as index time?


Generally, it is inappropriate to combine the word delimiter filter with the 
standard tokenizer - the later removes the punctuation that normally 
influences how WDF treats the parts of a token. Use the white space 
tokenizer if you intend to use WDF.


Which query parser are you using? What fields are being queried?

Please post the parsed query string from the debug output - it will show the 
precise generated query.


I think what you are seeing is that the ngram filter is generating tokens 
like h_cugtest and then the WDF is removing the underscore and then h

gets generated as a separate token.

-- Jack Krupansky

-Original Message-
From: Andreas Owen
Sent: Tuesday, March 11, 2014 5:09 AM
To: solr-user@lucene.apache.org
Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 
uppwards


I got it roght the first time and here is my requesthandler. The field 
plain_text is searched correctly and has the sam fieldtype as title - 
text_de


queryParser name=synonym_edismax
class=solr.SynonymExpandingExtendedDismaxQParserPlugin
 lst name=synonymAnalyzers
lst name=myCoolAnalyzer
 lst name=tokenizer
str name=classstandard/str
 /lst
 lst name=filter
str name=classshingle/str
str name=outputUnigramsIfNoShinglestrue/str
str name=outputUnigramstrue/str
str name=minShingleSize2/str
str name=maxShingleSize4/str
 /lst
 lst name=filter
str name=classsynonym/str
str name=tokenizerFactorysolr.KeywordTokenizerFactory/str
str name=synonymssynonyms.txt/str
str name=expandtrue/str
str name=ignoreCasetrue/str
 /lst
/lst
 /lst

Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-12 Thread Andreas Owen

yes that is exactly what happend in the analyzer. the term i searched for was 
listed on both sides (index  query).

here's the rest:

analyzer type=index
        tokenizer class=solr.WhitespaceTokenizerFactory/
        !-- in this example, we will only use synonyms at query time
        filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/
        --
        !-- Case insensitive stop word removal.
             enablePositionIncrements=true ensures that a 'gap' is left to
             allow for accurate phrase queries.
        --
        filter class=solr.StopFilterFactory
                ignoreCase=true
                words=stopwords.txt
                enablePositionIncrements=true
                /
        filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
        filter class=solr.RemoveDuplicatesTokenFilterFactory/
      /analyzer

-Original-Nachricht- 
 Von: Jack Krupansky j...@basetechnology.com 
 An: solr-user@lucene.apache.org 
 Datum: 12/03/2014 13:25 
 Betreff: Re: NOT SOLVED searches for single char tokens instead of from 3 
 uppwards 
 
 You didn't show the new index analyzer - it's tricky to assure that index 
 and query are compatible, but the Admin UI Analysis page can help.
 
 Generally, using pure defaults for WDF is not what you want, especially for 
 query time. Usually there needs to be a slight asymmetry between index and 
 query for WDF - index generates more terms than query.
 
 -- Jack Krupansky
 
 -Original Message- 
 From: Andreas Owen
 Sent: Wednesday, March 12, 2014 6:20 AM
 To: solr-user@lucene.apache.org
 Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 
 uppwards
 
 I now have the following:
 
 analyzer type=query
 tokenizer class=solr.WhiteSpaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=lang/stopwords_de.txt format=snowball 
 enablePositionIncrements=true/ !-- remove common words --
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German/
       /analyzer
 
 The gui analysis shows me that wdf doesn't cut the underscore anymore but it 
 still returns 0 results?
 
 Output:
 
 lst name=debug
   str name=rawquerystringyh_cug/str
   str name=querystringyh_cug/str
   str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | 
 links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | 
 url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | 
 breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | 
 editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) 
 ((expiration:[1394619501862 TO *] 
 (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
 FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord/str
   str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | 
 thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | 
 h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | 
 contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | 
 doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] 
 (+*:* -expiration:*))^6.0) 
 (div(int(clicks),max(int(displays),const(1^8.0/str
   lst name=explain/
   arr name=expandedSynonyms
     stryh_cug/str
   /arr
   lst name=reasonForNotExpandingSynonyms
     str name=nameDidntFindAnySynonyms/str
     str name=explanationNo synonyms found for this query.  Check your 
 synonyms file./str
   /lst
   lst name=mainQueryParser
     str name=QParserExtendedDismaxQParser/str
     null name=altquerystring/
     arr name=boost_queries
       str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str
     /arr
     arr name=parsed_boost_queries
       str(expiration:[1394619501862 TO *] 
 (+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str
     /arr
     arr name=boostfuncs
       strdiv(clicks,max(displays,1))^8/str
     /arr
   /lst
   lst name=synonymQueryParser
     str name=QParserExtendedDismaxQParser/str
     null name=altquerystring/
     arr name=boostfuncs
       strdiv(clicks,max(displays,1))^8/str
     /arr
   /lst
   lst name=timing
 
 
 
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: Dienstag, 11. März 2014 14:25
 To: solr-user@lucene.apache.org
 Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 
 uppwards
 
 The usual use of an ngram filter is at index time and not at query time.
 What exactly are you trying to achieve by using ngram filtering at query 
 time as well as index time?
 
 Generally, it is inappropriate to combine the word delimiter filter

RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-12 Thread Andreas Owen

Hi Jack,

do you know how i can use local parameters in my solrconfig? The params are 
visible in the debugquery-output but solr doesn't parse them.

lst name=invariants
str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO 
*]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) 
(+organisations:($org) -roles:[ TO *])/str
/lst


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Mittwoch, 12. März 2014 14:44
To: solr-user@lucene.apache.org
Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

yes that is exactly what happend in the analyzer. the term i searched for was 
listed on both sides (index  query).

here's the rest:

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
 enablePositionIncrements=true ensures that a 'gap' is left to
 allow for accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

-Original-Nachricht- 
 Von: Jack Krupansky j...@basetechnology.com
 An: solr-user@lucene.apache.org
 Datum: 12/03/2014 13:25
 Betreff: Re: NOT SOLVED searches for single char tokens instead of 
 from 3 uppwards
 
 You didn't show the new index analyzer - it's tricky to assure that 
 index and query are compatible, but the Admin UI Analysis page can help.
 
 Generally, using pure defaults for WDF is not what you want, 
 especially for query time. Usually there needs to be a slight 
 asymmetry between index and query for WDF - index generates more terms than 
 query.
 
 -- Jack Krupansky
 
 -Original Message-
 From: Andreas Owen
 Sent: Wednesday, March 12, 2014 6:20 AM
 To: solr-user@lucene.apache.org
 Subject: RE: NOT SOLVED searches for single char tokens instead of 
 from 3 uppwards
 
 I now have the following:
 
 analyzer type=query
 tokenizer class=solr.WhiteSpaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory 
 types=at-under-alpha.txt/ filter 
 class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=lang/stopwords_de.txt format=snowball 
 enablePositionIncrements=true/ !-- remove common words -- filter 
 class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German/
   /analyzer
 
 The gui analysis shows me that wdf doesn't cut the underscore anymore 
 but it still returns 0 results?
 
 Output:
 
 lst name=debug
   str name=rawquerystringyh_cug/str
   str name=querystringyh_cug/str
   str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 |
 links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 |
 url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 |
 breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 
 |
 editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0))
 ((expiration:[1394619501862 TO *]
 (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
 FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_
 coord/str
   str name=parsedquery_toString+(tags:yh_cug^10.0 | 
 links:yh_cug^5.0 |
 thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 |
 h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 |
 contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | 
 editorschoice:yh_cug^200.0 |
 doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *]
 (+*:* -expiration:*))^6.0)
 (div(int(clicks),max(int(displays),const(1^8.0/str
   lst name=explain/
   arr name=expandedSynonyms
 stryh_cug/str
   /arr
   lst name=reasonForNotExpandingSynonyms
 str name=nameDidntFindAnySynonyms/str
 str name=explanationNo synonyms found for this query.  Check 
 your synonyms file./str
   /lst
   lst name=mainQueryParser
 str name=QParserExtendedDismaxQParser/str
 null name=altquerystring/
 arr name=boost_queries
   str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str
 /arr
 arr name=parsed_boost_queries
   str(expiration:[1394619501862 TO *]
 (+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str
 /arr
 arr name=boostfuncs
   strdiv(clicks,max(displays,1))^8/str
 /arr
   /lst
   lst name=synonymQueryParser
 str name=QParserExtendedDismaxQParser/str
 null name=altquerystring/
 arr name=boostfuncs

Re: SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-11 Thread Andreas Owen

sorry i looked at the wrong fieldtype

-Original-Nachricht- 
 Von: Andreas Owen a...@conx.ch 
 An: solr-user@lucene.apache.org 
 Datum: 11/03/2014 08:45 
 Betreff: searches for single char tokens instead of from 3 uppwards 
 
 i have a field with the following type:
 
 fieldType name=text_de class=solr.TextField positionIncrementGap=100
       analyzer 
         tokenizer class=solr.StandardTokenizerFactory/
         filter class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=true 
 words=lang/stopwords_de.txt format=snowball 
 enablePositionIncrements=true/ !-- remove common words --
                filter class=solr.GermanNormalizationFilterFactory/
   filter class=solr.SnowballPorterFilterFactory 
 language=German/ 
   filter class=solr.NGramFilterFactory minGramSize=3 
 maxGramSize=15/
   filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=1 generateNumberParts=1 catenateWords=1 
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
       /analyzer
     /fieldType
 
 
 shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a 
 query report of 2 results:
 lst name=responseHeader  int name=status0/int  int 
 name=QTime125/int  lst name=params    str 
 name=debugQuerytrue/str    str 
 name=fltitle,roles,organisations,id/str    str name=indenttrue/str 
    str name=qyh_cugtest/str    str name=_1394522589347/str    
 str name=wtxml/str    str name=fqorganisations:* roles:*/str  
 /lst/lst
 result name=response numFound=5 start=0
    ..
 str name=dms:2681
 1.6365329 = (MATCH) sum of:   1.6346203 = (MATCH) max of:     0.14759353 = 
 (MATCH) product of:       0.28596246 = (MATCH) sum of:         0.01528686 = 
 (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of:           
 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of:             
 0.035319194 = queryWeight, product of:               5.540098 = 
 idf(docFreq=9, maxDocs=937)               0.0063751927 = queryNorm            
  0.43282017 = fieldWeight in 0, product of:               1.0 = tf(freq=1.0), 
 with freq of:                 1.0 = termFreq=1.0               5.540098 = 
 idf(docFreq=9, maxDocs=937)               0.078125 = fieldNorm(doc=0)         
 0.0119499 = (MATCH) weight(plain_text:ugt in 0) [DefaultSimilarity], result 
 of:           0.0119499 = score(doc=0,freq=1.0 = termFreq=1.0 ),
product of:             0.031227252 = queryWeight, product of:              
 4.8982444 = idf(docFreq=18, maxDocs=937)               0.0063751927 = 
 queryNorm             0.38267535 = fieldWeight in 0, product of:              
  1.0 = tf(freq=1.0), with freq of:                 1.0 = termFreq=1.0         
       4.8982444 = idf(docFreq=18, maxDocs=937)               0.078125 = 
 fieldNorm(doc=0)         0.019351374 = (MATCH) weight(plain_text:yhc in 0) 
 [DefaultSimilarity], result of:           0.019351374 = score(doc=0,freq=1.0 
 = termFreq=1.0 ), product of:             0.03973814 = queryWeight, product 
 of:               6.2332454 = idf(docFreq=4, maxDocs=937)               
 0.0063751927 = queryNorm             0.4869723 = fieldWeight in 0, product 
 of:               1.0 = tf(freq=1.0), with freq of:                 1.0 = 
 termFreq=1.0               6.2332454 =
idf(docFreq=4, maxDocs=937)               0.078125 = fieldNorm(doc=0)         
0.019351374 = (MATCH)
 weight(plain_text:hcu in 0) [DefaultSimilarity], result of:           
 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of:             
 0.03973814 = queryWeight, product of:               6.2332454 = 
 idf(docFreq=4, maxDocs=937)               0.0063751927 = queryNorm            
  0.4869723 = fieldWeight in 0, product of:               1.0 = tf(freq=1.0), 
 with freq of:                 1.0 = termFreq=1.0               6.2332454 = 
 idf(docFreq=4, maxDocs=937)               0.078125 = fieldNorm(doc=0)         
 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result 
 of:           0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
             0.035319194 = queryWeight, product of:               5.540098 = 
 idf(docFreq=9, maxDocs=937)               0.0063751927 =
queryNorm             0.43282017 = fieldWeight in 0, product of:               
1.0 =
 tf(freq=1.0), with freq of:                 1.0 = termFreq=1.0               
 5.540098 = idf(docFreq=9, maxDocs=937)               0.078125 = 
 fieldNorm(doc=0)         0.019351374 = (MATCH) weight(plain_text:cugt in 0) 
 [DefaultSimilarity], result of:           0.019351374 = score(doc=0,freq=1.0 
 = termFreq=1.0 ), product of:             0.03973814 = queryWeight, product 
 of:               6.2332454 = idf(docFreq=4, maxDocs=937)               
 0.0063751927 = queryNorm             0.4869723 = fieldWeight in 0, product 
 of:               1.0 = tf(freq=1.0), with freq of:                 1.0 = 
 termFreq=1.0               6.2332454 = idf(docFreq=4,

RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-11 Thread Andreas Owen

I got it roght the first time and here is my requesthandler. The field 
plain_text is searched correctly and has the sam fieldtype as title - 
text_de

queryParser name=synonym_edismax 
class=solr.SynonymExpandingExtendedDismaxQParserPlugin
  lst name=synonymAnalyzers
lst name=myCoolAnalyzer
  lst name=tokenizer
str name=classstandard/str
  /lst
  lst name=filter
str name=classshingle/str
str name=outputUnigramsIfNoShinglestrue/str
str name=outputUnigramstrue/str
str name=minShingleSize2/str
str name=maxShingleSize4/str
  /lst
  lst name=filter
str name=classsynonym/str
str 
name=tokenizerFactorysolr.KeywordTokenizerFactory/str
str name=synonymssynonyms.txt/str
str name=expandtrue/str
str name=ignoreCasetrue/str
  /lst
/lst
  /lst
/queryParser

requestHandler name=/select2 class=solr.SearchHandler
 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=defTypesynonym_edismax/str
   str name=synonymstrue/str
   str name=qfplain_text^10 editorschoice^200
title^20 h_*^14 
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
   /str

str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) 
(+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) 
(+organisations:($org) -roles:[ TO *])/str
   str name=bq(expiration:[NOW TO *] OR (*:* 
-expiration:*))^6/str  !-- tested: now or newer or empty gets small boost --
   str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --

   str name=dftext/str
   str name=fl*,path,score/str
   str name=wtjson/str
   str name=q.opAND/str
   
   !-- Highlighting defaults --
   str name=hlon/str
   str name=hl.flplain_text,title/str
   str name=hl.fragSize200/str
   str name=hl.simple.prelt;bgt;/str
   str name=hl.simple.postlt;/bgt;/str
   
 !-- lst name=invariants --
str name=faceton/str
str name=facet.mincount1/str
str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str
str name=f.inhaltstyp_s.facet.sortindex/str
str name=facet.field{!ex=doctype}doctype/str
str name=f.doctype.facet.sortindex/str
str name=facet.field{!ex=thema_f}thema_f/str
str name=f.thema_f.facet.sortindex/str
str name=facet.field{!ex=author_s}author_s/str
str name=f.author_s.facet.sortindex/str
str 
name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str
str name=f.sachverstaendiger_s.facet.sortindex/str
str 
name=facet.field{!ex=veranstaltung_s}veranstaltung_s/str
str name=f.veranstaltung_s.facet.sortindex/str
str name=facet.date{!ex=last_modified}last_modified/str
str name=facet.date.gap+1MONTH/str
str name=facet.date.endNOW/MONTH+1MONTH/str
str name=facet.date.startNOW/MONTH-36MONTHS/str
str name=facet.date.otherafter/str


   /lst
/requestHandler
 


 i have a field with the following type:
 
 fieldType name=text_de class=solr.TextField 
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=true 
 words=lang/stopwords_de.txt format=snowball 
 enablePositionIncrements=true/ !-- remove common words --
filter class=solr.GermanNormalizationFilterFactory/
   filter class=solr.SnowballPorterFilterFactory 
 language=German/
   filter class=solr.NGramFilterFactory minGramSize=3 
 maxGramSize=15/
   filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=1 generateNumberParts=1 catenateWords=1 
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   /analyzer
 /fieldType
 
 
 shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a 
query report of 2 results:

 lst name=responseHeader  int name=status0/int  int 
 name=QTime125/int  lst name=paramsstr 
 name=debugQuerytrue/strstr 
 name=fltitle,roles,organisations,id/strstr 
 name=indenttrue/strstr name=qyh_cugtest/strstr 
 name=_1394522589347/strstr name=wtxml/strstr 
 name=fqorganisations:* roles:*/str

Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-11 Thread Jack Krupansky

The usual use of an ngram filter is at index time and not at query time. 
What exactly are you trying to achieve by using ngram filtering at query 
time as well as index time?


Generally, it is inappropriate to combine the word delimiter filter with the 
standard tokenizer - the later removes the punctuation that normally 
influences how WDF treats the parts of a token. Use the white space 
tokenizer if you intend to use WDF.


Which query parser are you using? What fields are being queried?

Please post the parsed query string from the debug output - it will show the 
precise generated query.


I think what you are seeing is that the ngram filter is generating tokens 
like h_cugtest and then the WDF is removing the underscore and then h 
gets generated as a separate token.


-- Jack Krupansky

-Original Message- 
From: Andreas Owen

Sent: Tuesday, March 11, 2014 5:09 AM
To: solr-user@lucene.apache.org
Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 
uppwards


I got it roght the first time and here is my requesthandler. The field 
plain_text is searched correctly and has the sam fieldtype as title - 
text_de


queryParser name=synonym_edismax 
class=solr.SynonymExpandingExtendedDismaxQParserPlugin

 lst name=synonymAnalyzers
lst name=myCoolAnalyzer
 lst name=tokenizer
str name=classstandard/str
 /lst
 lst name=filter
str name=classshingle/str
str name=outputUnigramsIfNoShinglestrue/str
str name=outputUnigramstrue/str
str name=minShingleSize2/str
str name=maxShingleSize4/str
 /lst
 lst name=filter
str name=classsynonym/str
str name=tokenizerFactorysolr.KeywordTokenizerFactory/str
str name=synonymssynonyms.txt/str
str name=expandtrue/str
str name=ignoreCasetrue/str
 /lst
/lst
 /lst
/queryParser

requestHandler name=/select2 class=solr.SearchHandler
lst name=defaults
  str name=echoParamsexplicit/str
  int name=rows10/int
  str name=defTypesynonym_edismax/str
  str name=synonymstrue/str
  str name=qfplain_text^10 editorschoice^200
title^20 h_*^14
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
  /str

str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) 
(+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) 
(+organisations:($org) -roles:[ TO *])/str
  str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str 
!-- tested: now or newer or empty gets small boost --

  str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --

  str name=dftext/str
  str name=fl*,path,score/str
  str name=wtjson/str
  str name=q.opAND/str

  !-- Highlighting defaults --
  str name=hlon/str
  str name=hl.flplain_text,title/str
  str name=hl.fragSize200/str
  str name=hl.simple.prelt;bgt;/str
  str name=hl.simple.postlt;/bgt;/str

!-- lst name=invariants --
   str name=faceton/str
str name=facet.mincount1/str
   str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str
str name=f.inhaltstyp_s.facet.sortindex/str
str name=facet.field{!ex=doctype}doctype/str
str name=f.doctype.facet.sortindex/str
str name=facet.field{!ex=thema_f}thema_f/str
str name=f.thema_f.facet.sortindex/str
str name=facet.field{!ex=author_s}author_s/str
str name=f.author_s.facet.sortindex/str
str name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str
str name=f.sachverstaendiger_s.facet.sortindex/str
str name=facet.field{!ex=veranstaltung_s}veranstaltung_s/str
str name=f.veranstaltung_s.facet.sortindex/str
str name=facet.date{!ex=last_modified}last_modified/str
str name=facet.date.gap+1MONTH/str
str name=facet.date.endNOW/MONTH+1MONTH/str
str name=facet.date.startNOW/MONTH-36MONTHS/str
str name=facet.date.otherafter/str

  /lst
/requestHandler



i have a field with the following type:

fieldType name=text_de class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_de.txt format=snowball
enablePositionIncrements=true/ !-- remove common words --
   filter class=solr.GermanNormalizationFilterFactory/
  filter class=solr.SnowballPorterFilterFactory
language=German/
  filter class=solr.NGramFilterFactory minGramSize=3
maxGramSize=15/
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
  /analyzer
/fieldType


shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a 
query report of 2 results:



lst name=responseHeader  int name=status0/int  int
name=QTime125/int  lst name=paramsstr
name=debugQuerytrue/strstr
name=fltitle,roles,organisations,id/strstr
name=indenttrue/strstr name=qyh_cugtest/strstr
name=_1394522589347/strstr name=wtxml/strstr
name=fqorganisations:* roles:*/str  /lst/lst result
name=response

RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards

Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards

Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

Re: SOLVED searches for single char tokens instead of from 3 uppwards

RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards

Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards

8 matches

Site Navigation

Mail list logo

Footer information