from:"Christian Zambrano"


Got it. Sorry for not having an answer for your problem.

On 10/06/2009 04:58 PM, Ravi Kiran wrote:

You dont see any facet fields in my query because I have configured them in
the solrconfig.xml to give specific fields as facets by default in the
dismax and standard handlers so that I dont have to specify all those fields
individually everytime I query, all I need to do is just set facet=true
thats all

   
 
  dismax
  explicit
  0.01
  
 systemid^20.0 headline^20.0 keyword^18.0 person^18.0
organization^18.0 usstate^18.0 country^18.0 subject^18.0 quote^18.0
blurb^15.0 articlesubhead^8.0 byline^7.0 articleblurb^2.0 body^1.5
multimediablurb^1.5
  
  
 headline^20.5 keyword^18.5 person^18.5 organization^18.5
usstate^18.5 country^18.5 subject^18.5 quote^18.5 blurb^15.5
articlesubhead^8.5 byline^7.5 articleblurb^2.5 body^2.0 multimediablurb^2.0
  
  
 recip(rord(pubdatetime),1,1000,1000)^1.0
  
  
 *
  
  
 2<-1 5<-3 6<90%
  
  100
  *:*
  
  keyword
  
  0
  
  keyword
  regex  
  false
  1
  5
  5
  5
  5
  5
  5
  contenttype
  keyword
  keywordlower
  keywordformatted
  person
  personformatted
  organization
  usstate
  country
  subject
 
   


On Tue, Oct 6, 2009 at 5:45 PM, Christian Zambranowrote:

   

I am stumped then. I had a similar issue when I was using a field that was
being heavily tokenized, but I corrected the issue by using a
field(generated using copyField) that doesn't get analyzed at all.

On the query you provided before I didn't see the parameters to tell solr
for which field it should produce facets.

Something like:


http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location*




On 10/06/2009 04:09 PM, Ravi Kiran wrote:

 

Yes Exactly the same

On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano   

wrote:
 



   

And you had the analyzer for that field set-up the same way as shown on
your previous e-mail when you indexed the data?




On 10/06/2009 03:46 PM, Ravi Kiran wrote:



 

I did infact check it out any there is no weirdness in analysis
page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8
payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8
payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}
  term
position 1 term text New York term type word source start,end 0,8
payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8
payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano   

wrote:


 




   

Have you tried using the Analysis page to see what tokens are generated
for
the string "New York"? It could be one of the token filter is adding
the
token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:





 

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody
kindly
help me understand why I get them and how to eliminate them. My
schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory
is
that it will use all words as a single token, am I right ? for
example:
"New
York" will be indexed as 'New York' and will not be split right???
However
I
see then splitup in facets as follows when running the query "



http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
"...but
wh

Re: Weird Facet and KeywordTokenizerFactory Issue

I am stumped then. I had a similar issue when I was using a field that 
was being heavily tokenized, but I corrected the issue by using a 
field(generated using copyField) that doesn't get analyzed at all.


On the query you provided before I didn't see the parameters to tell 
solr for which field it should produce facets.


Something like:

http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location*



On 10/06/2009 04:09 PM, Ravi Kiran wrote:

Yes Exactly the same

On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambranowrote:

   

And you had the analyzer for that field set-up the same way as shown on
your previous e-mail when you indexed the data?




On 10/06/2009 03:46 PM, Ravi Kiran wrote:

 

I did infact check it out any there is no weirdness in analysis page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8 payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano   

wrote:
 



   

Have you tried using the Analysis page to see what tokens are generated
for
the string "New York"? It could be one of the token filter is adding the
token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:



 

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody
kindly
help me understand why I get them and how to eliminate them. My
schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example:
"New
York" will be indexed as 'New York' and will not be split right???
However
I
see then splitup in facets as follows when running the query "


http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont
find
any doc which has just "New". After digging in a bit I found that if
several
keywords have a common starting word it is being pulled out as another
facet
like the following. Any help is greatly appreciated

Result

47   >Ghost
7
16
10
147
23
8
5
6
8
10
8
5
7

7   -->Ghost
5
5


7 -->Ghost
6
26
6

27
8
7
12

Schema.xml
-

Re: Weird Facet and KeywordTokenizerFactory Issue

And you had the analyzer for that field set-up the same way as shown on 
your previous e-mail when you indexed the data?




On 10/06/2009 03:46 PM, Ravi Kiran wrote:

I did infact check it out any there is no weirdness in analysis page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8 payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambranowrote:

   

Have you tried using the Analysis page to see what tokens are generated for
the string "New York"? It could be one of the token filter is adding the
token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:

 

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody
kindly
help me understand why I get them and how to eliminate them. My schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example:
"New
York" will be indexed as 'New York' and will not be split right??? However
I
see then splitup in facets as follows when running the query "

http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont
find
any doc which has just "New". After digging in a bit I found that if
several
keywords have a common starting word it is being pulled out as another
facet
like the following. Any help is greatly appreciated

Result

47  >   Ghost
7
16
10
147
23
8
5
6
8
10
8
5
7

7  -->   Ghost
5
5


7-->   Ghost
6
26
6

27
8
7
12

Schema.xml
-

Re: Weird Facet and KeywordTokenizerFactory Issue

Have you tried using the Analysis page to see what tokens are generated 
for the string "New York"? It could be one of the token filter is adding 
the token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody kindly
help me understand why I get them and how to eliminate them. My schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example: "New
York" will be indexed as 'New York' and will not be split right??? However I
see then splitup in facets as follows when running the query "
http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont find
any doc which has just "New". After digging in a bit I found that if several
keywords have a common starting word it is being pulled out as another facet
like the following. Any help is greatly appreciated

Result

47 >  Ghost
7
16
10
147
23
8
5
6
8
10
8
5
7

7 -->  Ghost
5
5


7   -->  Ghost
6
26
6

27
8
7
12

Schema.xml
-

Re: Question about PatternReplace filter and automatic Synonym generation


Prasanna,

Wouldn't it be better to use built-in token filters at both index and  
query that will convert 'it!' to just 'it'? I believe the  
WorkDelimeterFilterFactory will do that for you.


Christian

On Oct 5, 2009, at 7:31 PM, Prasanna Ranganathan > wrote:






On 10/5/09 2:46 AM, "Shalin Shekhar Mangar"   
wrote:


Alternatively, is there a filter available which takes in a  
pattern and
produces additional forms of the token depending on the pattern?  
The use

case I am looking at here is using such a filter to automate synonym
generation. In our application, quite a few of the synonym file  
entries
match a specific pattern and having such a filter would make it  
easier I
believe. Pl. do correct me in case I am missing some unwanted side- 
effect

with this approach.


I do not understand this. TokenFilters are used for things like  
stemming,
replacing patterns, lowercasing, n-gramming etc. The synonym filter  
inserts

additional tokens (synonyms) from a file for each token.

What exactly are you trying to do with synonyms? I guess you could do
stemming etc with synonyms but why do you want to do that?


I ll try to explain with an example. Given the term 'it!' in the  
title, it
should match both 'it' and 'it!' in the query as an exact match.  
Currently,
this is done by using a synonym entry  (and index time  
SynonymFilter) as

follows:

it! => it, it!

Now, the above holds true for all cases where you have a title token  
of the

form [aA-zZ]*!. Handling all of those cases requires adding synonyms
manually for each case which is not easy to manage and does not scale.

I am hoping to do the same by using a index time filter that takes  
in a
pattern like the PatternReplace filter and adds the newly created  
token
instead of replacing the original one. Does this make sense? Am I  
missing

something that would break this approach?



Note that a change in synonym file needs a re-index of the affected
documents. Also, the synonym map is kept in memory.


What is the overhead incurred in having an additional filter applied  
during

indexing? It is strictly CPU only?

Thanks a lot for your valuable input.

Regards,

Prasanna.

Re: Need "OR" in DisMax Query


David,

If your schema includes fields with analyzers that use the 
StopFilterFactory and the dismax QueryHandler is set-up to search within 
those fields, then you are correct.



On 10/05/2009 01:36 PM, David Giffin wrote:

Hi There,

Maybe I'm missing something, but I can't seem to get the dismax
request handler to perform and OR query. It appears that OR is removed
by the stop words. I like to do something like
"qt=dismax&q=red+OR+green" and get all green and all red results.

Thanks,
David

Re: wildcard searches

On 10/05/2009 01:18 PM, Avlesh Singh wrote:

First of all, I know of no way of doing wildcard phrase queries.

http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_combine_wildcard_and_phrase_search.2C_e.g._.22foo_ba.2A.22.3F

Thanks for that link

When I said not filters, I meant TokenFilters which is what I believe you

mean by 'not analyzed'

Analysis is a Lucene way of configuring tokenizers and filters for a field
(index time and query time). I guess, both of us mean the same thing.

You are correct. I should have said ' Not Analyzed'. Thanks for the
correction.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 11:04 PM, Christian Zambranowrote:

Avlesh, I don't understand your answer.

First of all, I know of no way of doing wildcard phrase queries.

When I said not filters, I meant TokenFilters which is what I believe you
mean by 'not analyzed'

On 10/05/2009 12:27 PM, Avlesh Singh wrote:

No filters are applied to wildcard/fuzzy searches.

Ah! Not like that ..
I guess, it is just that the phrase searches using wildcards are not
analyzed.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:42 PM, Christian Zambrano

wrote:

No filters are applied to wildcard/fuzzy searches.

I couldn't find a reference to this on either the solr or lucene
documentation but I read it on the Solr book from PACKT

On 10/05/2009 12:09 PM, Angel Ice wrote:

Hi everyone,

I have a little question regarding the search engine when a wildcard
character is used in the query.
Let's take the following example :

- I have sent in indexation the word Hésitation (with an accent on the
"e")
- The filters applied to the field that will handle this word, result in
the indexation of "esit" (the mute H is suppressed (home made filter),
the
accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
"ation".

When i search for "hesitation", "esitation", "ésitation" etc ... all is
OK, the document is returned.
But as soon as I use a wildcard, like "hésita*", the document is not
returned. In fact, I have to put the wildcard in a manner that match the
indexed term exactly (example "esi*")

Does the search engine applies the filters to the word that prefix the
wildcard ? Or does it use this prefix verbatim ?

Thanks for you help.

Laurent

Re: A little help with indexing joined words

Would you mind explaining how omitNorm has any effect on the IDF problem 
I described earlier?


I agree with your second sentence. I had to use the NGramTokenFilter to 
accommodate partial matches.


On 10/05/2009 12:11 PM, Avlesh Singh wrote:

Using synonyms might be a better solution because the use of
EdgeNGramTokenizerFactory has the potential of creating a large number of
token which will artificially increase the number of tokens in the index
which in turn will affect the IDF score.

 

Well, I don't see a reason as to why someone would need a length based
normalization on such matches. I always have done omitNorms while using
fields with this filter.

Yes, synonyms might an answer when you have limited number of such words
(phrases) and their possible combinations.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:32 PM, Christian Zambranowrote:

   

Using synonyms might be a better solution because the use of
EdgeNGramTokenizerFactory has the potential of creating a large number of
token which will artificially increase the number of tokens in the index
which in turn will affect the IDF score.

A query for "borderland" should have returned results though. It is
difficult to troubleshoot why it didn't without knowing what query you used,
and what kind of analysis is taking place.

Have you tried using the analysis page on the admin section to see what
tokens gets generated for 'Borderlands'?

Christian


On 10/05/2009 11:01 AM, Avlesh Singh wrote:

 

We have indexed a product database and have come across some search terms
   

where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.



 

"Borderland" should have worked for a regular text field. For all other
desired matches you can use EdgeNGramTokenizerFactory.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombe   wrote:



   

Hi
I am hoping someone can point me in the right direction with regards to
indexing words that are concatenated together to make other words or
product
names.

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

Where do I look to resolve this?  The product name field is indexed using
a
text field type.

Thanks in advance
Andrew

Re: wildcard searches


Avlesh, I don't understand your answer.

First of all, I know of no way of doing wildcard phrase queries.

When I said not filters, I meant TokenFilters which is what I believe 
you mean by 'not analyzed'


On 10/05/2009 12:27 PM, Avlesh Singh wrote:

No filters are applied to wildcard/fuzzy searches.

 

Ah! Not like that ..
I guess, it is just that the phrase searches using wildcards are not
analyzed.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:42 PM, Christian Zambranowrote:

   

No filters are applied to wildcard/fuzzy searches.

I couldn't find a reference to this on either the solr or lucene
documentation but I read it on the Solr book from PACKT


On 10/05/2009 12:09 PM, Angel Ice wrote:

 

Hi everyone,

I have a little question regarding the search engine when a wildcard
character is used in the query.
Let's take the following example :

- I have sent in indexation the word Hésitation (with an accent on the
"e")
- The filters applied to the field that will handle this word, result in
the indexation of "esit" (the mute H is suppressed (home made filter), the
accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
"ation".

When i search for "hesitation", "esitation", "ésitation" etc ... all is
OK, the document is returned.
But as soon as I use a wildcard, like "hésita*", the document is not
returned. In fact, I have to put the wildcard in a manner that match the
indexed term exactly (example "esi*")

Does the search engine applies the filters to the word that prefix the
wildcard ? Or does it use this prefix verbatim ?

Thanks for you help.

Laurent

Re: wildcard searches


No filters are applied to wildcard/fuzzy searches.

I couldn't find a reference to this on either the solr or lucene 
documentation but I read it on the Solr book from PACKT


On 10/05/2009 12:09 PM, Angel Ice wrote:

Hi everyone,

I have a little question regarding the search engine when a wildcard character 
is used in the query.
Let's take the following example :

- I have sent in indexation the word Hésitation (with an accent on the "e")
- The filters applied to the field that will handle this word, result in the indexation of 
"esit" (the mute H is suppressed (home made filter), the accent too (IsoLatin1Filter), 
and the SnowballPorterFilter suppress the "ation".

When i search for "hesitation", "esitation", "ésitation" etc ... all is OK, the 
document is returned.
But as soon as I use a wildcard, like "hésita*", the document is not returned. In fact, I 
have to put the wildcard in a manner that match the indexed term exactly (example "esi*")

Does the search engine applies the filters to the word that prefix the wildcard 
? Or does it use this prefix verbatim ?

Thanks for you help.

Laurent

Re: Question regarding synonym


You are correct.

I would recommend to only use the Synonym TokenFilter at index time 
unless you have a very good reason to do it at query time.


On 10/05/2009 11:46 AM, darniz wrote:

yes that's what we decided to expand these terms while indexing.
if we have
bayrische motoren werke =>  bmw

and i have a document which has bmw in it, searching for text:bayrische does
not give me results. i have to give
text:"bayrische motoren werke" then it actually takes the synonym and gets
me the document.

Now if i change the synonym mapping to
bayrische motoren werke , bmw with expand parameter to true and also use
this file at indexing.

now at the  time i index this document along with "bmw" i also index the
following words "bayrische" "motoren" "werke"

any text query like text:motoren or text:bayrische will give me results now.

Please correct me if my assumption is wrong.

Thanks
darniz









Christian Zambrano wrote:
   



On 10/02/2009 06:02 PM, darniz wrote:
 

Thanks
As i said it even works by giving double quotes too.
like carDescription:"austin martin"

So is that the conclusion that in order to map two word synonym i have to
always enclose in double quotes, so that it doen not split the words




   

Yes, but there are things you need to keep in mind.

  From the solr wiki:

Keep in mind that while the SynonymFilter will happily work with
*synonyms* containing multiple words (ie:
"sea biscuit, sea biscit, seabiscuit") The recommended approach for
dealing with *synonyms* like this, is to expand the synonym when
indexing. This is because there are two potential issues that can arrise
at query time:

1.

   The Lucene QueryParser tokenizes on white space before giving any
   text to the Analyzer, so if a person searches for the words
   sea biscit the analyzer will be given the words "sea" and "biscit"
   seperately, and will not know that they match a synonym.

2.

   Phrase searching (ie: "sea biscit") will cause the QueryParser to
   pass the entire string to the analyzer, but if the SynonymFilter
   is configured to expand the *synonyms*, then when the QueryParser
   gets the resulting list of tokens back from the Analyzer, it will
   construct a MultiPhraseQuery that will not have the desired
   effect. This is because of the limited mechanism available for the
   Analyzer to indicate that two terms occupy the same position:
   there is no way to indicate that a "phrase" occupies the same
   position as a term. For our example the resulting MultiPhraseQuery
   would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would
   not match the simple case of "seabisuit" occuring in a document


 







Christian Zambrano wrote:

   

When you use a field qualifier(fieldName:valueToLookFor) it only applies
to the word right after the semicolon. If you look at the debug
infomation you will notice that for the second word it is using the
default field.

carDescription:austin
*text*:martin

the following should word:

carDescription:(austin martin)


On 10/02/2009 05:46 PM, darniz wrote:

 

This is not working when i search documents i have a document which
contains
text aston martin

when i search carDescription:"austin martin" i get a match but when i
dont
give double quotes

like carDescription:austin martin
there is no match

in the analyser if i give austin martin with out quotes, when it passes
through synonym filter it matches aston martin ,
may be by default analyser treats it as a phrase "austin martin" but
when
i
try to do a query by typing
carDescription:austin martin i get 0 documents. the following is the
debug
node info with debugQuery=on

carDescription:austin martin
carDescription:austin martin
carDescription:austin text:martin
carDescription:austin
text:martin

dont know why it breaks the word, may be its a desired behaviour
when i give carDescription:"austin martin" of course in this its able
to
map
to synonym and i get the desired result

Any opinion

darniz



Ensdorf Ken wrote:


   


 

Hi
i have a question regarding synonymfilter
i have a one way mapping defined
austin martin, astonmartin =>aston martin



   

...


 

Can anybody please explain if my observation is correct. This is a
very
critical aspect for my work.


   

That is correct - the synonym filter can recognize multi-token
synonyms
from consecutive tokens in a stream.

Re: A little help with indexing joined words

Using synonyms might be a better solution because the use of 
EdgeNGramTokenizerFactory has the potential of creating a large number 
of token which will artificially increase the number of tokens in the 
index which in turn will affect the IDF score.


A query for "borderland" should have returned results though. It is 
difficult to troubleshoot why it didn't without knowing what query you 
used, and what kind of analysis is taking place.


Have you tried using the analysis page on the admin section to see what 
tokens gets generated for 'Borderlands'?


Christian

On 10/05/2009 11:01 AM, Avlesh Singh wrote:

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

 

"Borderland" should have worked for a regular text field. For all other
desired matches you can use EdgeNGramTokenizerFactory.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombe  wrote:

   

Hi
I am hoping someone can point me in the right direction with regards to
indexing words that are concatenated together to make other words or
product
names.

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

Where do I look to resolve this?  The product name field is indexed using a
text field type.

Thanks in advance
Andrew

Re: Always spellcheck (suggest)


Shalin,


Thanks for the clarification. That explains a lot. I should have looked 
at the lucene documentation.



On 10/05/2009 05:28 AM, Shalin Shekhar Mangar wrote:

On Mon, Oct 5, 2009 at 10:24 AM, Christian Zambranowrote:

   

I am really surprised that a query for "behaviour" returns "behavior" as a
suggestion only when the parameter "spellcheck.onlyMorePopular=true" is
present. I re-read the documentation and I see nothing that will imply that
the parameter onlyMorePopular will do anything else but filter the
suggestions solr will return.

Maybe somebody else can shed some light on this.


 

Yeah, that is true. All this is actually done in the Lucene SpellChecker.
Solr's component is a wrapper over it with some extra features. I've added a
clarification to the wiki page.

Re: Always spellcheck (suggest)

I am really surprised that a query for "behaviour" returns "behavior" as 
a suggestion only when the parameter "spellcheck.onlyMorePopular=true" 
is present. I re-read the documentation and I see nothing that will 
imply that the parameter onlyMorePopular will do anything else but 
filter the suggestions solr will return.


Maybe somebody else can shed some light on this.

On 10/04/2009 09:51 PM, Greg Pendlebury wrote:

Thanks. I'll have to look into modifications then (was hoping to avoid that).

For clarity though I believe this point is slightly off:

   

"Adding the parameter onlyMorePopular limits the suggestions that solr can give 
you(to ones that return more hits than the existing query), nothing more."
   

The flag is definitely returning suggestions, even for 'correct' terms, they 
just have to be more popular 'correct' terms.

Eg. 'behaviour' suggests 'behavior' because it has four times as many hits, but 
they are both 'correct' and the suggestion does not occur without the 
'onlyMorePopular' flag set. 'behavior' will not suggest 'behaviour' however 
because it is less popular.

Greg

-Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com]
Sent: Monday, 5 October 2009 12:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Always spellcheck (suggest)

Greg,

I apologize if I misunderstood your original post. I don't think there
is a way you can force solr to return suggestions when all of the words
are "correctly" spelled. Adding the parameter onlyMorePopular limits the
suggestions that solr can give you(to ones that return more hits than
the existing query), nothing more.

In short, I believe the answer is No.

On 10/04/2009 09:19 PM, Greg Pendlebury wrote:
   

Thanks for the response Christian. I'll modify my original point (1) then. Is 
'onlyMorePopular' the only way to return suggestions when all of the search 
terms are present in the dictionary (ie. correct)? Is there any way to force 
behaviour (1) without behaviour (2) (filtering on frequency).

Ta,
Greg

-Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com]
Sent: Monday, 5 October 2009 11:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Always spellcheck (suggest)

I believe your understanding in incorrect. The first behavior you
described is produced by adding the paremeter "spellcheck=true".
Suggestions will be returned regardless of whether there are results.
The only time I believe spelling suggestions might not be included is
when all of the words are spelled "correctly".

On 10/04/2009 07:55 PM, Greg Pendlebury wrote:

 

Hi All,

If I understand correctly the flag 'onlyMorePopular' encapsulates two 
independent behaviours. 1) It runs spell checking across queries that returned 
hits. Without the flag spell checking is not run when results are found. 2) It 
limits suggestions to terms with higher frequencies.

Is there any way to get behaviour (1) without behaviour (2)? Such as another 
flag I'm not seeing in the doco? The usage context is spelling suggestions for 
international usage. Eg. The user searches 'behaviour', we want it to suggest 
US spelling 'behavior' and vice versa. At the moment, the suggestion only works 
one way.

Ta,
Greg


This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)





   

This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)



 

This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please,

Re: Always spellcheck (suggest)


Greg,

I apologize if I misunderstood your original post. I don't think there 
is a way you can force solr to return suggestions when all of the words 
are "correctly" spelled. Adding the parameter onlyMorePopular limits the 
suggestions that solr can give you(to ones that return more hits than 
the existing query), nothing more.


In short, I believe the answer is No.

On 10/04/2009 09:19 PM, Greg Pendlebury wrote:

Thanks for the response Christian. I'll modify my original point (1) then. Is 
'onlyMorePopular' the only way to return suggestions when all of the search 
terms are present in the dictionary (ie. correct)? Is there any way to force 
behaviour (1) without behaviour (2) (filtering on frequency).

Ta,
Greg

-----Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com]
Sent: Monday, 5 October 2009 11:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Always spellcheck (suggest)

I believe your understanding in incorrect. The first behavior you
described is produced by adding the paremeter "spellcheck=true".
Suggestions will be returned regardless of whether there are results.
The only time I believe spelling suggestions might not be included is
when all of the words are spelled "correctly".

On 10/04/2009 07:55 PM, Greg Pendlebury wrote:
   

Hi All,

If I understand correctly the flag 'onlyMorePopular' encapsulates two 
independent behaviours. 1) It runs spell checking across queries that returned 
hits. Without the flag spell checking is not run when results are found. 2) It 
limits suggestions to terms with higher frequencies.

Is there any way to get behaviour (1) without behaviour (2)? Such as another 
flag I'm not seeing in the doco? The usage context is spelling suggestions for 
international usage. Eg. The user searches 'behaviour', we want it to suggest 
US spelling 'behavior' and vice versa. At the moment, the suggestion only works 
one way.

Ta,
Greg


This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)




 

This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)

Re: Question regarding synonym




On 10/02/2009 06:02 PM, darniz wrote:

Thanks
As i said it even works by giving double quotes too.
like carDescription:"austin martin"

So is that the conclusion that in order to map two word synonym i have to
always enclose in double quotes, so that it doen not split the words



   

Yes, but there are things you need to keep in mind.

From the solr wiki:

Keep in mind that while the SynonymFilter will happily work with 
*synonyms* containing multiple words (ie: 
"sea biscuit, sea biscit, seabiscuit") The recommended approach for 
dealing with *synonyms* like this, is to expand the synonym when 
indexing. This is because there are two potential issues that can arrise 
at query time:


  1.

 The Lucene QueryParser tokenizes on white space before giving any
 text to the Analyzer, so if a person searches for the words
 sea biscit the analyzer will be given the words "sea" and "biscit"
 seperately, and will not know that they match a synonym.

  2.

 Phrase searching (ie: "sea biscit") will cause the QueryParser to
 pass the entire string to the analyzer, but if the SynonymFilter
 is configured to expand the *synonyms*, then when the QueryParser
 gets the resulting list of tokens back from the Analyzer, it will
 construct a MultiPhraseQuery that will not have the desired
 effect. This is because of the limited mechanism available for the
 Analyzer to indicate that two terms occupy the same position:
 there is no way to indicate that a "phrase" occupies the same
 position as a term. For our example the resulting MultiPhraseQuery
 would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would
 not match the simple case of "seabisuit" occuring in a document










Christian Zambrano wrote:
   

When you use a field qualifier(fieldName:valueToLookFor) it only applies
to the word right after the semicolon. If you look at the debug
infomation you will notice that for the second word it is using the
default field.

carDescription:austin *text*:martin

the following should word:

carDescription:(austin martin)


On 10/02/2009 05:46 PM, darniz wrote:
 

This is not working when i search documents i have a document which
contains
text aston martin

when i search carDescription:"austin martin" i get a match but when i
dont
give double quotes

like carDescription:austin martin
there is no match

in the analyser if i give austin martin with out quotes, when it passes
through synonym filter it matches aston martin ,
may be by default analyser treats it as a phrase "austin martin" but when
i
try to do a query by typing
carDescription:austin martin i get 0 documents. the following is the
debug
node info with debugQuery=on

carDescription:austin martin
carDescription:austin martin
carDescription:austin text:martin
carDescription:austin text:martin

dont know why it breaks the word, may be its a desired behaviour
when i give carDescription:"austin martin" of course in this its able to
map
to synonym and i get the desired result

Any opinion

darniz



Ensdorf Ken wrote:

   


 

Hi
i have a question regarding synonymfilter
i have a one way mapping defined
austin martin, astonmartin =>   aston martin


   

...

 

Can anybody please explain if my observation is correct. This is a very
critical aspect for my work.

   

That is correct - the synonym filter can recognize multi-token synonyms
from consecutive tokens in a stream.

Re: Always spellcheck (suggest)