Re: Problems with WordDelimiterFilterFactory

2009-10-08 Thread Christian Zambrano

Bern,

The only way that could be happening is if you are not using the field 
type you described on your original e-mail. The TokenFilter 
WordDelimiterFilterFactory should take care of the hyphen.


On 10/08/2009 05:30 PM, Bernadette Houghton wrote:

Thanks for this Patrick. If I remove one of the hyphens, solr doesn't throw up 
the error, but still doesn't find the right record. I see from marklo's 
analysis page that solr is still parsing it with a hyphen. Changing this part 
of our schema.xml -

 

To

 

i.e. replacing non-alpha chars with a space, looks like it may handle that 
aspect.

Regards
Bern

-Original Message-
From: Patrick Jungermann [mailto:patrick.jungerm...@googlemail.com]
Sent: Friday, 9 October 2009 9:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Problems with WordDelimiterFilterFactory

Hi Bern,

the problem is the character sequence "--". A query is not allowed to
have minus characters that consequent upon another one. Remove one minus
character and the query will be parsed without problems.

Because of this parsing problem, I'd recommend a query cleanup before
the submit to the Solr server that replaces each sequence of minus
characters by a single one.


Regards, Patrick



Bernadette Houghton schrieb:
   

Sorry, the last line was truncated -

HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse '(Asia -- Civilization AND status_i:(2)) ': Encountered "-" at line 1, 
column 7. Was expecting one of: "(" ... "*" ...  ...  ...  ...  ... 
"[" ... "{" ...  ...

-Original Message-
From: Bernadette Houghton [mailto:bernadette.hough...@deakin.edu.au]
Sent: Friday, 9 October 2009 8:22 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: Problems with WordDelimiterFilterFactory

Here's the query and the error -

Oct 09 08:20:17  [debug] [196] Solr query string:(Asia -- Civilization AND 
status_i:(2))
Oct 09 08:20:17  [debug] [196] Solr sort by:  score desc
Oct 09 08:20:17  [error] Error on searching: "400" Status: 
org.apache.lucene.queryParser.ParseException: Cannot parse '   (Asia -- Civilization AND 
status_i:(2)) ': Encount

Bern

-Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com]
Sent: Thursday, 8 October 2009 12:48 PM
To: solr-user@lucene.apache.org
Cc: solr-user@lucene.apache.org
Subject: Re: Problems with WordDelimiterFilterFactory

Bern,

I am interested on the solr query. In other words, the query that your
system sends to solr.

Thanks,


Christian

On Oct 7, 2009, at 5:56 PM, Bernadette 
Houghton  wrote:

 

Hi Christian, try this one - http://www.deakin.edu.au/dro/view/DU:3601

Either scroll down and click one of the "television broadcasting --
asia" links, or type it in the Quick Search box.


TIA

bern

-Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com]
Sent: Thursday, 8 October 2009 9:43 AM
To: solr-user@lucene.apache.org
Subject: Re: Problems with WordDelimiterFilterFactory

Could you please provide the exact URL of a query where you are
experiencing this problem?
eg(Not URL encoded): q=fieldName:"hot and cold: temperatures"

On 10/07/2009 05:32 PM, Bernadette Houghton wrote:
   

We are having some issues with our solr parent application not
retrieving records as expected.

For example, if the input query includes a colon (e.g. hot and
cold: temperatures), the relevant record (which contains a colon in
the same place) does not get retrieved; if the input query does not
include the colon, all is fine.  Ditto if the user searches for a
query containing hyphens, e.g. "asia - civilization, although with
the qualifier that something like "asia-civilization" (no spaces
either side of the hyphen) works fine, whereas "asia -
civilization" (spaces either side of hyphen) doesn't work.

Our schema.xml contains the following -

 
   
 
 
 
 
 
 
 
 
   
   
 
 
 
 
 
 
 
 
   
 

Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
MSN: bern_hough...@hotmail.com
Email: 
bernadette.hough...@deakin.edu.au<mailto:bernadette.hough...@deakin.edu.au
Website: http://www.deakin.edu.au
<http://www.deakin.edu.au/>Deakin University CRICOS Provider Code
00113B (Vic)

Important Notice: The contents of this email are intended solely
for the named addressee and are confidential; any unauthorised use,
reproduction or storage of the contents is expressly prohibited. If
you have received this email in error, please delete it and any
attachments immediately and advise the sender by return email or
telephone.
Deakin University does not warrant that this email and any
attachments are error or virus free



 
   


Re: Problems with WordDelimiterFilterFactory

2009-10-07 Thread Christian Zambrano

Bern,

I am interested on the solr query. In other words, the query that your  
system sends to solr.


Thanks,


Christian

On Oct 7, 2009, at 5:56 PM, Bernadette Houghton > wrote:



Hi Christian, try this one - http://www.deakin.edu.au/dro/view/DU:3601

Either scroll down and click one of the "television broadcasting --  
asia" links, or type it in the Quick Search box.



TIA

bern

-Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com]
Sent: Thursday, 8 October 2009 9:43 AM
To: solr-user@lucene.apache.org
Subject: Re: Problems with WordDelimiterFilterFactory

Could you please provide the exact URL of a query where you are
experiencing this problem?
eg(Not URL encoded): q=fieldName:"hot and cold: temperatures"

On 10/07/2009 05:32 PM, Bernadette Houghton wrote:
We are having some issues with our solr parent application not  
retrieving records as expected.


For example, if the input query includes a colon (e.g. hot and  
cold: temperatures), the relevant record (which contains a colon in  
the same place) does not get retrieved; if the input query does not  
include the colon, all is fine.  Ditto if the user searches for a  
query containing hyphens, e.g. "asia - civilization, although with  
the qualifier that something like "asia-civilization" (no spaces  
either side of the hyphen) works fine, whereas "asia -  
civilization" (spaces either side of hyphen) doesn't work.


Our schema.xml contains the following -

positionIncrementGap="100">

  


class="solr.ISOLatin1AccentFilterFactory"/>
words="stopwords.txt"/>
generateWordParts="1" generateNumberParts="1" catenateWords="1"  
catenateNumbers="1" catenateAll="0"/>


protected="protwords.txt"/>


  
  

class="solr.ISOLatin1AccentFilterFactory"/>
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
words="stopwords.txt"/>
generateWordParts="1" generateNumberParts="1" catenateWords="0"  
catenateNumbers="0" catenateAll="0"/>


protected="protwords.txt"/>


  


Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
MSN: bern_hough...@hotmail.com
Email: bernadette.hough...@deakin.edu.au<mailto:bernadette.hough...@deakin.edu.au 
>

Website: http://www.deakin.edu.au
<http://www.deakin.edu.au/>Deakin University CRICOS Provider Code  
00113B (Vic)


Important Notice: The contents of this email are intended solely  
for the named addressee and are confidential; any unauthorised use,  
reproduction or storage of the contents is expressly prohibited. If  
you have received this email in error, please delete it and any  
attachments immediately and advise the sender by return email or  
telephone.
Deakin University does not warrant that this email and any  
attachments are error or virus free






Re: Problems with WordDelimiterFilterFactory

2009-10-07 Thread Christian Zambrano
Could you please provide the exact URL of a query where you are 
experiencing this problem?

eg(Not URL encoded): q=fieldName:"hot and cold: temperatures"

On 10/07/2009 05:32 PM, Bernadette Houghton wrote:

We are having some issues with our solr parent application not retrieving 
records as expected.

For example, if the input query includes a colon (e.g. hot and cold: temperatures), the relevant record 
(which contains a colon in the same place) does not get retrieved; if the input query does not include 
the colon, all is fine.  Ditto if the user searches for a query containing hyphens, e.g. "asia - 
civilization, although with the qualifier that something like "asia-civilization" (no spaces 
either side of the hyphen) works fine, whereas "asia - civilization" (spaces either side of 
hyphen) doesn't work.

Our schema.xml contains the following -

 
   
 
 
 
 
 
 
 
 
   
   
 
 
 
 
 
 
 
 
   
 

Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
MSN: bern_hough...@hotmail.com
Email: 
bernadette.hough...@deakin.edu.au
Website: http://www.deakin.edu.au
Deakin University CRICOS Provider Code 00113B (Vic)

Important Notice: The contents of this email are intended solely for the named 
addressee and are confidential; any unauthorised use, reproduction or storage 
of the contents is expressly prohibited. If you have received this email in 
error, please delete it and any attachments immediately and advise the sender 
by return email or telephone.
Deakin University does not warrant that this email and any attachments are 
error or virus free


   


Re: Facet query pb

2009-10-07 Thread Christian Zambrano

Clico,

Because you are doing a wildcard query, the token 'AMERICA' will not be 
analyzed at all. This means that 'AMERICA*' will NOT match 'america'.


On 10/07/2009 12:30 PM, Avlesh Singh wrote:

I have no idea what "pb" mean but this is what you probably want -
fq=(location_field:(NORTH AMERICA*))

Cheers
Avlesh

On Wed, Oct 7, 2009 at 10:40 PM, clico  wrote:

   

Hello
I have a pb trying to retrieve a tree with facet use

I 've got a field location_field
Each doc in my index has a location_field

Location field can be
continent/country/city


I have 2 queries:

http://server/solr//select?fq=(location_field:NORTH*):
 ok, retrieve docs

http://server/solr//select?fq=(location_field:NORTHAMERICA*)
 : not ok


I think with NORTH AMERICA I have a pb with the space caractere

Could u help me



--
View this message in context:
http://www.nabble.com/Facet-query-pb-tp25790667p25790667.html
Sent from the Solr - User mailing list archive at Nabble.com.


 
   


Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Christian Zambrano

Got it. Sorry for not having an answer for your problem.

On 10/06/2009 04:58 PM, Ravi Kiran wrote:

You dont see any facet fields in my query because I have configured them in
the solrconfig.xml to give specific fields as facets by default in the
dismax and standard handlers so that I dont have to specify all those fields
individually everytime I query, all I need to do is just set facet=true
thats all

   
 
  dismax
  explicit
  0.01
  
 systemid^20.0 headline^20.0 keyword^18.0 person^18.0
organization^18.0 usstate^18.0 country^18.0 subject^18.0 quote^18.0
blurb^15.0 articlesubhead^8.0 byline^7.0 articleblurb^2.0 body^1.5
multimediablurb^1.5
  
  
 headline^20.5 keyword^18.5 person^18.5 organization^18.5
usstate^18.5 country^18.5 subject^18.5 quote^18.5 blurb^15.5
articlesubhead^8.5 byline^7.5 articleblurb^2.5 body^2.0 multimediablurb^2.0
  
  
 recip(rord(pubdatetime),1,1000,1000)^1.0
  
  
 *
  
  
 2<-1 5<-3 6<90%
  
  100
  *:*
  
  keyword
  
  0
  
  keyword
  regex  
  false
  1
  5
  5
  5
  5
  5
  5
  contenttype
  keyword
  keywordlower
  keywordformatted
  person
  personformatted
  organization
  usstate
  country
  subject
 
   


On Tue, Oct 6, 2009 at 5:45 PM, Christian Zambranowrote:

   

I am stumped then. I had a similar issue when I was using a field that was
being heavily tokenized, but I corrected the issue by using a
field(generated using copyField) that doesn't get analyzed at all.

On the query you provided before I didn't see the parameters to tell solr
for which field it should produce facets.

Something like:


http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location*




On 10/06/2009 04:09 PM, Ravi Kiran wrote:

 

Yes Exactly the same

On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano   

wrote:
 



   

And you had the analyzer for that field set-up the same way as shown on
your previous e-mail when you indexed the data?




On 10/06/2009 03:46 PM, Ravi Kiran wrote:



 

I did infact check it out any there is no weirdness in analysis
page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8
payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8
payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}
  term
position 1 term text New York term type word source start,end 0,8
payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8
payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano   

wrote:


 




   

Have you tried using the Analysis page to see what tokens are generated
for
the string "New York"? It could be one of the token filter is adding
the
token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:





 

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody
kindly
help me understand why I get them and how to eliminate them. My
schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory
is
that it will use all words as a single token, am I right ? for
example:
"New
York" will be indexed as 'New York' and will not be split right???
However
I
see then splitup in facets as follows when running the query "



http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
"...but
wh

Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Christian Zambrano
I am stumped then. I had a similar issue when I was using a field that 
was being heavily tokenized, but I corrected the issue by using a 
field(generated using copyField) that doesn't get analyzed at all.


On the query you provided before I didn't see the parameters to tell 
solr for which field it should produce facets.


Something like:

http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location*



On 10/06/2009 04:09 PM, Ravi Kiran wrote:

Yes Exactly the same

On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambranowrote:

   

And you had the analyzer for that field set-up the same way as shown on
your previous e-mail when you indexed the data?




On 10/06/2009 03:46 PM, Ravi Kiran wrote:

 

I did infact check it out any there is no weirdness in analysis page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8 payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano   

wrote:
 



   

Have you tried using the Analysis page to see what tokens are generated
for
the string "New York"? It could be one of the token filter is adding the
token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:



 

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody
kindly
help me understand why I get them and how to eliminate them. My
schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example:
"New
York" will be indexed as 'New York' and will not be split right???
However
I
see then splitup in facets as follows when running the query "


http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont
find
any doc which has just "New". After digging in a bit I found that if
several
keywords have a common starting word it is being pulled out as another
facet
like the following. Any help is greatly appreciated

Result

47   >Ghost
7
16
10
147
23
8
5
6
8
10
8
5
7

7   -->Ghost
5
5


7 -->Ghost
6
26
6

27
8
7
12

Schema.xml
-

 
   
 
 
 

 
 
   
   
 
 
 
 
 
   
 

 
 
 
 





   


 


   
 
   


Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Christian Zambrano
And you had the analyzer for that field set-up the same way as shown on 
your previous e-mail when you indexed the data?




On 10/06/2009 03:46 PM, Ravi Kiran wrote:

I did infact check it out any there is no weirdness in analysis page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8 payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambranowrote:

   

Have you tried using the Analysis page to see what tokens are generated for
the string "New York"? It could be one of the token filter is adding the
token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:

 

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody
kindly
help me understand why I get them and how to eliminate them. My schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example:
"New
York" will be indexed as 'New York' and will not be split right??? However
I
see then splitup in facets as follows when running the query "

http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont
find
any doc which has just "New". After digging in a bit I found that if
several
keywords have a common starting word it is being pulled out as another
facet
like the following. Any help is greatly appreciated

Result

47  >   Ghost
7
16
10
147
23
8
5
6
8
10
8
5
7

7  -->   Ghost
5
5


7-->   Ghost
6
26
6

27
8
7
12

Schema.xml
-

 
   
 
 
 

 
 
   
   
 
 
 
 
 
   
 

 
 
 
 



   
 
   


Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Christian Zambrano
Have you tried using the Analysis page to see what tokens are generated 
for the string "New York"? It could be one of the token filter is adding 
the token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody kindly
help me understand why I get them and how to eliminate them. My schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example: "New
York" will be indexed as 'New York' and will not be split right??? However I
see then splitup in facets as follows when running the query "
http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont find
any doc which has just "New". After digging in a bit I found that if several
keywords have a common starting word it is being pulled out as another facet
like the following. Any help is greatly appreciated

Result

47 >  Ghost
7
16
10
147
23
8
5
6
8
10
8
5
7

7 -->  Ghost
5
5


7   -->  Ghost
6
26
6

27
8
7
12

Schema.xml
-

 
   
 
 
 

 
 
   
   
 
 
 
 
 
   
 

 
 
 
 

   


Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-05 Thread Christian Zambrano

Prasanna,

Wouldn't it be better to use built-in token filters at both index and  
query that will convert 'it!' to just 'it'? I believe the  
WorkDelimeterFilterFactory will do that for you.


Christian

On Oct 5, 2009, at 7:31 PM, Prasanna Ranganathan > wrote:






On 10/5/09 2:46 AM, "Shalin Shekhar Mangar"   
wrote:


Alternatively, is there a filter available which takes in a  
pattern and
produces additional forms of the token depending on the pattern?  
The use

case I am looking at here is using such a filter to automate synonym
generation. In our application, quite a few of the synonym file  
entries
match a specific pattern and having such a filter would make it  
easier I
believe. Pl. do correct me in case I am missing some unwanted side- 
effect

with this approach.


I do not understand this. TokenFilters are used for things like  
stemming,
replacing patterns, lowercasing, n-gramming etc. The synonym filter  
inserts

additional tokens (synonyms) from a file for each token.

What exactly are you trying to do with synonyms? I guess you could do
stemming etc with synonyms but why do you want to do that?


I ll try to explain with an example. Given the term 'it!' in the  
title, it
should match both 'it' and 'it!' in the query as an exact match.  
Currently,
this is done by using a synonym entry  (and index time  
SynonymFilter) as

follows:

it! => it, it!

Now, the above holds true for all cases where you have a title token  
of the

form [aA-zZ]*!. Handling all of those cases requires adding synonyms
manually for each case which is not easy to manage and does not scale.

I am hoping to do the same by using a index time filter that takes  
in a
pattern like the PatternReplace filter and adds the newly created  
token
instead of replacing the original one. Does this make sense? Am I  
missing

something that would break this approach?



Note that a change in synonym file needs a re-index of the affected
documents. Also, the synonym map is kept in memory.


What is the overhead incurred in having an additional filter applied  
during

indexing? It is strictly CPU only?

Thanks a lot for your valuable input.

Regards,

Prasanna.



Re: Need "OR" in DisMax Query

2009-10-05 Thread Christian Zambrano

David,

If your schema includes fields with analyzers that use the 
StopFilterFactory and the dismax QueryHandler is set-up to search within 
those fields, then you are correct.



On 10/05/2009 01:36 PM, David Giffin wrote:

Hi There,

Maybe I'm missing something, but I can't seem to get the dismax
request handler to perform and OR query. It appears that OR is removed
by the stop words. I like to do something like
"qt=dismax&q=red+OR+green" and get all green and all red results.

Thanks,
David
   


Re: wildcard searches

2009-10-05 Thread Christian Zambrano



On 10/05/2009 01:18 PM, Avlesh Singh wrote:

First of all, I know of no way of doing wildcard phrase queries.

 

http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_combine_wildcard_and_phrase_search.2C_e.g._.22foo_ba.2A.22.3F
   

Thanks for that link

When I said not filters, I meant TokenFilters which is what I believe you
   

mean by 'not analyzed'

 

Analysis is a Lucene way of configuring tokenizers and filters for a field
(index time and query time). I guess, both of us mean the same thing.
   
You are correct. I should have said ' Not Analyzed'. Thanks for the 
correction.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 11:04 PM, Christian Zambranowrote:

   

Avlesh, I don't understand your answer.

First of all, I know of no way of doing wildcard phrase queries.

When I said not filters, I meant TokenFilters which is what I believe you
mean by 'not analyzed'


On 10/05/2009 12:27 PM, Avlesh Singh wrote:

 

No filters are applied to wildcard/fuzzy searches.
   



 

Ah! Not like that ..
I guess, it is just that the phrase searches using wildcards are not
analyzed.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:42 PM, Christian Zambrano   

wrote:
 



   

No filters are applied to wildcard/fuzzy searches.

I couldn't find a reference to this on either the solr or lucene
documentation but I read it on the Solr book from PACKT


On 10/05/2009 12:09 PM, Angel Ice wrote:



 

Hi everyone,

I have a little question regarding the search engine when a wildcard
character is used in the query.
Let's take the following example :

- I have sent in indexation the word Hésitation (with an accent on the
"e")
- The filters applied to the field that will handle this word, result in
the indexation of "esit" (the mute H is suppressed (home made filter),
the
accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
"ation".

When i search for "hesitation", "esitation", "ésitation" etc ... all is
OK, the document is returned.
But as soon as I use a wildcard, like "hésita*", the document is not
returned. In fact, I have to put the wildcard in a manner that match the
indexed term exactly (example "esi*")

Does the search engine applies the filters to the word that prefix the
wildcard ? Or does it use this prefix verbatim ?

Thanks for you help.

Laurent



   


 


   
 
   


Re: A little help with indexing joined words

2009-10-05 Thread Christian Zambrano
Would you mind explaining how omitNorm has any effect on the IDF problem 
I described earlier?


I agree with your second sentence. I had to use the NGramTokenFilter to 
accommodate partial matches.


On 10/05/2009 12:11 PM, Avlesh Singh wrote:

Using synonyms might be a better solution because the use of
EdgeNGramTokenizerFactory has the potential of creating a large number of
token which will artificially increase the number of tokens in the index
which in turn will affect the IDF score.

 

Well, I don't see a reason as to why someone would need a length based
normalization on such matches. I always have done omitNorms while using
fields with this filter.

Yes, synonyms might an answer when you have limited number of such words
(phrases) and their possible combinations.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:32 PM, Christian Zambranowrote:

   

Using synonyms might be a better solution because the use of
EdgeNGramTokenizerFactory has the potential of creating a large number of
token which will artificially increase the number of tokens in the index
which in turn will affect the IDF score.

A query for "borderland" should have returned results though. It is
difficult to troubleshoot why it didn't without knowing what query you used,
and what kind of analysis is taking place.

Have you tried using the analysis page on the admin section to see what
tokens gets generated for 'Borderlands'?

Christian


On 10/05/2009 11:01 AM, Avlesh Singh wrote:

 

We have indexed a product database and have come across some search terms
   

where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.



 

"Borderland" should have worked for a regular text field. For all other
desired matches you can use EdgeNGramTokenizerFactory.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombe   wrote:



   

Hi
I am hoping someone can point me in the right direction with regards to
indexing words that are concatenated together to make other words or
product
names.

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

Where do I look to resolve this?  The product name field is indexed using
a
text field type.

Thanks in advance
Andrew



 


   
 
   


Re: wildcard searches

2009-10-05 Thread Christian Zambrano

Avlesh, I don't understand your answer.

First of all, I know of no way of doing wildcard phrase queries.

When I said not filters, I meant TokenFilters which is what I believe 
you mean by 'not analyzed'


On 10/05/2009 12:27 PM, Avlesh Singh wrote:

No filters are applied to wildcard/fuzzy searches.

 

Ah! Not like that ..
I guess, it is just that the phrase searches using wildcards are not
analyzed.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:42 PM, Christian Zambranowrote:

   

No filters are applied to wildcard/fuzzy searches.

I couldn't find a reference to this on either the solr or lucene
documentation but I read it on the Solr book from PACKT


On 10/05/2009 12:09 PM, Angel Ice wrote:

 

Hi everyone,

I have a little question regarding the search engine when a wildcard
character is used in the query.
Let's take the following example :

- I have sent in indexation the word Hésitation (with an accent on the
"e")
- The filters applied to the field that will handle this word, result in
the indexation of "esit" (the mute H is suppressed (home made filter), the
accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
"ation".

When i search for "hesitation", "esitation", "ésitation" etc ... all is
OK, the document is returned.
But as soon as I use a wildcard, like "hésita*", the document is not
returned. In fact, I have to put the wildcard in a manner that match the
indexed term exactly (example "esi*")

Does the search engine applies the filters to the word that prefix the
wildcard ? Or does it use this prefix verbatim ?

Thanks for you help.

Laurent

   
 
   


Re: wildcard searches

2009-10-05 Thread Christian Zambrano

No filters are applied to wildcard/fuzzy searches.

I couldn't find a reference to this on either the solr or lucene 
documentation but I read it on the Solr book from PACKT


On 10/05/2009 12:09 PM, Angel Ice wrote:

Hi everyone,

I have a little question regarding the search engine when a wildcard character 
is used in the query.
Let's take the following example :

- I have sent in indexation the word Hésitation (with an accent on the "e")
- The filters applied to the field that will handle this word, result in the indexation of 
"esit" (the mute H is suppressed (home made filter), the accent too (IsoLatin1Filter), 
and the SnowballPorterFilter suppress the "ation".

When i search for "hesitation", "esitation", "ésitation" etc ... all is OK, the 
document is returned.
But as soon as I use a wildcard, like "hésita*", the document is not returned. In fact, I 
have to put the wildcard in a manner that match the indexed term exactly (example "esi*")

Does the search engine applies the filters to the word that prefix the wildcard 
? Or does it use this prefix verbatim ?

Thanks for you help.

Laurent




   


Re: Question regarding synonym

2009-10-05 Thread Christian Zambrano

You are correct.

I would recommend to only use the Synonym TokenFilter at index time 
unless you have a very good reason to do it at query time.


On 10/05/2009 11:46 AM, darniz wrote:

yes that's what we decided to expand these terms while indexing.
if we have
bayrische motoren werke =>  bmw

and i have a document which has bmw in it, searching for text:bayrische does
not give me results. i have to give
text:"bayrische motoren werke" then it actually takes the synonym and gets
me the document.

Now if i change the synonym mapping to
bayrische motoren werke , bmw with expand parameter to true and also use
this file at indexing.

now at the  time i index this document along with "bmw" i also index the
following words "bayrische" "motoren" "werke"

any text query like text:motoren or text:bayrische will give me results now.

Please correct me if my assumption is wrong.

Thanks
darniz









Christian Zambrano wrote:
   



On 10/02/2009 06:02 PM, darniz wrote:
 

Thanks
As i said it even works by giving double quotes too.
like carDescription:"austin martin"

So is that the conclusion that in order to map two word synonym i have to
always enclose in double quotes, so that it doen not split the words




   

Yes, but there are things you need to keep in mind.

  From the solr wiki:

Keep in mind that while the SynonymFilter will happily work with
*synonyms* containing multiple words (ie:
"sea biscuit, sea biscit, seabiscuit") The recommended approach for
dealing with *synonyms* like this, is to expand the synonym when
indexing. This is because there are two potential issues that can arrise
at query time:

1.

   The Lucene QueryParser tokenizes on white space before giving any
   text to the Analyzer, so if a person searches for the words
   sea biscit the analyzer will be given the words "sea" and "biscit"
   seperately, and will not know that they match a synonym.

2.

   Phrase searching (ie: "sea biscit") will cause the QueryParser to
   pass the entire string to the analyzer, but if the SynonymFilter
   is configured to expand the *synonyms*, then when the QueryParser
   gets the resulting list of tokens back from the Analyzer, it will
   construct a MultiPhraseQuery that will not have the desired
   effect. This is because of the limited mechanism available for the
   Analyzer to indicate that two terms occupy the same position:
   there is no way to indicate that a "phrase" occupies the same
   position as a term. For our example the resulting MultiPhraseQuery
   would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would
   not match the simple case of "seabisuit" occuring in a document


 







Christian Zambrano wrote:

   

When you use a field qualifier(fieldName:valueToLookFor) it only applies
to the word right after the semicolon. If you look at the debug
infomation you will notice that for the second word it is using the
default field.

carDescription:austin
*text*:martin

the following should word:

carDescription:(austin martin)


On 10/02/2009 05:46 PM, darniz wrote:

 

This is not working when i search documents i have a document which
contains
text aston martin

when i search carDescription:"austin martin" i get a match but when i
dont
give double quotes

like carDescription:austin martin
there is no match

in the analyser if i give austin martin with out quotes, when it passes
through synonym filter it matches aston martin ,
may be by default analyser treats it as a phrase "austin martin" but
when
i
try to do a query by typing
carDescription:austin martin i get 0 documents. the following is the
debug
node info with debugQuery=on

carDescription:austin martin
carDescription:austin martin
carDescription:austin text:martin
carDescription:austin
text:martin

dont know why it breaks the word, may be its a desired behaviour
when i give carDescription:"austin martin" of course in this its able
to
map
to synonym and i get the desired result

Any opinion

darniz



Ensdorf Ken wrote:


   


 

Hi
i have a question regarding synonymfilter
i have a one way mapping defined
austin martin, astonmartin =>aston martin



   

...


 

Can anybody please explain if my observation is correct. This is a
very
critical aspect for my work.


   

That is correct - the synonym filter can recognize multi-token
synonyms
from consecutive tokens in a stream.





 


   


 


   


 
   


Re: A little help with indexing joined words

2009-10-05 Thread Christian Zambrano
Using synonyms might be a better solution because the use of 
EdgeNGramTokenizerFactory has the potential of creating a large number 
of token which will artificially increase the number of tokens in the 
index which in turn will affect the IDF score.


A query for "borderland" should have returned results though. It is 
difficult to troubleshoot why it didn't without knowing what query you 
used, and what kind of analysis is taking place.


Have you tried using the analysis page on the admin section to see what 
tokens gets generated for 'Borderlands'?


Christian

On 10/05/2009 11:01 AM, Avlesh Singh wrote:

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

 

"Borderland" should have worked for a regular text field. For all other
desired matches you can use EdgeNGramTokenizerFactory.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombe  wrote:

   

Hi
I am hoping someone can point me in the right direction with regards to
indexing words that are concatenated together to make other words or
product
names.

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

Where do I look to resolve this?  The product name field is indexed using a
text field type.

Thanks in advance
Andrew

 
   


Re: Always spellcheck (suggest)

2009-10-05 Thread Christian Zambrano

Shalin,


Thanks for the clarification. That explains a lot. I should have looked 
at the lucene documentation.



On 10/05/2009 05:28 AM, Shalin Shekhar Mangar wrote:

On Mon, Oct 5, 2009 at 10:24 AM, Christian Zambranowrote:

   

I am really surprised that a query for "behaviour" returns "behavior" as a
suggestion only when the parameter "spellcheck.onlyMorePopular=true" is
present. I re-read the documentation and I see nothing that will imply that
the parameter onlyMorePopular will do anything else but filter the
suggestions solr will return.

Maybe somebody else can shed some light on this.


 

Yeah, that is true. All this is actually done in the Lucene SpellChecker.
Solr's component is a wrapper over it with some extra features. I've added a
clarification to the wiki page.

   


Re: Always spellcheck (suggest)

2009-10-04 Thread Christian Zambrano
I am really surprised that a query for "behaviour" returns "behavior" as 
a suggestion only when the parameter "spellcheck.onlyMorePopular=true" 
is present. I re-read the documentation and I see nothing that will 
imply that the parameter onlyMorePopular will do anything else but 
filter the suggestions solr will return.


Maybe somebody else can shed some light on this.

On 10/04/2009 09:51 PM, Greg Pendlebury wrote:

Thanks. I'll have to look into modifications then (was hoping to avoid that).

For clarity though I believe this point is slightly off:

   

"Adding the parameter onlyMorePopular limits the suggestions that solr can give 
you(to ones that return more hits than the existing query), nothing more."
   

The flag is definitely returning suggestions, even for 'correct' terms, they 
just have to be more popular 'correct' terms.

Eg. 'behaviour' suggests 'behavior' because it has four times as many hits, but 
they are both 'correct' and the suggestion does not occur without the 
'onlyMorePopular' flag set. 'behavior' will not suggest 'behaviour' however 
because it is less popular.

Greg

-Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com]
Sent: Monday, 5 October 2009 12:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Always spellcheck (suggest)

Greg,

I apologize if I misunderstood your original post. I don't think there
is a way you can force solr to return suggestions when all of the words
are "correctly" spelled. Adding the parameter onlyMorePopular limits the
suggestions that solr can give you(to ones that return more hits than
the existing query), nothing more.

In short, I believe the answer is No.

On 10/04/2009 09:19 PM, Greg Pendlebury wrote:
   

Thanks for the response Christian. I'll modify my original point (1) then. Is 
'onlyMorePopular' the only way to return suggestions when all of the search 
terms are present in the dictionary (ie. correct)? Is there any way to force 
behaviour (1) without behaviour (2) (filtering on frequency).

Ta,
Greg

-Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com]
Sent: Monday, 5 October 2009 11:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Always spellcheck (suggest)

I believe your understanding in incorrect. The first behavior you
described is produced by adding the paremeter "spellcheck=true".
Suggestions will be returned regardless of whether there are results.
The only time I believe spelling suggestions might not be included is
when all of the words are spelled "correctly".

On 10/04/2009 07:55 PM, Greg Pendlebury wrote:

 

Hi All,

If I understand correctly the flag 'onlyMorePopular' encapsulates two 
independent behaviours. 1) It runs spell checking across queries that returned 
hits. Without the flag spell checking is not run when results are found. 2) It 
limits suggestions to terms with higher frequencies.

Is there any way to get behaviour (1) without behaviour (2)? Such as another 
flag I'm not seeing in the doco? The usage context is spelling suggestions for 
international usage. Eg. The user searches 'behaviour', we want it to suggest 
US spelling 'behavior' and vice versa. At the moment, the suggestion only works 
one way.

Ta,
Greg


This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)





   

This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)



 

This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, 

Re: Always spellcheck (suggest)

2009-10-04 Thread Christian Zambrano

Greg,

I apologize if I misunderstood your original post. I don't think there 
is a way you can force solr to return suggestions when all of the words 
are "correctly" spelled. Adding the parameter onlyMorePopular limits the 
suggestions that solr can give you(to ones that return more hits than 
the existing query), nothing more.


In short, I believe the answer is No.

On 10/04/2009 09:19 PM, Greg Pendlebury wrote:

Thanks for the response Christian. I'll modify my original point (1) then. Is 
'onlyMorePopular' the only way to return suggestions when all of the search 
terms are present in the dictionary (ie. correct)? Is there any way to force 
behaviour (1) without behaviour (2) (filtering on frequency).

Ta,
Greg

-----Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com]
Sent: Monday, 5 October 2009 11:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Always spellcheck (suggest)

I believe your understanding in incorrect. The first behavior you
described is produced by adding the paremeter "spellcheck=true".
Suggestions will be returned regardless of whether there are results.
The only time I believe spelling suggestions might not be included is
when all of the words are spelled "correctly".

On 10/04/2009 07:55 PM, Greg Pendlebury wrote:
   

Hi All,

If I understand correctly the flag 'onlyMorePopular' encapsulates two 
independent behaviours. 1) It runs spell checking across queries that returned 
hits. Without the flag spell checking is not run when results are found. 2) It 
limits suggestions to terms with higher frequencies.

Is there any way to get behaviour (1) without behaviour (2)? Such as another 
flag I'm not seeing in the doco? The usage context is spelling suggestions for 
international usage. Eg. The user searches 'behaviour', we want it to suggest 
US spelling 'behavior' and vice versa. At the moment, the suggestion only works 
one way.

Ta,
Greg


This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)




 

This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)


   


Re: Question regarding synonym

2009-10-04 Thread Christian Zambrano



On 10/02/2009 06:02 PM, darniz wrote:

Thanks
As i said it even works by giving double quotes too.
like carDescription:"austin martin"

So is that the conclusion that in order to map two word synonym i have to
always enclose in double quotes, so that it doen not split the words



   

Yes, but there are things you need to keep in mind.

From the solr wiki:

Keep in mind that while the SynonymFilter will happily work with 
*synonyms* containing multiple words (ie: 
"sea biscuit, sea biscit, seabiscuit") The recommended approach for 
dealing with *synonyms* like this, is to expand the synonym when 
indexing. This is because there are two potential issues that can arrise 
at query time:


  1.

 The Lucene QueryParser tokenizes on white space before giving any
 text to the Analyzer, so if a person searches for the words
 sea biscit the analyzer will be given the words "sea" and "biscit"
 seperately, and will not know that they match a synonym.

  2.

 Phrase searching (ie: "sea biscit") will cause the QueryParser to
 pass the entire string to the analyzer, but if the SynonymFilter
 is configured to expand the *synonyms*, then when the QueryParser
 gets the resulting list of tokens back from the Analyzer, it will
 construct a MultiPhraseQuery that will not have the desired
 effect. This is because of the limited mechanism available for the
 Analyzer to indicate that two terms occupy the same position:
 there is no way to indicate that a "phrase" occupies the same
 position as a term. For our example the resulting MultiPhraseQuery
 would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would
 not match the simple case of "seabisuit" occuring in a document










Christian Zambrano wrote:
   

When you use a field qualifier(fieldName:valueToLookFor) it only applies
to the word right after the semicolon. If you look at the debug
infomation you will notice that for the second word it is using the
default field.

carDescription:austin *text*:martin

the following should word:

carDescription:(austin martin)


On 10/02/2009 05:46 PM, darniz wrote:
 

This is not working when i search documents i have a document which
contains
text aston martin

when i search carDescription:"austin martin" i get a match but when i
dont
give double quotes

like carDescription:austin martin
there is no match

in the analyser if i give austin martin with out quotes, when it passes
through synonym filter it matches aston martin ,
may be by default analyser treats it as a phrase "austin martin" but when
i
try to do a query by typing
carDescription:austin martin i get 0 documents. the following is the
debug
node info with debugQuery=on

carDescription:austin martin
carDescription:austin martin
carDescription:austin text:martin
carDescription:austin text:martin

dont know why it breaks the word, may be its a desired behaviour
when i give carDescription:"austin martin" of course in this its able to
map
to synonym and i get the desired result

Any opinion

darniz



Ensdorf Ken wrote:

   


 

Hi
i have a question regarding synonymfilter
i have a one way mapping defined
austin martin, astonmartin =>   aston martin


   

...

 

Can anybody please explain if my observation is correct. This is a very
critical aspect for my work.

   

That is correct - the synonym filter can recognize multi-token synonyms
from consecutive tokens in a stream.




 


   


 
   


Re: Always spellcheck (suggest)

2009-10-04 Thread Christian Zambrano
I believe your understanding in incorrect. The first behavior you 
described is produced by adding the paremeter "spellcheck=true". 
Suggestions will be returned regardless of whether there are results. 
The only time I believe spelling suggestions might not be included is 
when all of the words are spelled "correctly".


On 10/04/2009 07:55 PM, Greg Pendlebury wrote:

Hi All,

If I understand correctly the flag 'onlyMorePopular' encapsulates two 
independent behaviours. 1) It runs spell checking across queries that returned 
hits. Without the flag spell checking is not run when results are found. 2) It 
limits suggestions to terms with higher frequencies.

Is there any way to get behaviour (1) without behaviour (2)? Such as another 
flag I'm not seeing in the doco? The usage context is spelling suggestions for 
international usage. Eg. The user searches 'behaviour', we want it to suggest 
US spelling 'behavior' and vice versa. At the moment, the suggestion only works 
one way.

Ta,
Greg


This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)



   


Re: Question regarding synonym

2009-10-02 Thread Christian Zambrano
When you use a field qualifier(fieldName:valueToLookFor) it only applies 
to the word right after the semicolon. If you look at the debug 
infomation you will notice that for the second word it is using the 
default field.


carDescription:austin *text*:martin

the following should word:

carDescription:(austin martin)


On 10/02/2009 05:46 PM, darniz wrote:

This is not working when i search documents i have a document which contains
text aston martin

when i search carDescription:"austin martin" i get a match but when i dont
give double quotes

like carDescription:austin martin
there is no match

in the analyser if i give austin martin with out quotes, when it passes
through synonym filter it matches aston martin ,
may be by default analyser treats it as a phrase "austin martin" but when i
try to do a query by typing
carDescription:austin martin i get 0 documents. the following is the debug
node info with debugQuery=on

carDescription:austin martin
carDescription:austin martin
carDescription:austin text:martin
carDescription:austin text:martin

dont know why it breaks the word, may be its a desired behaviour
when i give carDescription:"austin martin" of course in this its able to map
to synonym and i get the desired result

Any opinion

darniz



Ensdorf Ken wrote:
   
 

Hi
i have a question regarding synonymfilter
i have a one way mapping defined
austin martin, astonmartin =>  aston martin

   

...
 

Can anybody please explain if my observation is correct. This is a very
critical aspect for my work.
   

That is correct - the synonym filter can recognize multi-token synonyms
from consecutive tokens in a stream.



 
   


Re: Problem with Wildcard...

2009-10-02 Thread Christian Zambrano
Another thing to remember about wildcard and fuzzy searches is that none 
of the token filters will be applied.


If you are using the LowerCaseFilterFactory at index time, then 
"RI-MC50034-1" gets converted to "ri-mc50034-1" which is never going to 
match "RI-MC5000*"


Also, I would probably use the analyze page of your solr admin site to 
see what tokens are produced from "RI-MC500034-1" and "500034" based on 
your schema


On 10/01/2009 02:42 AM, Shalin Shekhar Mangar wrote:

On Tue, Sep 29, 2009 at 6:42 PM, Jörg Agatzwrote:

   

Hi Users...

i have a Problem

I have a lot of fields, (type=text) for search in all fields i copy all
fields in the default text field and use this for default search.

Now i will search...

This is into a Field

"RI-MC500034-1"
when i search "RI-MC500034-1" i found it...
if i seacht "RI-MC5000*" i dosen´t

when i search "500034" i found it...
if i seacht "5000*" i dosen´t

what can i do to use the Wildcards?

 

I guess one thing you need to do is to add preserveOriginal="true" in the
WordDelimiterFactory section in your field type. That would help match
things like "RI-MC5000*". Make sure you re-index all documents after this
change.

As for the others, add debugQuery=on as a request parameter and see how the
query is being parsed. If you have a doubt, paste it on the list and we can
help you.

   


Re: What Tokenizerfactory/TokenFilterFactory can/should I use so a search for "wal mart" matches "walmart"(quotes not included in search or index)?

2009-09-11 Thread Christian Zambrano

Ahmet,

Thanks a lot. Your suggestion was really helpful. I tried using synonyms 
before but for some reason it didn't work but this time around it worked.


On 09/11/2009 02:55 AM, AHMET ARSLAN wrote:

There are a lot of company names that
people are uncertain as to the correct spelling. A few of
examples are:
1. best buy, bestbuy
2. walmart, wal mart, wal-mart
3. Holiday Inn, HolidayInn

What Tokenizer Factory and/or TokenFilterFactory should I
use so that somebody typing "wal mart"(quotes not included)
will find "wal mart" and "walmart"(again, quotes not
included)
 

I faced a similar requirement before. I solved it by hardcoding those names to 
synonyms_index.txt and using SynonymFilterFactory at index time.

synonyms_index.txt will contain:

best buy, bestbuy
walmart, wal mart
Holiday Inn, HolidayInn


   
   
   
   

   
   


Since solr wiki[1] advices to use index time synonym when dealing with 
multi-word synonyms, I am using index time synonym expansion only.

[1] 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46

When working with StandardAnalyzer, wal-mart is broken into two tokens: wal and 
mart. So you dont need to write - forms of the words in synonyms_index.txt


If all of your examples were similar to HolidayInn, you could use solr.WordDelimiterFilterFactory 
(without writing all these company named to a file) but you can't handle "wal mart" and 
"walmart" with it.

Hope this helps.



   


What Tokenizerfactory/TokenFilterFactory can/should I use so a search for "wal mart" matches "walmart"(quotes not included in search or index)?

2009-09-10 Thread Christian Zambrano
There are a lot of company names that people are uncertain as to the 
correct spelling. A few of examples are:

1. best buy, bestbuy
2. walmart, wal mart, wal-mart
3. Holiday Inn, HolidayInn

What Tokenizer Factory and/or TokenFilterFactory should I use so that 
somebody typing "wal mart"(quotes not included) will find "wal mart" and 
"walmart"(again, quotes not included)


Thanks,

Christian