[search > edismax] compound words different result issue

2019-02-11 Thread 유정인
Hi 

I use 'edismax'. 

Our main language uses compound words.

There is an issue here. 

For example, assume that 'ab' => 'a' and 'b' are analyzed. 

The results are different when searching with 'ab' and 'a b'. 

I want to get the same result as searching 'a b' when searching 'ab'.

Is there a way? 

 



 





Re: DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File

2015-04-07 Thread Mike L.

Typo:   *even when the user delimits with a space. (e.g. base ball should find 
baseball). 

Thanks,
  From: Mike L. javaone...@yahoo.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org 
 Sent: Tuesday, April 7, 2015 9:05 AM
 Subject: DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words 
File
   

Solr User Group -

   I have a case where I need to be able to search against compound words, even 
when the user delimits with a space. (e.g. baseball = base ball).  I think 
I've solved this by creating a compound-words dictionary file containing the 
split words that I would want DictionaryCompoundWordTokenFilterFactory to split.
 base \n  
ball
I also applied in the synonym file the following rule: baseball = base ball  ( 
to allow baseball to also get a hit)
   filter class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=compound-words.txt minWordSize=5 minSubwordSize=2 
maxSubwordSize=15 onlyLongestMatch=true/   
  
Two questions - If I could in advance figure out all the compound words I would 
want to split, would it be better (more reliable results) for me to maintain 
this compount-words file or would it be better to throw one of those open 
office dictionaries at it the filter?
Also - Any better suggestions to dealing with this problem vs the one I 
described using both the dictionary filter and the synonym rule?
Thanks in advance!
Mike



  

DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File

2015-04-07 Thread Mike L.

Solr User Group -

   I have a case where I need to be able to search against compound words, even 
when the user delimits with a space. (e.g. baseball = base ball).  I think 
I've solved this by creating a compound-words dictionary file containing the 
split words that I would want DictionaryCompoundWordTokenFilterFactory to split.
 base \n  
ball
I also applied in the synonym file the following rule: baseball = base ball  ( 
to allow baseball to also get a hit)
   filter class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=compound-words.txt minWordSize=5 minSubwordSize=2 
maxSubwordSize=15 onlyLongestMatch=true/   
  
Two questions - If I could in advance figure out all the compound words I would 
want to split, would it be better (more reliable results) for me to maintain 
this compount-words file or would it be better to throw one of those open 
office dictionaries at it the filter?
Also - Any better suggestions to dealing with this problem vs the one I 
described using both the dictionary filter and the synonym rule?
Thanks in advance!
Mike



Re: Having trouble with German compound words in Solr 4.7

2014-04-24 Thread Siegfried Goeschl

Hi Alistair,

it seems that there are many ways to skin the cat so I describe the 
approach I used with SOLR 3.6 :-)


* Using a patched DictionaryCompoundWordTokenFilterFactory in the 
index phase - so the german compound noun Leinenhose (linen 
trousers) would be indexed in addition to Leinen  Hose. Afterwards 
the three tokens go trough stemming.


* One hint which might be useful - I only split words which I consider 
proper german compound nouns. E.g. if your indexed text contains the 
token schwarzkleid I would NOT split it since it is NOT a proper noun 
- the proper noun would be Schwarzkleid - please note that even 
Schwarzkleid is not a proper german noun anyway :-)


* I use a custom dictionary for splitting consisting of 7.000 entries 
which contains a lot of customer-specific entries


I do not tinker with DictionaryCompoundWordTokenFilterFactory in the 
query phase of the field so the following queries would work with the 
indexed word Leinenhose


* leinenhosen
* leinenhose
* leinen hose
* leinen hosen

Cheers,

Siegfried Goeschl



On 22.04.14 12:13, Alistair wrote:

I've managed to solve this (in a quite hacky sort of way) by using filter
queries and the edismax queryparser.

I added in my solrconfig.xml the following parameters:

 str name=defTypeedismax/str
 str name=mm75%/str

Then when searching for multiple keywords (for example: schwarzkleid wenz,
where wenz is a german brand name), I use the first keyword as a query and
anything after that I add as a filterquery. So my final query looks
something like this:


fl=idsort=popular+descindent=onq=keywords:'schwarzkleide'+wt=jsonfq={!edismax}+keywords:'wenz'fq=deleted:0

My compound splitter filter splits schwarzkleide correctly and it is parsed
as edismax with mm=75%, then the filterqueries are added, for keywords they
are also parsed as edismax. The returned result is all the black dresses
from 'Wenz'.

If anybody has a better solution to what I've posted I would be more than
happy to read up on it as I'm quite new to Solr and I think my way is a bit
convoluted to be honest.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4132478.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Having trouble with German compound words in Solr 4.7

2014-04-22 Thread Alistair
I've managed to solve this (in a quite hacky sort of way) by using filter
queries and the edismax queryparser. 

I added in my solrconfig.xml the following parameters:

str name=defTypeedismax/str
str name=mm75%/str

Then when searching for multiple keywords (for example: schwarzkleid wenz,
where wenz is a german brand name), I use the first keyword as a query and
anything after that I add as a filterquery. So my final query looks
something like this:

   
fl=idsort=popular+descindent=onq=keywords:'schwarzkleide'+wt=jsonfq={!edismax}+keywords:'wenz'fq=deleted:0

My compound splitter filter splits schwarzkleide correctly and it is parsed
as edismax with mm=75%, then the filterqueries are added, for keywords they
are also parsed as edismax. The returned result is all the black dresses
from 'Wenz'. 

If anybody has a better solution to what I've posted I would be more than
happy to read up on it as I'm quite new to Solr and I think my way is a bit
convoluted to be honest.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4132478.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Having trouble with German compound words in Solr 4.7

2014-04-21 Thread Alistair
Hi Siegfried,

the debug shows that the separated keywords get OR'd together so a match to
either keyword appears in the results. So if I am searching for:

*keywords:schwarzkleid* this will get transformed to *keywords:schwarz
keywords:kleid *which is equivalent to *keywords:schwarz OR keywords:kleid*.
I need this query to be defaulted to* keywords:schwarz AND keywords:kleid*
so only items that match both keywords appear in my results (in this case
black dresses).

I am pretty confused as to why replacing the default boolean operator is
this difficult :(

Any other suggestions?

Ali



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4132338.html
Sent from the Solr - User mailing list archive at Nabble.com.


Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Alistair
Hello all,

I'm a fairly new Solr user and I need my search function to handle compound
words in German. I've searched through the archives and found that Solr
already has a Filter Factory made for such words called
DictionaryCompoundWordTokenFilterFactory. I've already built a list of words
that I want split, and it seems like the filter is working correctly in most
cases, the majority of our searches are clothing items so let's say
/schwarzkleid/ (black dress) becomes /schwarz/ /kleid/, which is what
I want to happen. However, it seems like the keyword search is done using an
*OR* operator. So I'm seeing items that are either black or are dresses but
I just want to see items that are both. I've also read that changing the
default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml
will rectify this issue, but nothing has changed in my query results. It
still uses the *OR* operator.
I've tried using Extended dismax in my queries but I am using the Solr PHP
library and I don't think it supports adding Dismax filters to the queries
themselves (if I'm wrong, please correct me). By the way, I am using Zend
Framework 2.0 in the backend and am communicating with Solr through the Solr
PHP library:  Solr PHP http://www.php.net/manual/tr/book.solr.php  . 

Any suggestions on how to change the operator after my compound word queries
have been split?

Thanks!

Ali



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Jack Krupansky
Make sure your field type has the autoGeneratePhraseQueries=true attribute 
(default is false). q.op only applies to explicit terms, not to terms which 
decompose into multiple terms. Confusing? Yes!


-- Jack Krupansky

-Original Message- 
From: Alistair

Sent: Friday, April 18, 2014 6:11 AM
To: solr-user@lucene.apache.org
Subject: Having trouble with German compound words in Solr 4.7

Hello all,

I'm a fairly new Solr user and I need my search function to handle compound
words in German. I've searched through the archives and found that Solr
already has a Filter Factory made for such words called
DictionaryCompoundWordTokenFilterFactory. I've already built a list of words
that I want split, and it seems like the filter is working correctly in most
cases, the majority of our searches are clothing items so let's say
/schwarzkleid/ (black dress) becomes /schwarz/ /kleid/, which is what
I want to happen. However, it seems like the keyword search is done using an
*OR* operator. So I'm seeing items that are either black or are dresses but
I just want to see items that are both. I've also read that changing the
default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml
will rectify this issue, but nothing has changed in my query results. It
still uses the *OR* operator.
I've tried using Extended dismax in my queries but I am using the Solr PHP
library and I don't think it supports adding Dismax filters to the queries
themselves (if I'm wrong, please correct me). By the way, I am using Zend
Framework 2.0 in the backend and am communicating with Solr through the Solr
PHP library:  Solr PHP http://www.php.net/manual/tr/book.solr.php  .

Any suggestions on how to change the operator after my compound word queries
have been split?

Thanks!

Ali



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Alistair
Hey Jack,

thanks for the reply. I added autoGeneratePhraseQueries=true to the
fieldType and now it's giving me even more results! I'm not sure if the
debug of my query will be helpful but I'll paste it just in case someone
might have an idea. This produces 113524 results, whereas if I manually
enter the query as keyword:schwarz AND keyword:kleid I only get 20283
results (which is the correct one). 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Siegfried Goeschl
Hi Alistair,

quick email before getting my plane - I worked with similar requirements in the 
past and tuning SOLR can be tricky

* are you hitting the same SOLR query handler (application versus manual 
checking)?
* turn on debugging for your application SOLR queries so you see what query is 
actually executed
* one thing I always do for prototyping is setting up the Solritas GUI using 
the same query handler as the application server

Cheers,

Siegfried Goeschl


On 18 Apr 2014, at 06:06, Alistair ali...@gmail.com wrote:

 Hey Jack,
 
 thanks for the reply. I added autoGeneratePhraseQueries=true to the
 fieldType and now it's giving me even more results! I'm not sure if the
 debug of my query will be helpful but I'll paste it just in case someone
 might have an idea. This produces 113524 results, whereas if I manually
 enter the query as keyword:schwarz AND keyword:kleid I only get 20283
 results (which is the correct one). 
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Compound words

2013-10-29 Thread Parvesh Garg
Hi Erick,

I tried with expand=true and got exactly the same tokens i.e., seabiscuit
sea bird at 1,2 and 3 positions respectively. As per solr documentation at
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory,
explicit mappings ignore the expand parameter in the schema.

So, the problem of creating compound problems at query time remains.


Parvesh Garg
http://www.zettata.com


On Tue, Oct 29, 2013 at 2:11 AM, Parvesh Garg parv...@zettata.com wrote:

 Hi Roman, thanks for the link, will go through it.

 Erick, will try with expand=true once and check out the results. Will
 update this thread with the findings. I remember we rejected expand=true
 because of some weird spaghetti problem. Will check it out again.

 Thanks,

 Parvesh Garg
 http://www.zettata.com


 On Mon, Oct 28, 2013 at 9:01 PM, Roman Chyla roman.ch...@gmail.comwrote:

 Hi Parvesh,
 I think you should check the following jira
 https://issues.apache.org/jira/browse/SOLR-5379. You will find there
 links
 to other possible solutions/problems:-)
 Roman
 On 28 Oct 2013 09:06, Erick Erickson erickerick...@gmail.com wrote:

  Consider setting expand=true at index time. That
  puts all the tokens in your index, and then you
  may not need to have any synonym
  processing at query time since all the variants will
  already be in the index.
 
  As it is, you've replaced the words in the original with
  synonyms, essentially collapsed them down to a single
  word and then you have to do something at query time
  to get matches. If all the variants are in the index, you
  shouldn't have to. That's what I meant by raw.
 
  Best,
  Erick
 
 
  On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg parv...@zettata.com
 wrote:
 
   Hi Erick,
  
   Thanks for the suggestion. Like I said, I'm an infant.
  
   We tried synonyms both ways. sea biscuit = seabiscuit and seabiscuit
 =
   sea biscuit and didn't understand exactly how it worked. But I just
  checked
   the analysis tool, and it seems to work perfectly fine at index time.
  Now,
   I can happily discard my own filter and 4 days of work. I'm happy I
 got
  to
   know a few ways on how/when not to write a solr filter :)
  
   I tried the string sea biscuit sea bird with expand=false and the
  tokens
   i got were seabiscuit sea bird at 1,2 and 3 positions respectively.
 But
  at
   query time, when I enter the same term sea biscuit sea bird, using
   edismax and qf, pf2, and pf3, the parsedQuery looks like this:
  
   +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\biscuit
  sea\)
   (text:\sea bird\)) ((text:\seabiscuit sea\) (text:\biscuit sea
   bird\))
  
   What I wanted instead was this
  
   +((text:seabiscuit) (text:sea) (text:bird)) ((text:\seabiscuit
 sea\)
   (text:\sea bird\)) (text:\seabiscuit sea bird\)
  
   Looks like there isn't any other way than to pre-process query myself
 and
   create the compound word. What do you mean by just query the raw
  string?
   Am I still missing something?
  
   Parvesh Garg
   http://www.zettata.com
   (This time I did remove my phone number :) )
  
   On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
Why did you reject using synonyms? You can have multi-word
synonyms just fine at index time, and at query time, since the
multiple words are already substituted in the index you don't
need to do the same substitution, just query the raw strings.
   
I freely acknowledge you may have very good reasons for doing
this yourself, I'm just making sure you know what's already
there.
   
See:
   
   
  
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
   
Look particularly at the explanations for sea biscuit in that
  section.
   
Best,
Erick
   
   
   
On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg parv...@zettata.com
   wrote:
   
 One more thing, Is there a way to remove my accidentally sent
 phone
number
 in the signature from the previous mail? aarrrggghhh

   
  
 





Compound words

2013-10-28 Thread Parvesh Garg
Hi,

I'm an infant in Solr/Lucene family, just a couple of months old.

We are trying to find a way to combine words into a single compound word at
index and query time. E.g. if the document has sea bird in it, it should
be indexed as seabird and any query having sea bird in it should also look
for seabird not only in qf but also in pf, pf2, pf3 fields. Well, we are
using edismax query parser.

Our problem is not at index time, we have achieved it by writing our own
token filter, but at query time. Our token filter takes a dictionary in the
form of prefix,suffix in the file and keeps emitting regular and compound
tokens as it encounters them.

We configured our own filter at query time but figured that at query time
individual clauses like field:sea , field:bird etc are created first and
then sent to the analyzer. First of all, can someone please confirm if this
part of my understanding is correct? So, we are forced to emit sea and bird
as individual tokens because we are not getting them in sequence at all.

Is it possible to achieve this by other means than pre-processing query
before sending it to solr? Can a CharFilter be used instead, are they
applied before creating query clauses?

I can keep providing more details as necessary. This mail has already
crossed TL;DR limits for many :)

Parvesh Garg
http://www.zettata.com
+91 963 222 5540


Re: Compound words

2013-10-28 Thread Parvesh Garg
One more thing, Is there a way to remove my accidentally sent phone number
in the signature from the previous mail? aarrrggghhh


Re: Compound words

2013-10-28 Thread Erick Erickson
Why did you reject using synonyms? You can have multi-word
synonyms just fine at index time, and at query time, since the
multiple words are already substituted in the index you don't
need to do the same substitution, just query the raw strings.

I freely acknowledge you may have very good reasons for doing
this yourself, I'm just making sure you know what's already
there.

See:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

Look particularly at the explanations for sea biscuit in that section.

Best,
Erick



On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg parv...@zettata.com wrote:

 One more thing, Is there a way to remove my accidentally sent phone number
 in the signature from the previous mail? aarrrggghhh



Re: Compound words

2013-10-28 Thread Parvesh Garg
Hi Erick,

Thanks for the suggestion. Like I said, I'm an infant.

We tried synonyms both ways. sea biscuit = seabiscuit and seabiscuit =
sea biscuit and didn't understand exactly how it worked. But I just checked
the analysis tool, and it seems to work perfectly fine at index time. Now,
I can happily discard my own filter and 4 days of work. I'm happy I got to
know a few ways on how/when not to write a solr filter :)

I tried the string sea biscuit sea bird with expand=false and the tokens
i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But at
query time, when I enter the same term sea biscuit sea bird, using
edismax and qf, pf2, and pf3, the parsedQuery looks like this:

+((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\biscuit sea\)
(text:\sea bird\)) ((text:\seabiscuit sea\) (text:\biscuit sea
bird\))

What I wanted instead was this

+((text:seabiscuit) (text:sea) (text:bird)) ((text:\seabiscuit sea\)
(text:\sea bird\)) (text:\seabiscuit sea bird\)

Looks like there isn't any other way than to pre-process query myself and
create the compound word. What do you mean by just query the raw string?
Am I still missing something?

Parvesh Garg
http://www.zettata.com
(This time I did remove my phone number :) )

On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson erickerick...@gmail.comwrote:

 Why did you reject using synonyms? You can have multi-word
 synonyms just fine at index time, and at query time, since the
 multiple words are already substituted in the index you don't
 need to do the same substitution, just query the raw strings.

 I freely acknowledge you may have very good reasons for doing
 this yourself, I'm just making sure you know what's already
 there.

 See:

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

 Look particularly at the explanations for sea biscuit in that section.

 Best,
 Erick



 On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg parv...@zettata.com wrote:

  One more thing, Is there a way to remove my accidentally sent phone
 number
  in the signature from the previous mail? aarrrggghhh
 



Re: Compound words

2013-10-28 Thread Erick Erickson
Consider setting expand=true at index time. That
puts all the tokens in your index, and then you
may not need to have any synonym
processing at query time since all the variants will
already be in the index.

As it is, you've replaced the words in the original with
synonyms, essentially collapsed them down to a single
word and then you have to do something at query time
to get matches. If all the variants are in the index, you
shouldn't have to. That's what I meant by raw.

Best,
Erick


On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg parv...@zettata.com wrote:

 Hi Erick,

 Thanks for the suggestion. Like I said, I'm an infant.

 We tried synonyms both ways. sea biscuit = seabiscuit and seabiscuit =
 sea biscuit and didn't understand exactly how it worked. But I just checked
 the analysis tool, and it seems to work perfectly fine at index time. Now,
 I can happily discard my own filter and 4 days of work. I'm happy I got to
 know a few ways on how/when not to write a solr filter :)

 I tried the string sea biscuit sea bird with expand=false and the tokens
 i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But at
 query time, when I enter the same term sea biscuit sea bird, using
 edismax and qf, pf2, and pf3, the parsedQuery looks like this:

 +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\biscuit sea\)
 (text:\sea bird\)) ((text:\seabiscuit sea\) (text:\biscuit sea
 bird\))

 What I wanted instead was this

 +((text:seabiscuit) (text:sea) (text:bird)) ((text:\seabiscuit sea\)
 (text:\sea bird\)) (text:\seabiscuit sea bird\)

 Looks like there isn't any other way than to pre-process query myself and
 create the compound word. What do you mean by just query the raw string?
 Am I still missing something?

 Parvesh Garg
 http://www.zettata.com
 (This time I did remove my phone number :) )

 On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Why did you reject using synonyms? You can have multi-word
  synonyms just fine at index time, and at query time, since the
  multiple words are already substituted in the index you don't
  need to do the same substitution, just query the raw strings.
 
  I freely acknowledge you may have very good reasons for doing
  this yourself, I'm just making sure you know what's already
  there.
 
  See:
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
 
  Look particularly at the explanations for sea biscuit in that section.
 
  Best,
  Erick
 
 
 
  On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg parv...@zettata.com
 wrote:
 
   One more thing, Is there a way to remove my accidentally sent phone
  number
   in the signature from the previous mail? aarrrggghhh
  
 



Re: Compound words

2013-10-28 Thread Roman Chyla
Hi Parvesh,
I think you should check the following jira
https://issues.apache.org/jira/browse/SOLR-5379. You will find there links
to other possible solutions/problems:-)
Roman
On 28 Oct 2013 09:06, Erick Erickson erickerick...@gmail.com wrote:

 Consider setting expand=true at index time. That
 puts all the tokens in your index, and then you
 may not need to have any synonym
 processing at query time since all the variants will
 already be in the index.

 As it is, you've replaced the words in the original with
 synonyms, essentially collapsed them down to a single
 word and then you have to do something at query time
 to get matches. If all the variants are in the index, you
 shouldn't have to. That's what I meant by raw.

 Best,
 Erick


 On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg parv...@zettata.com wrote:

  Hi Erick,
 
  Thanks for the suggestion. Like I said, I'm an infant.
 
  We tried synonyms both ways. sea biscuit = seabiscuit and seabiscuit =
  sea biscuit and didn't understand exactly how it worked. But I just
 checked
  the analysis tool, and it seems to work perfectly fine at index time.
 Now,
  I can happily discard my own filter and 4 days of work. I'm happy I got
 to
  know a few ways on how/when not to write a solr filter :)
 
  I tried the string sea biscuit sea bird with expand=false and the
 tokens
  i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But
 at
  query time, when I enter the same term sea biscuit sea bird, using
  edismax and qf, pf2, and pf3, the parsedQuery looks like this:
 
  +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\biscuit
 sea\)
  (text:\sea bird\)) ((text:\seabiscuit sea\) (text:\biscuit sea
  bird\))
 
  What I wanted instead was this
 
  +((text:seabiscuit) (text:sea) (text:bird)) ((text:\seabiscuit sea\)
  (text:\sea bird\)) (text:\seabiscuit sea bird\)
 
  Looks like there isn't any other way than to pre-process query myself and
  create the compound word. What do you mean by just query the raw
 string?
  Am I still missing something?
 
  Parvesh Garg
  http://www.zettata.com
  (This time I did remove my phone number :) )
 
  On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   Why did you reject using synonyms? You can have multi-word
   synonyms just fine at index time, and at query time, since the
   multiple words are already substituted in the index you don't
   need to do the same substitution, just query the raw strings.
  
   I freely acknowledge you may have very good reasons for doing
   this yourself, I'm just making sure you know what's already
   there.
  
   See:
  
  
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
  
   Look particularly at the explanations for sea biscuit in that
 section.
  
   Best,
   Erick
  
  
  
   On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg parv...@zettata.com
  wrote:
  
One more thing, Is there a way to remove my accidentally sent phone
   number
in the signature from the previous mail? aarrrggghhh
   
  
 



Re: Compound words

2013-10-28 Thread Parvesh Garg
Hi Roman, thanks for the link, will go through it.

Erick, will try with expand=true once and check out the results. Will
update this thread with the findings. I remember we rejected expand=true
because of some weird spaghetti problem. Will check it out again.

Thanks,

Parvesh Garg
http://www.zettata.com


On Mon, Oct 28, 2013 at 9:01 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Parvesh,
 I think you should check the following jira
 https://issues.apache.org/jira/browse/SOLR-5379. You will find there links
 to other possible solutions/problems:-)
 Roman
 On 28 Oct 2013 09:06, Erick Erickson erickerick...@gmail.com wrote:

  Consider setting expand=true at index time. That
  puts all the tokens in your index, and then you
  may not need to have any synonym
  processing at query time since all the variants will
  already be in the index.
 
  As it is, you've replaced the words in the original with
  synonyms, essentially collapsed them down to a single
  word and then you have to do something at query time
  to get matches. If all the variants are in the index, you
  shouldn't have to. That's what I meant by raw.
 
  Best,
  Erick
 
 
  On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg parv...@zettata.com
 wrote:
 
   Hi Erick,
  
   Thanks for the suggestion. Like I said, I'm an infant.
  
   We tried synonyms both ways. sea biscuit = seabiscuit and seabiscuit
 =
   sea biscuit and didn't understand exactly how it worked. But I just
  checked
   the analysis tool, and it seems to work perfectly fine at index time.
  Now,
   I can happily discard my own filter and 4 days of work. I'm happy I got
  to
   know a few ways on how/when not to write a solr filter :)
  
   I tried the string sea biscuit sea bird with expand=false and the
  tokens
   i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But
  at
   query time, when I enter the same term sea biscuit sea bird, using
   edismax and qf, pf2, and pf3, the parsedQuery looks like this:
  
   +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\biscuit
  sea\)
   (text:\sea bird\)) ((text:\seabiscuit sea\) (text:\biscuit sea
   bird\))
  
   What I wanted instead was this
  
   +((text:seabiscuit) (text:sea) (text:bird)) ((text:\seabiscuit sea\)
   (text:\sea bird\)) (text:\seabiscuit sea bird\)
  
   Looks like there isn't any other way than to pre-process query myself
 and
   create the compound word. What do you mean by just query the raw
  string?
   Am I still missing something?
  
   Parvesh Garg
   http://www.zettata.com
   (This time I did remove my phone number :) )
  
   On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
Why did you reject using synonyms? You can have multi-word
synonyms just fine at index time, and at query time, since the
multiple words are already substituted in the index you don't
need to do the same substitution, just query the raw strings.
   
I freely acknowledge you may have very good reasons for doing
this yourself, I'm just making sure you know what's already
there.
   
See:
   
   
  
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
   
Look particularly at the explanations for sea biscuit in that
  section.
   
Best,
Erick
   
   
   
On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg parv...@zettata.com
   wrote:
   
 One more thing, Is there a way to remove my accidentally sent
 phone
number
 in the signature from the previous mail? aarrrggghhh

   
  
 



Re: Adding the Lucene org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter to solr for german compound words

2008-07-23 Thread Chris Hostetter

FYI: In general we try to make sure that whenever posible we have a 
Factory for any TokenFilter or Tkenizer that ships with Lucene-Core or the 
Lucene Analysis contrib ... we have a stub-analysis-factory-maker.pl 
script that automates this in most cases, and requires a small amount of 
coding for others -- but in some cases there is no easy way to create a 
generic factor for a TokenFilter, HyphenationCompoundWordTokenFilter is 
an example of this becuase it requires a HyphenationTree to construct it, 
and HyphenationTree is a fairly complicated class, that didnt' lend itself 
to an easy XML configuration for construction.

But if you have a specific HyphenationTree instance you wnat to use, you 
can hardcode that into a custom TokenFilterFactory.

*BUT* before you do that, consider whether or not the 
DictionaryCompoundWordTokenFilter will meet your needs -- there is already 
a Solr Factory checked in for that.

: See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
: 
: Essentially, you need to create a TokenFilterFactory that wraps it.  Please
: feel free to donate it, too, if you can.
: 
: -Grant
: 
: On Jul 23, 2008, at 2:42 PM, Barry Harding wrote:
: 
:  Hi can anybody point me in the right direction in how I go about adding
:  the
:  
:  org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
:  
:  Token filter to the solr schema.xml.
:  
:  
:  
:  
:  
:  I need to be able to break German compound words, and from what I have
:  read this Token filter would seem to be what I need to use, my question
:  is how do I configure SOLR to use this filter text field types.
:  
:  
:  
:  Is it possible to just call it directly from the confog file or do I
:  need to wrap it in a custom class in some way
:  
:  
:  
:  Thanks
:  
:  
:  
:  Barry H
:  
:  
:  
:  Misco is a division of Systemax Europe Ltd.  Registered in Scotland Number
:  114143.  Registered Office: Caledonian Exchange, 19a Canning Street,
:  Edinburgh EH3 8EG.  Telephone +44 (0)1933 686000.
: 
: --
: Grant Ingersoll
: http://www.lucidimagination.com
: 
: Lucene Helpful Hints:
: http://wiki.apache.org/lucene-java/BasicsOfPerformance
: http://wiki.apache.org/lucene-java/LuceneFAQ
: 
: 
: 
: 
: 
: 
: 



-Hoss