subject:"\"Compound words\""

[search > edismax] compound words different result issue

2019-02-11 Thread 유정인

Hi 

I use 'edismax'. 

Our main language uses compound words.

There is an issue here. 

For example, assume that 'ab' => 'a' and 'b' are analyzed. 

The results are different when searching with 'ab' and 'a b'. 

I want to get the same result as searching 'a b' when searching 'ab'.

Is there a way?

Re: DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File

2015-04-07 Thread Mike L.

Typo:   *even when the user delimits with a space. (e.g. base ball should find 
baseball). 

Thanks,
  From: Mike L. 
 To: "solr-user@lucene.apache.org"  
 Sent: Tuesday, April 7, 2015 9:05 AM
 Subject: DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words 
File

Solr User Group -

   I have a case where I need to be able to search against compound words, even 
when the user delimits with a space. (e.g. baseball => base ball).  I think 
I've solved this by creating a compound-words dictionary file containing the 
split words that I would want DictionaryCompoundWordTokenFilterFactory to split.
 base \n  
ball
I also applied in the synonym file the following rule: baseball => base ball  ( 
to allow baseball to also get a hit)

Two questions - If I could in advance figure out all the compound words I would 
want to split, would it be better (more reliable results) for me to maintain 
this compount-words file or would it be better to throw one of those open 
office dictionaries at it the filter?
Also - Any better suggestions to dealing with this problem vs the one I 
described using both the dictionary filter and the synonym rule?
Thanks in advance!
Mike

DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File

2015-04-07 Thread Mike L.


Solr User Group -

   I have a case where I need to be able to search against compound words, even 
when the user delimits with a space. (e.g. baseball => base ball).  I think 
I've solved this by creating a compound-words dictionary file containing the 
split words that I would want DictionaryCompoundWordTokenFilterFactory to split.
 base \n  
ball
I also applied in the synonym file the following rule: baseball => base ball  ( 
to allow baseball to also get a hit)
      
  
Two questions - If I could in advance figure out all the compound words I would 
want to split, would it be better (more reliable results) for me to maintain 
this compount-words file or would it be better to throw one of those open 
office dictionaries at it the filter?
Also - Any better suggestions to dealing with this problem vs the one I 
described using both the dictionary filter and the synonym rule?
Thanks in advance!
Mike

Re: Having trouble with German compound words in Solr 4.7

2014-04-24 Thread Siegfried Goeschl


Hi Alistair,

it seems that there are many ways to skin the cat so I describe the 
approach I used with SOLR 3.6 :-)


* Using a patched DictionaryCompoundWordTokenFilterFactory in the 
"index" phase - so the german compound noun "Leinenhose" (linen 
trousers) would be indexed in addition to "Leinen" & "Hose". Afterwards 
the three tokens go trough stemming.


* One hint which might be useful - I only split words which I consider 
proper german compound nouns. E.g. if your indexed text contains the 
token "schwarzkleid" I would NOT split it since it is NOT a proper noun 
- the proper noun would be "Schwarzkleid" - please note that even 
"Schwarzkleid" is not a proper german noun anyway :-)


* I use a custom dictionary for splitting consisting of 7.000 entries 
which contains a lot of customer-specific entries


I do not tinker with DictionaryCompoundWordTokenFilterFactory in the 
"query" phase of the field so the following queries would work with the 
indexed word "Leinenhose"


* "leinenhosen"
* "leinenhose"
* "leinen hose"
* "leinen hosen"

Cheers,

Siegfried Goeschl



On 22.04.14 12:13, Alistair wrote:

I've managed to solve this (in a quite hacky sort of way) by using filter
queries and the edismax queryparser.

I added in my solrconfig.xml the following parameters:

 edismax
 75%

Then when searching for multiple keywords (for example: schwarzkleid wenz,
where wenz is a german brand name), I use the first keyword as a query and
anything after that I add as a filterquery. So my final query looks
something like this:


fl=id&sort=popular+desc&indent=on&q=keywords:'schwarzkleide'+&wt=json&fq={!edismax}+keywords:'wenz'&fq=deleted:0

My compound splitter filter splits schwarzkleide correctly and it is parsed
as edismax with mm=75%, then the filterqueries are added, for keywords they
are also parsed as edismax. The returned result is all the black dresses
from 'Wenz'.

If anybody has a better solution to what I've posted I would be more than
happy to read up on it as I'm quite new to Solr and I think my way is a bit
convoluted to be honest.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4132478.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Having trouble with German compound words in Solr 4.7

2014-04-22 Thread Alistair

I've managed to solve this (in a quite hacky sort of way) by using filter
queries and the edismax queryparser. 

I added in my solrconfig.xml the following parameters:

edismax
75%

Then when searching for multiple keywords (for example: schwarzkleid wenz,
where wenz is a german brand name), I use the first keyword as a query and
anything after that I add as a filterquery. So my final query looks
something like this:

   
fl=id&sort=popular+desc&indent=on&q=keywords:'schwarzkleide'+&wt=json&fq={!edismax}+keywords:'wenz'&fq=deleted:0

My compound splitter filter splits schwarzkleide correctly and it is parsed
as edismax with mm=75%, then the filterqueries are added, for keywords they
are also parsed as edismax. The returned result is all the black dresses
from 'Wenz'. 

If anybody has a better solution to what I've posted I would be more than
happy to read up on it as I'm quite new to Solr and I think my way is a bit
convoluted to be honest.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4132478.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Having trouble with German compound words in Solr 4.7

2014-04-21 Thread Alistair

Hi Siegfried,

the debug shows that the separated keywords get OR'd together so a match to
either keyword appears in the results. So if I am searching for:

*keywords:schwarzkleid* this will get transformed to *keywords:schwarz
keywords:kleid *which is equivalent to *keywords:schwarz OR keywords:kleid*.
I need this query to be defaulted to* keywords:schwarz AND keywords:kleid*
so only items that match both keywords appear in my results (in this case
black dresses).

I am pretty confused as to why replacing the default boolean operator is
this difficult :(

Any other suggestions?

Ali



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4132338.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Siegfried Goeschl

Hi Alistair,

quick email before getting my plane - I worked with similar requirements in the 
past and tuning SOLR can be tricky

* are you hitting the same SOLR query handler (application versus manual 
checking)?
* turn on debugging for your application SOLR queries so you see what query is 
actually executed
* one thing I always do for prototyping is setting up the Solritas GUI using 
the same query handler as the application server

Cheers,

Siegfried Goeschl


On 18 Apr 2014, at 06:06, Alistair  wrote:

> Hey Jack,
> 
> thanks for the reply. I added autoGeneratePhraseQueries="true" to the
> fieldType and now it's giving me even more results! I'm not sure if the
> debug of my query will be helpful but I'll paste it just in case someone
> might have an idea. This produces 113524 results, whereas if I manually
> enter the query as keyword:schwarz AND keyword:kleid I only get 20283
> results (which is the correct one). 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Alistair

Hey Jack,

thanks for the reply. I added autoGeneratePhraseQueries="true" to the
fieldType and now it's giving me even more results! I'm not sure if the
debug of my query will be helpful but I'll paste it just in case someone
might have an idea. This produces 113524 results, whereas if I manually
enter the query as keyword:schwarz AND keyword:kleid I only get 20283
results (which is the correct one). 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Jack Krupansky

Make sure your field type has the autoGeneratePhraseQueries="true" attribute 
(default is false). q.op only applies to explicit terms, not to terms which 
decompose into multiple terms. Confusing? Yes!


-- Jack Krupansky

-Original Message- 
From: Alistair

Sent: Friday, April 18, 2014 6:11 AM
To: solr-user@lucene.apache.org
Subject: Having trouble with German compound words in Solr 4.7

Hello all,

I'm a fairly new Solr user and I need my search function to handle compound
words in German. I've searched through the archives and found that Solr
already has a Filter Factory made for such words called
DictionaryCompoundWordTokenFilterFactory. I've already built a list of words
that I want split, and it seems like the filter is working correctly in most
cases, the majority of our searches are clothing items so let's say
"/schwarzkleid/" (black dress) becomes "/schwarz/" "/kleid/", which is what
I want to happen. However, it seems like the keyword search is done using an
*OR* operator. So I'm seeing items that are either black or are dresses but
I just want to see items that are both. I've also read that changing the
default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml
will rectify this issue, but nothing has changed in my query results. It
still uses the *OR* operator.
I've tried using Extended dismax in my queries but I am using the Solr PHP
library and I don't think it supports adding Dismax filters to the queries
themselves (if I'm wrong, please correct me). By the way, I am using Zend
Framework 2.0 in the backend and am communicating with Solr through the Solr
PHP library:  Solr PHP <http://www.php.net/manual/tr/book.solr.php>  .

Any suggestions on how to change the operator after my compound word queries
have been split?

Thanks!

Ali



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html
Sent from the Solr - User mailing list archive at Nabble.com.

Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Alistair

Hello all,

I'm a fairly new Solr user and I need my search function to handle compound
words in German. I've searched through the archives and found that Solr
already has a Filter Factory made for such words called
DictionaryCompoundWordTokenFilterFactory. I've already built a list of words
that I want split, and it seems like the filter is working correctly in most
cases, the majority of our searches are clothing items so let's say
"/schwarzkleid/" (black dress) becomes "/schwarz/" "/kleid/", which is what
I want to happen. However, it seems like the keyword search is done using an
*OR* operator. So I'm seeing items that are either black or are dresses but
I just want to see items that are both. I've also read that changing the
default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml
will rectify this issue, but nothing has changed in my query results. It
still uses the *OR* operator.
I've tried using Extended dismax in my queries but I am using the Solr PHP
library and I don't think it supports adding Dismax filters to the queries
themselves (if I'm wrong, please correct me). By the way, I am using Zend
Framework 2.0 in the backend and am communicating with Solr through the Solr
PHP library: Solr PHP <http://www.php.net/manual/tr/book.solr.php> .

Any suggestions on how to change the operator after my compound word queries
have been split?

Thanks!

Ali

--
View this message in context:
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Compound words

2013-10-29 Thread Parvesh Garg

Hi Erick,

I tried with expand=true and got exactly the same tokens i.e., seabiscuit
sea bird at 1,2 and 3 positions respectively. As per solr documentation at
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory,
explicit mappings ignore the expand parameter in the schema.

So, the problem of creating compound problems at query time remains.


Parvesh Garg
http://www.zettata.com


On Tue, Oct 29, 2013 at 2:11 AM, Parvesh Garg  wrote:

> Hi Roman, thanks for the link, will go through it.
>
> Erick, will try with expand=true once and check out the results. Will
> update this thread with the findings. I remember we rejected expand=true
> because of some weird spaghetti problem. Will check it out again.
>
> Thanks,
>
> Parvesh Garg
> http://www.zettata.com
>
>
> On Mon, Oct 28, 2013 at 9:01 PM, Roman Chyla wrote:
>
>> Hi Parvesh,
>> I think you should check the following jira
>> https://issues.apache.org/jira/browse/SOLR-5379. You will find there
>> links
>> to other possible solutions/problems:-)
>> Roman
>> On 28 Oct 2013 09:06, "Erick Erickson"  wrote:
>>
>> > Consider setting expand=true at index time. That
>> > puts all the tokens in your index, and then you
>> > may not need to have any synonym
>> > processing at query time since all the variants will
>> > already be in the index.
>> >
>> > As it is, you've replaced the words in the original with
>> > synonyms, essentially collapsed them down to a single
>> > word and then you have to do something at query time
>> > to get matches. If all the variants are in the index, you
>> > shouldn't have to. That's what I meant by "raw".
>> >
>> > Best,
>> > Erick
>> >
>> >
>> > On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg 
>> wrote:
>> >
>> > > Hi Erick,
>> > >
>> > > Thanks for the suggestion. Like I said, I'm an infant.
>> > >
>> > > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit
>> =>
>> > > sea biscuit and didn't understand exactly how it worked. But I just
>> > checked
>> > > the analysis tool, and it seems to work perfectly fine at index time.
>> > Now,
>> > > I can happily discard my own filter and 4 days of work. I'm happy I
>> got
>> > to
>> > > know a few ways on how/when not to write a solr filter :)
>> > >
>> > > I tried the string "sea biscuit sea bird" with expand=false and the
>> > tokens
>> > > i got were seabiscuit sea bird at 1,2 and 3 positions respectively.
>> But
>> > at
>> > > query time, when I enter the same term "sea biscuit sea bird", using
>> > > edismax and qf, pf2, and pf3, the parsedQuery looks like this:
>> > >
>> > > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit
>> > sea\")
>> > > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea
>> > > bird\"))"
>> > >
>> > > What I wanted instead was this
>> > >
>> > > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit
>> sea\")
>> > > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")"
>> > >
>> > > Looks like there isn't any other way than to pre-process query myself
>> and
>> > > create the compound word. What do you mean by "just query the raw
>> > string"?
>> > > Am I still missing something?
>> > >
>> > > Parvesh Garg
>> > > http://www.zettata.com
>> > > (This time I did remove my phone number :) )
>> > >
>> > > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson <
>> erickerick...@gmail.com
>> > > >wrote:
>> > >
>> > > > Why did you reject using synonyms? You can have multi-word
>> > > > synonyms just fine at index time, and at query time, since the
>> > > > multiple words are already substituted in the index you don't
>> > > > need to do the same substitution, just query the raw strings.
>> > > >
>> > > > I freely acknowledge you may have very good reasons for doing
>> > > > this yourself, I'm just making sure you know what's already
>> > > > there.
>> > > >
>> > > > See:
>> > > >
>> > > >
>> > >
>> >
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
>> > > >
>> > > > Look particularly at the explanations for "sea biscuit" in that
>> > section.
>> > > >
>> > > > Best,
>> > > > Erick
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg 
>> > > wrote:
>> > > >
>> > > > > One more thing, Is there a way to remove my "accidentally sent
>> phone
>> > > > number
>> > > > > in the signature" from the previous mail? aarrrggghhh
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Compound words

2013-10-28 Thread Parvesh Garg

Hi Roman, thanks for the link, will go through it.

Erick, will try with expand=true once and check out the results. Will
update this thread with the findings. I remember we rejected expand=true
because of some weird spaghetti problem. Will check it out again.

Thanks,

Parvesh Garg
http://www.zettata.com


On Mon, Oct 28, 2013 at 9:01 PM, Roman Chyla  wrote:

> Hi Parvesh,
> I think you should check the following jira
> https://issues.apache.org/jira/browse/SOLR-5379. You will find there links
> to other possible solutions/problems:-)
> Roman
> On 28 Oct 2013 09:06, "Erick Erickson"  wrote:
>
> > Consider setting expand=true at index time. That
> > puts all the tokens in your index, and then you
> > may not need to have any synonym
> > processing at query time since all the variants will
> > already be in the index.
> >
> > As it is, you've replaced the words in the original with
> > synonyms, essentially collapsed them down to a single
> > word and then you have to do something at query time
> > to get matches. If all the variants are in the index, you
> > shouldn't have to. That's what I meant by "raw".
> >
> > Best,
> > Erick
> >
> >
> > On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg 
> wrote:
> >
> > > Hi Erick,
> > >
> > > Thanks for the suggestion. Like I said, I'm an infant.
> > >
> > > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit
> =>
> > > sea biscuit and didn't understand exactly how it worked. But I just
> > checked
> > > the analysis tool, and it seems to work perfectly fine at index time.
> > Now,
> > > I can happily discard my own filter and 4 days of work. I'm happy I got
> > to
> > > know a few ways on how/when not to write a solr filter :)
> > >
> > > I tried the string "sea biscuit sea bird" with expand=false and the
> > tokens
> > > i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But
> > at
> > > query time, when I enter the same term "sea biscuit sea bird", using
> > > edismax and qf, pf2, and pf3, the parsedQuery looks like this:
> > >
> > > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit
> > sea\")
> > > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea
> > > bird\"))"
> > >
> > > What I wanted instead was this
> > >
> > > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\")
> > > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")"
> > >
> > > Looks like there isn't any other way than to pre-process query myself
> and
> > > create the compound word. What do you mean by "just query the raw
> > string"?
> > > Am I still missing something?
> > >
> > > Parvesh Garg
> > > http://www.zettata.com
> > > (This time I did remove my phone number :) )
> > >
> > > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson <
> erickerick...@gmail.com
> > > >wrote:
> > >
> > > > Why did you reject using synonyms? You can have multi-word
> > > > synonyms just fine at index time, and at query time, since the
> > > > multiple words are already substituted in the index you don't
> > > > need to do the same substitution, just query the raw strings.
> > > >
> > > > I freely acknowledge you may have very good reasons for doing
> > > > this yourself, I'm just making sure you know what's already
> > > > there.
> > > >
> > > > See:
> > > >
> > > >
> > >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> > > >
> > > > Look particularly at the explanations for "sea biscuit" in that
> > section.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > >
> > > >
> > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg 
> > > wrote:
> > > >
> > > > > One more thing, Is there a way to remove my "accidentally sent
> phone
> > > > number
> > > > > in the signature" from the previous mail? aarrrggghhh
> > > > >
> > > >
> > >
> >
>

Re: Compound words

2013-10-28 Thread Roman Chyla

Hi Parvesh,
I think you should check the following jira
https://issues.apache.org/jira/browse/SOLR-5379. You will find there links
to other possible solutions/problems:-)
Roman
On 28 Oct 2013 09:06, "Erick Erickson"  wrote:

> Consider setting expand=true at index time. That
> puts all the tokens in your index, and then you
> may not need to have any synonym
> processing at query time since all the variants will
> already be in the index.
>
> As it is, you've replaced the words in the original with
> synonyms, essentially collapsed them down to a single
> word and then you have to do something at query time
> to get matches. If all the variants are in the index, you
> shouldn't have to. That's what I meant by "raw".
>
> Best,
> Erick
>
>
> On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg  wrote:
>
> > Hi Erick,
> >
> > Thanks for the suggestion. Like I said, I'm an infant.
> >
> > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit =>
> > sea biscuit and didn't understand exactly how it worked. But I just
> checked
> > the analysis tool, and it seems to work perfectly fine at index time.
> Now,
> > I can happily discard my own filter and 4 days of work. I'm happy I got
> to
> > know a few ways on how/when not to write a solr filter :)
> >
> > I tried the string "sea biscuit sea bird" with expand=false and the
> tokens
> > i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But
> at
> > query time, when I enter the same term "sea biscuit sea bird", using
> > edismax and qf, pf2, and pf3, the parsedQuery looks like this:
> >
> > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit
> sea\")
> > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea
> > bird\"))"
> >
> > What I wanted instead was this
> >
> > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\")
> > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")"
> >
> > Looks like there isn't any other way than to pre-process query myself and
> > create the compound word. What do you mean by "just query the raw
> string"?
> > Am I still missing something?
> >
> > Parvesh Garg
> > http://www.zettata.com
> > (This time I did remove my phone number :) )
> >
> > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson  > >wrote:
> >
> > > Why did you reject using synonyms? You can have multi-word
> > > synonyms just fine at index time, and at query time, since the
> > > multiple words are already substituted in the index you don't
> > > need to do the same substitution, just query the raw strings.
> > >
> > > I freely acknowledge you may have very good reasons for doing
> > > this yourself, I'm just making sure you know what's already
> > > there.
> > >
> > > See:
> > >
> > >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> > >
> > > Look particularly at the explanations for "sea biscuit" in that
> section.
> > >
> > > Best,
> > > Erick
> > >
> > >
> > >
> > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg 
> > wrote:
> > >
> > > > One more thing, Is there a way to remove my "accidentally sent phone
> > > number
> > > > in the signature" from the previous mail? aarrrggghhh
> > > >
> > >
> >
>

Re: Compound words

2013-10-28 Thread Erick Erickson

Consider setting expand=true at index time. That
puts all the tokens in your index, and then you
may not need to have any synonym
processing at query time since all the variants will
already be in the index.

As it is, you've replaced the words in the original with
synonyms, essentially collapsed them down to a single
word and then you have to do something at query time
to get matches. If all the variants are in the index, you
shouldn't have to. That's what I meant by "raw".

Best,
Erick


On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg  wrote:

> Hi Erick,
>
> Thanks for the suggestion. Like I said, I'm an infant.
>
> We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit =>
> sea biscuit and didn't understand exactly how it worked. But I just checked
> the analysis tool, and it seems to work perfectly fine at index time. Now,
> I can happily discard my own filter and 4 days of work. I'm happy I got to
> know a few ways on how/when not to write a solr filter :)
>
> I tried the string "sea biscuit sea bird" with expand=false and the tokens
> i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But at
> query time, when I enter the same term "sea biscuit sea bird", using
> edismax and qf, pf2, and pf3, the parsedQuery looks like this:
>
> +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit sea\")
> (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea
> bird\"))"
>
> What I wanted instead was this
>
> "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\")
> (text:\"sea bird\")) (text:\"seabiscuit sea bird\")"
>
> Looks like there isn't any other way than to pre-process query myself and
> create the compound word. What do you mean by "just query the raw string"?
> Am I still missing something?
>
> Parvesh Garg
> http://www.zettata.com
> (This time I did remove my phone number :) )
>
> On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson  >wrote:
>
> > Why did you reject using synonyms? You can have multi-word
> > synonyms just fine at index time, and at query time, since the
> > multiple words are already substituted in the index you don't
> > need to do the same substitution, just query the raw strings.
> >
> > I freely acknowledge you may have very good reasons for doing
> > this yourself, I'm just making sure you know what's already
> > there.
> >
> > See:
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> >
> > Look particularly at the explanations for "sea biscuit" in that section.
> >
> > Best,
> > Erick
> >
> >
> >
> > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg 
> wrote:
> >
> > > One more thing, Is there a way to remove my "accidentally sent phone
> > number
> > > in the signature" from the previous mail? aarrrggghhh
> > >
> >
>

Re: Compound words

2013-10-28 Thread Parvesh Garg

Hi Erick,

Thanks for the suggestion. Like I said, I'm an infant.

We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit =>
sea biscuit and didn't understand exactly how it worked. But I just checked
the analysis tool, and it seems to work perfectly fine at index time. Now,
I can happily discard my own filter and 4 days of work. I'm happy I got to
know a few ways on how/when not to write a solr filter :)

I tried the string "sea biscuit sea bird" with expand=false and the tokens
i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But at
query time, when I enter the same term "sea biscuit sea bird", using
edismax and qf, pf2, and pf3, the parsedQuery looks like this:

+((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit sea\")
(text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea
bird\"))"

What I wanted instead was this

"+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\")
(text:\"sea bird\")) (text:\"seabiscuit sea bird\")"

Looks like there isn't any other way than to pre-process query myself and
create the compound word. What do you mean by "just query the raw string"?
Am I still missing something?

Parvesh Garg
http://www.zettata.com
(This time I did remove my phone number :) )

On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson wrote:

> Why did you reject using synonyms? You can have multi-word
> synonyms just fine at index time, and at query time, since the
> multiple words are already substituted in the index you don't
> need to do the same substitution, just query the raw strings.
>
> I freely acknowledge you may have very good reasons for doing
> this yourself, I'm just making sure you know what's already
> there.
>
> See:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
>
> Look particularly at the explanations for "sea biscuit" in that section.
>
> Best,
> Erick
>
>
>
> On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg  wrote:
>
> > One more thing, Is there a way to remove my "accidentally sent phone
> number
> > in the signature" from the previous mail? aarrrggghhh
> >
>

Re: Compound words

2013-10-28 Thread Erick Erickson

Why did you reject using synonyms? You can have multi-word
synonyms just fine at index time, and at query time, since the
multiple words are already substituted in the index you don't
need to do the same substitution, just query the raw strings.

I freely acknowledge you may have very good reasons for doing
this yourself, I'm just making sure you know what's already
there.

See:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

Look particularly at the explanations for "sea biscuit" in that section.

Best,
Erick

On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg  wrote:

> One more thing, Is there a way to remove my "accidentally sent phone number
> in the signature" from the previous mail? aarrrggghhh
>

Re: Compound words

2013-10-28 Thread Parvesh Garg

One more thing, Is there a way to remove my "accidentally sent phone number
in the signature" from the previous mail? aarrrggghhh

Compound words

2013-10-28 Thread Parvesh Garg

Hi,

I'm an infant in Solr/Lucene family, just a couple of months old.

We are trying to find a way to combine words into a single compound word at
index and query time. E.g. if the document has "sea bird" in it, it should
be indexed as seabird and any query having sea bird in it should also look
for seabird not only in qf but also in pf, pf2, pf3 fields. Well, we are
using edismax query parser.

Our problem is not at index time, we have achieved it by writing our own
token filter, but at query time. Our token filter takes a dictionary in the
form of "prefix,suffix" in the file and keeps emitting regular and compound
tokens as it encounters them.

We configured our own filter at query time but figured that at query time
individual clauses like field:sea , field:bird etc are created first and
then sent to the analyzer. First of all, can someone please confirm if this
part of my understanding is correct? So, we are forced to emit sea and bird
as individual tokens because we are not getting them in sequence at all.

Is it possible to achieve this by other means than pre-processing query
before sending it to solr? Can a CharFilter be used instead, are they
applied before creating query clauses?

I can keep providing more details as necessary. This mail has already
crossed TL;DR limits for many :)

Parvesh Garg
http://www.zettata.com
+91 963 222 5540

Re: Adding the Lucene org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter to solr for german compound words

2008-07-23 Thread Chris Hostetter


FYI: In general we try to make sure that whenever posible we have a 
Factory for any TokenFilter or Tkenizer that ships with Lucene-Core or the 
Lucene Analysis contrib ... we have a stub-analysis-factory-maker.pl 
script that automates this in most cases, and requires a small amount of 
coding for others -- but in some cases there is no easy way to create a 
"generic" factor for a TokenFilter, HyphenationCompoundWordTokenFilter is 
an example of this becuase it requires a HyphenationTree to construct it, 
and HyphenationTree is a fairly complicated class, that didnt' lend itself 
to an easy XML configuration for construction.

But if you have a specific HyphenationTree instance you wnat to use, you 
can hardcode that into a custom TokenFilterFactory.

*BUT* before you do that, consider whether or not the 
DictionaryCompoundWordTokenFilter will meet your needs -- there is already 
a Solr Factory checked in for that.

: See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
: 
: Essentially, you need to create a TokenFilterFactory that wraps it.  Please
: feel free to donate it, too, if you can.
: 
: -Grant
: 
: On Jul 23, 2008, at 2:42 PM, Barry Harding wrote:
: 
: > Hi can anybody point me in the right direction in how I go about adding
: > the
: > 
: > org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
: > 
: > Token filter to the solr schema.xml.
: > 
: > 
: > 
: > 
: > 
: > I need to be able to break German compound words, and from what I have
: > read this Token filter would seem to be what I need to use, my question
: > is how do I configure SOLR to use this filter text field types.
: > 
: > 
: > 
: > Is it possible to just call it directly from the confog file or do I
: > need to wrap it in a custom class in some way
: > 
: > 
: > 
: > Thanks
: > 
: > 
: > 
: > Barry H
: > 
: > 
: > 
: > Misco is a division of Systemax Europe Ltd.  Registered in Scotland Number
: > 114143.  Registered Office: Caledonian Exchange, 19a Canning Street,
: > Edinburgh EH3 8EG.  Telephone +44 (0)1933 686000.
: 
: --
: Grant Ingersoll
: http://www.lucidimagination.com
: 
: Lucene Helpful Hints:
: http://wiki.apache.org/lucene-java/BasicsOfPerformance
: http://wiki.apache.org/lucene-java/LuceneFAQ
: 
: 
: 
: 
: 
: 
: 



-Hoss

Re: Adding the Lucene org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter to solr for german compound words

2008-07-23 Thread Grant Ingersoll


See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Essentially, you need to create a TokenFilterFactory that wraps it.   
Please feel free to donate it, too, if you can.


-Grant

On Jul 23, 2008, at 2:42 PM, Barry Harding wrote:

Hi can anybody point me in the right direction in how I go about  
adding

the

org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter

Token filter to the solr schema.xml.





I need to be able to break German compound words, and from what I have
read this Token filter would seem to be what I need to use, my  
question

is how do I configure SOLR to use this filter text field types.



Is it possible to just call it directly from the confog file or do I
need to wrap it in a custom class in some way



Thanks



Barry H



Misco is a division of Systemax Europe Ltd.  Registered in Scotland  
Number 114143.  Registered Office: Caledonian Exchange, 19a Canning  
Street, Edinburgh EH3 8EG.  Telephone +44 (0)1933 686000.


--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Adding the Lucene org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter to solr for german compound words

2008-07-23 Thread Barry Harding

Hi can anybody point me in the right direction in how I go about adding
the 

org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter

Token filter to the solr schema.xml.

 

 

I need to be able to break German compound words, and from what I have
read this Token filter would seem to be what I need to use, my question
is how do I configure SOLR to use this filter text field types.

 

Is it possible to just call it directly from the confog file or do I
need to wrap it in a custom class in some way

 

Thanks

 

Barry H



Misco is a division of Systemax Europe Ltd.  Registered in Scotland Number 
114143.  Registered Office: Caledonian Exchange, 19a Canning Street, Edinburgh 
EH3 8EG.  Telephone +44 (0)1933 686000.

[search > edismax] compound words different result issue

Re: DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File

DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File

Re: Having trouble with German compound words in Solr 4.7

Re: Having trouble with German compound words in Solr 4.7

Re: Having trouble with German compound words in Solr 4.7

Re: Having trouble with German compound words in Solr 4.7

Re: Having trouble with German compound words in Solr 4.7

Re: Having trouble with German compound words in Solr 4.7

Having trouble with German compound words in Solr 4.7

Re: Compound words

Re: Compound words

Re: Compound words

Re: Compound words

Re: Compound words

Re: Compound words

Re: Compound words

Compound words

Re: Adding the Lucene org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter to solr for german compound words

Re: Adding the Lucene org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter to solr for german compound words

Adding the Lucene org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter to solr for german compound words

21 matches

Site Navigation

Mail list logo

Footer information