[search > edismax] compound words different result issue
Hi I use 'edismax'. Our main language uses compound words. There is an issue here. For example, assume that 'ab' => 'a' and 'b' are analyzed. The results are different when searching with 'ab' and 'a b'. I want to get the same result as searching 'a b' when searching 'ab'. Is there a way?
Re: DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File
Typo: *even when the user delimits with a space. (e.g. base ball should find baseball). Thanks, From: Mike L. To: "solr-user@lucene.apache.org" Sent: Tuesday, April 7, 2015 9:05 AM Subject: DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File Solr User Group - I have a case where I need to be able to search against compound words, even when the user delimits with a space. (e.g. baseball => base ball). I think I've solved this by creating a compound-words dictionary file containing the split words that I would want DictionaryCompoundWordTokenFilterFactory to split. base \n ball I also applied in the synonym file the following rule: baseball => base ball ( to allow baseball to also get a hit) Two questions - If I could in advance figure out all the compound words I would want to split, would it be better (more reliable results) for me to maintain this compount-words file or would it be better to throw one of those open office dictionaries at it the filter? Also - Any better suggestions to dealing with this problem vs the one I described using both the dictionary filter and the synonym rule? Thanks in advance! Mike
DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File
Solr User Group - I have a case where I need to be able to search against compound words, even when the user delimits with a space. (e.g. baseball => base ball). I think I've solved this by creating a compound-words dictionary file containing the split words that I would want DictionaryCompoundWordTokenFilterFactory to split. base \n ball I also applied in the synonym file the following rule: baseball => base ball ( to allow baseball to also get a hit) Two questions - If I could in advance figure out all the compound words I would want to split, would it be better (more reliable results) for me to maintain this compount-words file or would it be better to throw one of those open office dictionaries at it the filter? Also - Any better suggestions to dealing with this problem vs the one I described using both the dictionary filter and the synonym rule? Thanks in advance! Mike
Re: Having trouble with German compound words in Solr 4.7
Hi Alistair, it seems that there are many ways to skin the cat so I describe the approach I used with SOLR 3.6 :-) * Using a patched DictionaryCompoundWordTokenFilterFactory in the "index" phase - so the german compound noun "Leinenhose" (linen trousers) would be indexed in addition to "Leinen" & "Hose". Afterwards the three tokens go trough stemming. * One hint which might be useful - I only split words which I consider proper german compound nouns. E.g. if your indexed text contains the token "schwarzkleid" I would NOT split it since it is NOT a proper noun - the proper noun would be "Schwarzkleid" - please note that even "Schwarzkleid" is not a proper german noun anyway :-) * I use a custom dictionary for splitting consisting of 7.000 entries which contains a lot of customer-specific entries I do not tinker with DictionaryCompoundWordTokenFilterFactory in the "query" phase of the field so the following queries would work with the indexed word "Leinenhose" * "leinenhosen" * "leinenhose" * "leinen hose" * "leinen hosen" Cheers, Siegfried Goeschl On 22.04.14 12:13, Alistair wrote: I've managed to solve this (in a quite hacky sort of way) by using filter queries and the edismax queryparser. I added in my solrconfig.xml the following parameters: edismax 75% Then when searching for multiple keywords (for example: schwarzkleid wenz, where wenz is a german brand name), I use the first keyword as a query and anything after that I add as a filterquery. So my final query looks something like this: fl=id&sort=popular+desc&indent=on&q=keywords:'schwarzkleide'+&wt=json&fq={!edismax}+keywords:'wenz'&fq=deleted:0 My compound splitter filter splits schwarzkleide correctly and it is parsed as edismax with mm=75%, then the filterqueries are added, for keywords they are also parsed as edismax. The returned result is all the black dresses from 'Wenz'. If anybody has a better solution to what I've posted I would be more than happy to read up on it as I'm quite new to Solr and I think my way is a bit convoluted to be honest. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4132478.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Having trouble with German compound words in Solr 4.7
I've managed to solve this (in a quite hacky sort of way) by using filter queries and the edismax queryparser. I added in my solrconfig.xml the following parameters: edismax 75% Then when searching for multiple keywords (for example: schwarzkleid wenz, where wenz is a german brand name), I use the first keyword as a query and anything after that I add as a filterquery. So my final query looks something like this: fl=id&sort=popular+desc&indent=on&q=keywords:'schwarzkleide'+&wt=json&fq={!edismax}+keywords:'wenz'&fq=deleted:0 My compound splitter filter splits schwarzkleide correctly and it is parsed as edismax with mm=75%, then the filterqueries are added, for keywords they are also parsed as edismax. The returned result is all the black dresses from 'Wenz'. If anybody has a better solution to what I've posted I would be more than happy to read up on it as I'm quite new to Solr and I think my way is a bit convoluted to be honest. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4132478.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Having trouble with German compound words in Solr 4.7
Hi Siegfried, the debug shows that the separated keywords get OR'd together so a match to either keyword appears in the results. So if I am searching for: *keywords:schwarzkleid* this will get transformed to *keywords:schwarz keywords:kleid *which is equivalent to *keywords:schwarz OR keywords:kleid*. I need this query to be defaulted to* keywords:schwarz AND keywords:kleid* so only items that match both keywords appear in my results (in this case black dresses). I am pretty confused as to why replacing the default boolean operator is this difficult :( Any other suggestions? Ali -- View this message in context: http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4132338.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Having trouble with German compound words in Solr 4.7
Hi Alistair, quick email before getting my plane - I worked with similar requirements in the past and tuning SOLR can be tricky * are you hitting the same SOLR query handler (application versus manual checking)? * turn on debugging for your application SOLR queries so you see what query is actually executed * one thing I always do for prototyping is setting up the Solritas GUI using the same query handler as the application server Cheers, Siegfried Goeschl On 18 Apr 2014, at 06:06, Alistair wrote: > Hey Jack, > > thanks for the reply. I added autoGeneratePhraseQueries="true" to the > fieldType and now it's giving me even more results! I'm not sure if the > debug of my query will be helpful but I'll paste it just in case someone > might have an idea. This produces 113524 results, whereas if I manually > enter the query as keyword:schwarz AND keyword:kleid I only get 20283 > results (which is the correct one). > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Having trouble with German compound words in Solr 4.7
Hey Jack, thanks for the reply. I added autoGeneratePhraseQueries="true" to the fieldType and now it's giving me even more results! I'm not sure if the debug of my query will be helpful but I'll paste it just in case someone might have an idea. This produces 113524 results, whereas if I manually enter the query as keyword:schwarz AND keyword:kleid I only get 20283 results (which is the correct one). -- View this message in context: http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Having trouble with German compound words in Solr 4.7
Make sure your field type has the autoGeneratePhraseQueries="true" attribute (default is false). q.op only applies to explicit terms, not to terms which decompose into multiple terms. Confusing? Yes! -- Jack Krupansky -Original Message- From: Alistair Sent: Friday, April 18, 2014 6:11 AM To: solr-user@lucene.apache.org Subject: Having trouble with German compound words in Solr 4.7 Hello all, I'm a fairly new Solr user and I need my search function to handle compound words in German. I've searched through the archives and found that Solr already has a Filter Factory made for such words called DictionaryCompoundWordTokenFilterFactory. I've already built a list of words that I want split, and it seems like the filter is working correctly in most cases, the majority of our searches are clothing items so let's say "/schwarzkleid/" (black dress) becomes "/schwarz/" "/kleid/", which is what I want to happen. However, it seems like the keyword search is done using an *OR* operator. So I'm seeing items that are either black or are dresses but I just want to see items that are both. I've also read that changing the default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml will rectify this issue, but nothing has changed in my query results. It still uses the *OR* operator. I've tried using Extended dismax in my queries but I am using the Solr PHP library and I don't think it supports adding Dismax filters to the queries themselves (if I'm wrong, please correct me). By the way, I am using Zend Framework 2.0 in the backend and am communicating with Solr through the Solr PHP library: Solr PHP <http://www.php.net/manual/tr/book.solr.php> . Any suggestions on how to change the operator after my compound word queries have been split? Thanks! Ali -- View this message in context: http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html Sent from the Solr - User mailing list archive at Nabble.com.
Having trouble with German compound words in Solr 4.7
Hello all, I'm a fairly new Solr user and I need my search function to handle compound words in German. I've searched through the archives and found that Solr already has a Filter Factory made for such words called DictionaryCompoundWordTokenFilterFactory. I've already built a list of words that I want split, and it seems like the filter is working correctly in most cases, the majority of our searches are clothing items so let's say "/schwarzkleid/" (black dress) becomes "/schwarz/" "/kleid/", which is what I want to happen. However, it seems like the keyword search is done using an *OR* operator. So I'm seeing items that are either black or are dresses but I just want to see items that are both. I've also read that changing the default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml will rectify this issue, but nothing has changed in my query results. It still uses the *OR* operator. I've tried using Extended dismax in my queries but I am using the Solr PHP library and I don't think it supports adding Dismax filters to the queries themselves (if I'm wrong, please correct me). By the way, I am using Zend Framework 2.0 in the backend and am communicating with Solr through the Solr PHP library: Solr PHP <http://www.php.net/manual/tr/book.solr.php> . Any suggestions on how to change the operator after my compound word queries have been split? Thanks! Ali -- View this message in context: http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Compound words
Hi Erick, I tried with expand=true and got exactly the same tokens i.e., seabiscuit sea bird at 1,2 and 3 positions respectively. As per solr documentation at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory, explicit mappings ignore the expand parameter in the schema. So, the problem of creating compound problems at query time remains. Parvesh Garg http://www.zettata.com On Tue, Oct 29, 2013 at 2:11 AM, Parvesh Garg wrote: > Hi Roman, thanks for the link, will go through it. > > Erick, will try with expand=true once and check out the results. Will > update this thread with the findings. I remember we rejected expand=true > because of some weird spaghetti problem. Will check it out again. > > Thanks, > > Parvesh Garg > http://www.zettata.com > > > On Mon, Oct 28, 2013 at 9:01 PM, Roman Chyla wrote: > >> Hi Parvesh, >> I think you should check the following jira >> https://issues.apache.org/jira/browse/SOLR-5379. You will find there >> links >> to other possible solutions/problems:-) >> Roman >> On 28 Oct 2013 09:06, "Erick Erickson" wrote: >> >> > Consider setting expand=true at index time. That >> > puts all the tokens in your index, and then you >> > may not need to have any synonym >> > processing at query time since all the variants will >> > already be in the index. >> > >> > As it is, you've replaced the words in the original with >> > synonyms, essentially collapsed them down to a single >> > word and then you have to do something at query time >> > to get matches. If all the variants are in the index, you >> > shouldn't have to. That's what I meant by "raw". >> > >> > Best, >> > Erick >> > >> > >> > On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg >> wrote: >> > >> > > Hi Erick, >> > > >> > > Thanks for the suggestion. Like I said, I'm an infant. >> > > >> > > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit >> => >> > > sea biscuit and didn't understand exactly how it worked. But I just >> > checked >> > > the analysis tool, and it seems to work perfectly fine at index time. >> > Now, >> > > I can happily discard my own filter and 4 days of work. I'm happy I >> got >> > to >> > > know a few ways on how/when not to write a solr filter :) >> > > >> > > I tried the string "sea biscuit sea bird" with expand=false and the >> > tokens >> > > i got were seabiscuit sea bird at 1,2 and 3 positions respectively. >> But >> > at >> > > query time, when I enter the same term "sea biscuit sea bird", using >> > > edismax and qf, pf2, and pf3, the parsedQuery looks like this: >> > > >> > > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit >> > sea\") >> > > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea >> > > bird\"))" >> > > >> > > What I wanted instead was this >> > > >> > > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit >> sea\") >> > > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")" >> > > >> > > Looks like there isn't any other way than to pre-process query myself >> and >> > > create the compound word. What do you mean by "just query the raw >> > string"? >> > > Am I still missing something? >> > > >> > > Parvesh Garg >> > > http://www.zettata.com >> > > (This time I did remove my phone number :) ) >> > > >> > > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson < >> erickerick...@gmail.com >> > > >wrote: >> > > >> > > > Why did you reject using synonyms? You can have multi-word >> > > > synonyms just fine at index time, and at query time, since the >> > > > multiple words are already substituted in the index you don't >> > > > need to do the same substitution, just query the raw strings. >> > > > >> > > > I freely acknowledge you may have very good reasons for doing >> > > > this yourself, I'm just making sure you know what's already >> > > > there. >> > > > >> > > > See: >> > > > >> > > > >> > > >> > >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory >> > > > >> > > > Look particularly at the explanations for "sea biscuit" in that >> > section. >> > > > >> > > > Best, >> > > > Erick >> > > > >> > > > >> > > > >> > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg >> > > wrote: >> > > > >> > > > > One more thing, Is there a way to remove my "accidentally sent >> phone >> > > > number >> > > > > in the signature" from the previous mail? aarrrggghhh >> > > > > >> > > > >> > > >> > >> > >
Re: Compound words
Hi Roman, thanks for the link, will go through it. Erick, will try with expand=true once and check out the results. Will update this thread with the findings. I remember we rejected expand=true because of some weird spaghetti problem. Will check it out again. Thanks, Parvesh Garg http://www.zettata.com On Mon, Oct 28, 2013 at 9:01 PM, Roman Chyla wrote: > Hi Parvesh, > I think you should check the following jira > https://issues.apache.org/jira/browse/SOLR-5379. You will find there links > to other possible solutions/problems:-) > Roman > On 28 Oct 2013 09:06, "Erick Erickson" wrote: > > > Consider setting expand=true at index time. That > > puts all the tokens in your index, and then you > > may not need to have any synonym > > processing at query time since all the variants will > > already be in the index. > > > > As it is, you've replaced the words in the original with > > synonyms, essentially collapsed them down to a single > > word and then you have to do something at query time > > to get matches. If all the variants are in the index, you > > shouldn't have to. That's what I meant by "raw". > > > > Best, > > Erick > > > > > > On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg > wrote: > > > > > Hi Erick, > > > > > > Thanks for the suggestion. Like I said, I'm an infant. > > > > > > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit > => > > > sea biscuit and didn't understand exactly how it worked. But I just > > checked > > > the analysis tool, and it seems to work perfectly fine at index time. > > Now, > > > I can happily discard my own filter and 4 days of work. I'm happy I got > > to > > > know a few ways on how/when not to write a solr filter :) > > > > > > I tried the string "sea biscuit sea bird" with expand=false and the > > tokens > > > i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But > > at > > > query time, when I enter the same term "sea biscuit sea bird", using > > > edismax and qf, pf2, and pf3, the parsedQuery looks like this: > > > > > > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit > > sea\") > > > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea > > > bird\"))" > > > > > > What I wanted instead was this > > > > > > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\") > > > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")" > > > > > > Looks like there isn't any other way than to pre-process query myself > and > > > create the compound word. What do you mean by "just query the raw > > string"? > > > Am I still missing something? > > > > > > Parvesh Garg > > > http://www.zettata.com > > > (This time I did remove my phone number :) ) > > > > > > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson < > erickerick...@gmail.com > > > >wrote: > > > > > > > Why did you reject using synonyms? You can have multi-word > > > > synonyms just fine at index time, and at query time, since the > > > > multiple words are already substituted in the index you don't > > > > need to do the same substitution, just query the raw strings. > > > > > > > > I freely acknowledge you may have very good reasons for doing > > > > this yourself, I'm just making sure you know what's already > > > > there. > > > > > > > > See: > > > > > > > > > > > > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory > > > > > > > > Look particularly at the explanations for "sea biscuit" in that > > section. > > > > > > > > Best, > > > > Erick > > > > > > > > > > > > > > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg > > > wrote: > > > > > > > > > One more thing, Is there a way to remove my "accidentally sent > phone > > > > number > > > > > in the signature" from the previous mail? aarrrggghhh > > > > > > > > > > > > > > >
Re: Compound words
Hi Parvesh, I think you should check the following jira https://issues.apache.org/jira/browse/SOLR-5379. You will find there links to other possible solutions/problems:-) Roman On 28 Oct 2013 09:06, "Erick Erickson" wrote: > Consider setting expand=true at index time. That > puts all the tokens in your index, and then you > may not need to have any synonym > processing at query time since all the variants will > already be in the index. > > As it is, you've replaced the words in the original with > synonyms, essentially collapsed them down to a single > word and then you have to do something at query time > to get matches. If all the variants are in the index, you > shouldn't have to. That's what I meant by "raw". > > Best, > Erick > > > On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg wrote: > > > Hi Erick, > > > > Thanks for the suggestion. Like I said, I'm an infant. > > > > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit => > > sea biscuit and didn't understand exactly how it worked. But I just > checked > > the analysis tool, and it seems to work perfectly fine at index time. > Now, > > I can happily discard my own filter and 4 days of work. I'm happy I got > to > > know a few ways on how/when not to write a solr filter :) > > > > I tried the string "sea biscuit sea bird" with expand=false and the > tokens > > i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But > at > > query time, when I enter the same term "sea biscuit sea bird", using > > edismax and qf, pf2, and pf3, the parsedQuery looks like this: > > > > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit > sea\") > > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea > > bird\"))" > > > > What I wanted instead was this > > > > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\") > > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")" > > > > Looks like there isn't any other way than to pre-process query myself and > > create the compound word. What do you mean by "just query the raw > string"? > > Am I still missing something? > > > > Parvesh Garg > > http://www.zettata.com > > (This time I did remove my phone number :) ) > > > > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson > >wrote: > > > > > Why did you reject using synonyms? You can have multi-word > > > synonyms just fine at index time, and at query time, since the > > > multiple words are already substituted in the index you don't > > > need to do the same substitution, just query the raw strings. > > > > > > I freely acknowledge you may have very good reasons for doing > > > this yourself, I'm just making sure you know what's already > > > there. > > > > > > See: > > > > > > > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory > > > > > > Look particularly at the explanations for "sea biscuit" in that > section. > > > > > > Best, > > > Erick > > > > > > > > > > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg > > wrote: > > > > > > > One more thing, Is there a way to remove my "accidentally sent phone > > > number > > > > in the signature" from the previous mail? aarrrggghhh > > > > > > > > > >
Re: Compound words
Consider setting expand=true at index time. That puts all the tokens in your index, and then you may not need to have any synonym processing at query time since all the variants will already be in the index. As it is, you've replaced the words in the original with synonyms, essentially collapsed them down to a single word and then you have to do something at query time to get matches. If all the variants are in the index, you shouldn't have to. That's what I meant by "raw". Best, Erick On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg wrote: > Hi Erick, > > Thanks for the suggestion. Like I said, I'm an infant. > > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit => > sea biscuit and didn't understand exactly how it worked. But I just checked > the analysis tool, and it seems to work perfectly fine at index time. Now, > I can happily discard my own filter and 4 days of work. I'm happy I got to > know a few ways on how/when not to write a solr filter :) > > I tried the string "sea biscuit sea bird" with expand=false and the tokens > i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But at > query time, when I enter the same term "sea biscuit sea bird", using > edismax and qf, pf2, and pf3, the parsedQuery looks like this: > > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit sea\") > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea > bird\"))" > > What I wanted instead was this > > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\") > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")" > > Looks like there isn't any other way than to pre-process query myself and > create the compound word. What do you mean by "just query the raw string"? > Am I still missing something? > > Parvesh Garg > http://www.zettata.com > (This time I did remove my phone number :) ) > > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson >wrote: > > > Why did you reject using synonyms? You can have multi-word > > synonyms just fine at index time, and at query time, since the > > multiple words are already substituted in the index you don't > > need to do the same substitution, just query the raw strings. > > > > I freely acknowledge you may have very good reasons for doing > > this yourself, I'm just making sure you know what's already > > there. > > > > See: > > > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory > > > > Look particularly at the explanations for "sea biscuit" in that section. > > > > Best, > > Erick > > > > > > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg > wrote: > > > > > One more thing, Is there a way to remove my "accidentally sent phone > > number > > > in the signature" from the previous mail? aarrrggghhh > > > > > >
Re: Compound words
Hi Erick, Thanks for the suggestion. Like I said, I'm an infant. We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit => sea biscuit and didn't understand exactly how it worked. But I just checked the analysis tool, and it seems to work perfectly fine at index time. Now, I can happily discard my own filter and 4 days of work. I'm happy I got to know a few ways on how/when not to write a solr filter :) I tried the string "sea biscuit sea bird" with expand=false and the tokens i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But at query time, when I enter the same term "sea biscuit sea bird", using edismax and qf, pf2, and pf3, the parsedQuery looks like this: +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit sea\") (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea bird\"))" What I wanted instead was this "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\") (text:\"sea bird\")) (text:\"seabiscuit sea bird\")" Looks like there isn't any other way than to pre-process query myself and create the compound word. What do you mean by "just query the raw string"? Am I still missing something? Parvesh Garg http://www.zettata.com (This time I did remove my phone number :) ) On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson wrote: > Why did you reject using synonyms? You can have multi-word > synonyms just fine at index time, and at query time, since the > multiple words are already substituted in the index you don't > need to do the same substitution, just query the raw strings. > > I freely acknowledge you may have very good reasons for doing > this yourself, I'm just making sure you know what's already > there. > > See: > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory > > Look particularly at the explanations for "sea biscuit" in that section. > > Best, > Erick > > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg wrote: > > > One more thing, Is there a way to remove my "accidentally sent phone > number > > in the signature" from the previous mail? aarrrggghhh > > >
Re: Compound words
Why did you reject using synonyms? You can have multi-word synonyms just fine at index time, and at query time, since the multiple words are already substituted in the index you don't need to do the same substitution, just query the raw strings. I freely acknowledge you may have very good reasons for doing this yourself, I'm just making sure you know what's already there. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory Look particularly at the explanations for "sea biscuit" in that section. Best, Erick On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg wrote: > One more thing, Is there a way to remove my "accidentally sent phone number > in the signature" from the previous mail? aarrrggghhh >
Re: Compound words
One more thing, Is there a way to remove my "accidentally sent phone number in the signature" from the previous mail? aarrrggghhh
Compound words
Hi, I'm an infant in Solr/Lucene family, just a couple of months old. We are trying to find a way to combine words into a single compound word at index and query time. E.g. if the document has "sea bird" in it, it should be indexed as seabird and any query having sea bird in it should also look for seabird not only in qf but also in pf, pf2, pf3 fields. Well, we are using edismax query parser. Our problem is not at index time, we have achieved it by writing our own token filter, but at query time. Our token filter takes a dictionary in the form of "prefix,suffix" in the file and keeps emitting regular and compound tokens as it encounters them. We configured our own filter at query time but figured that at query time individual clauses like field:sea , field:bird etc are created first and then sent to the analyzer. First of all, can someone please confirm if this part of my understanding is correct? So, we are forced to emit sea and bird as individual tokens because we are not getting them in sequence at all. Is it possible to achieve this by other means than pre-processing query before sending it to solr? Can a CharFilter be used instead, are they applied before creating query clauses? I can keep providing more details as necessary. This mail has already crossed TL;DR limits for many :) Parvesh Garg http://www.zettata.com +91 963 222 5540
Re: Adding the Lucene org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter to solr for german compound words
FYI: In general we try to make sure that whenever posible we have a Factory for any TokenFilter or Tkenizer that ships with Lucene-Core or the Lucene Analysis contrib ... we have a stub-analysis-factory-maker.pl script that automates this in most cases, and requires a small amount of coding for others -- but in some cases there is no easy way to create a "generic" factor for a TokenFilter, HyphenationCompoundWordTokenFilter is an example of this becuase it requires a HyphenationTree to construct it, and HyphenationTree is a fairly complicated class, that didnt' lend itself to an easy XML configuration for construction. But if you have a specific HyphenationTree instance you wnat to use, you can hardcode that into a custom TokenFilterFactory. *BUT* before you do that, consider whether or not the DictionaryCompoundWordTokenFilter will meet your needs -- there is already a Solr Factory checked in for that. : See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters : : Essentially, you need to create a TokenFilterFactory that wraps it. Please : feel free to donate it, too, if you can. : : -Grant : : On Jul 23, 2008, at 2:42 PM, Barry Harding wrote: : : > Hi can anybody point me in the right direction in how I go about adding : > the : > : > org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter : > : > Token filter to the solr schema.xml. : > : > : > : > : > : > I need to be able to break German compound words, and from what I have : > read this Token filter would seem to be what I need to use, my question : > is how do I configure SOLR to use this filter text field types. : > : > : > : > Is it possible to just call it directly from the confog file or do I : > need to wrap it in a custom class in some way : > : > : > : > Thanks : > : > : > : > Barry H : > : > : > : > Misco is a division of Systemax Europe Ltd. Registered in Scotland Number : > 114143. Registered Office: Caledonian Exchange, 19a Canning Street, : > Edinburgh EH3 8EG. Telephone +44 (0)1933 686000. : : -- : Grant Ingersoll : http://www.lucidimagination.com : : Lucene Helpful Hints: : http://wiki.apache.org/lucene-java/BasicsOfPerformance : http://wiki.apache.org/lucene-java/LuceneFAQ : : : : : : : -Hoss
Re: Adding the Lucene org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter to solr for german compound words
See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Essentially, you need to create a TokenFilterFactory that wraps it. Please feel free to donate it, too, if you can. -Grant On Jul 23, 2008, at 2:42 PM, Barry Harding wrote: Hi can anybody point me in the right direction in how I go about adding the org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter Token filter to the solr schema.xml. I need to be able to break German compound words, and from what I have read this Token filter would seem to be what I need to use, my question is how do I configure SOLR to use this filter text field types. Is it possible to just call it directly from the confog file or do I need to wrap it in a custom class in some way Thanks Barry H Misco is a division of Systemax Europe Ltd. Registered in Scotland Number 114143. Registered Office: Caledonian Exchange, 19a Canning Street, Edinburgh EH3 8EG. Telephone +44 (0)1933 686000. -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Adding the Lucene org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter to solr for german compound words
Hi can anybody point me in the right direction in how I go about adding the org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter Token filter to the solr schema.xml. I need to be able to break German compound words, and from what I have read this Token filter would seem to be what I need to use, my question is how do I configure SOLR to use this filter text field types. Is it possible to just call it directly from the confog file or do I need to wrap it in a custom class in some way Thanks Barry H Misco is a division of Systemax Europe Ltd. Registered in Scotland Number 114143. Registered Office: Caledonian Exchange, 19a Canning Street, Edinburgh EH3 8EG. Telephone +44 (0)1933 686000.