Re: Compound word search (maybe DisMaxQueryPaser problem)
Oh my ... thinking even more about it, I have to admit you're right :) But that leaves me somewhat clueless again. So I'll just try and share my thoughts on this. Maybe someone will read this and can point me to a possible solution ... or tell me where I'm wrong. Say we have a schema with fields f1, f2 and f3. And the user queries for "a b c" (without the quotes). What I would expect as resulting query would be (leaving out the details like tie, boosting etc.): ((f1:a OR f2:a OR f3:a) AND (f1:b OR f2:b OR f3:b) AND (f1:c OR f2:c OR f3:c)) OR ((f1:ab OR f2:ab OR f3:ab) AND (f1:c f2:c f3:c)) OR ((f1:a OR f2:a OR f3:a) AND (f1:bc f2:bc f3:bc)) (possibly also f1:abc OR f2:abc .. and/or f1:a b c OR f2:a b c etc. ) So every possibility of how to write compound words is covered. But then there is the problem that there are fields that require exact matching (something like EAN, manufacturer code or product serial number. Unfortunately these can contain whitespaces etc. So a b c can also be a valid manufacturer code which sould match as a whole). So I modeled the fields in the schema accordingly: making exact match fields string and add ShingleFilter and WordDelimiterFiler for content fields. And I thought the fields analyzer stack would take care of how to process the user input. But when I pass the user query as phrase to the DisMax Handler (so that every field gets to see the whole user query and can tokenize and shingle it) I get a query like this: (f1:a b c) OR (f2:a OR f2:ab OR f2:b OR f2:bc OR f2:c) OR (f3:a OR f3:ab OR f3:b OR f3:bc OR f3:c) which apparently is not what I need as it also would find for example documents that only contain a or b etc. When using phrase fields this query is just added to the normal query and therefore the query fails to find the compound words. Also using the FieldQuery Analyzer does not yield the desired results as the parsed queries as a matter of fact look like the phrase queries from the DisMax parser. I tried dozends of variations and I'm still pretty sure that there must be a way to get this working. It doesn't look that hard. But for now I will settle this for the weekend :) Have a nice weekend all and thanks in advance for any comments or replies. Tobi Chris Hostetter schrieb: : Many thanks for your explanation. That really helped me a lot in understanding : DisMax - and finally I realized that DisMax is not at all what I need. : Actually I do not want results where "blue" is in one field and "tooth" in : another (imagine you search for a notebook with blue tooth and get some blue : products that accidentally have tooth in some field). except that if you use the "pf" param as well, a search for... blue tooth can score products where "blue tooth" appears in one field higher then products where "blue" apears in one field and "tooth" appears in another field. The approach you are describing might give you you better precisions (ie: less total results) but it will have a loss in precision, a query like this... blue tooth notebook ...probably won't be able to find documents matching the terms "product_type:notebook features:blue features:tooth" ... but dismax can. -Hoss
Re: Compound word search (maybe DisMaxQueryPaser problem)
: Many thanks for your explanation. That really helped me a lot in understanding : DisMax - and finally I realized that DisMax is not at all what I need. : Actually I do not want results where "blue" is in one field and "tooth" in : another (imagine you search for a notebook with blue tooth and get some blue : products that accidentally have tooth in some field). except that if you use the "pf" param as well, a search for... blue tooth can score products where "blue tooth" appears in one field higher then products where "blue" apears in one field and "tooth" appears in another field. The approach you are describing might give you you better precisions (ie: less total results) but it will have a loss in precision, a query like this... blue tooth notebook ...probably won't be able to find documents matching the terms "product_type:notebook features:blue features:tooth" ... but dismax can. -Hoss
Re: Compound word search (maybe DisMaxQueryPaser problem)
Many thanks for your explanation. That really helped me a lot in understanding DisMax - and finally I realized that DisMax is not at all what I need. Actually I do not want results where "blue" is in one field and "tooth" in another (imagine you search for a notebook with blue tooth and get some blue products that accidentally have tooth in some field). My feeling already was that I have to come up with my own solution mixing parts of DisMax (distribute the query among the fields) and FieldQParserPlugin. So now I will try that out. Many thanks Tobi Chris Hostetter schrieb: : My original assumption for the DisMax Handler was, that it will just take the : original query string and pass it to every field in its fieldlist using the : fields configured analyzer stack. Maybe in the end add some stuff for the : special options and so ... and then send the query to lucene. Can you explain : why this approach was not choosen? because then it wouldn't be the DisMaxRequestHandler. seriously: the point of dismax is to build up a DisjunctionMaxQuery for each "chunk" in the query string and populate those DisjunctionMaxQueries with the Queries produced by analyzing that "chunk" against each field in the qf -- then all of the DisjunctionMaxQueries are grouped into a BooleanQuery with a minNrSHouldMatch. if you look at the query toString from debugQuery (using a non trivial qf param and a q string containing more then one "chunk") you can see what i mean. your example shows it pretty well actaully... : > : > : > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1) the point is to build those DisjunctionMaxQueries -- so that each "chunk" only contributes significantly based on the highest scoring field that chunk appears in ... if your example someone typing "blue tooth" can get a match when a doc matches blue in one field and tooth in another -- that wouldn't be possible with the appraoch you describe. the Query structure also means that a doc where "tooth" appears in both the category and name fields but "blue" doesn't appear at all won't score as high as a doc that matches "blue" in category and "tooth" in name (allthough you have to look at the score explanations to really see hwat i mean by that) There are certainly a lot of improvements that could be made to dismax ... more customiation in terms of how the querystrings is parsed before building up the DisjunctionMaxQueries and calling the individual field analyzers would certainly be one way it could improve ... but so far no one has attempted anything like that. -Hoss
Re: Compound word search (maybe DisMaxQueryPaser problem)
: My original assumption for the DisMax Handler was, that it will just take the : original query string and pass it to every field in its fieldlist using the : fields configured analyzer stack. Maybe in the end add some stuff for the : special options and so ... and then send the query to lucene. Can you explain : why this approach was not choosen? because then it wouldn't be the DisMaxRequestHandler. seriously: the point of dismax is to build up a DisjunctionMaxQuery for each "chunk" in the query string and populate those DisjunctionMaxQueries with the Queries produced by analyzing that "chunk" against each field in the qf -- then all of the DisjunctionMaxQueries are grouped into a BooleanQuery with a minNrSHouldMatch. if you look at the query toString from debugQuery (using a non trivial qf param and a q string containing more then one "chunk") you can see what i mean. your example shows it pretty well actaully... : > : > : > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1) the point is to build those DisjunctionMaxQueries -- so that each "chunk" only contributes significantly based on the highest scoring field that chunk appears in ... if your example someone typing "blue tooth" can get a match when a doc matches blue in one field and tooth in another -- that wouldn't be possible with the appraoch you describe. the Query structure also means that a doc where "tooth" appears in both the category and name fields but "blue" doesn't appear at all won't score as high as a doc that matches "blue" in category and "tooth" in name (allthough you have to look at the score explanations to really see hwat i mean by that) There are certainly a lot of improvements that could be made to dismax ... more customiation in terms of how the querystrings is parsed before building up the DisjunctionMaxQueries and calling the individual field analyzers would certainly be one way it could improve ... but so far no one has attempted anything like that. -Hoss
Re: Compound word search (maybe DisMaxQueryPaser problem)
First of all: sorry Chris, Walter .. I did not mean to put pressure on anyone. It's just that if you're stuck with something and you have that little needle stinging saying: maybe you're just too damn stupid for this ... :) So, thanks a lot for your answers. As for index time expansion using synonyms: I think this is not an option for me since it would mean that I have to a) find all such words that might cause problems and b) find every variant that might possibly be used by customers. And then in the end I have to keep all my synonym files up-to-date. But the main design goal for my search implementation is little to no maintainance. My original assumption for the DisMax Handler was, that it will just take the original query string and pass it to every field in its fieldlist using the fields configured analyzer stack. Maybe in the end add some stuff for the special options and so ... and then send the query to lucene. Can you explain why this approach was not choosen? Thanks Tobi Chris Hostetter schrieb: : Hmmm was my mail so weird or my question so stupid ... or is there simply : noone with an answer? Not even a hint? :( patience my freind, i've got a backlog of ~~500 Lucene related messages in my INBOX, and i was just reading your original email when this reply came in. In generally this is a fairly hard problem ... the easiest solution i know of that works in most cases is to do index time expansion using the SYnonymFilter, so regardless of wether a document contains "usbcable" "usb-cable" or "usb cable" all three varients get indexed, and then the user can search for any of them. the downside is that it can throw off your tf/idf stats for some terms (if they apear by themselves, and as part of a compound) and it can result in false positives for esoteric phrase searches (but that tends to be more of a theoretical problem then an actual one. : > But this never happens since with the DisMax Searcher the parser produces a : > query like this: : > : > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1) ... : > to deal with this compound word problem? Is there another query parser that : > already does the trick? take a look at the FieldQParserPlugin ... it passes the raw query string to the analyser of a specified field -- this would let your TokenFilters see the "stream" of tokens (which isn't possible with the conventional QueryParser tokenization rules) but it doesn't have any of the "field/query matric cross product" goodness of dismax -- you'd only be able to query the one field. (Hmmm i wonder if DisMaxQParser 2.0 could have an option to let you specify a FieldType whose analyzer was used to tokenize the query string instead of using the Lucene QueryParser JavaCC tokenization, and *then* the tokens resulting from that initial analyzer could be passed to the analyzers of the various qf fields ... hmmm, that might be just crazy enough to be too crazy to work) -Hoss
Re: Compound word search (maybe DisMaxQueryPaser problem)
: Hmmm was my mail so weird or my question so stupid ... or is there simply : noone with an answer? Not even a hint? :( patience my freind, i've got a backlog of ~~500 Lucene related messages in my INBOX, and i was just reading your original email when this reply came in. In generally this is a fairly hard problem ... the easiest solution i know of that works in most cases is to do index time expansion using the SYnonymFilter, so regardless of wether a document contains "usbcable" "usb-cable" or "usb cable" all three varients get indexed, and then the user can search for any of them. the downside is that it can throw off your tf/idf stats for some terms (if they apear by themselves, and as part of a compound) and it can result in false positives for esoteric phrase searches (but that tends to be more of a theoretical problem then an actual one. : > But this never happens since with the DisMax Searcher the parser produces a : > query like this: : > : > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1) ... : > to deal with this compound word problem? Is there another query parser that : > already does the trick? take a look at the FieldQParserPlugin ... it passes the raw query string to the analyser of a specified field -- this would let your TokenFilters see the "stream" of tokens (which isn't possible with the conventional QueryParser tokenization rules) but it doesn't have any of the "field/query matric cross product" goodness of dismax -- you'd only be able to query the one field. (Hmmm i wonder if DisMaxQParser 2.0 could have an option to let you specify a FieldType whose analyzer was used to tokenize the query string instead of using the Lucene QueryParser JavaCC tokenization, and *then* the tokens resulting from that initial analyzer could be passed to the analyzers of the various qf fields ... hmmm, that might be just crazy enough to be too crazy to work) -Hoss
Re: Compound word search (maybe DisMaxQueryPaser problem)
Sorry, I missed this. We have the same problem. None of our customers use query syntax, so I have considered making a full-text query parser. Use the analyzer chain, then convert the result into a big OR query, then pass it to the rest of Dismax. Shingles and synonyms should work at query time with that approach. This question should probably go to a Lucene list, too. wunder On 3/11/09 2:54 AM, "Tobias Dittrich" wrote: > Hmmm was my mail so weird or my question so stupid ... or is > there simply noone with an answer? Not even a hint? :( > > Tobias Dittrich schrieb: >> Hi all, >> >> I know there are a lot of topics about compound word search already but >> I haven't found anything for my specific problem yet. So if this is >> already answered (which would be nice :)) then any hints or search >> phrases for the mail archive would be apreciated. >> >> Bascially I want users to be able to search my index for compound words >> that are not really compounds but merely terms that can be written in >> several ways. >> >> For example I have the categories "usb" and "cable" in my index and I >> want the user to be able to search for "usbcable" or "usb-cable" etc. >> Also there is "bluetooth" in the index and I want the search for "blue >> tooth" to return the corresponding documents. >> >> My approach is to use ShingleFilterFactory followed by >> WordDelimiterFilterFactory to index all possible combinations of words >> and get rid of intra-word delimiters. This nicely covers the first part >> of my requirements since the terms "usb" and "cable" somewhere along the >> process get concatenated and "usbcable" is in the index. >> >> Now I also want use this on the query side, so the user input "blue >> tooth" (not as phrase) would become "bluetooth" for this field and >> produce a hit. But this never happens since with the DisMax Searcher the >> parser produces a query like this: >> >> ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1) >> >> And the filters and analysers for this field never get to see the whole >> user query and cannot perform their shingle and delimiter tasks :( >> >> So my question now is: how can I get this working? Is there a preferable >> way to deal with this compound word problem? Is there another query >> parser that already does the trick? >> >> Or would it make sense to write my own query parser that passes the user >> query "as is" to the several fields? >> >> Any hints on this are welcome. >> >> Thanks in advance >> Tobias >>
Re: Compound word search (maybe DisMaxQueryPaser problem)
Hmmm was my mail so weird or my question so stupid ... or is there simply noone with an answer? Not even a hint? :( Tobias Dittrich schrieb: Hi all, I know there are a lot of topics about compound word search already but I haven't found anything for my specific problem yet. So if this is already answered (which would be nice :)) then any hints or search phrases for the mail archive would be apreciated. Bascially I want users to be able to search my index for compound words that are not really compounds but merely terms that can be written in several ways. For example I have the categories "usb" and "cable" in my index and I want the user to be able to search for "usbcable" or "usb-cable" etc. Also there is "bluetooth" in the index and I want the search for "blue tooth" to return the corresponding documents. My approach is to use ShingleFilterFactory followed by WordDelimiterFilterFactory to index all possible combinations of words and get rid of intra-word delimiters. This nicely covers the first part of my requirements since the terms "usb" and "cable" somewhere along the process get concatenated and "usbcable" is in the index. Now I also want use this on the query side, so the user input "blue tooth" (not as phrase) would become "bluetooth" for this field and produce a hit. But this never happens since with the DisMax Searcher the parser produces a query like this: ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1) And the filters and analysers for this field never get to see the whole user query and cannot perform their shingle and delimiter tasks :( So my question now is: how can I get this working? Is there a preferable way to deal with this compound word problem? Is there another query parser that already does the trick? Or would it make sense to write my own query parser that passes the user query "as is" to the several fields? Any hints on this are welcome. Thanks in advance Tobias -- Tobias Dittrich - Leiter Internet-Entwicklung - _ WAVE Computersysteme GmbH Philipp-Reis-Str. 9 35440 Linden Geschäftsführer: Carsten Kellmann Registergericht Gießen HRB 1823 Fon: +49 (0) 6403 / 9050 6001 Fax: +49 (0) 6403 / 9050 5089 mailto:dittr...@wave-computer.de http://www.wave-computer.de
Compound word search (maybe DisMaxQueryPaser problem)
Hi all, I know there are a lot of topics about compound word search already but I haven't found anything for my specific problem yet. So if this is already answered (which would be nice :)) then any hints or search phrases for the mail archive would be apreciated. Bascially I want users to be able to search my index for compound words that are not really compounds but merely terms that can be written in several ways. For example I have the categories "usb" and "cable" in my index and I want the user to be able to search for "usbcable" or "usb-cable" etc. Also there is "bluetooth" in the index and I want the search for "blue tooth" to return the corresponding documents. My approach is to use ShingleFilterFactory followed by WordDelimiterFilterFactory to index all possible combinations of words and get rid of intra-word delimiters. This nicely covers the first part of my requirements since the terms "usb" and "cable" somewhere along the process get concatenated and "usbcable" is in the index. Now I also want use this on the query side, so the user input "blue tooth" (not as phrase) would become "bluetooth" for this field and produce a hit. But this never happens since with the DisMax Searcher the parser produces a query like this: ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1) And the filters and analysers for this field never get to see the whole user query and cannot perform their shingle and delimiter tasks :( So my question now is: how can I get this working? Is there a preferable way to deal with this compound word problem? Is there another query parser that already does the trick? Or would it make sense to write my own query parser that passes the user query "as is" to the several fields? Any hints on this are welcome. Thanks in advance Tobias -- Tobias Dittrich - Leiter Internet-Entwicklung - _ WAVE Computersysteme GmbH Philipp-Reis-Str. 9 35440 Linden Geschäftsführer: Carsten Kellmann Registergericht Gießen HRB 1823 Fon: +49 (0) 6403 / 9050 6001 Fax: +49 (0) 6403 / 9050 5089 mailto:dittr...@wave-computer.de http://www.wave-computer.de