Re: Index search questions; special cases
: Chris, thanks for the tips (or should I say, detailed explanation!). I : actually got it working! It was a pain at first (never did any java, and good to know .. glad it worked out for you. : If Solr is interested in the filter, just tell me (and how should I do : to contribute it). The full list of instructions on how to submit a patch can be found on the wiki... http://wiki.apache.org/solr/HowToContribute ...ideally a patch should include unit tests demonstrating the new feature, but if you don't have any of those (and don't feel like writing them) a patch can still be usefull to other people (who might be interested in writing unit tests to encourage getting the changes added) if you do open a Jira issue and attach your code, please note this thread and the URL of the orriginal class in nutch, so people who may stumble accross it in Jira know where the orriginal version is. -Hoss
Re: Index search questions; special cases
CommonGrams itself seems to have some other dependencies on nutch because of other utilities in the same class, but based on a quick skim, what you really want is the nested private static class Filter extends TokenFilter which doesn't really have any external dependencies. If you extract that class into some more specificly named CommonGramsFilter, all you need after that to use it in Solr is a simple little FilterFactory so you can refrence it in your schema.xml ... you can use the StopFilterFactory as a template since you'll need exactly the same initalization (get the name of a word list file from the init params, parse it, and build a word set out of it)... Chris, thanks for the tips (or should I say, detailed explanation!). I actually got it working! It was a pain at first (never did any java, and all this ant, junit, war, jar, java, .classes are confusing!). I had some compile errors that I cleaned up. Playing around with the filter in the admin panel analyser yields expected results; I can't thank you enough for your help. I now use : tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.CommonGramsFilterFactory words=stopwords-complete.txt ignoreCase=true/ filter class=solr.StopFilterFactory words=stopwords-complete.txt ignoreCase=true/ And it works perfectly. If Solr is interested in the filter, just tell me (and how should I do to contribute it). Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/analysis/StopFilterFactory.java?view=markup ...all you really need to change is that the create method should return a new CommonGramsFilter instead of a StopFilter. Incidently: most of the code in CommonGrams.Filter seems to be dealing with the buffering of tokens ... it may be easier to reimpliment the logic with Solr's BufferedTokenStream as a base class.
Re: Index search questions; special cases
Erik Hatcher wrote: Yeah, the Nutch code is highly intertwined with its unique configuration infrastructure and makes it hard to pull pieces of it out like this. This is a critique that has been heard a lot (mainly because its true :) It would be really cool if different camps of lucene could build these nice utilities to be usable between projects. Not exactly sure how this could be accomplished but anyway something to consider. -- Sami Siren
Re: Index search questions; special cases
: Yeah, the Nutch code is highly intertwined with its unique configuration : infrastructure and makes it hard to pull pieces of it out like this. that CacheGrams inner Filter classe seemed like it could be extracted easily enough. : This is a critique that has been heard a lot (mainly because its true :) : It would be really cool if different camps of lucene could build these : nice utilities to be usable between projects. Not exactly sure how this : could be accomplished but anyway something to consider. [EMAIL PROTECTED] is probably the best place to raise this discussion if you're interested in pursuing it ... i think the best way to deal with it may just be on a case by case basis ... if you find cool code in sub-project XYZ, start by working with XYZ-dev to refactor it into an extractable chunk, then work with java-dev to promote it up in the lucene Java code base, and then circle back to XYZ-dev to deprecate the copy in the XYZ code repository and replace it with a dependency on the newly promoted version. -Hoss
Re: Index search questions; special cases
: : Nutch has phrase pre-filtering which helps with this. It indexes the : : phrase fragments as separate terms and uses that set of matches to : : filter the set of matching documents. : That reminds me ... i seem to remember someone saying once that Nutch lso : builds word based n-grams out of it's stop words, so searches on the : or on won't match anything because those words are never indexed as a : single tokens, but if a document contains the dog in the house it would : match a search on in the because the Analyzer would treat that as a : single token in_the. : This looks like exactly what I'm looking for. Is it related to the above : 'nutch pre-filtering'? This way if I stopword single letters and : numbers, it would still index 'hepatitis_a' as a single token, and match : a search on 'hepatitis a' (non-phrase search) without hitting 'a patient : has hepatitis'? I guess i'd have to apply the filter to the query too, : so it turns the query into hepatitis_a? right ... i think we were both talking baout the same feature, which Otis says is in Nutch's CommonGrams class... http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/CommonGrams.java?view=markup : Any chance at all this kind of filter gets implemented into solr? If : not, indications on how to do it myself would be appreciated - I can't CommonGrams itself seems to have some other dependencies on nutch because of other utilities in the same class, but based on a quick skim, what you really want is the nested private static class Filter extends TokenFilter which doesn't really have any external dependencies. If you extract that class into some more specificly named CommonGramsFilter, all you need after that to use it in Solr is a simple little FilterFactory so you can refrence it in your schema.xml ... you can use the StopFilterFactory as a template since you'll need exactly the same initalization (get the name of a word list file from the init params, parse it, and build a word set out of it)... http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/analysis/StopFilterFactory.java?view=markup ...all you really need to change is that the create method should return a new CommonGramsFilter instead of a StopFilter. Incidently: most of the code in CommonGrams.Filter seems to be dealing with the buffering of tokens ... it may be easier to reimpliment the logic with Solr's BufferedTokenStream as a base class. -Hoss
Re: Index search questions; special cases
On Nov 14, 2006, at 2:00 PM, Chris Hostetter wrote: CommonGrams itself seems to have some other dependencies on nutch because of other utilities in the same class, but based on a quick skim, what you really want is the nested private static class Filter extends TokenFilter which doesn't really have any external dependencies. If you extract that class into some more specificly named CommonGramsFilter,... Yeah, the Nutch code is highly intertwined with its unique configuration infrastructure and makes it hard to pull pieces of it out like this. Erik
Re: Index search questions; special cases
On 11/12/06 8:52 PM, Michael Imbeault [EMAIL PROTECTED] wrote: Sadly I can't rely on users smartness for this :) I have concerns that for stuff like Hepatitis A, it will match just about every document containing hepatitis and the very common 'a' word, anywhere in the document. I can't stopword single letters, cause then there would be no way to find documents about 'hepatitis c' and not about 'hepatitis b' for example. I will test my solution and report; if you have any other ideas, just tell me. Nutch has phrase pre-filtering which helps with this. It indexes the phrase fragments as separate terms and uses that set of matches to filter the set of matching documents. Another approach is to implement protected phrases, similar to the protected words in stemming. These would be protected from stopword processing. A list of exception word and phrases is a pretty common trick in other engines. Otherwise, you go nuts trying to get your analyzer to handle .NET and vitamin a. I know that AltaVista and Inktomi did this. wunder -- Walter Underwood Search Guru, Netflix
Re: Index search questions; special cases
On 11/13/06, Walter Underwood [EMAIL PROTECTED] wrote: Another approach is to implement protected phrases, similar to the protected words in stemming. These would be protected from stopword processing. One could use the synonym filter (which can handle multi-token synonyms) to get this effect. WordDelimiterFilter = SynonymFilter = StopwordFilter = Stemmer The SynonymFilter could have the following config: hepatitis a, hepatitis_a Do expand=true on the indexing analyzer, and expand=false on the query analyzer Then, a doc with hepatitis a will end up indexing hepatitus and hepatitis_a And at query time all the following searches will find the doc: text:hepatitus text:hepatitis a text:hepatitis-a A list of exception word and phrases is a pretty common trick in other engines. Otherwise, you go nuts trying to get your analyzer to handle .NET and vitamin a. I know that AltaVista and Inktomi did this. That's not a bad idea... most of the code from the multi-token SynonymFilter could be reused to efficiently recognize multi-token matches. -Yonik
Re: Index search questions; special cases
: Sadly I can't rely on users smartness for this :) I have concerns that : for stuff like Hepatitis A, it will match just about every document : containing hepatitis and the very common 'a' word, anywhere in the : document. I can't stopword single letters, cause then there would be no : way to find documents about 'hepatitis c' and not about 'hepatitis b' : Nutch has phrase pre-filtering which helps with this. It indexes the : phrase fragments as separate terms and uses that set of matches to : filter the set of matching documents. That reminds me ... i seem to remember someone saying once that Nutch lso builds word based n-grams out of it's stop words, so searches on the or on won't match anything because those words are never indexed as a single tokens, but if a document contains the dog in the house it would match a search on in the becaue the Analyzer would treat that as a single token in_the. something like thta might work as well. -Hoss
Re: Index search questions; special cases
On 11/12/06, Michael Imbeault [EMAIL PROTECTED] wrote: - Somewhat related : Let's say I index Polymyxin B. If I stopword single letters, would a phrase search (Polymyxin B) still find the right documents (I don't think so, but still)? If not, I'll have to index single letters; how do I prevent the same problem as in the first question (i.e., a search on Polymyxin B yielding documents with Polymyxin and B, but not close to one another). The general problem seems that you can tell what should be in a phrase search and what shouldn't You could try throwing everything in a sloppy phrase query, so at least scores will go up when terms are closer together (in general). You could also try an exact phrase query, and if you don't get enough results, follow it up with another strategy (like what you have below). My thought is to parse the user query and rephrase it to do phrase searches on nearby terms containing single letters / numbers. If an user search for HIV 1 hepatitis, I'd rewrite it as (HIV 1 AND hepatitis) OR (1 hepatitis AND hiv). Is it a sensible solution? That might work. Whatever general strategy you end up trying, you can probably boost relevancy with some domain specific knowledge injected with something like the SynonymFilter. -Yonik
Re: Index search questions; special cases
On 11/13/06, Yonik Seeley [EMAIL PROTECTED] wrote: The SynonymFilter could have the following config: hepatitis a, hepatitis_a Oops, the synonyms should be reversed like so: hepatitis_a, hepatitis a so that when expand=false for querying, hepatitis a is mapped to hepatitis_a -Yonik
Re: Index search questions; special cases
On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote: That reminds me ... i seem to remember someone saying once that Nutch lso builds word based n-grams out of it's stop words, so searches on the or on won't match anything because those words are never indexed as a single tokens, but if a document contains the dog in the house it would match a search on in the becaue the Analyzer would treat that as a single token in_the. Yup we covered this in LIA: http://lucenebook.com/search?query=nutch+stop+words
Re: Index search questions; special cases
Indeed. CommonGrams.java in Nutch is the place to look. Otis - Original Message From: Erik Hatcher [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 13, 2006 2:08:51 PM Subject: Re: Index search questions; special cases On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote: That reminds me ... i seem to remember someone saying once that Nutch lso builds word based n-grams out of it's stop words, so searches on the or on won't match anything because those words are never indexed as a single tokens, but if a document contains the dog in the house it would match a search on in the becaue the Analyzer would treat that as a single token in_the. Yup we covered this in LIA: http://lucenebook.com/search?query=nutch+stop+words
Re: Index search questions; special cases
Hello everyone, Thanks for all your answers; synonyms based approaches won't work because the medical / research field is evolving way too fast; it would become unmaintainable very quickly, and the list would be huge. Anyway, I can't rely on score because I'm sorting by date, so I need to eliminate the 'hiv' in one part of the doc and '1' in another part problem completely (if I want docs that fits HIV-1, or Polymyxin B, or hepatitis A - I don't want docs that fits 'A patient was cured of hepatitis C' if I search for 'hepatitis a'). : Nutch has phrase pre-filtering which helps with this. It indexes the : phrase fragments as separate terms and uses that set of matches to : filter the set of matching documents. Is this a filter that I could implement easily into Solr? I never did java, but it can't be that complicated I guess. Any help would be appreciated. That reminds me ... i seem to remember someone saying once that Nutch lso builds word based n-grams out of it's stop words, so searches on the or on won't match anything because those words are never indexed as a single tokens, but if a document contains the dog in the house it would match a search on in the because the Analyzer would treat that as a single token in_the. This looks like exactly what I'm looking for. Is it related to the above 'nutch pre-filtering'? This way if I stopword single letters and numbers, it would still index 'hepatitis_a' as a single token, and match a search on 'hepatitis a' (non-phrase search) without hitting 'a patient has hepatitis'? I guess i'd have to apply the filter to the query too, so it turns the query into hepatitis_a? Basically, its another way to what I proposed as a solution - rewrite the query to include phrase queries when you find a stopword, if you index them anyway. Still, this solution looks better, as the size of the index would probably be smaller than if I didn't stopword single letters at all? For reference, what I proposed was: My thought is to parse the user query and rephrase it to do phrase searches on nearby terms containing single letters / numbers. If an user search for HIV 1 hepatitis, I'd rewrite it as (HIV 1 AND hepatitis) OR (1 hepatitis AND hiv). Is it a sensible solution? Any chance at all this kind of filter gets implemented into solr? If not, indications on how to do it myself would be appreciated - I can't say I have a clue right now (never did java, the only lucene programming I did was via a php bridge). Thanks for the help, Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Index search questions; special cases
Hello again, - Let's say I index HIV-1 with filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1/. Would a search on HIV AND 1 (or even HIV-1, which after parsing by the above filter would yield HIV1 or HIV 1) also find documents which have HIV and the number 1 somewhere in the document, but not directly after HIV? If so, how should I fix this? I could boost score by proximity, but I'm doing a sort on date anyway, so I guess it would be pointless to do so. - Somewhat related : Let's say I index Polymyxin B. If I stopword single letters, would a phrase search (Polymyxin B) still find the right documents (I don't think so, but still)? If not, I'll have to index single letters; how do I prevent the same problem as in the first question (i.e., a search on Polymyxin B yielding documents with Polymyxin and B, but not close to one another). My thought is to parse the user query and rephrase it to do phrase searches on nearby terms containing single letters / numbers. If an user search for HIV 1 hepatitis, I'd rewrite it as (HIV 1 AND hepatitis) OR (1 hepatitis AND hiv). Is it a sensible solution? Thanks, -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Re: Index search questions; special cases
: - Let's say I index HIV-1 with filter : class=solr.WordDelimiterFilterFactory generateWordParts=1 : generateNumberParts=1 catenateWords=1 catenateNumbers=1 : catenateAll=1/. Would a search on HIV AND 1 (or even HIV-1, which : after parsing by the above filter would yield HIV1 or HIV 1) also find : documents which have HIV and the number 1 somewhere in the document, : but not directly after HIV? If so, how should I fix this? I could boost : score by proximity, but I'm doing a sort on date anyway, so I guess it : would be pointless to do so. A couple of things make your question really hard to answer ... first off, you can specify differnet analyser chains for index time and query time -- shen dealing with the WordDelim filter (or the synonym fitler) this is frequently neccessary -- so the ansers to your questions really depend on wether you use WordDelim at both index time and query time (or if you do use it in both cases, but configure it differnetly) Have you by any chance played with the Analysis page on your Solr index? http://localhost:8983/solr/admin/analysis.jsp?name=verbose=onhighlight=onqverbose=on; ...it makes it really easy to see exactly how your various fields will get parsed at index time and query time. I would also suggest you use the debugQuery=on option when doing some searches -- even if there aren't nay documents in your index, that will help you see how your query is getting parsed and what Query structure QueryParser is building based on the tokens it gets from each of hte Anaalyzers. : - Somewhat related : Let's say I index Polymyxin B. If I stopword : single letters, would a phrase search (Polymyxin B) still find the : right documents (I don't think so, but still)? If not, I'll have to depends on what the right documents are .. if you strip stopwords out both at index time and at query time then it will ultimately match exctly the same thing as a query on Polymyxin which i guess must be the right documents since no documents will container the letter B so what else could be right? :) : index single letters; how do I prevent the same problem as in the first : question (i.e., a search on Polymyxin B yielding documents with : Polymyxin and B, but not close to one another). : : My thought is to parse the user query and rephrase it to do phrase : searches on nearby terms containing single letters / numbers. If an user : search for HIV 1 hepatitis, I'd rewrite it as (HIV 1 AND hepatitis) OR : (1 hepatitis AND hiv). Is it a sensible solution? that's kind of a strange behavior for a search application to have ... you might just wnat to trust that your users will be smart and if they find that 'HIV 1 hepatitis' is matching docs where 1 doesn't appear near HIV or hepatitis then they will start entering 'HIV 1 hepatitis (or 'HIV 1 hepatits' if that's what they ment.) -Hoss
Re: Index search questions; special cases
Chris Hostetter wrote: A couple of things make your question really hard to answer ... first off, you can specify differnet analyser chains for index time and query time -- shen dealing with the WordDelim filter (or the synonym fitler) this is frequently neccessary -- so the ansers to your questions really depend on wether you use WordDelim at both index time and query time (or if you do use it in both cases, but configure it differnetly) For clarification, I'm using the filter both at index and query time. Have you by any chance played with the Analysis page on your Solr index? http://localhost:8983/solr/admin/analysis.jsp?name=verbose=onhighlight=onqverbose=on; ...it makes it really easy to see exactly how your various fields will get parsed at index time and query time. I would also suggest you use the debugQuery=on option when doing some searches -- even if there aren't nay documents in your index, that will help you see how your query is getting parsed and what Query structure QueryParser is building based on the tokens it gets from each of hte Anaalyzers. Will try that, played with it in the past, but not for this particular problem, good idea :) : My thought is to parse the user query and rephrase it to do phrase : searches on nearby terms containing single letters / numbers. If an user : search for HIV 1 hepatitis, I'd rewrite it as (HIV 1 AND hepatitis) OR : (1 hepatitis AND hiv). Is it a sensible solution? that's kind of a strange behavior for a search application to have ... you might just wnat to trust that your users will be smart and if they find that 'HIV 1 hepatitis' is matching docs where 1 doesn't appear near HIV or hepatitis then they will start entering 'HIV 1 hepatitis (or 'HIV 1 hepatits' if that's what they ment.) Sadly I can't rely on users smartness for this :) I have concerns that for stuff like Hepatitis A, it will match just about every document containing hepatitis and the very common 'a' word, anywhere in the document. I can't stopword single letters, cause then there would be no way to find documents about 'hepatitis c' and not about 'hepatitis b' for example. I will test my solution and report; if you have any other ideas, just tell me. And thanks for the help! :)