Re: Wildcards / Binary searches
Chris Hostetter a écrit : : It could be a useful request handler ? Giving a field, with a perhaps, but as i said -- i think it requires more then just a special request handler, you want a special index as well. FYI: there is an ongoing thread on this general topic on the java-user list, i didn't have the time/energy to follow it but the concepts discussed there might prove interesting for you (most of the people involved have spent a lot more time on problems like this then i have)... http://www.nabble.com/How-to-implement-AJAX-search%7ELucene-Search-part--tf3887286.html Interesting, here is my idea : WildcardTermEnum (NOT query) http://www.nabble.com/Re%3A-How-to-implement-AJAX-search%7ELucene-Search-part--p11027221.html -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: Wildcards / Binary searches
: Do you mean something like below ? : field name=autocompletew wo wor word/field yeah, but there are some Tokenizers that make this trivial (EdgeNGramTokenizer i think is the name) : project, definitively not a good practice for portability of indexes. A : duplicate field with an analyser to produce a sortable ASCII version : would be better. exactly ... I think conceptually the methodology for solving the problem is very similar to the way the SpellChecker contrib works: use a very custom index designed for the application (not just look at the terms in the main corpus) and custom logic for using that index. -Hoss
Re: Wildcards / Binary searches
Sorry to jump on a Side note of the thread, but the topic is about some of my need of the moment. Side Note: It's my opinion that type ahead or auto complete' style functionality is best addressed by customized logic (most likely using specially built fields containing all of the prefixes of the key words up to N characters as seperate tokens). Do you mean something like below ? field name=autocompletew wo wor word/field simple uses of PrefixQueries are only going ot get you so far particularly under heavy load or in an index with a large number of unique terms. For a bibliographic app with lucene, I implemented a suggest on different fields (especially subject terms, like topic or place), to populate a form with already used values. I used the Lucene IndexReader to get very fastly list of terms in sorting order, without duplicate values. http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term) There's a bad drawback of this way, The enumeration is ordered by Term.compareTo(), the sorting order is natively ASCII, uppercase is before lowercase. I had to patch Lucene Term.compareTo() for this project, definitively not a good practice for portability of indexes. A duplicate field with an analyser to produce a sortable ASCII version would be better. Opinions of the list on this topic would be welcome. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
RE: Wildcards / Binary searches
I have a similar question about dismax, here is what Chris said: the dismax handler uses a much more simplified query syntax then the standard request handler. Only +, -, and are special characters so wildcards are not supported. HTH -Original Message- From: galo [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 06, 2007 8:41 AM To: solr-user@lucene.apache.org Subject: Wildcards / Binary searches Hi, Three questions: 1. I want to use solr for some sort of live search, querying with incomplete terms + wildcard and getting any similar results. Radioh* would return anything containing that string. The DisMax req. hander doesn't accept wildcards in the q param so i'm trying the simple one and still have problems as all my results are coming back with score = 1 and I need them sorted by relevance.. Is there a way of doing this? Why doesn't * work in dismax (nor ~ by the way)?? 2. What do the phrase slop params do? 3. I'm trying to implement another index where I store a number of int values for each document. Everything works ok as integers but i'd like to have some sort of fuzzy searches based on the bit representation of the numbers. Essentially, this number: 1001001010100 would be compared to these two 1011001010100 1001001010111 And the first would get a bigger score than the second, as it has only 1 flipped bit while the second has 2. Is it possible to implement this in solr? Cheers, galo
Re: Wildcards / Binary searches
At 4:40 PM +0100 6/6/07, galo wrote: 1. I want to use solr for some sort of live search, querying with incomplete terms + wildcard and getting any similar results. Radioh* would return anything containing that string. The DisMax req. hander doesn't accept wildcards in the q param so i'm trying the simple one and still have problems as all my results are coming back with score = 1 and I need them sorted by relevance.. Is there a way of doing this? Why doesn't * work in dismax (nor ~ by the way)?? DisMax was written with the intent of supporting a simple search box in which one could type or paste some text, e.g. a title like Santa Clause: Is he Real (and if so, what is real)? and get meaningful results. To do that it pre-processes the query string by removing unbalanced quotation marks and escaping characters that would otherwise be treated by the query parser as operators: \ ! ( ) : ^ [ ] { } ~ * ? I have a local version of DisMax which parameterizes the escaping so certain operators can be allowed through, which I'd be happy to contribute to you or the codebase, but I expect SimpleRH may be a better tool for your application than DisMaxRH, as long as you get it to score as you wish. Both Standard and DisMax request handlers use SolrQueryParser, an extension of the Lucene query parser which introduces a small number of changes, one of which is that prefix queries e.g. Radioh* are evaluated with ConstantScorePrefixQuery rather than the standard PrefixQuery. In issue SOLR-218 developers have been discussing per-field control of query parser options (some of it Solr's, some of it Lucene's). When that is implemented there should additionally be a property useConstantScorePrefixQuery analogous to the unfortunately-named QueryParser useOldRangeQuery, but handled by SolrQueryParser (until CSPQs are implemented as an option in Lucene QP). Until that time, well, Chris H. posted a clever and rather timely workaround on the solr-dev list: one work arround people may want to consider ... is to force the use of a WildCardQuery in what would otherwise be interpreted as a PrefixQuery by putting a ? before the * ie: auto?* instead of auto* (yes, this does require that at least one character follow the prefix) Perhaps that would help in your case? - J.J.
Re: Wildcards / Binary searches
Yeah i thought of that solution but this is a 20G index with each document having around 300 or those numbers so i was a bit worried about the performance.. I'll try anyway, thanks! On 06/06/07, *Yonik Seeley* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: On 6/6/07, galo [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: 3. I'm trying to implement another index where I store a number of int values for each document. Everything works ok as integers but i'd like to have some sort of fuzzy searches based on the bit representation of the numbers. Essentially, this number: 1001001010100 would be compared to these two 1011001010100 1001001010111 And the first would get a bigger score than the second, as it has only 1 flipped bit while the second has 2. You could store the numbers as a string field with the binary representation, then try a fuzzy search. myfield:1001001010100~ -Yonik
Re: Wildcards / Binary searches
Ok further to my email below i've been testing with q=radioh?* Basically the problem is, searching artists even with Radiohead having a big boost, it's returning stuff with less boost before like Radiohead+Ani Di Franco or Radiohead+Michael Stipe The debug output is below, but basically, for Radiohead and one of the others we get this: radiohead+ani - 655391.5 * 0.046359334 radiohead - 1150991.9 * 0.025442434 So it's fairly clear where is the difference. Looking at the numbers, the cause seems to be in this line: 8.781371 = idf(docFreq=4096) While Radiohead+Ani is getting 16.000769 = idf(docFreq=2) If I can alter this I think sorted.. what's idf and docFreq? str name=id=1200360,internal_docid=159496 30383.514 = (MATCH) sum of: 30383.514 = (MATCH) weight(text:radiohead+ani in 159496), product of: 0.046359334 = queryWeight(text:radiohead+ani), product of: 16.000769 = idf(docFreq=2) 0.0028973192 = queryNorm 655391.5 = (MATCH) fieldWeight(text:radiohead+ani in 159496), product of: 1.0 = tf(termFreq(text:radiohead+ani)=1) 16.000769 = idf(docFreq=2) 40960.0 = fieldNorm(field=text, doc=159496) /str str name=id=979,internal_docid=9799640 29284.035 = (MATCH) sum of: 29284.035 = (MATCH) weight(text:radiohead in 9799640), product of: 0.025442434 = queryWeight(text:radiohead), product of: 8.781371 = idf(docFreq=4096) 0.0028973192 = queryNorm 1150991.9 = (MATCH) fieldWeight(text:radiohead in 9799640), product of: 1.0 = tf(termFreq(text:radiohead)=1) 8.781371 = idf(docFreq=4096) 131072.0 = fieldNorm(field=text, doc=9799640) /str Thanks a lot, galo galo wrote: I was doing a different trick, basically searching q=radioh*+radioh~, and the results are slightly better than ?*, but not great. By the way, the case sensitiveness of wildcards affects here of course. I'd like to have a look to that DisMax you have if you can post it, at least to compare results. The way I get to do scoring as I say is far from perfect. By the way, I'm seeing the highlighting dissapears when using these wildcards, is that normal?? Thanks for your help, galo At 4:40 PM +0100 6/6/07, galo wrote: 1. I want to use solr for some sort of live search, querying with incomplete terms + wildcard and getting any similar results. Radioh* would return anything containing that string. The DisMax req. hander doesn't accept wildcards in the q param so i'm trying the simple one and still have problems as all my results are coming back with score = 1 and I need them sorted by relevance.. Is there a way of doing this? Why doesn't * work in dismax (nor ~ by the way)?? DisMax was written with the intent of supporting a simple search box in which one could type or paste some text, e.g. a title like Santa Clause: Is he Real (and if so, what is real)? and get meaningful results. To do that it pre-processes the query string by removing unbalanced quotation marks and escaping characters that would otherwise be treated by the query parser as operators: \ ! ( ) : ^ [ ] { } ~ * ? I have a local version of DisMax which parameterizes the escaping so certain operators can be allowed through, which I'd be happy to contribute to you or the codebase, but I expect SimpleRH may be a better tool for your application than DisMaxRH, as long as you get it to score as you wish. Both Standard and DisMax request handlers use SolrQueryParser, an extension of the Lucene query parser which introduces a small number of changes, one of which is that prefix queries e.g. Radioh* are evaluated with ConstantScorePrefixQuery rather than the standard PrefixQuery. In issue SOLR-218 developers have been discussing per-field control of query parser options (some of it Solr's, some of it Lucene's). When that is implemented there should additionally be a property useConstantScorePrefixQuery analogous to the unfortunately-named QueryParser useOldRangeQuery, but handled by SolrQueryParser (until CSPQs are implemented as an option in Lucene QP). Until that time, well, Chris H. posted a clever and rather timely workaround on the solr-dev list: one work arround people may want to consider ... is to force the use of a WildCardQuery in what would otherwise be interpreted as a PrefixQuery by putting a ? before the * ie: auto?* instead of auto* (yes, this does require that at least one character follow the prefix) Perhaps that would help in your case? - J.J.
Re: Wildcards / Binary searches
: I have a local version of DisMax which parameterizes the escaping so : certain operators can be allowed through, which I'd be happy to : contribute to you or the codebase, but I expect SimpleRH may be a better That sounds like it would be a really usefull patch if you be interested in posting it to Jira. -Hoss
Re: Wildcards / Binary searches
Hi, Hoss. I have a number of things I'd like to post... but the generally-useful stuff is unfortunately a bit interwoven with the special-case stuff, and I need to get out of breathing-down-my-back deadline mode to find the time to separate them, clean up and comment, make test cases, etc. Hopefully next week I can post at least a modest contribution including this. - J.J. At 11:31 AM -0700 6/6/07, Chris Hostetter wrote: : I have a local version of DisMax which parameterizes the escaping so : certain operators can be allowed through, which I'd be happy to : contribute to you or the codebase, but I expect SimpleRH may be a better That sounds like it would be a really usefull patch if you be interested in posting it to Jira. -Hoss