Re: Wildcards / Binary searches

2007-06-10 Thread Frédéric Glorieux

Chris Hostetter a écrit :

: It could be a useful request handler ? Giving a field, with a

perhaps, but as i said -- i think it requires more then just a special
request handler, you want a special index as well.

FYI: there is an ongoing thread on this general topic on the java-user
list, i didn't have the time/energy to follow it but the concepts
discussed there might prove interesting for you (most of the people
involved have spent a lot more time on problems like this then i have)...

http://www.nabble.com/How-to-implement-AJAX-search%7ELucene-Search-part--tf3887286.html


Interesting, here is my idea : WildcardTermEnum (NOT query)

http://www.nabble.com/Re%3A-How-to-implement-AJAX-search%7ELucene-Search-part--p11027221.html


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: Wildcards / Binary searches

2007-06-08 Thread Chris Hostetter

: Do you mean something like below ?
: field name=autocompletew wo wor word/field

yeah, but there are some Tokenizers that make this trivial
(EdgeNGramTokenizer i think is the name)


: project, definitively not a good practice for portability of indexes. A
: duplicate field with an analyser to produce a sortable ASCII version
: would be better.

exactly ... I think conceptually the methodology for solving the problem
is very similar to the way the SpellChecker contrib works: use a very
custom index designed for the application (not just look at the terms in
the main corpus) and custom logic for using that index.



-Hoss



Re: Wildcards / Binary searches

2007-06-07 Thread Frédéric Glorieux




Sorry to jump on a Side  note of the thread, but the topic is about 
some of my need of the moment.



Side Note: It's my opinion that type ahead or auto complete' style
functionality is best addressed by customized logic (most likely using
specially built fields containing all of the prefixes of the key words up
to N characters as seperate tokens).  


Do you mean something like below ?
field name=autocompletew wo wor word/field


simple uses of PrefixQueries are
only going ot get you so far particularly under heavy load or in an index
with a large number of unique terms.


For a bibliographic app with lucene, I implemented a suggest on 
different fields (especially subject terms, like topic or place), to 
populate a form with already used values. I used the Lucene IndexReader 
to get very fastly list of terms in sorting order, without duplicate values.


http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term)

There's a bad drawback of this way, The enumeration is ordered by 
Term.compareTo(), the sorting order is natively ASCII, uppercase is 
before lowercase. I had to patch Lucene Term.compareTo() for this 
project, definitively not a good practice for portability of indexes. A 
duplicate field with an analyser to produce a sortable ASCII version 
would be better.


Opinions of the list on this topic would be welcome.

--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


RE: Wildcards / Binary searches

2007-06-06 Thread Xuesong Luo
I have a similar question about dismax, here is what Chris said:

the dismax handler uses a much more simplified query syntax then the
standard request handler.  Only +, -, and  are special characters so
wildcards are not supported.


HTH

-Original Message-
From: galo [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 06, 2007 8:41 AM
To: solr-user@lucene.apache.org
Subject: Wildcards / Binary searches

Hi,

Three questions:

1. I want to use solr for some sort of live search, querying with 
incomplete terms + wildcard and getting any similar results. Radioh* 
would return anything containing that string. The DisMax req. hander 
doesn't accept wildcards in the q param so i'm trying the simple one and

still have problems as all my results are coming back with score = 1 and

I need them sorted by relevance.. Is there a way of doing this? Why 
doesn't * work in dismax (nor ~ by the way)??

2. What do the phrase slop params do?

3. I'm trying to implement another index where I store a number of int 
values for each document. Everything works ok as integers but i'd like 
to have some sort of fuzzy searches based on the bit representation of 
the numbers. Essentially, this number:

1001001010100

would be compared to these two

1011001010100
1001001010111

And the first would get a bigger score than the second, as it has only 1

flipped bit while the second has 2.

Is it possible to implement this in solr?


Cheers,
galo




Re: Wildcards / Binary searches

2007-06-06 Thread J.J. Larrea
At 4:40 PM +0100 6/6/07, galo wrote:
1. I want to use solr for some sort of live search, querying with incomplete 
terms + wildcard and getting any similar results. Radioh* would return 
anything containing that string. The DisMax req. hander doesn't accept 
wildcards in the q param so i'm trying the simple one and still have problems 
as all my results are coming back with score = 1 and I need them sorted by 
relevance.. Is there a way of doing this? Why doesn't * work in dismax (nor ~ 
by the way)??

DisMax was written with the intent of supporting a simple search box in which 
one could type or paste some text, e.g. a title like

Santa Clause: Is he Real (and if so, what is real)?

and get meaningful results.  To do that it pre-processes the query string by 
removing unbalanced quotation marks and escaping characters that would 
otherwise be treated by the query parser as operators:

\ ! ( ) : ^ [ ] { } ~ * ?

I have a local version of DisMax which parameterizes the escaping so certain 
operators can be allowed through, which I'd be happy to contribute to you or 
the codebase, but I expect SimpleRH may be a better tool for your application 
than DisMaxRH, as long as you get it to score as you wish.

Both Standard and DisMax request handlers use SolrQueryParser, an extension of 
the Lucene query parser which introduces a small number of changes, one of 
which is that prefix queries e.g. Radioh* are evaluated with 
ConstantScorePrefixQuery rather than the standard PrefixQuery.

In issue SOLR-218 developers have been discussing per-field control of query 
parser options (some of it Solr's, some of it Lucene's).  When that is 
implemented there should additionally be a property useConstantScorePrefixQuery 
analogous to the unfortunately-named QueryParser useOldRangeQuery, but handled 
by SolrQueryParser (until CSPQs are implemented as an option in Lucene QP).

Until that time, well, Chris H. posted a clever and rather timely workaround on 
the solr-dev list:

one work arround people may want to consider ... is to force the use of a 
WildCardQuery in what would otherwise be interpreted as a PrefixQuery by 
putting a ? before the *

ie: auto?* instead of auto*

(yes, this does require that at least one character follow the prefix)

Perhaps that would help in your case?

- J.J.



Re: Wildcards / Binary searches

2007-06-06 Thread galo

Yeah i thought of that solution but this is a 20G index with each
document having around 300 or those numbers so i was a bit worried about
the performance.. I'll try anyway, thanks!

On 06/06/07, *Yonik Seeley* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 
wrote:


On 6/6/07, galo [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote:
  3. I'm trying to implement another index where I store a number of
int
  values for each document. Everything works ok as integers but i'd
like
  to have some sort of fuzzy searches based on the bit representation of
  the numbers. Essentially, this number:

  1001001010100

  would be compared to these two

  1011001010100
  1001001010111

  And the first would get a bigger score than the second, as it has
only 1
  flipped bit while the second has 2.

You could store the numbers as a string field with the binary
representation,
then try a fuzzy search.

  myfield:1001001010100~

-Yonik






Re: Wildcards / Binary searches

2007-06-06 Thread galo

Ok further to my email below i've been testing with q=radioh?*

Basically the problem is, searching artists even with Radiohead having a 
big boost, it's returning stuff with less boost before like 
Radiohead+Ani Di Franco or Radiohead+Michael Stipe


The debug output is below, but basically, for Radiohead and one of the 
others we get this:


radiohead+ani - 655391.5  * 0.046359334
radiohead - 1150991.9 * 0.025442434

So it's fairly clear where is the difference. Looking at the numbers, 
the cause seems to be in this line:


8.781371 = idf(docFreq=4096)

While Radiohead+Ani is getting

16.000769 = idf(docFreq=2)

If I can alter this I think sorted.. what's idf and docFreq?


  str name=id=1200360,internal_docid=159496
30383.514 = (MATCH) sum of:
  30383.514 = (MATCH) weight(text:radiohead+ani in 159496), product of:
0.046359334 = queryWeight(text:radiohead+ani), product of:
  16.000769 = idf(docFreq=2)
  0.0028973192 = queryNorm
655391.5 = (MATCH) fieldWeight(text:radiohead+ani in 159496), 
product of:

  1.0 = tf(termFreq(text:radiohead+ani)=1)
  16.000769 = idf(docFreq=2)
  40960.0 = fieldNorm(field=text, doc=159496)
/str
  str name=id=979,internal_docid=9799640
29284.035 = (MATCH) sum of:
  29284.035 = (MATCH) weight(text:radiohead in 9799640), product of:
0.025442434 = queryWeight(text:radiohead), product of:
  8.781371 = idf(docFreq=4096)
  0.0028973192 = queryNorm
1150991.9 = (MATCH) fieldWeight(text:radiohead in 9799640), product of:
  1.0 = tf(termFreq(text:radiohead)=1)
  8.781371 = idf(docFreq=4096)
  131072.0 = fieldNorm(field=text, doc=9799640)
/str

Thanks a lot,

galo


galo wrote:
I was doing a different trick, basically searching q=radioh*+radioh~, 
and the results are slightly better than ?*, but not great. By the way, 
the case sensitiveness of wildcards affects here of course.


I'd like to have a look to that DisMax you have if you can post it, at 
least to compare results. The way I get to do scoring as I say is far 
from perfect.


By the way, I'm seeing the highlighting dissapears when using these 
wildcards, is that normal??


Thanks for your help,

galo


At 4:40 PM +0100 6/6/07, galo wrote:
 1. I want to use solr for some sort of live search, querying with 
incomplete terms + wildcard and getting any similar results. Radioh* 
would return anything containing that string. The DisMax req. hander 
doesn't accept wildcards in the q param so i'm trying the simple one 
and still have problems as all my results are coming back with score = 
1 and I need them sorted by relevance.. Is there a way of doing this? 
Why doesn't * work in dismax (nor ~ by the way)??


DisMax was written with the intent of supporting a simple search box 
in which one could type or paste some text, e.g. a title like


Santa Clause: Is he Real (and if so, what is real)?

and get meaningful results.  To do that it pre-processes the query 
string by removing unbalanced quotation marks and escaping characters 
that would otherwise be treated by the query parser as operators:


\ ! ( ) : ^ [ ] { } ~ * ?

I have a local version of DisMax which parameterizes the escaping so 
certain operators can be allowed through, which I'd be happy to 
contribute to you or the codebase, but I expect SimpleRH may be a 
better tool for your application than DisMaxRH, as long as you get it 
to score as you wish.


Both Standard and DisMax request handlers use SolrQueryParser, an 
extension of the Lucene query parser which introduces a small number 
of changes, one of which is that prefix queries e.g. Radioh* are 
evaluated with ConstantScorePrefixQuery rather than the standard 
PrefixQuery.


In issue SOLR-218 developers have been discussing per-field control of 
query parser options (some of it Solr's, some of it Lucene's).  When 
that is implemented there should additionally be a property 
useConstantScorePrefixQuery analogous to the unfortunately-named 
QueryParser useOldRangeQuery, but handled by SolrQueryParser (until 
CSPQs are implemented as an option in Lucene QP).


Until that time, well, Chris H. posted a clever and rather timely 
workaround on the solr-dev list:


 one work arround people may want to consider ... is to force the use 
of a WildCardQuery in what would otherwise be interpreted as a 
PrefixQuery by putting a ? before the *

 
 ie: auto?* instead of auto*
 
 (yes, this does require that at least one character follow the prefix)

Perhaps that would help in your case?

- J.J.










Re: Wildcards / Binary searches

2007-06-06 Thread Chris Hostetter

: I have a local version of DisMax which parameterizes the escaping so
: certain operators can be allowed through, which I'd be happy to
: contribute to you or the codebase, but I expect SimpleRH may be a better

That sounds like it would be a really usefull patch if you be interested
in posting it to Jira.



-Hoss



Re: Wildcards / Binary searches

2007-06-06 Thread J.J. Larrea
Hi, Hoss.

I have a number of things I'd like to post... but the generally-useful stuff is 
unfortunately a bit interwoven with the special-case stuff, and I need to get 
out of breathing-down-my-back deadline mode to find the time to separate them, 
clean up and comment, make test cases, etc.  Hopefully next week I can post at 
least a modest contribution including this.

- J.J.

At 11:31 AM -0700 6/6/07, Chris Hostetter wrote:
: I have a local version of DisMax which parameterizes the escaping so
: certain operators can be allowed through, which I'd be happy to
: contribute to you or the codebase, but I expect SimpleRH may be a better

That sounds like it would be a really usefull patch if you be interested
in posting it to Jira.



-Hoss