Wildcards / Binary searches

2007-06-06 Thread galo

Hi,

Three questions:

1. I want to use solr for some sort of live search, querying with 
incomplete terms + wildcard and getting any similar results. Radioh* 
would return anything containing that string. The DisMax req. hander 
doesn't accept wildcards in the q param so i'm trying the simple one and 
still have problems as all my results are coming back with score = 1 and 
I need them sorted by relevance.. Is there a way of doing this? Why 
doesn't * work in dismax (nor ~ by the way)??


2. What do the phrase slop params do?

3. I'm trying to implement another index where I store a number of int 
values for each document. Everything works ok as integers but i'd like 
to have some sort of fuzzy searches based on the bit representation of 
the numbers. Essentially, this number:


1001001010100

would be compared to these two

1011001010100
1001001010111

And the first would get a bigger score than the second, as it has only 1 
flipped bit while the second has 2.


Is it possible to implement this in solr?


Cheers,
galo



Re: Wildcards / Binary searches

2007-06-06 Thread galo

Yeah i thought of that solution but this is a 20G index with each
document having around 300 or those numbers so i was a bit worried about
the performance.. I'll try anyway, thanks!

On 06/06/07, *Yonik Seeley* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 
wrote:


On 6/6/07, galo [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote:
  3. I'm trying to implement another index where I store a number of
int
  values for each document. Everything works ok as integers but i'd
like
  to have some sort of fuzzy searches based on the bit representation of
  the numbers. Essentially, this number:

  1001001010100

  would be compared to these two

  1011001010100
  1001001010111

  And the first would get a bigger score than the second, as it has
only 1
  flipped bit while the second has 2.

You could store the numbers as a string field with the binary
representation,
then try a fuzzy search.

  myfield:1001001010100~

-Yonik






Re: Wildcards / Binary searches

2007-06-06 Thread galo

Ok further to my email below i've been testing with q=radioh?*

Basically the problem is, searching artists even with Radiohead having a 
big boost, it's returning stuff with less boost before like 
Radiohead+Ani Di Franco or Radiohead+Michael Stipe


The debug output is below, but basically, for Radiohead and one of the 
others we get this:


radiohead+ani - 655391.5  * 0.046359334
radiohead - 1150991.9 * 0.025442434

So it's fairly clear where is the difference. Looking at the numbers, 
the cause seems to be in this line:


8.781371 = idf(docFreq=4096)

While Radiohead+Ani is getting

16.000769 = idf(docFreq=2)

If I can alter this I think sorted.. what's idf and docFreq?


  str name=id=1200360,internal_docid=159496
30383.514 = (MATCH) sum of:
  30383.514 = (MATCH) weight(text:radiohead+ani in 159496), product of:
0.046359334 = queryWeight(text:radiohead+ani), product of:
  16.000769 = idf(docFreq=2)
  0.0028973192 = queryNorm
655391.5 = (MATCH) fieldWeight(text:radiohead+ani in 159496), 
product of:

  1.0 = tf(termFreq(text:radiohead+ani)=1)
  16.000769 = idf(docFreq=2)
  40960.0 = fieldNorm(field=text, doc=159496)
/str
  str name=id=979,internal_docid=9799640
29284.035 = (MATCH) sum of:
  29284.035 = (MATCH) weight(text:radiohead in 9799640), product of:
0.025442434 = queryWeight(text:radiohead), product of:
  8.781371 = idf(docFreq=4096)
  0.0028973192 = queryNorm
1150991.9 = (MATCH) fieldWeight(text:radiohead in 9799640), product of:
  1.0 = tf(termFreq(text:radiohead)=1)
  8.781371 = idf(docFreq=4096)
  131072.0 = fieldNorm(field=text, doc=9799640)
/str

Thanks a lot,

galo


galo wrote:
I was doing a different trick, basically searching q=radioh*+radioh~, 
and the results are slightly better than ?*, but not great. By the way, 
the case sensitiveness of wildcards affects here of course.


I'd like to have a look to that DisMax you have if you can post it, at 
least to compare results. The way I get to do scoring as I say is far 
from perfect.


By the way, I'm seeing the highlighting dissapears when using these 
wildcards, is that normal??


Thanks for your help,

galo


At 4:40 PM +0100 6/6/07, galo wrote:
 1. I want to use solr for some sort of live search, querying with 
incomplete terms + wildcard and getting any similar results. Radioh* 
would return anything containing that string. The DisMax req. hander 
doesn't accept wildcards in the q param so i'm trying the simple one 
and still have problems as all my results are coming back with score = 
1 and I need them sorted by relevance.. Is there a way of doing this? 
Why doesn't * work in dismax (nor ~ by the way)??


DisMax was written with the intent of supporting a simple search box 
in which one could type or paste some text, e.g. a title like


Santa Clause: Is he Real (and if so, what is real)?

and get meaningful results.  To do that it pre-processes the query 
string by removing unbalanced quotation marks and escaping characters 
that would otherwise be treated by the query parser as operators:


\ ! ( ) : ^ [ ] { } ~ * ?

I have a local version of DisMax which parameterizes the escaping so 
certain operators can be allowed through, which I'd be happy to 
contribute to you or the codebase, but I expect SimpleRH may be a 
better tool for your application than DisMaxRH, as long as you get it 
to score as you wish.


Both Standard and DisMax request handlers use SolrQueryParser, an 
extension of the Lucene query parser which introduces a small number 
of changes, one of which is that prefix queries e.g. Radioh* are 
evaluated with ConstantScorePrefixQuery rather than the standard 
PrefixQuery.


In issue SOLR-218 developers have been discussing per-field control of 
query parser options (some of it Solr's, some of it Lucene's).  When 
that is implemented there should additionally be a property 
useConstantScorePrefixQuery analogous to the unfortunately-named 
QueryParser useOldRangeQuery, but handled by SolrQueryParser (until 
CSPQs are implemented as an option in Lucene QP).


Until that time, well, Chris H. posted a clever and rather timely 
workaround on the solr-dev list:


 one work arround people may want to consider ... is to force the use 
of a WildCardQuery in what would otherwise be interpreted as a 
PrefixQuery by putting a ? before the *

 
 ie: auto?* instead of auto*
 
 (yes, this does require that at least one character follow the prefix)

Perhaps that would help in your case?

- J.J.










wrong path in snappuller

2007-04-24 Thread galo
I have downloaded all the scripts from the current version in the trunk 
and I'm finding the same issues as 
https://issues.apache.org/jira/browse/SOLR-188 in the snappuller, I 
haven't looked in other scripts yet.


rsync -Wa${verbose}${compress} --delete ${sizeonly} \
${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/ 
${data_dir}/${name}-wip


that command fails in non-default installations due to that /solr/

Is this known or should I log it in JIRA?

thanks,

galo




Re: New docs need server restart after synchronization

2007-04-17 Thread galo
I was expecting to see the commit error which is the one that told me 
before the configuration was not correct but the commit was made and 
logged correctly... on the master. I had copied scripts.conf from the 
master to all the slaves with solr_hostname=master and didn't replace 
with slave1 slave2 etc, so the commit was successful, but not on the 
slave. Dumb. Dumb.


Now that we are at it.. how's the best set up for multiple indexes on 
the same server? Atm I have a tomcat server runs on each machine with a 
separate webapp for each index for a few indexes (5 by now, will be many 
more)


Something like
 master: http://master:8080/index1, http://master:8080/index2
 slave1: http://slave1:8080/index1, http://slave1:8080/index2
 slave2: http://slave2:8080/index1, http://slave2:8080/index2
 slave3: http://slave3:8080/index1, http://slave3:8080/index2
 ...

On each server I have a main solr directory and then one for each index

solr
solr-index1
solr-index2

The main directory (solr) holds all the shared folders, so inside 
solr-index1 and solr-index2, bin, etc, lib, ext, webapp etc are links to 
sorl/bin, solr/etc, solr/lib, etc. Obviously, conf, data and logs are 
not shared. This is reasonably simple to set up and install on other 
servers being the only real complain that i'd like to keep a single conf 
folder shared across master and slaves for each index (schema.xml, 
solrconfig.xml etc. should be the same and scripts.conf can be shared as 
long as you use solr_hostname=localhost and the same port, which is my 
case). Not a tragedy anyway.


How are you dealing with these situations, are there better ways than this?

Cheers,
galo


Chris Hostetter wrote:

: problems for a few weeks, snappuller and snapinstaller run every hour
: normally, install the new index without errors etc.
:
: In the last couple of days it seems like I need to restart tomcat for
: the new documents to appear in the slave indexes. New docs are updated,
: commited and appear if I do a search on the master, but if after the
: synchronization I search in the slaves they don't appear. If I restart
: tomcat they do.

snappuller notifies the slaves that they should reopen the index by
executing the bin/commit script ... if that fails it should log a message
... but maybe bin/commit is logging an error in it's log and exiting with
a successful return code?

do you see the commit log messages from Solr?  do you see the /update
requests in the slaves tomcat access logs?





-Hoss





--
Galo Navarro, Developer

[EMAIL PROTECTED]
t. +44 (0)20 7780 7080

Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL

http://www.last.fm/user/galeote


New docs need server restart after synchronization

2007-04-16 Thread galo

Hi there,

I've been running an index on 3 nodes (master + 2 slaves) without 
problems for a few weeks, snappuller and snapinstaller run every hour 
normally, install the new index without errors etc.


In the last couple of days it seems like I need to restart tomcat for 
the new documents to appear in the slave indexes. New docs are updated, 
commited and appear if I do a search on the master, but if after the 
synchronization I search in the slaves they don't appear. If I restart 
tomcat they do.


I remember this happening while I was setting up the index and seeing an 
error somewhere (something like it not being able to open a new searcher 
if i remember well), but i'm pretty sure the configuration is ok, it's 
been working for weeks normally and I haven't done any changes since then..


any ideas where i should look at? I can't see any error messages in the 
tomcat logs or the scripts' !


thanks,
galo


Re: New docs need server restart after synchronization

2007-04-16 Thread galo

Yep, all normal..

galo

Bill Au wrote:

Did you check for error messages in the snappuller and snapinstaller
log files sunder solr/logs?  Distribution related errors will not show
up in the tomcat logs.

Bill

On 4/16/07, galo [EMAIL PROTECTED] wrote:

Hi there,

I've been running an index on 3 nodes (master + 2 slaves) without
problems for a few weeks, snappuller and snapinstaller run every hour
normally, install the new index without errors etc.

In the last couple of days it seems like I need to restart tomcat for
the new documents to appear in the slave indexes. New docs are updated,
commited and appear if I do a search on the master, but if after the
synchronization I search in the slaves they don't appear. If I restart
tomcat they do.

I remember this happening while I was setting up the index and seeing an
error somewhere (something like it not being able to open a new searcher
if i remember well), but i'm pretty sure the configuration is ok, it's
been working for weeks normally and I haven't done any changes since 
then..


any ideas where i should look at? I can't see any error messages in the
tomcat logs or the scripts' !

thanks,
galo






--
Galo Navarro, Developer

[EMAIL PROTECTED]
t. +44 (0)20 7780 7080

Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL

http://www.last.fm/user/galeote


problems finding negative values

2007-04-04 Thread galo

Hi,

I have an index consisting on the following fields:

field name=id type=long indexed=true stored=true/
field name=length type=integer indexed=true stored=true/
field name=key type=integer indexed=true stored=true 
multiValued=true /


Each doc has a few key values, some of which are negative.

Ok, I know there's a document that has both 826606443 and -1861807411

If I search with

http://localhost:8080/solr/select/?stylesheet=version=2.1start=0rows=50indent=onq=-1861807411fl=id,length,key

I get no results, but if I do

http://localhost:8080/solr/select/?stylesheet=version=2.1start=0rows=50indent=onq=826606443fl=id,length,key

I get the document as expected.

Obviously the key field is configured as a search field, indexed, etc. 
but somehow solr doesn't like negatives. I'm assuming this might have 
something to do with analysers but can't tell how to fix it.. any ideas??


Thanks

galo


failing post-optimize command execution

2007-03-28 Thread galo

Hi,

I've configured my solrconfig.xml to execute a snapshoot after an 
optimize is made but I keep getting the following exception in the 
tomcat logs:


SEVERE: java.io.IOException: Cannot run program snapshooter (in 
directory /home/solr/solr/bin): java.io.IOException: error=2, No such 
file or directory


I'm certain the path and filename is correct.. does anybody have 
problems with this?


Cheers,

galo


Re: Solr on Tomcat 6.0.10?

2007-03-08 Thread galo

I'm using 6.0.9 and no issues (fingers crossed)

Walter Underwood wrote:

Is anyone running Solr on Tomcat 6.0.10? Any issues?
I searched the archives and didn't see anything.

wunder
  



--
Galo Navarro, Developer

[EMAIL PROTECTED]
t. +44 (0)20 7780 7080

Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL 


http://www.last.fm/user/galeote



Multiple instances, wiki out of date?

2007-02-26 Thread galo

Hi there,

I've been following the instruction from 
http://wiki.apache.org/solr/SolrJetty?highlight=%28Multiple%29%7C%28Solr%29%7C%28Webapps%29solr 

to get a few indexes running under the same instance of jetty 6.1.2. If 
I use the webapp descriptors as specified in the wiki (with correct 
paths, I'm just pasting the example here)..


Call name=addWebApplication
 Arg/*solr*1/*/Arg
 Arg/your/path/to/the/*solr*.war/Arg
 Set name=extractWARtrue/Set
 Set 
name=defaultsDescriptororg/mortbay/jetty/servlet/webdefault.xml/Set

 Call name=addEnvEntry
   Arg*solr*/home/Arg
   Arg type=String/your/path/to/your/*solr*/home/dir/Arg
 /Call
/Call

Call name=addWebApplication
 Arg/*solr*2/*/Arg
 Arg/your/path/to/the/*solr*.war/Arg
 Set name=extractWARtrue/Set
 Set 
name=defaultsDescriptororg/mortbay/jetty/servlet/webdefault.xml/Set

 Call name=addEnvEntry
   Arg*solr*/home/Arg
   Arg type=String/your/path/to/your/alternate/*solr*/home/dir/Arg
 /Call
/Call

Jetty complains that:

2007-02-26 18:36:04.874::INFO:  Logging to STDERR via 
org.mortbay.log.StdErrLog
2007-02-26 18:36:05.066::WARN:  Config error at Call 
name=addWebApplicationArg/solr1/*/ArgArg/your/path/to/the/solr.war/ArgSet 
name=extractWARtrue/SetSet 
name=defaultsDescriptororg/mortbay/jetty/servlet/webdefault.xml/SetCall 
name=addEnvEntryArgsolr/home/ArgArg 
type=String/your/path/to/your/solr/home/dir/Arg/Call/Call

2007-02-26 18:36:05.066::WARN:  EXCEPTION
java.lang.IllegalStateException: No Method: Call 
name=addWebApplicationArg/solr1/*/ArgArg/your/path/to/the/solr.war/ArgSet 
name=extractWARtrue/SetSet 
name=defaultsDescriptororg/mortbay/jetty/servlet/webdefault.xml/SetCall 
name=addEnvEntryArgsolr/home/ArgArg 
type=String/your/path/to/your/solr/home/dir/Arg/Call/Call on 
class org.mortbay.jetty.Server

   at org.mortbay.xml.XmlConfiguration.call(XmlConfiguration.java:548)
   at 
org.mortbay.xml.XmlConfiguration.configure(XmlConfiguration.java:241)
   at 
org.mortbay.xml.XmlConfiguration.configure(XmlConfiguration.java:203)

   at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:919)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:585)
   at org.mortbay.start.Main.invokeMain(Main.java:183)
   at org.mortbay.start.Main.start(Main.java:497)
   at org.mortbay.start.Main.main(Main.java:115)
2007-02-26 18:36:05.068::INFO:  Shutdown hook executing
2007-02-26 18:36:05.068::INFO:  Shutdown hook complete

I've been looking at the Jetty API and it looks like those methods are 
deprecated in the latest versions of Jetty. Anyway, I can get several 
instances to run together using the descriptor shown below and several 
war files


Call name=addLifeCycle
 Arg
   New class=org.mortbay.jetty.deployer.WebAppDeployer
 Set name=contextsRef id=Contexts//Set
 Set name=webAppDirSystemProperty name=jetty.home 
default=.//webapps-plus/Set

 Set name=parentLoaderPriorityfalse/Set
 Set name=extracttrue/Set
 Set name=allowDuplicatesfalse/Set
 Set name=defaultsDescriptorSystemProperty name=jetty.home 
default=.//etc/webdefault.xml/Set

   /New
 /Arg
/Call

This is good enough for me but the problem then is that all point to the 
same data/index folder sharing the same index and I need them to use 
different indexes. The question is, how can you configure solr.home 
differently for each of the solr instances deployed in the webapps-plus 
folder?


It would be equally valid if there is a way of fixing the xml in the 
wiki so individual war files can be specified passing a different 
solr.home to each..


thanks,

galo.