Wildcards / Binary searches
Hi, Three questions: 1. I want to use solr for some sort of live search, querying with incomplete terms + wildcard and getting any similar results. Radioh* would return anything containing that string. The DisMax req. hander doesn't accept wildcards in the q param so i'm trying the simple one and still have problems as all my results are coming back with score = 1 and I need them sorted by relevance.. Is there a way of doing this? Why doesn't * work in dismax (nor ~ by the way)?? 2. What do the phrase slop params do? 3. I'm trying to implement another index where I store a number of int values for each document. Everything works ok as integers but i'd like to have some sort of fuzzy searches based on the bit representation of the numbers. Essentially, this number: 1001001010100 would be compared to these two 1011001010100 1001001010111 And the first would get a bigger score than the second, as it has only 1 flipped bit while the second has 2. Is it possible to implement this in solr? Cheers, galo
Re: Wildcards / Binary searches
Yeah i thought of that solution but this is a 20G index with each document having around 300 or those numbers so i was a bit worried about the performance.. I'll try anyway, thanks! On 06/06/07, *Yonik Seeley* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: On 6/6/07, galo [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: 3. I'm trying to implement another index where I store a number of int values for each document. Everything works ok as integers but i'd like to have some sort of fuzzy searches based on the bit representation of the numbers. Essentially, this number: 1001001010100 would be compared to these two 1011001010100 1001001010111 And the first would get a bigger score than the second, as it has only 1 flipped bit while the second has 2. You could store the numbers as a string field with the binary representation, then try a fuzzy search. myfield:1001001010100~ -Yonik
Re: Wildcards / Binary searches
Ok further to my email below i've been testing with q=radioh?* Basically the problem is, searching artists even with Radiohead having a big boost, it's returning stuff with less boost before like Radiohead+Ani Di Franco or Radiohead+Michael Stipe The debug output is below, but basically, for Radiohead and one of the others we get this: radiohead+ani - 655391.5 * 0.046359334 radiohead - 1150991.9 * 0.025442434 So it's fairly clear where is the difference. Looking at the numbers, the cause seems to be in this line: 8.781371 = idf(docFreq=4096) While Radiohead+Ani is getting 16.000769 = idf(docFreq=2) If I can alter this I think sorted.. what's idf and docFreq? str name=id=1200360,internal_docid=159496 30383.514 = (MATCH) sum of: 30383.514 = (MATCH) weight(text:radiohead+ani in 159496), product of: 0.046359334 = queryWeight(text:radiohead+ani), product of: 16.000769 = idf(docFreq=2) 0.0028973192 = queryNorm 655391.5 = (MATCH) fieldWeight(text:radiohead+ani in 159496), product of: 1.0 = tf(termFreq(text:radiohead+ani)=1) 16.000769 = idf(docFreq=2) 40960.0 = fieldNorm(field=text, doc=159496) /str str name=id=979,internal_docid=9799640 29284.035 = (MATCH) sum of: 29284.035 = (MATCH) weight(text:radiohead in 9799640), product of: 0.025442434 = queryWeight(text:radiohead), product of: 8.781371 = idf(docFreq=4096) 0.0028973192 = queryNorm 1150991.9 = (MATCH) fieldWeight(text:radiohead in 9799640), product of: 1.0 = tf(termFreq(text:radiohead)=1) 8.781371 = idf(docFreq=4096) 131072.0 = fieldNorm(field=text, doc=9799640) /str Thanks a lot, galo galo wrote: I was doing a different trick, basically searching q=radioh*+radioh~, and the results are slightly better than ?*, but not great. By the way, the case sensitiveness of wildcards affects here of course. I'd like to have a look to that DisMax you have if you can post it, at least to compare results. The way I get to do scoring as I say is far from perfect. By the way, I'm seeing the highlighting dissapears when using these wildcards, is that normal?? Thanks for your help, galo At 4:40 PM +0100 6/6/07, galo wrote: 1. I want to use solr for some sort of live search, querying with incomplete terms + wildcard and getting any similar results. Radioh* would return anything containing that string. The DisMax req. hander doesn't accept wildcards in the q param so i'm trying the simple one and still have problems as all my results are coming back with score = 1 and I need them sorted by relevance.. Is there a way of doing this? Why doesn't * work in dismax (nor ~ by the way)?? DisMax was written with the intent of supporting a simple search box in which one could type or paste some text, e.g. a title like Santa Clause: Is he Real (and if so, what is real)? and get meaningful results. To do that it pre-processes the query string by removing unbalanced quotation marks and escaping characters that would otherwise be treated by the query parser as operators: \ ! ( ) : ^ [ ] { } ~ * ? I have a local version of DisMax which parameterizes the escaping so certain operators can be allowed through, which I'd be happy to contribute to you or the codebase, but I expect SimpleRH may be a better tool for your application than DisMaxRH, as long as you get it to score as you wish. Both Standard and DisMax request handlers use SolrQueryParser, an extension of the Lucene query parser which introduces a small number of changes, one of which is that prefix queries e.g. Radioh* are evaluated with ConstantScorePrefixQuery rather than the standard PrefixQuery. In issue SOLR-218 developers have been discussing per-field control of query parser options (some of it Solr's, some of it Lucene's). When that is implemented there should additionally be a property useConstantScorePrefixQuery analogous to the unfortunately-named QueryParser useOldRangeQuery, but handled by SolrQueryParser (until CSPQs are implemented as an option in Lucene QP). Until that time, well, Chris H. posted a clever and rather timely workaround on the solr-dev list: one work arround people may want to consider ... is to force the use of a WildCardQuery in what would otherwise be interpreted as a PrefixQuery by putting a ? before the * ie: auto?* instead of auto* (yes, this does require that at least one character follow the prefix) Perhaps that would help in your case? - J.J.
wrong path in snappuller
I have downloaded all the scripts from the current version in the trunk and I'm finding the same issues as https://issues.apache.org/jira/browse/SOLR-188 in the snappuller, I haven't looked in other scripts yet. rsync -Wa${verbose}${compress} --delete ${sizeonly} \ ${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/ ${data_dir}/${name}-wip that command fails in non-default installations due to that /solr/ Is this known or should I log it in JIRA? thanks, galo
Re: New docs need server restart after synchronization
I was expecting to see the commit error which is the one that told me before the configuration was not correct but the commit was made and logged correctly... on the master. I had copied scripts.conf from the master to all the slaves with solr_hostname=master and didn't replace with slave1 slave2 etc, so the commit was successful, but not on the slave. Dumb. Dumb. Now that we are at it.. how's the best set up for multiple indexes on the same server? Atm I have a tomcat server runs on each machine with a separate webapp for each index for a few indexes (5 by now, will be many more) Something like master: http://master:8080/index1, http://master:8080/index2 slave1: http://slave1:8080/index1, http://slave1:8080/index2 slave2: http://slave2:8080/index1, http://slave2:8080/index2 slave3: http://slave3:8080/index1, http://slave3:8080/index2 ... On each server I have a main solr directory and then one for each index solr solr-index1 solr-index2 The main directory (solr) holds all the shared folders, so inside solr-index1 and solr-index2, bin, etc, lib, ext, webapp etc are links to sorl/bin, solr/etc, solr/lib, etc. Obviously, conf, data and logs are not shared. This is reasonably simple to set up and install on other servers being the only real complain that i'd like to keep a single conf folder shared across master and slaves for each index (schema.xml, solrconfig.xml etc. should be the same and scripts.conf can be shared as long as you use solr_hostname=localhost and the same port, which is my case). Not a tragedy anyway. How are you dealing with these situations, are there better ways than this? Cheers, galo Chris Hostetter wrote: : problems for a few weeks, snappuller and snapinstaller run every hour : normally, install the new index without errors etc. : : In the last couple of days it seems like I need to restart tomcat for : the new documents to appear in the slave indexes. New docs are updated, : commited and appear if I do a search on the master, but if after the : synchronization I search in the slaves they don't appear. If I restart : tomcat they do. snappuller notifies the slaves that they should reopen the index by executing the bin/commit script ... if that fails it should log a message ... but maybe bin/commit is logging an error in it's log and exiting with a successful return code? do you see the commit log messages from Solr? do you see the /update requests in the slaves tomcat access logs? -Hoss -- Galo Navarro, Developer [EMAIL PROTECTED] t. +44 (0)20 7780 7080 Last.fm | http://www.last.fm Karen House 1-11 Baches Street London N1 6DL http://www.last.fm/user/galeote
New docs need server restart after synchronization
Hi there, I've been running an index on 3 nodes (master + 2 slaves) without problems for a few weeks, snappuller and snapinstaller run every hour normally, install the new index without errors etc. In the last couple of days it seems like I need to restart tomcat for the new documents to appear in the slave indexes. New docs are updated, commited and appear if I do a search on the master, but if after the synchronization I search in the slaves they don't appear. If I restart tomcat they do. I remember this happening while I was setting up the index and seeing an error somewhere (something like it not being able to open a new searcher if i remember well), but i'm pretty sure the configuration is ok, it's been working for weeks normally and I haven't done any changes since then.. any ideas where i should look at? I can't see any error messages in the tomcat logs or the scripts' ! thanks, galo
Re: New docs need server restart after synchronization
Yep, all normal.. galo Bill Au wrote: Did you check for error messages in the snappuller and snapinstaller log files sunder solr/logs? Distribution related errors will not show up in the tomcat logs. Bill On 4/16/07, galo [EMAIL PROTECTED] wrote: Hi there, I've been running an index on 3 nodes (master + 2 slaves) without problems for a few weeks, snappuller and snapinstaller run every hour normally, install the new index without errors etc. In the last couple of days it seems like I need to restart tomcat for the new documents to appear in the slave indexes. New docs are updated, commited and appear if I do a search on the master, but if after the synchronization I search in the slaves they don't appear. If I restart tomcat they do. I remember this happening while I was setting up the index and seeing an error somewhere (something like it not being able to open a new searcher if i remember well), but i'm pretty sure the configuration is ok, it's been working for weeks normally and I haven't done any changes since then.. any ideas where i should look at? I can't see any error messages in the tomcat logs or the scripts' ! thanks, galo -- Galo Navarro, Developer [EMAIL PROTECTED] t. +44 (0)20 7780 7080 Last.fm | http://www.last.fm Karen House 1-11 Baches Street London N1 6DL http://www.last.fm/user/galeote
problems finding negative values
Hi, I have an index consisting on the following fields: field name=id type=long indexed=true stored=true/ field name=length type=integer indexed=true stored=true/ field name=key type=integer indexed=true stored=true multiValued=true / Each doc has a few key values, some of which are negative. Ok, I know there's a document that has both 826606443 and -1861807411 If I search with http://localhost:8080/solr/select/?stylesheet=version=2.1start=0rows=50indent=onq=-1861807411fl=id,length,key I get no results, but if I do http://localhost:8080/solr/select/?stylesheet=version=2.1start=0rows=50indent=onq=826606443fl=id,length,key I get the document as expected. Obviously the key field is configured as a search field, indexed, etc. but somehow solr doesn't like negatives. I'm assuming this might have something to do with analysers but can't tell how to fix it.. any ideas?? Thanks galo
failing post-optimize command execution
Hi, I've configured my solrconfig.xml to execute a snapshoot after an optimize is made but I keep getting the following exception in the tomcat logs: SEVERE: java.io.IOException: Cannot run program snapshooter (in directory /home/solr/solr/bin): java.io.IOException: error=2, No such file or directory I'm certain the path and filename is correct.. does anybody have problems with this? Cheers, galo
Re: Solr on Tomcat 6.0.10?
I'm using 6.0.9 and no issues (fingers crossed) Walter Underwood wrote: Is anyone running Solr on Tomcat 6.0.10? Any issues? I searched the archives and didn't see anything. wunder -- Galo Navarro, Developer [EMAIL PROTECTED] t. +44 (0)20 7780 7080 Last.fm | http://www.last.fm Karen House 1-11 Baches Street London N1 6DL http://www.last.fm/user/galeote
Multiple instances, wiki out of date?
Hi there, I've been following the instruction from http://wiki.apache.org/solr/SolrJetty?highlight=%28Multiple%29%7C%28Solr%29%7C%28Webapps%29solr to get a few indexes running under the same instance of jetty 6.1.2. If I use the webapp descriptors as specified in the wiki (with correct paths, I'm just pasting the example here).. Call name=addWebApplication Arg/*solr*1/*/Arg Arg/your/path/to/the/*solr*.war/Arg Set name=extractWARtrue/Set Set name=defaultsDescriptororg/mortbay/jetty/servlet/webdefault.xml/Set Call name=addEnvEntry Arg*solr*/home/Arg Arg type=String/your/path/to/your/*solr*/home/dir/Arg /Call /Call Call name=addWebApplication Arg/*solr*2/*/Arg Arg/your/path/to/the/*solr*.war/Arg Set name=extractWARtrue/Set Set name=defaultsDescriptororg/mortbay/jetty/servlet/webdefault.xml/Set Call name=addEnvEntry Arg*solr*/home/Arg Arg type=String/your/path/to/your/alternate/*solr*/home/dir/Arg /Call /Call Jetty complains that: 2007-02-26 18:36:04.874::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2007-02-26 18:36:05.066::WARN: Config error at Call name=addWebApplicationArg/solr1/*/ArgArg/your/path/to/the/solr.war/ArgSet name=extractWARtrue/SetSet name=defaultsDescriptororg/mortbay/jetty/servlet/webdefault.xml/SetCall name=addEnvEntryArgsolr/home/ArgArg type=String/your/path/to/your/solr/home/dir/Arg/Call/Call 2007-02-26 18:36:05.066::WARN: EXCEPTION java.lang.IllegalStateException: No Method: Call name=addWebApplicationArg/solr1/*/ArgArg/your/path/to/the/solr.war/ArgSet name=extractWARtrue/SetSet name=defaultsDescriptororg/mortbay/jetty/servlet/webdefault.xml/SetCall name=addEnvEntryArgsolr/home/ArgArg type=String/your/path/to/your/solr/home/dir/Arg/Call/Call on class org.mortbay.jetty.Server at org.mortbay.xml.XmlConfiguration.call(XmlConfiguration.java:548) at org.mortbay.xml.XmlConfiguration.configure(XmlConfiguration.java:241) at org.mortbay.xml.XmlConfiguration.configure(XmlConfiguration.java:203) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:919) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) 2007-02-26 18:36:05.068::INFO: Shutdown hook executing 2007-02-26 18:36:05.068::INFO: Shutdown hook complete I've been looking at the Jetty API and it looks like those methods are deprecated in the latest versions of Jetty. Anyway, I can get several instances to run together using the descriptor shown below and several war files Call name=addLifeCycle Arg New class=org.mortbay.jetty.deployer.WebAppDeployer Set name=contextsRef id=Contexts//Set Set name=webAppDirSystemProperty name=jetty.home default=.//webapps-plus/Set Set name=parentLoaderPriorityfalse/Set Set name=extracttrue/Set Set name=allowDuplicatesfalse/Set Set name=defaultsDescriptorSystemProperty name=jetty.home default=.//etc/webdefault.xml/Set /New /Arg /Call This is good enough for me but the problem then is that all point to the same data/index folder sharing the same index and I need them to use different indexes. The question is, how can you configure solr.home differently for each of the solr instances deployed in the webapps-plus folder? It would be equally valid if there is a way of fixing the xml in the wiki so individual war files can be specified passing a different solr.home to each.. thanks, galo.