Re: Wildcards / Binary searches
Ok further to my email below i've been testing with q=radioh?* Basically the problem is, searching artists even with Radiohead having a big boost, it's returning stuff with less boost before like "Radiohead+Ani Di Franco" or "Radiohead+Michael Stipe" The debug output is below, but basically, for Radiohead and one of the others we get this: radiohead+ani - 655391.5 * 0.046359334 radiohead - 1150991.9 * 0.025442434 So it's fairly clear where is the difference. Looking at the numbers, the cause seems to be in this line: 8.781371 = idf(docFreq=4096) While Radiohead+Ani is getting 16.000769 = idf(docFreq=2) If I can alter this I think sorted.. what's idf and docFreq? 30383.514 = (MATCH) sum of: 30383.514 = (MATCH) weight(text:radiohead+ani in 159496), product of: 0.046359334 = queryWeight(text:radiohead+ani), product of: 16.000769 = idf(docFreq=2) 0.0028973192 = queryNorm 655391.5 = (MATCH) fieldWeight(text:radiohead+ani in 159496), product of: 1.0 = tf(termFreq(text:radiohead+ani)=1) 16.000769 = idf(docFreq=2) 40960.0 = fieldNorm(field=text, doc=159496) 29284.035 = (MATCH) sum of: 29284.035 = (MATCH) weight(text:radiohead in 9799640), product of: 0.025442434 = queryWeight(text:radiohead), product of: 8.781371 = idf(docFreq=4096) 0.0028973192 = queryNorm 1150991.9 = (MATCH) fieldWeight(text:radiohead in 9799640), product of: 1.0 = tf(termFreq(text:radiohead)=1) 8.781371 = idf(docFreq=4096) 131072.0 = fieldNorm(field=text, doc=9799640) Thanks a lot, galo galo wrote: I was doing a different trick, basically searching q=radioh*+radioh~, and the results are slightly better than ?*, but not great. By the way, the case sensitiveness of wildcards affects here of course. I'd like to have a look to that DisMax you have if you can post it, at least to compare results. The way I get to do scoring as I say is far from perfect. By the way, I'm seeing the highlighting dissapears when using these wildcards, is that normal?? Thanks for your help, galo At 4:40 PM +0100 6/6/07, galo wrote: >1. I want to use solr for some sort of live search, querying with incomplete terms + wildcard and getting any similar results. Radioh* would return anything containing that string. The DisMax req. hander doesn't accept wildcards in the q param so i'm trying the simple one and still have problems as all my results are coming back with score = 1 and I need them sorted by relevance.. Is there a way of doing this? Why doesn't * work in dismax (nor ~ by the way)?? DisMax was written with the intent of supporting a simple search box in which one could type or paste some text, e.g. a title like Santa Clause: Is he Real (and if so, what is "real")? and get meaningful results. To do that it pre-processes the query string by removing unbalanced quotation marks and escaping characters that would otherwise be treated by the query parser as operators: \ ! ( ) : ^ [ ] { } ~ * ? I have a local version of DisMax which parameterizes the escaping so certain operators can be allowed through, which I'd be happy to contribute to you or the codebase, but I expect SimpleRH may be a better tool for your application than DisMaxRH, as long as you get it to score as you wish. Both Standard and DisMax request handlers use SolrQueryParser, an extension of the Lucene query parser which introduces a small number of changes, one of which is that prefix queries e.g. Radioh* are evaluated with ConstantScorePrefixQuery rather than the standard PrefixQuery. In issue SOLR-218 developers have been discussing per-field control of query parser options (some of it Solr's, some of it Lucene's). When that is implemented there should additionally be a property useConstantScorePrefixQuery analogous to the unfortunately-named QueryParser useOldRangeQuery, but handled by SolrQueryParser (until CSPQs are implemented as an option in Lucene QP). Until that time, well, Chris H. posted a clever and rather timely workaround on the solr-dev list: >one work arround people may want to consider ... is to force the use of a WildCardQuery in what would otherwise be interpreted as a PrefixQuery by putting a "?" before the "*" > >ie: auto?* instead of auto* > >(yes, this does require that at least one character follow the prefix) Perhaps that would help in your case? - J.J.
Re: Wildcards / Binary searches
Yeah i thought of that solution but this is a 20G index with each document having around 300 or those numbers so i was a bit worried about the performance.. I'll try anyway, thanks! On 06/06/07, *Yonik Seeley* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: On 6/6/07, galo <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > 3. I'm trying to implement another index where I store a number of int > values for each document. Everything works ok as integers but i'd like > to have some sort of fuzzy searches based on the bit representation of > the numbers. Essentially, this number: > > 1001001010100 > > would be compared to these two > > 1011001010100 > 1001001010111 > > And the first would get a bigger score than the second, as it has only 1 > flipped bit while the second has 2. You could store the numbers as a string field with the binary representation, then try a fuzzy search. myfield:1001001010100~ -Yonik
Wildcards / Binary searches
Hi, Three questions: 1. I want to use solr for some sort of live search, querying with incomplete terms + wildcard and getting any similar results. Radioh* would return anything containing that string. The DisMax req. hander doesn't accept wildcards in the q param so i'm trying the simple one and still have problems as all my results are coming back with score = 1 and I need them sorted by relevance.. Is there a way of doing this? Why doesn't * work in dismax (nor ~ by the way)?? 2. What do the phrase slop params do? 3. I'm trying to implement another index where I store a number of int values for each document. Everything works ok as integers but i'd like to have some sort of fuzzy searches based on the bit representation of the numbers. Essentially, this number: 1001001010100 would be compared to these two 1011001010100 1001001010111 And the first would get a bigger score than the second, as it has only 1 flipped bit while the second has 2. Is it possible to implement this in solr? Cheers, galo
Re: wrong path in snappuller
Ok, i will create an issue. I got round it changing this > : rsync -Wa${verbose}${compress} --delete ${sizeonly} \ > : ${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/ > : ${data_dir}/${name}-wip for > : rsync -Wa${verbose}${compress} --delete ${sizeonly} \ > : ${stats} ${master_host}:${master_data_dir}/${name}/ > : ${data_dir}/${name}-wip I had to remove the rsync:// as it was causing some problems finding the path and I didn't have much time to investigate. It works with absolute or relative paths set in the slave's master data folder param. Why does it need to start an rsyncd in the master in a different port for each ap, is it not enough to call rsync on master:path? Thanks for answering, Galo Chris Hostetter wrote: : and I'm finding the same issues as : https://issues.apache.org/jira/browse/SOLR-188 in the snappuller, I : haven't looked in other scripts yet. : : rsync -Wa${verbose}${compress} --delete ${sizeonly} \ : ${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/ : ${data_dir}/${name}-wip that would be a seperate issue from SOLR-188 ... 188 has to do with non standard URLs, this seems to be an issue with snappuller assuming a specific rsync path (which if i understand correctly, is relative the working directory of rsyncd?) : Is this known or should I log it in JIRA? please open a new Jira issue ... i'm guessing a new optional param will be needed for the master's solr_home relative the rsync server. -Hoss
wrong path in snappuller
I have downloaded all the scripts from the current version in the trunk and I'm finding the same issues as https://issues.apache.org/jira/browse/SOLR-188 in the snappuller, I haven't looked in other scripts yet. rsync -Wa${verbose}${compress} --delete ${sizeonly} \ ${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/ ${data_dir}/${name}-wip that command fails in non-default installations due to that /solr/ Is this known or should I log it in JIRA? thanks, galo
Re: New docs need server restart after synchronization
I was expecting to see the commit error which is the one that told me before the configuration was not correct but the commit was made and logged correctly... on the master. I had copied scripts.conf from the master to all the slaves with solr_hostname=master and didn't replace with slave1 slave2 etc, so the commit was successful, but not on the slave. Dumb. Dumb. Now that we are at it.. how's the best set up for multiple indexes on the same server? Atm I have a tomcat server runs on each machine with a separate webapp for each index for a few indexes (5 by now, will be many more) Something like master: http://master:8080/index1, http://master:8080/index2 slave1: http://slave1:8080/index1, http://slave1:8080/index2 slave2: http://slave2:8080/index1, http://slave2:8080/index2 slave3: http://slave3:8080/index1, http://slave3:8080/index2 ... On each server I have a main solr directory and then one for each index solr solr-index1 solr-index2 The main directory (solr) holds all the shared folders, so inside solr-index1 and solr-index2, bin, etc, lib, ext, webapp etc are links to sorl/bin, solr/etc, solr/lib, etc. Obviously, conf, data and logs are not shared. This is reasonably simple to set up and install on other servers being the only real complain that i'd like to keep a single conf folder shared across master and slaves for each index (schema.xml, solrconfig.xml etc. should be the same and scripts.conf can be shared as long as you use solr_hostname=localhost and the same port, which is my case). Not a tragedy anyway. How are you dealing with these situations, are there better ways than this? Cheers, galo Chris Hostetter wrote: : problems for a few weeks, snappuller and snapinstaller run every hour : normally, install the new index without errors etc. : : In the last couple of days it seems like I need to restart tomcat for : the new documents to appear in the slave indexes. New docs are updated, : commited and appear if I do a search on the master, but if after the : synchronization I search in the slaves they don't appear. If I restart : tomcat they do. snappuller "notifies" the slaves that they should reopen the index by executing the bin/commit script ... if that fails it should log a message ... but maybe bin/commit is logging an error in it's log and exiting with a successful return code? do you see the "commit" log messages from Solr? do you see the /update requests in the slaves tomcat access logs? -Hoss -- Galo Navarro, Developer [EMAIL PROTECTED] t. +44 (0)20 7780 7080 Last.fm | http://www.last.fm Karen House 1-11 Baches Street London N1 6DL http://www.last.fm/user/galeote
Re: New docs need server restart after synchronization
Yep, all normal.. galo Bill Au wrote: Did you check for error messages in the snappuller and snapinstaller log files sunder solr/logs? Distribution related errors will not show up in the tomcat logs. Bill On 4/16/07, galo <[EMAIL PROTECTED]> wrote: Hi there, I've been running an index on 3 nodes (master + 2 slaves) without problems for a few weeks, snappuller and snapinstaller run every hour normally, install the new index without errors etc. In the last couple of days it seems like I need to restart tomcat for the new documents to appear in the slave indexes. New docs are updated, commited and appear if I do a search on the master, but if after the synchronization I search in the slaves they don't appear. If I restart tomcat they do. I remember this happening while I was setting up the index and seeing an error somewhere (something like it not being able to open a new searcher if i remember well), but i'm pretty sure the configuration is ok, it's been working for weeks normally and I haven't done any changes since then.. any ideas where i should look at? I can't see any error messages in the tomcat logs or the scripts' ! thanks, galo -- Galo Navarro, Developer [EMAIL PROTECTED] t. +44 (0)20 7780 7080 Last.fm | http://www.last.fm Karen House 1-11 Baches Street London N1 6DL http://www.last.fm/user/galeote
New docs need server restart after synchronization
Hi there, I've been running an index on 3 nodes (master + 2 slaves) without problems for a few weeks, snappuller and snapinstaller run every hour normally, install the new index without errors etc. In the last couple of days it seems like I need to restart tomcat for the new documents to appear in the slave indexes. New docs are updated, commited and appear if I do a search on the master, but if after the synchronization I search in the slaves they don't appear. If I restart tomcat they do. I remember this happening while I was setting up the index and seeing an error somewhere (something like it not being able to open a new searcher if i remember well), but i'm pretty sure the configuration is ok, it's been working for weeks normally and I haven't done any changes since then.. any ideas where i should look at? I can't see any error messages in the tomcat logs or the scripts' ! thanks, galo
Re: Question: index performance
Hi there, I'm building an index to which I'm sending a few hundred thousand entries. I pull them off the database in batches of 25k and send them to solr, 100 documents at a time. I was doing a commit after each of those but after what Yonik says I will remove it and commit only after each batch of 25k. Q1: I've got autocommit set to 1000 now.. in solrconfig.xml, should i disable it in this scenario? Q2: To decide which of those 25k are going to be indexed, we need to do a query for each (this is the main reason to optimize before a new DB batch is indexed), each of these 25k queries take around 30ms which is good enough for us, but i've observed every ~30 queries the time of one search goes up to 150ms or even 1200ms. Then it does another ~30, etc. I guess there is something happening inside the server regularly that causes it. Any clues what it can be and how can i minimize that time? Q3: The 25k searches are done without any cumulative effect on performance (avg/search is ~30ms from start to end). But if inmmediately after start posting documents to the index tomcat peaks CPU. But if i stop tomcat, and then post the 25k documents without doing those searches they're very quick. Is there any reason why the searches would affect tomcat to justify this? Just to clarify, searches are NOT done at the same time as indexing. My tomcat is running with -server -Xmx512m -Xms512m Cheers, galo Yonik Seeley wrote: On 4/13/07, James liu <[EMAIL PROTECTED]> wrote: i find it will be OutOfMemory when i get more that 10k records. so now i index 10k records( 5k / record) In one request? There's really no reason to put more than hundreds of documents in a single add request. If you are indexing using multiple requests, and always run into problems at 10k records, you are probably hitting memory issues with Lucene merging. If that's the case, try lowering the mergeFactor so fewer segments will be merged at the same time. Some other things to be careful of: - don't call commit after you add every batch of documents - don't set maxBufferedDocs too high if you don't have the memory -Yonik
Re: problems finding negative values
Ah! thanks. Wrapping the term in quotes solves the issue, but i've tried escaping with \- as Yonik suggested and it doesn't. I guess there's no performance difference between both so I can live with quotes but anyway, for curiosity sake, should \ work? thanks, galo Jeff Rodenburg wrote: This one caught us as well. Refer to http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping%20Special%20Charactersfor understanding what characters need to be escaped for your queries. On 4/4/07, galo <[EMAIL PROTECTED]> wrote: Hi, I have an index consisting on the following fields: Each doc has a few key values, some of which are negative. Ok, I know there's a document that has both 826606443 and -1861807411 If I search with http://localhost:8080/solr/select/?stylesheet=&version=2.1&start=0&rows=50&indent=on&q=-1861807411&fl=id,length,key I get no results, but if I do http://localhost:8080/solr/select/?stylesheet=&version=2.1&start=0&rows=50&indent=on&q=826606443&fl=id,length,key I get the document as expected. Obviously the key field is configured as a search field, indexed, etc. but somehow solr doesn't like negatives. I'm assuming this might have something to do with analysers but can't tell how to fix it.. any ideas?? Thanks galo -- Galo Navarro, Developer [EMAIL PROTECTED] t. +44 (0)20 7780 7080 Last.fm | http://www.last.fm Karen House 1-11 Baches Street London N1 6DL http://www.last.fm/user/galeote
problems finding negative values
Hi, I have an index consisting on the following fields: multiValued="true" /> Each doc has a few key values, some of which are negative. Ok, I know there's a document that has both 826606443 and -1861807411 If I search with http://localhost:8080/solr/select/?stylesheet=&version=2.1&start=0&rows=50&indent=on&q=-1861807411&fl=id,length,key I get no results, but if I do http://localhost:8080/solr/select/?stylesheet=&version=2.1&start=0&rows=50&indent=on&q=826606443&fl=id,length,key I get the document as expected. Obviously the key field is configured as a search field, indexed, etc. but somehow solr doesn't like negatives. I'm assuming this might have something to do with analysers but can't tell how to fix it.. any ideas?? Thanks galo
failing post-optimize command execution
Hi, I've configured my solrconfig.xml to execute a snapshoot after an optimize is made but I keep getting the following exception in the tomcat logs: SEVERE: java.io.IOException: Cannot run program "snapshooter" (in directory "/home/solr/solr/bin"): java.io.IOException: error=2, No such file or directory I'm certain the path and filename is correct.. does anybody have problems with this? Cheers, galo
Re: Solr on Tomcat 6.0.10?
I'm using 6.0.9 and no issues (fingers crossed) Walter Underwood wrote: Is anyone running Solr on Tomcat 6.0.10? Any issues? I searched the archives and didn't see anything. wunder -- Galo Navarro, Developer [EMAIL PROTECTED] t. +44 (0)20 7780 7080 Last.fm | http://www.last.fm Karen House 1-11 Baches Street London N1 6DL http://www.last.fm/user/galeote
Re: Time after snapshot is "visible" on the slave
Yep, the snapinstaller was failing and it was the same problem as Jeff posted this morning about bin/optimize, but this time with bin/commit, not using ${webapp_name}. I fixed that and worked normally. I've submitted a bug to JIRA as I think Jeff didn't submit it yet Mm now I see your other email.. oh well.. Thanks for your help, Graham Stead wrote: Hi Galo, The snapinstaller actually performs a commit as its last step, so if that didn't work, it's not surprising that running commit separately didn't work, either. I would suggest running the snapinstaller and/or commit scripts with the -V option. This will produce verbose debugging information and allow you to see where they encounter problems. Hope this helps, -Graham -- Galo Navarro, Developer [EMAIL PROTECTED] t. +44 (0)20 7780 7080 Last.fm | http://www.last.fm Karen House 1-11 Baches Street London N1 6DL http://www.last.fm/user/galeote
Time after snapshot is "visible" on the slave
Hi, I've been testing index replication and after snappulling and installing the latest version of the master index, if i run a query on the slave i don't get any results back (tried a commit in despair, which didn't work either). If I restart the web server (tomcat) then it works. Am I missing any steps or just being too impatient sending queries? Cheers -- Galo Navarro, Developer [EMAIL PROTECTED] t. +44 (0)20 7780 7080 Last.fm | http://www.last.fm Karen House 1-11 Baches Street London N1 6DL http://www.last.fm/user/galeote
Multiple instances, wiki out of date?
Hi there, I've been following the instruction from http://wiki.apache.org/solr/SolrJetty?highlight=%28Multiple%29%7C%28Solr%29%7C%28Webapps%29solr to get a few indexes running under the same instance of jetty 6.1.2. If I use the webapp descriptors as specified in the wiki (with correct paths, I'm just pasting the example here).. /*solr*1/* /your/path/to/the/*solr*.war true name="defaultsDescriptor">org/mortbay/jetty/servlet/webdefault.xml *solr*/home /your/path/to/your/*solr*/home/dir /*solr*2/* /your/path/to/the/*solr*.war true name="defaultsDescriptor">org/mortbay/jetty/servlet/webdefault.xml *solr*/home /your/path/to/your/alternate/*solr*/home/dir Jetty complains that: 2007-02-26 18:36:04.874::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2007-02-26 18:36:05.066::WARN: Config error at name="addWebApplication">/solr1/*/your/path/to/the/solr.warname="extractWAR">truename="defaultsDescriptor">org/mortbay/jetty/servlet/webdefault.xmlname="addEnvEntry">solr/hometype="String">/your/path/to/your/solr/home/dir 2007-02-26 18:36:05.066::WARN: EXCEPTION java.lang.IllegalStateException: No Method: name="addWebApplication">/solr1/*/your/path/to/the/solr.warname="extractWAR">truename="defaultsDescriptor">org/mortbay/jetty/servlet/webdefault.xmlname="addEnvEntry">solr/hometype="String">/your/path/to/your/solr/home/dir on class org.mortbay.jetty.Server at org.mortbay.xml.XmlConfiguration.call(XmlConfiguration.java:548) at org.mortbay.xml.XmlConfiguration.configure(XmlConfiguration.java:241) at org.mortbay.xml.XmlConfiguration.configure(XmlConfiguration.java:203) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:919) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) 2007-02-26 18:36:05.068::INFO: Shutdown hook executing 2007-02-26 18:36:05.068::INFO: Shutdown hook complete I've been looking at the Jetty API and it looks like those methods are deprecated in the latest versions of Jetty. Anyway, I can get several instances to run together using the descriptor shown below and several war files default="."/>/webapps-plus false true false default="."/>/etc/webdefault.xml This is good enough for me but the problem then is that all point to the same data/index folder sharing the same index and I need them to use different indexes. The question is, how can you configure solr.home differently for each of the solr instances deployed in the webapps-plus folder? It would be equally valid if there is a way of fixing the xml in the wiki so individual war files can be specified passing a different solr.home to each.. thanks, galo.