Re: Wildcards / Binary searches

2007-06-06 Thread galo

Ok further to my email below i've been testing with q=radioh?*

Basically the problem is, searching artists even with Radiohead having a 
big boost, it's returning stuff with less boost before like 
"Radiohead+Ani Di Franco" or "Radiohead+Michael Stipe"


The debug output is below, but basically, for Radiohead and one of the 
others we get this:


radiohead+ani - 655391.5  * 0.046359334
radiohead - 1150991.9 * 0.025442434

So it's fairly clear where is the difference. Looking at the numbers, 
the cause seems to be in this line:


8.781371 = idf(docFreq=4096)

While Radiohead+Ani is getting

16.000769 = idf(docFreq=2)

If I can alter this I think sorted.. what's idf and docFreq?


  
30383.514 = (MATCH) sum of:
  30383.514 = (MATCH) weight(text:radiohead+ani in 159496), product of:
0.046359334 = queryWeight(text:radiohead+ani), product of:
  16.000769 = idf(docFreq=2)
  0.0028973192 = queryNorm
655391.5 = (MATCH) fieldWeight(text:radiohead+ani in 159496), 
product of:

  1.0 = tf(termFreq(text:radiohead+ani)=1)
  16.000769 = idf(docFreq=2)
  40960.0 = fieldNorm(field=text, doc=159496)

  
29284.035 = (MATCH) sum of:
  29284.035 = (MATCH) weight(text:radiohead in 9799640), product of:
0.025442434 = queryWeight(text:radiohead), product of:
  8.781371 = idf(docFreq=4096)
  0.0028973192 = queryNorm
1150991.9 = (MATCH) fieldWeight(text:radiohead in 9799640), product of:
  1.0 = tf(termFreq(text:radiohead)=1)
  8.781371 = idf(docFreq=4096)
  131072.0 = fieldNorm(field=text, doc=9799640)


Thanks a lot,

galo


galo wrote:
I was doing a different trick, basically searching q=radioh*+radioh~, 
and the results are slightly better than ?*, but not great. By the way, 
the case sensitiveness of wildcards affects here of course.


I'd like to have a look to that DisMax you have if you can post it, at 
least to compare results. The way I get to do scoring as I say is far 
from perfect.


By the way, I'm seeing the highlighting dissapears when using these 
wildcards, is that normal??


Thanks for your help,

galo


At 4:40 PM +0100 6/6/07, galo wrote:
 >1. I want to use solr for some sort of live search, querying with 
incomplete terms + wildcard and getting any similar results. Radioh* 
would return anything containing that string. The DisMax req. hander 
doesn't accept wildcards in the q param so i'm trying the simple one 
and still have problems as all my results are coming back with score = 
1 and I need them sorted by relevance.. Is there a way of doing this? 
Why doesn't * work in dismax (nor ~ by the way)??


DisMax was written with the intent of supporting a simple search box 
in which one could type or paste some text, e.g. a title like


Santa Clause: Is he Real (and if so, what is "real")?

and get meaningful results.  To do that it pre-processes the query 
string by removing unbalanced quotation marks and escaping characters 
that would otherwise be treated by the query parser as operators:


\ ! ( ) : ^ [ ] { } ~ * ?

I have a local version of DisMax which parameterizes the escaping so 
certain operators can be allowed through, which I'd be happy to 
contribute to you or the codebase, but I expect SimpleRH may be a 
better tool for your application than DisMaxRH, as long as you get it 
to score as you wish.


Both Standard and DisMax request handlers use SolrQueryParser, an 
extension of the Lucene query parser which introduces a small number 
of changes, one of which is that prefix queries e.g. Radioh* are 
evaluated with ConstantScorePrefixQuery rather than the standard 
PrefixQuery.


In issue SOLR-218 developers have been discussing per-field control of 
query parser options (some of it Solr's, some of it Lucene's).  When 
that is implemented there should additionally be a property 
useConstantScorePrefixQuery analogous to the unfortunately-named 
QueryParser useOldRangeQuery, but handled by SolrQueryParser (until 
CSPQs are implemented as an option in Lucene QP).


Until that time, well, Chris H. posted a clever and rather timely 
workaround on the solr-dev list:


 >one work arround people may want to consider ... is to force the use 
of a WildCardQuery in what would otherwise be interpreted as a 
PrefixQuery by putting a "?" before the "*"

 >
 >ie: auto?* instead of auto*
 >
 >(yes, this does require that at least one character follow the prefix)

Perhaps that would help in your case?

- J.J.










Re: Wildcards / Binary searches

2007-06-06 Thread galo

Yeah i thought of that solution but this is a 20G index with each
document having around 300 or those numbers so i was a bit worried about
the performance.. I'll try anyway, thanks!

On 06/06/07, *Yonik Seeley* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> 
wrote:


    On 6/6/07, galo <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>  3. I'm trying to implement another index where I store a number of
int
>  values for each document. Everything works ok as integers but i'd
like
>  to have some sort of fuzzy searches based on the bit representation of
>  the numbers. Essentially, this number:
>
>  1001001010100
>
>  would be compared to these two
>
>  1011001010100
>  1001001010111
>
>  And the first would get a bigger score than the second, as it has
only 1
>  flipped bit while the second has 2.

You could store the numbers as a string field with the binary
representation,
then try a fuzzy search.

  myfield:1001001010100~

-Yonik






Wildcards / Binary searches

2007-06-06 Thread galo

Hi,

Three questions:

1. I want to use solr for some sort of live search, querying with 
incomplete terms + wildcard and getting any similar results. Radioh* 
would return anything containing that string. The DisMax req. hander 
doesn't accept wildcards in the q param so i'm trying the simple one and 
still have problems as all my results are coming back with score = 1 and 
I need them sorted by relevance.. Is there a way of doing this? Why 
doesn't * work in dismax (nor ~ by the way)??


2. What do the phrase slop params do?

3. I'm trying to implement another index where I store a number of int 
values for each document. Everything works ok as integers but i'd like 
to have some sort of fuzzy searches based on the bit representation of 
the numbers. Essentially, this number:


1001001010100

would be compared to these two

1011001010100
1001001010111

And the first would get a bigger score than the second, as it has only 1 
flipped bit while the second has 2.


Is it possible to implement this in solr?


Cheers,
galo



Re: wrong path in snappuller

2007-04-25 Thread galo

Ok, i will create an issue.

I got round it changing this

> : rsync -Wa${verbose}${compress} --delete ${sizeonly} \
> : ${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/
> : ${data_dir}/${name}-wip

for

> : rsync -Wa${verbose}${compress} --delete ${sizeonly} \
> : ${stats} ${master_host}:${master_data_dir}/${name}/
> : ${data_dir}/${name}-wip

I had to remove the rsync:// as it was causing some problems finding the 
path and I didn't have much time to investigate. It works with absolute 
or relative paths set in the slave's master data folder param.


Why does it need to start an rsyncd in the master in a different port 
for each ap, is it not enough to call rsync on master:path?


Thanks for answering,

Galo


Chris Hostetter wrote:

: and I'm finding the same issues as
: https://issues.apache.org/jira/browse/SOLR-188 in the snappuller, I
: haven't looked in other scripts yet.
:
: rsync -Wa${verbose}${compress} --delete ${sizeonly} \
: ${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/
: ${data_dir}/${name}-wip

that would be a seperate issue from SOLR-188 ... 188 has to do with non
standard URLs, this seems to be an issue with snappuller assuming a
specific rsync path (which if i understand correctly, is relative the
working directory of rsyncd?)

: Is this known or should I log it in JIRA?

please open a new Jira issue ... i'm guessing a new optional param will be
needed for the master's solr_home relative the rsync server.


-Hoss







wrong path in snappuller

2007-04-24 Thread galo
I have downloaded all the scripts from the current version in the trunk 
and I'm finding the same issues as 
https://issues.apache.org/jira/browse/SOLR-188 in the snappuller, I 
haven't looked in other scripts yet.


rsync -Wa${verbose}${compress} --delete ${sizeonly} \
${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/ 
${data_dir}/${name}-wip


that command fails in non-default installations due to that /solr/

Is this known or should I log it in JIRA?

thanks,

galo




Re: New docs need server restart after synchronization

2007-04-17 Thread galo
I was expecting to see the commit error which is the one that told me 
before the configuration was not correct but the commit was made and 
logged correctly... on the master. I had copied scripts.conf from the 
master to all the slaves with solr_hostname=master and didn't replace 
with slave1 slave2 etc, so the commit was successful, but not on the 
slave. Dumb. Dumb.


Now that we are at it.. how's the best set up for multiple indexes on 
the same server? Atm I have a tomcat server runs on each machine with a 
separate webapp for each index for a few indexes (5 by now, will be many 
more)


Something like
 master: http://master:8080/index1, http://master:8080/index2
 slave1: http://slave1:8080/index1, http://slave1:8080/index2
 slave2: http://slave2:8080/index1, http://slave2:8080/index2
 slave3: http://slave3:8080/index1, http://slave3:8080/index2
 ...

On each server I have a main solr directory and then one for each index

solr
solr-index1
solr-index2

The main directory (solr) holds all the shared folders, so inside 
solr-index1 and solr-index2, bin, etc, lib, ext, webapp etc are links to 
sorl/bin, solr/etc, solr/lib, etc. Obviously, conf, data and logs are 
not shared. This is reasonably simple to set up and install on other 
servers being the only real complain that i'd like to keep a single conf 
folder shared across master and slaves for each index (schema.xml, 
solrconfig.xml etc. should be the same and scripts.conf can be shared as 
long as you use solr_hostname=localhost and the same port, which is my 
case). Not a tragedy anyway.


How are you dealing with these situations, are there better ways than this?

Cheers,
galo


Chris Hostetter wrote:

: problems for a few weeks, snappuller and snapinstaller run every hour
: normally, install the new index without errors etc.
:
: In the last couple of days it seems like I need to restart tomcat for
: the new documents to appear in the slave indexes. New docs are updated,
: commited and appear if I do a search on the master, but if after the
: synchronization I search in the slaves they don't appear. If I restart
: tomcat they do.

snappuller "notifies" the slaves that they should reopen the index by
executing the bin/commit script ... if that fails it should log a message
... but maybe bin/commit is logging an error in it's log and exiting with
a successful return code?

do you see the "commit" log messages from Solr?  do you see the /update
requests in the slaves tomcat access logs?





-Hoss





--
Galo Navarro, Developer

[EMAIL PROTECTED]
t. +44 (0)20 7780 7080

Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL

http://www.last.fm/user/galeote


Re: New docs need server restart after synchronization

2007-04-16 Thread galo

Yep, all normal..

galo

Bill Au wrote:

Did you check for error messages in the snappuller and snapinstaller
log files sunder solr/logs?  Distribution related errors will not show
up in the tomcat logs.

Bill

On 4/16/07, galo <[EMAIL PROTECTED]> wrote:

Hi there,

I've been running an index on 3 nodes (master + 2 slaves) without
problems for a few weeks, snappuller and snapinstaller run every hour
normally, install the new index without errors etc.

In the last couple of days it seems like I need to restart tomcat for
the new documents to appear in the slave indexes. New docs are updated,
commited and appear if I do a search on the master, but if after the
synchronization I search in the slaves they don't appear. If I restart
tomcat they do.

I remember this happening while I was setting up the index and seeing an
error somewhere (something like it not being able to open a new searcher
if i remember well), but i'm pretty sure the configuration is ok, it's
been working for weeks normally and I haven't done any changes since 
then..


any ideas where i should look at? I can't see any error messages in the
tomcat logs or the scripts' !

thanks,
galo






--
Galo Navarro, Developer

[EMAIL PROTECTED]
t. +44 (0)20 7780 7080

Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL

http://www.last.fm/user/galeote


New docs need server restart after synchronization

2007-04-16 Thread galo

Hi there,

I've been running an index on 3 nodes (master + 2 slaves) without 
problems for a few weeks, snappuller and snapinstaller run every hour 
normally, install the new index without errors etc.


In the last couple of days it seems like I need to restart tomcat for 
the new documents to appear in the slave indexes. New docs are updated, 
commited and appear if I do a search on the master, but if after the 
synchronization I search in the slaves they don't appear. If I restart 
tomcat they do.


I remember this happening while I was setting up the index and seeing an 
error somewhere (something like it not being able to open a new searcher 
if i remember well), but i'm pretty sure the configuration is ok, it's 
been working for weeks normally and I haven't done any changes since then..


any ideas where i should look at? I can't see any error messages in the 
tomcat logs or the scripts' !


thanks,
galo


Re: Question: index performance

2007-04-13 Thread galo

Hi there,

I'm building an index to which I'm sending a few hundred thousand 
entries. I pull them off the database in batches of 25k and send them to 
solr, 100 documents at a time. I was doing a commit after each of those 
but after what Yonik says I will remove it and commit only after each 
batch of 25k.


Q1: I've got autocommit set to 1000 now.. in solrconfig.xml, should i 
disable it in this scenario?


Q2: To decide which of those 25k are going to be indexed, we need to do 
a query for each (this is the main reason to optimize before a new DB 
batch is indexed), each of these 25k queries take around 30ms which is 
good enough for us, but i've observed every ~30 queries the time of one 
search goes up to 150ms or even 1200ms. Then it does another ~30, etc. I 
guess there is something happening inside the server regularly that 
causes it. Any clues what it can be and how can i minimize that time?


Q3: The 25k searches are done without any cumulative effect on 
performance (avg/search is ~30ms from start to end). But if inmmediately 
after start posting documents to the index tomcat peaks CPU. But if i 
stop tomcat, and then post the 25k documents without doing those 
searches they're very quick. Is there any reason why the searches would 
affect tomcat to justify this? Just to clarify, searches are NOT done at 
the same time as indexing.


My tomcat is running with -server -Xmx512m -Xms512m

Cheers,

galo

Yonik Seeley wrote:

On 4/13/07, James liu <[EMAIL PROTECTED]> wrote:

i find it will be OutOfMemory when i get more that 10k records.

so now i index 10k records( 5k / record)


In one request?  There's really no reason to put more than hundreds of
documents in a single add request.

If you are indexing using multiple requests, and always run into
problems at 10k records, you are probably hitting memory issues with
Lucene merging.  If that's the case, try lowering the mergeFactor so
fewer segments will be merged at the same time.

Some other things to be careful of:
- don't call commit after you add every batch of documents
- don't set maxBufferedDocs too high if you don't have the memory

-Yonik



Re: problems finding negative values

2007-04-05 Thread galo

Ah! thanks.

Wrapping the term in quotes solves the issue, but i've tried escaping 
with \- as Yonik suggested and it doesn't. I guess there's no 
performance difference between both so I can live with quotes but 
anyway, for curiosity sake, should \ work?


thanks,
galo

Jeff Rodenburg wrote:

This one caught us as well.

Refer to
http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping%20Special%20Charactersfor 


understanding what characters need to be escaped for your queries.



On 4/4/07, galo <[EMAIL PROTECTED]> wrote:


Hi,

I have an index consisting on the following fields:





Each doc has a few key values, some of which are negative.

Ok, I know there's a document that has both 826606443 and -1861807411

If I search with


http://localhost:8080/solr/select/?stylesheet=&version=2.1&start=0&rows=50&indent=on&q=-1861807411&fl=id,length,key 



I get no results, but if I do


http://localhost:8080/solr/select/?stylesheet=&version=2.1&start=0&rows=50&indent=on&q=826606443&fl=id,length,key 



I get the document as expected.

Obviously the key field is configured as a search field, indexed, etc.
but somehow solr doesn't like negatives. I'm assuming this might have
something to do with analysers but can't tell how to fix it.. any ideas??

Thanks

galo






--
Galo Navarro, Developer

[EMAIL PROTECTED]
t. +44 (0)20 7780 7080

Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL

http://www.last.fm/user/galeote


problems finding negative values

2007-04-04 Thread galo

Hi,

I have an index consisting on the following fields:



multiValued="true" />


Each doc has a few key values, some of which are negative.

Ok, I know there's a document that has both 826606443 and -1861807411

If I search with

http://localhost:8080/solr/select/?stylesheet=&version=2.1&start=0&rows=50&indent=on&q=-1861807411&fl=id,length,key

I get no results, but if I do

http://localhost:8080/solr/select/?stylesheet=&version=2.1&start=0&rows=50&indent=on&q=826606443&fl=id,length,key

I get the document as expected.

Obviously the key field is configured as a search field, indexed, etc. 
but somehow solr doesn't like negatives. I'm assuming this might have 
something to do with analysers but can't tell how to fix it.. any ideas??


Thanks

galo


failing post-optimize command execution

2007-03-28 Thread galo

Hi,

I've configured my solrconfig.xml to execute a snapshoot after an 
optimize is made but I keep getting the following exception in the 
tomcat logs:


SEVERE: java.io.IOException: Cannot run program "snapshooter" (in 
directory "/home/solr/solr/bin"): java.io.IOException: error=2, No such 
file or directory


I'm certain the path and filename is correct.. does anybody have 
problems with this?


Cheers,

galo


Re: Solr on Tomcat 6.0.10?

2007-03-08 Thread galo

I'm using 6.0.9 and no issues (fingers crossed)

Walter Underwood wrote:

Is anyone running Solr on Tomcat 6.0.10? Any issues?
I searched the archives and didn't see anything.

wunder
  



--
Galo Navarro, Developer

[EMAIL PROTECTED]
t. +44 (0)20 7780 7080

Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL 


http://www.last.fm/user/galeote



Re: Time after snapshot is "visible" on the slave

2007-03-06 Thread galo
Yep, the snapinstaller was failing and it was the same problem as Jeff 
posted this morning about bin/optimize, but this time with bin/commit, 
not using ${webapp_name}.


I fixed that and worked normally.  I've submitted a bug to JIRA as I 
think Jeff didn't submit it yet


Mm now I see your other email.. oh well..

Thanks for your help,

Graham Stead wrote:

Hi Galo,

The snapinstaller actually performs a commit as its last step, so if that
didn't work, it's not surprising that running commit separately didn't work,
either.

I would suggest running the snapinstaller and/or commit scripts with the -V
option. This will produce verbose debugging information and allow you to see
where they encounter problems.

Hope this helps,
-Graham



  



--
Galo Navarro, Developer

[EMAIL PROTECTED]
t. +44 (0)20 7780 7080

Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL 


http://www.last.fm/user/galeote



Time after snapshot is "visible" on the slave

2007-03-06 Thread galo

Hi,

I've been testing index replication and after snappulling and installing 
the latest version of the master index, if i run a query on the slave i 
don't get any results back (tried a commit in despair, which didn't work 
either). If I restart the web server (tomcat) then it works.


Am I missing any steps or just being too impatient sending queries?

Cheers

--
Galo Navarro, Developer

[EMAIL PROTECTED]
t. +44 (0)20 7780 7080

Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL 


http://www.last.fm/user/galeote



Multiple instances, wiki out of date?

2007-02-26 Thread galo

Hi there,

I've been following the instruction from 
http://wiki.apache.org/solr/SolrJetty?highlight=%28Multiple%29%7C%28Solr%29%7C%28Webapps%29solr 

to get a few indexes running under the same instance of jetty 6.1.2. If 
I use the webapp descriptors as specified in the wiki (with correct 
paths, I'm just pasting the example here)..



 /*solr*1/*
 /your/path/to/the/*solr*.war
 true
 name="defaultsDescriptor">org/mortbay/jetty/servlet/webdefault.xml

 
   *solr*/home
   /your/path/to/your/*solr*/home/dir
 



 /*solr*2/*
 /your/path/to/the/*solr*.war
 true
 name="defaultsDescriptor">org/mortbay/jetty/servlet/webdefault.xml

 
   *solr*/home
   /your/path/to/your/alternate/*solr*/home/dir
 


Jetty complains that:

2007-02-26 18:36:04.874::INFO:  Logging to STDERR via 
org.mortbay.log.StdErrLog
2007-02-26 18:36:05.066::WARN:  Config error at name="addWebApplication">/solr1/*/your/path/to/the/solr.warname="extractWAR">truename="defaultsDescriptor">org/mortbay/jetty/servlet/webdefault.xmlname="addEnvEntry">solr/hometype="String">/your/path/to/your/solr/home/dir

2007-02-26 18:36:05.066::WARN:  EXCEPTION
java.lang.IllegalStateException: No Method: name="addWebApplication">/solr1/*/your/path/to/the/solr.warname="extractWAR">truename="defaultsDescriptor">org/mortbay/jetty/servlet/webdefault.xmlname="addEnvEntry">solr/hometype="String">/your/path/to/your/solr/home/dir on 
class org.mortbay.jetty.Server

   at org.mortbay.xml.XmlConfiguration.call(XmlConfiguration.java:548)
   at 
org.mortbay.xml.XmlConfiguration.configure(XmlConfiguration.java:241)
   at 
org.mortbay.xml.XmlConfiguration.configure(XmlConfiguration.java:203)

   at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:919)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:585)
   at org.mortbay.start.Main.invokeMain(Main.java:183)
   at org.mortbay.start.Main.start(Main.java:497)
   at org.mortbay.start.Main.main(Main.java:115)
2007-02-26 18:36:05.068::INFO:  Shutdown hook executing
2007-02-26 18:36:05.068::INFO:  Shutdown hook complete

I've been looking at the Jetty API and it looks like those methods are 
deprecated in the latest versions of Jetty. Anyway, I can get several 
instances to run together using the descriptor shown below and several 
war files



 
   
 
 default="."/>/webapps-plus

 false
 true
 false
 default="."/>/etc/webdefault.xml

   
 


This is good enough for me but the problem then is that all point to the 
same data/index folder sharing the same index and I need them to use 
different indexes. The question is, how can you configure solr.home 
differently for each of the solr instances deployed in the webapps-plus 
folder?


It would be equally valid if there is a way of fixing the xml in the 
wiki so individual war files can be specified passing a different 
solr.home to each..


thanks,

galo.