Solr 1.4 Replication index directories

2010-01-28 Thread mark angelillo


We're using the new replication and it's working pretty well. There's  
one detail I'd like to get some more information about.

As the replication works, it creates versions of the index in the data  
directory. Originally we had index/, but now there are dated versions  
such as index.20100127044500/, which are the replicated versions.

Each copy is sized in the vicinity of 65G. With our current hard drive  
it's fine to have two around, but 3 gets a little dicey. Sometimes  
we're finding that the replication doesn't always clean up after  
itself. I would like to understand this better, or to not have this  
happen. It could be a configuration issue.

Some more specific questions:

- Is it safe to remove the index/ directory (that doesn't have the  
date on it)? I think I tried this once and the whole thing broke,  
however maybe something else was wrong at the time.

- Is there a way to know which one is the current one? (I'm looking at  
the file, and it seems to be correct, but sometimes  
there's a newer version in the directory, which later is removed)

- Could it be that the index does not finish replicating in the poll  
interval I give it? What happens if, say there's a poll interval X and  
replicating the index happens to take longer than X sometimes. (Our  
current poll interval is 45 minutes, and every time I'm watching it it  
completes in time.)

Thanks in advance

Re: Solr 1.4 Replication index directories

2010-01-28 Thread mark angelillo

Thanks, Otis. Responses inline.


We're using the new replication and it's working pretty well.  
There's one detail

I'd like to get some more information about.

As the replication works, it creates versions of the index in the  
directory. Originally we had index/, but now there are dated  
versions such as

index.20100127044500/, which are the replicated versions.

Each copy is sized in the vicinity of 65G. With our current hard  
drive it's fine
to have two around, but 3 gets a little dicey. Sometimes we're  
finding that the
replication doesn't always clean up after itself. I would like to  
this better, or to not have this happen. It could be a  
configuration issue.

Some more specific questions:

- Is it safe to remove the index/ directory (that doesn't have the  
date on it)?
I think I tried this once and the whole thing broke, however maybe  

else was wrong at the time.

No, that's the real, live index, you don't want to remove that one.

Yeah... I tried it once and remember things breaking.

However nothing in this directory has been modified for over a week  
(since the last replication initialization). And I'm still sitting on  
130GB of data for what is only 65GB on the master

- Is there a way to know which one is the current one? (I'm looking  
at the file, and it seems to be correct, but sometimes there's  
a newer

version in the directory, which later is removed)

I think the index one is always current, no?  If not, I imagine  
the admin replication page will tell you, or even the Statistics page.

reader :   
readerDir : 

reader :  
readerDir : 

- Could it be that the index does not finish replicating in the  
poll interval I
give it? What happens if, say there's a poll interval X and  
replicating the
index happens to take longer than X sometimes. (Our current poll  
interval is 45

minutes, and every time I'm watching it it completes in time.)

I think only 1 replication will/should be happening at a time.

Whew, that's comforting.

Re: Forced Top Document

2007-10-25 Thread mark angelillo
Thanks for your thoughts, Chris. I agree with you about the user's  
experience. Snooth doesn't serve any ads/sponsored results -- the  
goal here is to make sure that the most recent document the user has  
acted on shows up top in searches for recent activity. My aim is to  
forcibly preserve the sort order until the document can be reindexed/ 

Since the dynamic field is too memory intensive, I'll try boosting on  
the date field -- and boosting more on the date field for the  
document that needs to be up top. If that doesn't end up working I'll  
just perform two queries and be done with it.


On Oct 25, 2007, at 3:11 AM, Chris Hostetter wrote:

: The typical use case, though, is for the featured document to be  
on top only
: for certain queries.  Like in an intranet where someone queries  
401K or
: retirement or similar, you want to feature a document about  
benefits that
: would otherwise rank really low for that query.  I have not be  
able to make

: sorting strategies work very well.

this type of question typically falls into two use cases:
  1) targeted ads
  2) sponsored results

in the targeted ads case, the special matches aren't part of the  
flow of results, and don't fit into pagination -- they always  
appera at
the top, or to the right, on every page, no matter what the  
sort  this
kind of usage doesn't really need any special logic, it can be  
solved as

easily by a second Solr hit as it can by custom request handler logic.

in the sponsored results use case, the special matches should  
in the normal flow of results as the #1 (2, 3, etc) matches, so  
that they

don't appear on page#2 ... but that also means that it's extremely
disconcerting for users if those matches are still at the top when the
userse resort.  if a user is looking at product listings, sorted by
relevancy and the top 3 results all say they are sponsered  
that's fine
... but if the user sort by price and those 3 results are still  
at teh
top of the list, even though they clearly aren't the chepest,  
that's just

going to piss the user off.

in my profesional opinion: don't fuck with your users.  default to
whatever order you want, but if the user specificly requests to  
sort the

results by some option, do it.

assuming you follow my professional opinion, then boosting docs  
to have

an artifically high score will work fine.

if you absolutely *MUST* have certain docs sorting before others,
regardless of which sort option the user picks, then it is still  
do ... i'm hesitant to even say how, but if people insist on  

allways sort by score first, then by whatever field the user wants  
to sort
by ... but when the user wants to sort on a specific field, move  
the users
main query input into an fq (so it doesn't influence the  
score) ... and
use an extremely low boost matchalldocs query along with your  
special doc

matching query as the main (scoring) query param.  the key being that
even though your primary sort is on score, every doc except your  

matches have identical scores.

(this may not be possible with dismax because it's not trivial to move
the query into an fq, it might work if you can use 0 as the boost on
fields in the qf so it still dictates the matches but doesn't  

the score enough to throw off the sort)


mark angelillo
snooth inc.
o: 646.723.4328
c: 484.437.9915
snooth -- 1.8 million ratings and counting...

Re: Forced Top Document

2007-10-24 Thread mark angelillo

That's the ticket exactly, Kyle.

What I have is the ID of my document, so I indexed a dynamic field  
with name id_*. Then I just set that field for each document with the  
proper ID.

So for example, to pop one document to the top of the index, i just run:

q=field: value; id_700390+desc, date+desc

Works like a charm, even with multiple documents.

q=field: value; id_700390+desc, id_604030+desc, date+desc


On Oct 24, 2007, at 4:15 PM, Kyle Banerjee wrote:

The typical use case, though, is for the featured document to be  
on top only
for certain queries.  Like in an intranet where someone queries  
401K or
retirement or similar, you want to feature a document about  
benefits that
would otherwise rank really low for that query.  I have not be  
able to make

sorting strategies work very well.

Depending on how many of these certain queries you have, it seems like
you could still use some variation of the strategy based on a bogus
tag sort. If you place a dynamic field for each query term (e.g.
foo_s, bar_s, etc) relevant to a document and then detect when one of
the special query terms is detected, you can still sort on the
appropriate dynamic field before applying the rest of the sort.


mark angelillo
snooth inc.
o: 646.723.4328
c: 484.437.9915
snooth -- 1.8 million ratings and counting...

Forced Top Document

2007-10-23 Thread mark angelillo

Hi all,

Is there a way to get a specific document to appear on top of search  
results even if a sorting parameter would push it further down?

Thanks in advance,

mark angelillo
snooth inc.
o: 646.723.4328
c: 484.437.9915
snooth -- 1.8 million ratings and counting...

Re: Filter by Group

2007-09-20 Thread mark angelillo

Thanks, Pieter. I'll go for that then.


On Sep 19, 2007, at 10:15 PM, Pieter Berkel wrote:

Sounds like you're on the right track, if your groups overap (i.e. a
document can be in group A and B), then you should ensure your  

field is multivalued.

If you are searching for foo in documents contained in group A,  
then it

might be more efficient to use a filter query (fq) like:


See the wiki page on common query parameters for more info: 


On 20/09/2007, mark angelillo [EMAIL PROTECTED] wrote:

Hey all,

Let's say I have an index of one hundred documents, and these
documents are grouped into 4 groups A, B, C, and D. The groups do in
fact overlap. What would people recommend as the best way to apply a
search query and return only the documents that are in group A? Also,
how about if we run the same search query but return only those
documents in groups A, C and D?

I imagine that I could do this by indexing a text field populated
with the group names and adding something like groups:A to the
query but I'm wondering if there's a better solution.

Thanks in advance,

mark angelillo
snooth inc.
o: 646.723.4328
c: 484.437.9915
snooth -- 1.7 million ratings and counting...

mark angelillo
snooth inc.
o: 646.723.4328
c: 484.437.9915
snooth -- 1.7 million ratings and counting...

Filter by Group

2007-09-19 Thread mark angelillo

Hey all,

Let's say I have an index of one hundred documents, and these  
documents are grouped into 4 groups A, B, C, and D. The groups do in  
fact overlap. What would people recommend as the best way to apply a  
search query and return only the documents that are in group A? Also,  
how about if we run the same search query but return only those  
documents in groups A, C and D?

I imagine that I could do this by indexing a text field populated  
with the group names and adding something like groups:A to the  
query but I'm wondering if there's a better solution.

Thanks in advance,

mark angelillo
snooth inc.
o: 646.723.4328
c: 484.437.9915
snooth -- 1.7 million ratings and counting...

Re: Error loading custom similarity class

2007-04-09 Thread mark angelillo

Thanks, Yonik. I was definitely missing that.

On Apr 9, 2007, at 2:08 PM, Yonik Seeley wrote:

On 4/9/07, mark angelillo [EMAIL PROTECTED] wrote:

package org.apache.lucene.misc;


2) Create the .jar file. (Maybe I'm doing this wrong?)

  javac classpath lucene-core-nightly.jar
  jar -cvf CustomSimilarity.jar CustomSimilarity.class

This may be the problem.  The path in the jar file needs to reflect  
the package.
So the CustomSimilarity.class file needs to be in the org/apache/ 



Re: dynamicField Scaling

2007-03-07 Thread mark angelillo

On Mar 7, 2007, at 2:17 PM, Mike Klaas wrote:

On 3/7/07, mark angelillo [EMAIL PROTECTED] wrote:


I've got a Solr index running and I want to use a dynamicField to
store n different sorting fields. The field that is used to actually
sort the results will be determined by the application that is
querying the index.

I'm wondering if anyone has done something similar to this, or if
anyone has an idea of how Solr will perform as the number n of
sorting fields grows larger. Is there a way to make sure this doesn't
start to slow the index down? Is there any information out there
about the number of dynamicFields that can be declared in this way
before the entire index suffers? Is there such a limit?

It's not realy about the number of dynamic fields.  The key variable
is the number of sort fields.  To sort efficiently, solr needs to
maintain a cache of field values.  This consumes memory per-field on
the order of

D x S + U

where D is the document count, S is the the size of the data type (eg.
4bytes for ints, 8 bytes for doubles, 4/8 bytes for anything else
[pointers]), and U is the cumulative size of the unique field values
(if sorting on a non-primitive type, like Strings).

If you have sufficient memory to store this data for each field you
are sorting on, you shouldn't have any problems.


Okay, makes sense.
