Re: Broken attachment link on Wiki

2011-07-11 Thread Simon Wistow
Bump?


On Mon, Jun 27, 2011 at 06:17:42PM +0100, me said:
 On the SolrJetty page 
 
 http://wiki.apache.org/solr/SolrJetty
 
 there's a link to a tar ball
 
 http://wiki.apache.org/solr/SolrJetty?action=AttachFiledo=viewtarget=DEMO_multiple_webapps_jetty_6.1.3.tgz
 
 which fails with the error
 
 You are not allowed to do AttachFile on this page.
 
 Can someone fix it somehow? Or put the file else where?
 
 


Broken attachment link on Wiki

2011-06-27 Thread Simon Wistow
On the SolrJetty page 

http://wiki.apache.org/solr/SolrJetty

there's a link to a tar ball

http://wiki.apache.org/solr/SolrJetty?action=AttachFiledo=viewtarget=DEMO_multiple_webapps_jetty_6.1.3.tgz

which fails with the error

You are not allowed to do AttachFile on this page.

Can someone fix it somehow? Or put the file else where?




Multiple Solrs on the same box

2011-06-20 Thread Simon Wistow
First, a couple of assumptions.

We have boxes with a large amount (~70Gb) of memory which we're running 
Solr under. We've currently set -Xmx to 25Gb with the GC settings

-XX:+UseConcMarkSweepGC 
-XX:+UseParNewGC 
-XX:+CMSIncrementalMode 
-XX:+CMSIncrementalPacing

We're reluctant to up the -Xmx because when stop the world GC does 
eventually happen it'll be pretty devastating. But we also have a bunch 
of spare memory lying around.

So we're wondering if running multiple Solrs is the right thing to do - 
that way we'll be using all our memory without very long GC pauses. 

Of course, if that assumption is wrong then the rest of this mail is 
irrelevant. 

We're currently using Tomcat but we're pondering moving to Jetty but 
whilst I've managed to get multiple Solr apps running on different ports 
under the same Jetty instance I can't seem to get them configured via 
JNDI.

It looks like someone put a tar ball with details of how to do that on 
the Wiki 

http://wiki.apache.org/solr/SolrJetty#JNDI_Caveats_Noted_By_Users

but the permissions have been set so that you can't actually download 
it.

So - three questions really:

Am I barking up the wrong tree or is multiple instances a good idea?

Is Jetty worth it or should I just stick to Tomcat?

Can someone set the permissions on the wiki so I can download that file?
;)


cheers,

Simon


Expunging deletes from a very large index

2011-06-06 Thread Simon Wistow
Due to some emergency maintenance I needed to run delete on a large 
number of documents in a 200Gb index.

The problem is that it's taking an inordinately long amount of time (2+ 
hours so far and counting) and is steadily eating up disk space - 
presumably up to 2x index size which is getting awfully close to the 
wire on this machine.

Is that inevitable? Is there any way to speed up the process or use less 
space? Maybe do an optimize with a different number of maxSegments?

I suspect not but I thought it was worth asking.






Negative OR in fq field not working as expected

2011-04-25 Thread Simon Wistow
I have a field 'type' that has several values. If it's type 'foo' then 
it also has a field 'restriction_id'.

What I want is a filter query which says either it's not a 'foo' or if 
it is then it has the restriction '1'

I expect two matches - one of type 'bar' and one of type 'foo' 

Neither

 fq=(-type:foo OR restriction_id:1)
 fq={!dismax q.op=OR}-type:foo restriction_id:1

produce any results.

 fq=restriction_id:1

gets the 'foo' typed result.

 fq=type:bar 

get the 'bar' typed result.

Either of these

  fq=type:[* TO *] OR (type:foo AND restriction_id:1)
  fq=type:(bar OR quux OR fleeg) OR restriction_id:1

do work but are very, very slow to the point of unusability (our indexes 
are pretty large).

Searching round it seems like other people have experienced similar 
issues and the answer has been Lucene just doesn't work like that

When dealing with Lucene people are strongly encouraged to think in 
terms of MUST, MUST_NOT and SHOULD (which are represented in the query 
parser as the prefixes +, - and the default) instead of in terms of 
AND, OR, and NOT ... Lucene's Boolean Queries (and thus Lucene's 
QueryParser) is not a strict Boolean Logic system, so it's best not to 
try and think of it like one.

  http://wiki.apache.org/lucene-java/BooleanQuerySyntax

Am I just out of luck? Might edismax help here?

Simon








Re: Negative OR in fq field not working as expected

2011-04-25 Thread Simon Wistow
On Mon, Apr 25, 2011 at 04:34:05PM -0400, Jonathan Rochkind said:
 This is what I do instead, to rewrite the query to mean the same thing but 
 not give the lucene query parser trouble:
 
 fq=( (*:* AND -type:foo) OR restriction_id:1)
 
 *:* means everything, so (*:* AND -type:foo) means the same thing as 
 just -type:foo, but can get around the lucene query parsers troubles.
 
 So that might work for you.

Thanks for confirming my suspicions.

Unfortunately I've tried that as well and, whilst it works 
it's also unbelievably slow (~30s query time).

Would writing my own Query Parser help here?

Simon






Re: Negative OR in fq field not working as expected

2011-04-25 Thread Simon Wistow
On Mon, Apr 25, 2011 at 05:02:12PM -0400, Yonik Seeley said:
 It really shouldn't be that slow... how many documents are in your
 index, and how many match -type:foo?

Total number of docs is 161,000,000

 type:foo  39,000,000
-type:foo 122,200,000 
 type:bar 90,000,000

We're aware it's large and we're in the process or splitting the index 
up but I was just hoping that there was a workaround I could use in 
order to reclaim some performance.






Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-04-05 Thread Simon Wistow
On Wed, Apr 06, 2011 at 12:05:57AM +0200, Jan Høydahl said:
 Just curious, was there any resolution to this?

Not really.

We tuned the GC pretty aggressively - we use these options

-server 
-Xmx20G -Xms20G -Xss10M
-XX:+UseConcMarkSweepGC 
-XX:+UseParNewGC 
-XX:+CMSIncrementalMode 
-XX:+CMSIncrementalPacing
-XX:SoftRefLRUPolicyMSPerMB=10

and we've played a little with CompressOops and AggressiveOpts.

We also backported the MMapDirectory factory to 1.4.1 and that helped a 
lot.

We do still gets spikes of long (5s-20s queries) a few times an hour 
which don't appear to be caused by any kind of Query of Death. 
Occasionally (once every few days) one of the slaves will experience a 
period of sustained slowness but recovers by itself in less than a 
minute.

According to our GC logs we haven't had a full GC for a long time. 

Currently the state of play is that we commit on our master every 5000ms 
and replicate from the slaves every 2 minutes. Our reponse times for 
searches on the slaves are about 180-270ms but if we turn off 
replication then we get 60-90ms. So something is clearly up with that.

Having talked to the good people at Lucid we're going to try playing 
around with commit intervals, upping our mergeFactor from 10 to 25 and 
maybe using the BalancedSegmentMergePolicy. 

The system seems to be stable at the moment which is good but obviously 
we'd like to lower our query times if possible.

Hopefully this might be of some use to somebody out there, sometime.

Simon




Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-02-07 Thread Simon Wistow
On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
 Heap usage can spike after a commit. Existing caches are still in use and new 
 caches are being generated and/or auto warmed. Can you confirm this is the 
 case?

We see spikes after replication which I suspect is, as you say, because 
of the ensuing commit.

What we seem to have found is that when we weren't using the Concurrent 
GC stop-the-world gc runs would kill the app. Now that we're using CMS 
we occasionally find ourselves in situations where the app still has 
memory left over but the load on the machine spikes, the GC duty cycle 
goes to 100 and the app never recovers.

Restarting usually helps but sometimes we have to take the machine out 
of the laod balancer, wait for a number of minutes and then out it back 
in.

We're working on two hypotheses 

Firstly - we're CPU bound somehow and that at some point we cross some 
threshhold and GC or something else is just unable to to keep up. So 
whilst it looks like instantaneous death of the app it's actually 
gradual resource exhaustion where the definition of 'gradual' is 'a very 
short period of time' (as opposed to some cataclysmic infinite loop bug 
somewhere).

Either that or ... Secondly - there's some sort of Query Of Death that 
kills machines. We just haven't found it yet, even when replaying logs. 

Or some combination of both. Or other things. It's maddeningly 
frustrating.

We're also got to try deploying a custom solr.war and try using the 
MMapDirectory to see if that helps with anything.







Re: Searching for negative numbers very slow

2011-02-07 Thread Simon Wistow
On Fri, Jan 28, 2011 at 12:29:18PM -0500, Yonik Seeley said:
 That's odd - there should be nothing special about negative numbers.
 Here are a couple of ideas:
   - if you have a really big index and querying by a negative number
 is much more rare, it could just be that part of the index wasn't
 cached by the OS and so the query needs to hit the disk.  This can
 happen with any term and a really big index - nothing special for
 negatives here.
  - if -1 is a really common value, it can be slower.  is fq=uid:\-2 or
 other negative numbers really slow also?

This was my first thought but -1 is relatively common but we have other 
numbers just as common. 


Interestingly enough

fq=uid:-1
fq=foo:bar
fq=alpha:omega

is much (4x) slower than

q=uid:-1 AND foo:bar AND alpha:omega

but only when searching for that number.

I'm going to wave my hands here and say something like Maybe something 
to do with the field caches?





Searching for negative numbers very slow

2011-01-27 Thread Simon Wistow
If I do 

qt=dismax
fq=uid:1

(or any other positive number) then queries are as quick as normal - in 
the 20ms range. 

However, any of

fq=uid:\-1

or

fq=uid:[* TO -1]

or 
   
fq=uid:[-1 to -1]

or

fq=-uid:[0 TO *]

then queries are incredibly slow - in the 9 *second* range.

Anything I can do to mitigate this? Negative numbers have significant 
meaning in our system so it wouldn't be trivial to shift all uids up by 
the number of negative ids.


Thanks, 

Simon




Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-27 Thread Simon Wistow
On Tue, Jan 25, 2011 at 01:28:16PM +0100, Markus Jelsma said:
 Are you sure you need CMS incremental mode? It's only adviced when running on 
 a machine with one or two processors. If you have more you should consider 
 disabling the incremental flags.

I'll test agin but we added those to get better performance - not much 
but there did seem to be an improvement.

The problem seems to not be in average use but that occasionally there's 
huge spike in load (there doesn't seem to be a particular killer 
query) and Solr just never recovers.

Thanks,

Simon




Re: Searching for negative numbers very slow

2011-01-27 Thread Simon Wistow
On Thu, Jan 27, 2011 at 11:32:26PM +, me said:
 If I do 
 
   qt=dismax
 fq=uid:1
 
 (or any other positive number) then queries are as quick as normal - in 
 the 20ms range. 

For what it's worth uid is a TrieIntField with precisionStep=0,
omitNorms=true, positionIncrementGap=0




Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-24 Thread Simon Wistow
We have two slaves replicating off one master every 2 minutes.

Both using the CMS + ParNew Garbage collector. Specifically

-server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing

but periodically they both get into a GC storm and just keel over.

Looking through the GC logs the amount of memory reclaimed in each GC 
run gets less and less until we get a concurrent mode failure and then 
Solr effectively dies.

Is it possible there's a memory leak? I note that later versions of 
Lucene have fixed a few leaks. Our current versions are relatively old

Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 
18:06:42

Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55

so I'm wondering if upgrading to later version of Lucene might help (of 
course it might not but I'm trying to investigate all options at this 
point). If so what's the best way to go about this? Can I just grab the 
Lucene jars and drop them somewhere (or unpack and then repack the solr 
war file?). Or should I use a nightly solr 1.4?

Or am I barking up completely the wrong tree? I'm trawling through heap 
logs and gc logs at the moment trying to to see what other tuning I can 
do but any other hints, tips, tricks or cluebats gratefully received. 
Even if it's just Yeah, we had that problem and we added more slaves 
and periodically restarted them

thanks,

Simon


Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-24 Thread Simon Wistow
On Mon, Jan 24, 2011 at 08:00:53PM +0100, Markus Jelsma said:
 Are you using 3rd-party plugins?

No third party plugins - this is actually pretty much stock tomcat6 + 
solr from Ubuntu. The only difference is that we've adapted the 
directory layout to fit in with our house style


Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-24 Thread Simon Wistow
On Mon, Jan 24, 2011 at 10:55:59AM -0800, Em said:
 Could it be possible that your slaves not finished their replicating until
 the new replication-process starts?
 If so, there you got the OOM :).

This was one of my thoughts as well - we're currently running a slave 
which has no queries in it just to see if that exhibits similar 
behaviour.

My reasoning against it is that we're not seeing any 

PERFORMANCE WARNING: Overlapping onDeckSearchers=x

in the logs which is something I'd expect to see.

2 minutes doesn't seem like an unreasonable period of time either - the 
docs at http://wiki.apache.org/solr/SolrReplication suggest 20 seconds.




Box occasionally pegs one cpu at 100%

2011-01-10 Thread Simon Wistow
I have a fairly classic master/slave set up.

Response times on the slave are generally good with blips periodically, 
apparently when replication is happening.

Occasionally however the process will have one incredibly slow query and 
will peg the CPU at 100%.

The weird thing is that it will remain that way even if we stop querying 
it and stop replication and then wait for over 20 minutes. The only way 
to fix the problem at that point is to restart tomcat.

Looking at slow queries around the time of the incident they don't look 
particularly bad - they're predominantly filter queries running under 
dismax and there doesn't seem to be anything unusual about them.

The index file is about 266G and has 30G of disk free. The machine has 
50G of RAM and is running with -Xmx35G.

Looking at the processes running it appears to be the main Java thread 
that's CPU bound, not the child threads. 

Stracing the process gives a lot of brk instructions (presumably some 
sort of wait loop) with occasional blips of: 


mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0
futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
325, {1294683789, 614186000}, ) = 0
futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0
mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0
futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1
mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0
futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0
futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
mmap(0x7fc2e023, 121962496, PROT_NONE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
0x7fc2e023
mmap(0x7fbca58e, 237568, PROT_NONE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
0x7fbca58e

Any ideas about what's happening and if there's anyway to mitigate it? 
If the box at least recovered then I could run another slave and load 
balance between them working on the principle that the second box 
would pick up the slack whilst the first box restabilised but, as it is, 
that's not reliable.

Thanks,

Simon



Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread Simon Wistow
On Mon, Jan 10, 2011 at 01:56:27PM -0500, Brian Burke said:
 This sounds like it could be garbage collection related, especially 
 with a heap that large.  Depending on your jvm tuning, a FGC could 
 take quite a while, effectively 'pausing' the JVM.
 
 Have you looked at something like jstat -gcutil or similar to monitor 
 the garbage collection?

I think you may have hit the nail on the head. 

Having checked the configuration again I noticed that the -server flag 
didn't appear to be present in the options passed to Java (I'm convinced 
it used to be there). As I understand it, this would mean that the 
Parallel GC wouldn't be implicitly enabled.

If that's true then that's a definite strong candidate for causing the 
root process and only the root process to peg a single CPU.

Anybody have any experience of the differences between

-XX:+UseParallelGC 

and

-XX:+UseConcMarkSweepGC with -XX:+UseParNewGC

?

I believe -XX:+UseParallelGC  is the default with -server so I suppose 
that's a good place to start but I'd appreciate any anecdotes or 
experiences.





Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread Simon Wistow
On Mon, Jan 10, 2011 at 05:58:42PM -0500, François Schiettecatte said:
 http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html(you 
 need to read this one)
 
 http://java.sun.com/performance/reference/whitepapers/tuning.html (and 
 this one).

Yeah, I have these two pages bookmarked :)

 jstat is also very good for seeing what is going on in the JVM. I also 
 recall there was a way to trace GC in the JVM but cant recall how off 
 the top of my head, maybe it was a JVM option.

You can use -XX:+PrintGC and -XX:+PrintGCDetail (and 
-XX:+PrintGCTimeStamps) as well as -Xloggc:gc.log to log to a file.

I'm also finding NewRelic's RPM system great for monitoring Solr - the 
integration is really good, I give it two thumbs up.




Very slow sorting, even on small result sets

2010-11-30 Thread Simon Wistow
We've got a largish corpus (~94 million documents). We'd like to be able 
to sort on one of the string fields. However this takes an incredibly 
long time. A warming query for that field takes about ~20 minutes.

However most of the time the result sets are small since we use filters 
heavily - typically a result set is between 2 and 100 documents.

Yet sorting on the string field is still very, very slow.

Now, as I understand it sorting on a field requires building a 
FieldCache for every document no matter how many documents actually 
match the query.

Is there any way round that - is there any way to say just sort the 
matched documents?

We can probably work round this by sorting in application space but I 
wanted to double check that I'm not missing anything before I implement 
that.

thanks,

Simon




Experiencing lots of full GC runs

2010-11-18 Thread Simon Wistow
We currently have a 30G index with 73M of .tii files running on a 
machine with 4 Intel 2.27GHz Xeons with 15G of memory.

About once a second a process indexes ~10-20 smallish documents using 
the XML Update Handler. A commit happens after every update. However we 
see this behaviour even if the indexer isn't running.

The system is running under Tomcat6 with Solr 1.4.1 955763M - mark - 
2010-06-17 18:06:42 and Lucene 2.9.3 951790 - 2010-06-06 01:30:55

Out GC settings (the least worst we've found so far) currently look like 

-XX:+UseConcMarkSweepGC 
-XX:+CMSIncrementalMode 
-XX:+UseParNewGC
-XX:NewSize=5G 
-XX:SurvivorRatio=3 
-Xmx10G -Xss10M 
-XX:CMSInitiatingOccupancyFraction=40 
-XX:+UseCMSInitiatingOccupancyOnly

Everything is fine until we start to try and search at which point 
performance goes to hell with multi second response times and frequent 
full GC runs (approx every 15 seconds) looking like

2372.886: [Full GC 2372.886: [CMS2378.577: [CMS-concurrent-mark: 
5.912/5.913 secs] [Times: user=6.10 sys=0.01, real=5.91 secs] 
 (concurrent mode failure): 5242879K-5242879K(5242880K), 18.2557740 
secs] 9437183K-9409440K(9437184K), [CMS Perm : 30246K-30242K(50552K)] 
icms_dc=100 , 18.2558680 secs] [Times: user=18.20 sys=0.05, real=18.26 
secs] 

Looking at top jsvc is using 100% of CPU.

I'm baffled - I've had way bigger indexes than this before with no 
performance problems. At first it was the frequent updates but the fact 
that it happens even when the indexer isn't running seems to put paid to 
that.

One salient point - because of the frequent updates we don't have a 
queryResultCache configured.

Any ideas? Hints? Tips?

Simon



 


Re: Experiencing lots of full GC runs

2010-11-18 Thread Simon Wistow
On Fri, Nov 19, 2010 at 12:01:09AM +, me said:
 I'm baffled - I've had way bigger indexes than this before with no 
 performance problems. At first it was the frequent updates but the fact 
 that it happens even when the indexer isn't running seems to put paid to 
 that.

More information:

- The index has ~30 million smallish documents 

- Once a slow query has been executed all other queries, even ones which 
had previously been slow but tolerable (response times ~1s) become 
incredibly slow

- Once the process has turned slow only a kill -9 will bring it down

- Upgrading to a recent nightly build of Solr (3.1-2010-11-18_05-27-29 
1036325 - hudson - 2010-11-18 05:41:58) has made things even slower

- I'd check with 4.0.x if someone can point me at a tool that can 
migrate indexes. I seem to be unable to find one and Lucene 3.0 informs 
me that it's incompatible with 2.9.x 









Re: Possible memory leaks with frequent replication

2010-11-02 Thread Simon Wistow
On Mon, Nov 01, 2010 at 05:42:51PM -0700, Lance Norskog said:
 You should query against the indexer. I'm impressed that you got 5s
 replication to work reliably.

That's our current solution - I was just wondering if there was anything 
I was missing. 

Thanks!
 


Possible memory leaks with frequent replication

2010-11-01 Thread Simon Wistow
We've been trying to get a setup in which a slave replicates from a 
master every few seconds (ideally every second but currently we have it 
set at every 5s).

Everything seems to work fine until, periodically, the slave just stops 
responding from what looks like it running out of memory:

org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet jsp threw exception
java.lang.OutOfMemoryError: Java heap space


(our monitoring seems to confirm this).

Looking around my suspicion is that it takes new Readers longer to warm 
than the gap between replication and thus they just build up until all 
memory is consumed (which, I suppose isn't really memory 'leaking' per 
se, more just resource consumption)

That said, we've tried turning off caching on the slave and that didn't 
help either so it's possible I'm wrong.

Is there anything we can do about this? I'm reluctant to increase the 
heap space since I suspect that will mean that there's just a longer 
period between failures. Might Zoie help here? Or should we just query 
against the Master?


Thanks,

Simon


Re: Sorting on arbitary 'custom' fields

2010-10-15 Thread Simon Wistow
On Mon, Oct 11, 2010 at 07:17:43PM +0100, me said:
 It was just an idea though and I was hoping that there would be a 
 simpler more orthodox way of doing it.

In the end, for anyone who cares, we used dynamic fields.

There are a lot of them but we haven't seen performance impacted that 
badly so far.






Re: Sorting on arbitary 'custom' fields

2010-10-11 Thread Simon Wistow
On Sat, Oct 09, 2010 at 06:31:19PM -0400, Erick Erickson said:
 I'm confused. What do you mean that a user can set any
 number of arbitrarily named fields on a document. It sounds
 like you are talking about a user adding arbitrarily may entries
 to a multi-valued field? Or is it some kind of key:value pairs
 in a field in your schema?

Users can add arbitary key/values to documents. Kind of like Machine 
Tags.

So whilst a document has some standard fields (e.g title=My Random 
Document, user=Simon, date=2010-10-11) I might have added 
current_temp_in_c=32 to one of my documents but you might have put 
time_taken_to_write_in_mins=30.

We currently don't index these fields but we'd like to and be able to 
have users sort on them. 

Ideas I had:

- Everytime a user adds a new field (e.g time_taken_to_write_in_mins) 
update the global schema

But that would be horrible and would create an index with many thousands 
of fields.

- Give each user their own core and update each individual schema

Better but still inelegant

The multi valued field idea occurred to me because I could have, for 
example

user_field: [time_taken_to_write_in_mins=30, current_temp_in_c=32]

(i.e flatten the key/value)

I could then maybe write something that allowed sorting only on matched 
values of multi-value field. 

sort=user_field:time_taken_to_write_in_mins=*

or

fq=user_field:time_taken_to_write_in_mins=*sort=user_field

It was just an idea though and I was hoping that there would be a 
simpler more orthodox way of doing it.

thanks,

Simon


Problems indexing spatial field - undefined subField

2010-08-31 Thread Simon Wistow
I'm trying to index a latLon field.

I have a fieldType in my schema.xml that looks like

fieldType name=latLon class=solr.LatLonType subFieldSuffix=_latLon/

and a field that looks like

field name=location type=latLon  indexed=true  stored=true/

I'm trying upload via the JSON update handler but I'm getting a 400 
error

undefined field location_0_latLon

FWIW the JSON looks like

  location: 38.044337,-103.513824

Any idea what I'm doing wrong. Maybe I shouldn't be using the JSON 
update handler?

Simon




Re: Problems indexing spatial field - undefined subField

2010-08-31 Thread Simon Wistow
On Wed, Sep 01, 2010 at 01:05:47AM +0100, me said:
 I'm trying to index a latLon field.

 fieldType name=latLon class=solr.LatLonType subFieldSuffix=_latLon/
 field name=location type=latLon  indexed=true  stored=true/

Turns out changing it to 

fieldType name=latLon class=solr.LatLonType subFieldType=double/

fixed it.




Re: Getting the character offset from highlighted fragments

2010-04-22 Thread Simon Wistow
On Thu, Apr 22, 2010 at 02:15:08AM +0100, me said:
 It looks like org.apache.lucene.search.highlight.TextFragment has the 
 right information to do this (i.e textStartPos)

Turns out that it doesn't seem to have the right information in that 
textStartPos always seems to be 0 (and textEndPos just seems to be the 
lenght of the fragment).

Any suggestions?




Getting the character offset from highlighted fragments

2010-04-21 Thread Simon Wistow
Having poked around little it doesn't look like there's an query param 
to turn this on but it'd be really useful if highlighted fragments could 
have a character offset return somehow - maybe something like

lst name=highlighting
  lst name=27314523
arr name=content
  str offset=600
  Lorem ipsum dolor sit amet, emconsectetur/em adipisicing 
  /str
/arr
  /lst
/lst

or even

lst name=highlighting
  lst name=27314523
arr name=content
  str
  Lorem ipsum dolor sit amet, emconsectetur/em adipisicing 
  /str
/arr
arr name=offsets
  int
  600
  /int
/arr
  /lst
/lst



It looks like org.apache.lucene.search.highlight.TextFragment has the 
right information to do this (i.e textStartPos) but before I start 
writing patches ...

 - I'm a duplicating exisiting work
 - Am I missing some reason why this is a dumb idea
 - Is this desirable (or, to be more succinct, if I write a patch, is it 
   likely to be accepted?)


Thanks,

Simon




Re: Slow QueryComponent.process() when queries have numbers in them

2010-02-05 Thread Simon Wistow
On Wed, Feb 03, 2010 at 07:38:13PM -0800, Lance Norskog said:
 The debugQuery parameter shows you how the query is parsed into a tree
 of Lucene query objects.

Well, that's kind of what I'm asking - I know how the query is being 
parsed:

str name=rawquerystringmyers 8e psychology chapter 9/str

str name=querystringmyers 8e psychology chapter 9/str

str name=parsedquery
+((DisjunctionMaxQuery((content:myer^0.8 | title:myer^1.5)~0.01) 
DisjunctionMaxQuery((content:8 e~2^0.8 | title:8 e~2^1.5)~0.01) 
DisjunctionMaxQuery((content:psycholog^0.8 | title:psycholog^1.5)~0.01) 
DisjunctionMaxQuery((content:chapter^0.8 | title:chapter^1.5)~0.01) 
DisjunctionMaxQuery((content:9^0.8 | title:9^1.5)~0.01))~4) ()
/str

str name=parsedquery_toString
+(((content:myer^0.8 | title:myer^1.5)~0.01 (content:8 e~2^0.8 | 
title:8 e~2^1.5)~0.01 (content:psycholog^0.8 | 
title:psycholog^1.5)~0.01 (content:chapter^0.8 | title:chapter^1.5)~0.01 
(content:9^0.8 | title:9^1.5)~0.01)~4) ()
/str

But that's sort of besides the point - I was really asking if this is a 
known issue (i.e queries with numbers in them can be very slow) and 
whether there are any workarounds







Slow QueryComponent.process() when queries have numbers in them

2010-02-03 Thread Simon Wistow
According to my logs 

org.apache.solr.handler.component.QueryComponent.process()

takes a significant amount of time (5s but I've seen up to 15s) when a 
query has an odd pattern of numbers in e.g

neodymium megagauss-oersteds (MGOe) (1 MG·Oe = 7,958·10³ T·A/m = 7,958 
kJ/m³

myers 8e psychology chapter 9

JOHN PIPER 1 TIMOTEO 3:1?

lab 2.6.2: using wireshark to view protocol data units

malha de aço 3x3 6mm - peso m2

or even looks like it could be a query

An experiment has two outcomes, A and A. If A is three time as likely to occur 
as , what is P(A)?

other params were

fl:
*,score
fq:
+num_pages:[2 TO *] AND +language:1
hl:
true
hl.fl:
content title description
hl.simple.post:
/strong
hl.simple.pre:
strong
hl.snippets:
2
qf:
title^1.5 content^0.8
qs:
2
qt:
dismax
rows:
10
sort:
score desc
start:
0
wt:
json 


is this just something I'm going to have to put up with? Or is there 
something I can do to mitigate it. If it's a bug any suggestions on how 
to start patching it?




Problems with spellchecker

2010-01-20 Thread Simon Wistow
The spellchecker in my 1.4 install started behaving increasingly 
erratically andsuggestions would only be returned some of the time with 
the same query. 

I tried to force a rebuild using

spellcheck.build=yes 

The full request being

/select/?q=alexandr the great
indent=on
fl=title
spellcheck=yes
spellcheck.collate=yes
spellcheck.count=3
qt=dismax
spellcheck.build=yes

and the request spun for a while and then returned 

HTTP Status 500 - this IndexReader is closed 
org.apache.lucene.store.AlreadyClosedException: this IndexReader is 
closed at 
org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:209) at 
org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:624) 
at 
org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:147) 
at 
org.apache.lucene.search.spell.SpellChecker.exist(SpellChecker.java:315) 
at 
org.apache.lucene.search.spell.SpellChecker.indexDictionary(SpellChecker.java:339)
 
at 
org.apache.lucene.search.spell.SpellChecker.indexDictionary(SpellChecker.java:362)
 
at 
org.apache.solr.spelling.IndexBasedSpellChecker.build(IndexBasedSpellChecker.java:89)
 
at 
org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:102)
 
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
 
at 
[ .. snip .. ]

And the spellcheck directory now has a write.lock file.

I tried a couple more times then stopped tomcat, deleted the write.lock, 
restarted and tried again. Same error. So I stopped tomcat again, nuked 
the spellcheck directory, restarted tomcat and tried again. Same error. 
I tried one more time and got a This file does not exists _f00.cfs 
tried again and got the Index Reader is closed error again.

My source index is 58Gb.

Any ideas?

Simon




Oddly slow replication

2009-12-07 Thread Simon Wistow
I have a Master server with two Slaves populated via Solr 1.4 native 
replication.

Slave1 syncs at a respectable speed i.e around 100MB/s but Slave2 runs 
much, much slower - the peak I've seen is 56KB/s.

Both are running off the same hardware with the same config - 
compression is set to 'internal' and http(Conn|Read)Timeout are defaults 
(5000/1). 

I've checked too see if it was a disk problem using dd and if it was a 
network problem by doing a manual scp and an rsync from the slave to the 
master and the master to the slave. 

I've shut down the replication polling on Slave1 just to see if that was 
causing the problem but there's been no improvement.

Any ideas?




Re: Oddness with Phrase Query

2009-11-23 Thread Simon Wistow
On Mon, Nov 23, 2009 at 12:10:42PM -0800, Chris Hostetter said:
 ...hmm, you shouldn't have to reindex everything.  arey ou sure you 
 restarted solr after making the enablePositionIncrements=true change to 
 the query analyzer?

Yup - definitely restarted
 
 what do the offsets look like when you go to analysis.jsp and past in that 
 sentence?

org.apache.solr.analysis.StopFilterFactory 
{words=stopwords.txt, ignoreCase=true, enablePositionIncrements=true}

term position:  1   4 
term text:  HereDragons
term type:  wordword
source start,end0,4 14,21
payload 


 the other thing to consider: you can increase the slop value on that
 phrase query (to allow looser matching) using the qs param (query slop) 
 ... that could help in this situation (stop words getting striped out of 
 hte query) as well as other situations (ie: what if the user just types 
 here be dragons -- with or without stop words)

After fiddling with the position incremements stuff I upped the query 
slop to 2 which seems to now provide better results but I'm worried 
about that effecting relevancy elsewhere (which I presume is the reason 
why it's not the default value).

If that's the case - is it worth writing something for my app so that if 
it detects a phrase query with lots of stop words it ups the phrase 
slop?

Either way it seems to be working now  - thanks for all the help,

Simon



Re: Oddness with Phrase Query

2009-11-17 Thread Simon Wistow
On Tue, Nov 17, 2009 at 11:09:38AM -0800, Chris Hostetter said:
 
 Several things about your message don't make sense...

Hmm, sorry - a byproduct of building up the mail over time I think.

The query 

?q=Here there be dragons
fl=id,title,score
debugQuery=on
qt=dismax
qf=title

gets echoed as 

lst name=params
 str name=qftitle/str
 str name=flid,title,score/str
 str name=debugQueryon/str
 str name=qHere there be dragons/str
 str name=qtdismax/str
/lst

and gets parsed as 

+DisjunctionMaxQuery((title:here dragon)~0.01) ()

and gets no results.


Whereas 

?q=Here there be dragons
fl=id,title,score
debugQuery=on
qt=dismax
qf=title

gets echoed as 

lst name=params
str name=debugQueryon/str
str name=flid,title,score/str
str name=qHere, there be dragons/str
str name=qftitle/str
str name=qtdismax/str
/lst

and parsed as 

+((DisjunctionMaxQuery((title:here)~0.01) 
DisjunctionMaxQuery((title:dragon)~0.01))~2) ()

Gets one result

doc
float name=score6.3863463/float
str name=id20980889/str
str name=titleZelazny, Roger - Here There Be Dragons/str
/doc


It looks like it might be related to 

SOLR-879: Enable position increments in the query parser and fix the
  example schema to enable position increments for the stop 
  filter in both the index and query analyzers to fix the bug 
  with phrase queries with stopwords. (yonik)

http://issues.apache.org/jira/browse/SOLR-879

Although I added enablePositionIncrements=true to

filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/

in to the analyzer type=query for fieldType name=text in the 
schema which didn't fix it - I presume this means that I have to reindex 
everything (although the StopFilterFactory in analyzer type=index 
already had it).





Oddness with Phrase Query

2009-11-09 Thread Simon Wistow
I have a document with the title Here, there be dragons and a body.

When I search for 

q  = Here, there be dragons
qf = title^2.0 body^0.8
qt = dismax

Which is parsed as 

+DisjunctionMaxQuery((content:here dragon^0.8 | title:here 
dragon^2.0)~0.01) ()

I get the document as the first hit which is what I'd suspect.

However, if change the query to 

q  = Here, there be dragons

(with quotes)

which is parsed as

+DisjunctionMaxQuery((content:here dragon^0.8 | title:here 
dragon^2.0)~0.01) ()

then I don't get the document at all. Which is not what I'd suspect.

I've tried modifying the phrase slop but still don't get any results 
back.

Am I doing something wrong - do I have to have an untokenized copy of 
fields lying around?

Thanks,

Simon




Re: Issues with SolrJ and IndexReader reopening

2009-11-04 Thread Simon Wistow
On Fri, Oct 30, 2009 at 11:20:19AM +0530, Shalin Shekhar Mangar said:
 That is very strange. IndexReaders do get re-opened after commits. Do you
 see a commit  message in the Solr logs?

Sorry for the delay - I've been trying to puzzle over this some more.

The code looks like

 server.add(docs);
 server.commit();
 server.optimize();

I'm seeing which would seem to indicate that both a commit and an 
optimize are happening and that a new searcher is getting opened.

documentCache{lookups=0,hits=0,hitratio=0.00,inserts=10,evictions=0,size=10,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
Nov 4, 2009 11:38:06 AM 
org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {commit=} 0 264
Nov 4, 2009 11:38:06 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=/update
params={commit=truewaitFlush=truewaitSearcher=true} status=0 QTime=264
Nov 4, 2009 11:38:06 AM org.apache.solr.update.DirectUpdateHandler2 
commit
INFO: start commit(optimize=true,waitFlush=true,waitSearcher=true)
Nov 4, 2009 11:38:08 AM org.apache.solr.search.SolrIndexSearcher init
INFO: Opening searc...@6d3f199b main
Nov 4, 2009 11:38:08 AM org.apache.solr.update.DirectUpdateHandler2 
commit
INFO: end_commit_flush
Nov 4, 2009 11:38:08 AM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming searc...@6d3f199b main from searc...@2f8acca4 main

But I still the same result.


We ended up going a different route anyway but I'm still slightly 
confused. For what it's worth - the code for instantiating the server is

  SolrConfig solrConfig = new SolrConfig(CONFIG_PATH,solrconfig.xml,null);
  IndexSchema indexSchema = new IndexSchema(solrConfig,schema.xml,null);
  SolrResourceLoader resource = new SolrResourceLoader
   (SolrResourceLoader.locateInstanceDir())  

  CoreContainer container = new CoreContainer(resource);
  CoreDescriptor dcore = new CoreDescriptor(container, ,
solrConfig.getResourceLoader().getInstanceDir());
dcore.setConfigName(solrConfig.getResourceName());
dcore.setSchemaName(indexSchema.getResourceName());

  core = new SolrCore(null, DATA_PATH, solrConfig, indexSchema, dcore);
  container.register(, core, false);
  server = new EmbeddedSolrServer(container, );

Thanks, 

Simon




Issues with SolrJ and IndexReader reopening

2009-10-29 Thread Simon Wistow
We've been trying to build an indexing pipeline using SolrJ but we've 
run into a couple of issues - namely that IndexReaders don't seem to get 
reopened after a commit().

After an index or delete the change doesn't show up until I restart 
solr.

I've tried commit() and commit(true, true) just to try and be specific. 
I've also tried adding an optimize(true, true) but nothing doing.

Am I missing something obvious?




Index Corruption (possibly during commit)

2009-10-19 Thread Simon Wistow
We have an indexing script which has been running for a couple of weeks 
now without problems. It indexes documents and then periodically commit 
(which is a tad redundant I suppose) both via the HTTP interface.

All documents are indexed to a master and a slave rsyncs them off using 
the standard 1.3.0 replication.

Recently the indexing script got into problems when the commit was 
taking longer than the request timeout. I killed the script, did a 
commit by hand (using 
bin/commit) and then started to index again and it still wouldn't 
commit. We then tried to go to the stats page and got the error

org.apache.lucene.index.CorruptIndexException:
doc counts differ for segment _mib: fieldsReader shows 1 but segmentInfo 
shows 718 at 
org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:960) at 
org.apache.solr.core.SolrCore.init(SolrCore.java:470) at 

This is a stock 1.3.0 running off tomcat 6.0.20 with 

java version 1.6.0_13
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

Linux solr.local 2.6.18-128.1.10.el5 #1 SMP Thu May 7 10:35:59 EDT 
2009 x86_64 x86_64 x86_64 GNU/Linux

Plenty of RAM and disk space (usage is 31% - 353G used from 534G)

CheckIndex says

Opening index @ index/

Segments file=segments_c8z numSegments=28 version=FORMAT_HAS_PROX 
[Lucene 2.4]

Checking only these segments: _mib:
  22 of 28: name=_mib docCount=718
compound=false
hasProx=true
numFiles=9
size (MB)=0.029
has deletions [delFileName=_mib_1.del]
test: open reader.FAILED
WARNING: fixIndex() would remove reference to this segment; full 
exception:
org.apache.lucene.index.CorruptIndexException: doc counts differ for 
segment _mib: fieldsReader shows 1 but segmentInfo shows 718
at 
org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:282)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:640)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:591)
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:491)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)

WARNING: 1 broken segments (containing 718 documents) detected
WARNING: would write new segments file, and 718 documents would be lost, 
if -fix were specified


 
Any ideas? We can restore from back ups and back fill but really we'd 
love to know what caused this so we can avoid a repetition.

Simon




'Down' boosting shorter docs

2009-10-14 Thread Simon Wistow
Our index has some items in it which basically contain a title and a 
single word body.

If the user searches for a word in the title (especially if title is of 
itself only oen word) then that doc will get scored quite highly, 
despite the fact that, in this case, it's not really relevant.

I've tried something like

qf=title^2.0 content^0.5
bf=num_pages

but that disproportionally boosts long documents to the detriment of 
relevancy

bf=product(num_pages,0.05)

has no effect but 

bf=product(num_pages,0.06)


has a bunch of long documents which don't seem to return any highlighted 
fields plus the short document with only the query in the title which is 
progress in that it's almost exactly the opposite of what I want.

Any suggestions? Am I going to need to reindex and add the length in 
bytes or characters of the document?

Simon






Advantages of different Servlet Containers

2009-10-02 Thread Simon Wistow
I know that the Solr FAQ says 

Users should decide for themselves which Servlet Container they 
consider the easiest/best for their use cases based on their 
needs/experience. For high traffic scenarios, investing time for tuning 
the servlet container can often make a big difference.

but is there anywhere that lists some of the variosu advantages and 
disadvantages of, say, Tomcat over Jetty for someone who isn't current 
with the Java ecosystem?

Also, I'm currently using Jetty but I've had to do a horrific hack to 
make it work under init.d in that I start it up in the background and 
then tail the output waiting for the line that says the SocketConnector 
has been started

   while [ '' = $(tail -1 $LOG | grep 'Started SocketConnector')  ] ; 
   do
   sleep 1
   done

There's *got* to be a better way of doing this, right? 

Thanks,

Simon