Re: Broken attachment link on Wiki
Bump? On Mon, Jun 27, 2011 at 06:17:42PM +0100, me said: On the SolrJetty page http://wiki.apache.org/solr/SolrJetty there's a link to a tar ball http://wiki.apache.org/solr/SolrJetty?action=AttachFiledo=viewtarget=DEMO_multiple_webapps_jetty_6.1.3.tgz which fails with the error You are not allowed to do AttachFile on this page. Can someone fix it somehow? Or put the file else where?
Broken attachment link on Wiki
On the SolrJetty page http://wiki.apache.org/solr/SolrJetty there's a link to a tar ball http://wiki.apache.org/solr/SolrJetty?action=AttachFiledo=viewtarget=DEMO_multiple_webapps_jetty_6.1.3.tgz which fails with the error You are not allowed to do AttachFile on this page. Can someone fix it somehow? Or put the file else where?
Multiple Solrs on the same box
First, a couple of assumptions. We have boxes with a large amount (~70Gb) of memory which we're running Solr under. We've currently set -Xmx to 25Gb with the GC settings -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing We're reluctant to up the -Xmx because when stop the world GC does eventually happen it'll be pretty devastating. But we also have a bunch of spare memory lying around. So we're wondering if running multiple Solrs is the right thing to do - that way we'll be using all our memory without very long GC pauses. Of course, if that assumption is wrong then the rest of this mail is irrelevant. We're currently using Tomcat but we're pondering moving to Jetty but whilst I've managed to get multiple Solr apps running on different ports under the same Jetty instance I can't seem to get them configured via JNDI. It looks like someone put a tar ball with details of how to do that on the Wiki http://wiki.apache.org/solr/SolrJetty#JNDI_Caveats_Noted_By_Users but the permissions have been set so that you can't actually download it. So - three questions really: Am I barking up the wrong tree or is multiple instances a good idea? Is Jetty worth it or should I just stick to Tomcat? Can someone set the permissions on the wiki so I can download that file? ;) cheers, Simon
Expunging deletes from a very large index
Due to some emergency maintenance I needed to run delete on a large number of documents in a 200Gb index. The problem is that it's taking an inordinately long amount of time (2+ hours so far and counting) and is steadily eating up disk space - presumably up to 2x index size which is getting awfully close to the wire on this machine. Is that inevitable? Is there any way to speed up the process or use less space? Maybe do an optimize with a different number of maxSegments? I suspect not but I thought it was worth asking.
Negative OR in fq field not working as expected
I have a field 'type' that has several values. If it's type 'foo' then it also has a field 'restriction_id'. What I want is a filter query which says either it's not a 'foo' or if it is then it has the restriction '1' I expect two matches - one of type 'bar' and one of type 'foo' Neither fq=(-type:foo OR restriction_id:1) fq={!dismax q.op=OR}-type:foo restriction_id:1 produce any results. fq=restriction_id:1 gets the 'foo' typed result. fq=type:bar get the 'bar' typed result. Either of these fq=type:[* TO *] OR (type:foo AND restriction_id:1) fq=type:(bar OR quux OR fleeg) OR restriction_id:1 do work but are very, very slow to the point of unusability (our indexes are pretty large). Searching round it seems like other people have experienced similar issues and the answer has been Lucene just doesn't work like that When dealing with Lucene people are strongly encouraged to think in terms of MUST, MUST_NOT and SHOULD (which are represented in the query parser as the prefixes +, - and the default) instead of in terms of AND, OR, and NOT ... Lucene's Boolean Queries (and thus Lucene's QueryParser) is not a strict Boolean Logic system, so it's best not to try and think of it like one. http://wiki.apache.org/lucene-java/BooleanQuerySyntax Am I just out of luck? Might edismax help here? Simon
Re: Negative OR in fq field not working as expected
On Mon, Apr 25, 2011 at 04:34:05PM -0400, Jonathan Rochkind said: This is what I do instead, to rewrite the query to mean the same thing but not give the lucene query parser trouble: fq=( (*:* AND -type:foo) OR restriction_id:1) *:* means everything, so (*:* AND -type:foo) means the same thing as just -type:foo, but can get around the lucene query parsers troubles. So that might work for you. Thanks for confirming my suspicions. Unfortunately I've tried that as well and, whilst it works it's also unbelievably slow (~30s query time). Would writing my own Query Parser help here? Simon
Re: Negative OR in fq field not working as expected
On Mon, Apr 25, 2011 at 05:02:12PM -0400, Yonik Seeley said: It really shouldn't be that slow... how many documents are in your index, and how many match -type:foo? Total number of docs is 161,000,000 type:foo 39,000,000 -type:foo 122,200,000 type:bar 90,000,000 We're aware it's large and we're in the process or splitting the index up but I was just hoping that there was a workaround I could use in order to reclaim some performance.
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
On Wed, Apr 06, 2011 at 12:05:57AM +0200, Jan Høydahl said: Just curious, was there any resolution to this? Not really. We tuned the GC pretty aggressively - we use these options -server -Xmx20G -Xms20G -Xss10M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:SoftRefLRUPolicyMSPerMB=10 and we've played a little with CompressOops and AggressiveOpts. We also backported the MMapDirectory factory to 1.4.1 and that helped a lot. We do still gets spikes of long (5s-20s queries) a few times an hour which don't appear to be caused by any kind of Query of Death. Occasionally (once every few days) one of the slaves will experience a period of sustained slowness but recovers by itself in less than a minute. According to our GC logs we haven't had a full GC for a long time. Currently the state of play is that we commit on our master every 5000ms and replicate from the slaves every 2 minutes. Our reponse times for searches on the slaves are about 180-270ms but if we turn off replication then we get 60-90ms. So something is clearly up with that. Having talked to the good people at Lucid we're going to try playing around with commit intervals, upping our mergeFactor from 10 to 25 and maybe using the BalancedSegmentMergePolicy. The system seems to be stable at the moment which is good but obviously we'd like to lower our query times if possible. Hopefully this might be of some use to somebody out there, sometime. Simon
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said: Heap usage can spike after a commit. Existing caches are still in use and new caches are being generated and/or auto warmed. Can you confirm this is the case? We see spikes after replication which I suspect is, as you say, because of the ensuing commit. What we seem to have found is that when we weren't using the Concurrent GC stop-the-world gc runs would kill the app. Now that we're using CMS we occasionally find ourselves in situations where the app still has memory left over but the load on the machine spikes, the GC duty cycle goes to 100 and the app never recovers. Restarting usually helps but sometimes we have to take the machine out of the laod balancer, wait for a number of minutes and then out it back in. We're working on two hypotheses Firstly - we're CPU bound somehow and that at some point we cross some threshhold and GC or something else is just unable to to keep up. So whilst it looks like instantaneous death of the app it's actually gradual resource exhaustion where the definition of 'gradual' is 'a very short period of time' (as opposed to some cataclysmic infinite loop bug somewhere). Either that or ... Secondly - there's some sort of Query Of Death that kills machines. We just haven't found it yet, even when replaying logs. Or some combination of both. Or other things. It's maddeningly frustrating. We're also got to try deploying a custom solr.war and try using the MMapDirectory to see if that helps with anything.
Re: Searching for negative numbers very slow
On Fri, Jan 28, 2011 at 12:29:18PM -0500, Yonik Seeley said: That's odd - there should be nothing special about negative numbers. Here are a couple of ideas: - if you have a really big index and querying by a negative number is much more rare, it could just be that part of the index wasn't cached by the OS and so the query needs to hit the disk. This can happen with any term and a really big index - nothing special for negatives here. - if -1 is a really common value, it can be slower. is fq=uid:\-2 or other negative numbers really slow also? This was my first thought but -1 is relatively common but we have other numbers just as common. Interestingly enough fq=uid:-1 fq=foo:bar fq=alpha:omega is much (4x) slower than q=uid:-1 AND foo:bar AND alpha:omega but only when searching for that number. I'm going to wave my hands here and say something like Maybe something to do with the field caches?
Searching for negative numbers very slow
If I do qt=dismax fq=uid:1 (or any other positive number) then queries are as quick as normal - in the 20ms range. However, any of fq=uid:\-1 or fq=uid:[* TO -1] or fq=uid:[-1 to -1] or fq=-uid:[0 TO *] then queries are incredibly slow - in the 9 *second* range. Anything I can do to mitigate this? Negative numbers have significant meaning in our system so it wouldn't be trivial to shift all uids up by the number of negative ids. Thanks, Simon
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
On Tue, Jan 25, 2011 at 01:28:16PM +0100, Markus Jelsma said: Are you sure you need CMS incremental mode? It's only adviced when running on a machine with one or two processors. If you have more you should consider disabling the incremental flags. I'll test agin but we added those to get better performance - not much but there did seem to be an improvement. The problem seems to not be in average use but that occasionally there's huge spike in load (there doesn't seem to be a particular killer query) and Solr just never recovers. Thanks, Simon
Re: Searching for negative numbers very slow
On Thu, Jan 27, 2011 at 11:32:26PM +, me said: If I do qt=dismax fq=uid:1 (or any other positive number) then queries are as quick as normal - in the 20ms range. For what it's worth uid is a TrieIntField with precisionStep=0, omitNorms=true, positionIncrementGap=0
Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
We have two slaves replicating off one master every 2 minutes. Both using the CMS + ParNew Garbage collector. Specifically -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing but periodically they both get into a GC storm and just keel over. Looking through the GC logs the amount of memory reclaimed in each GC run gets less and less until we get a concurrent mode failure and then Solr effectively dies. Is it possible there's a memory leak? I note that later versions of Lucene have fixed a few leaks. Our current versions are relatively old Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42 Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55 so I'm wondering if upgrading to later version of Lucene might help (of course it might not but I'm trying to investigate all options at this point). If so what's the best way to go about this? Can I just grab the Lucene jars and drop them somewhere (or unpack and then repack the solr war file?). Or should I use a nightly solr 1.4? Or am I barking up completely the wrong tree? I'm trawling through heap logs and gc logs at the moment trying to to see what other tuning I can do but any other hints, tips, tricks or cluebats gratefully received. Even if it's just Yeah, we had that problem and we added more slaves and periodically restarted them thanks, Simon
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
On Mon, Jan 24, 2011 at 08:00:53PM +0100, Markus Jelsma said: Are you using 3rd-party plugins? No third party plugins - this is actually pretty much stock tomcat6 + solr from Ubuntu. The only difference is that we've adapted the directory layout to fit in with our house style
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
On Mon, Jan 24, 2011 at 10:55:59AM -0800, Em said: Could it be possible that your slaves not finished their replicating until the new replication-process starts? If so, there you got the OOM :). This was one of my thoughts as well - we're currently running a slave which has no queries in it just to see if that exhibits similar behaviour. My reasoning against it is that we're not seeing any PERFORMANCE WARNING: Overlapping onDeckSearchers=x in the logs which is something I'd expect to see. 2 minutes doesn't seem like an unreasonable period of time either - the docs at http://wiki.apache.org/solr/SolrReplication suggest 20 seconds.
Box occasionally pegs one cpu at 100%
I have a fairly classic master/slave set up. Response times on the slave are generally good with blips periodically, apparently when replication is happening. Occasionally however the process will have one incredibly slow query and will peg the CPU at 100%. The weird thing is that it will remain that way even if we stop querying it and stop replication and then wait for over 20 minutes. The only way to fix the problem at that point is to restart tomcat. Looking at slow queries around the time of the incident they don't look particularly bad - they're predominantly filter queries running under dismax and there doesn't seem to be anything unusual about them. The index file is about 266G and has 30G of disk free. The machine has 50G of RAM and is running with -Xmx35G. Looking at the processes running it appears to be the main Java thread that's CPU bound, not the child threads. Stracing the process gives a lot of brk instructions (presumably some sort of wait loop) with occasional blips of: mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0 futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 325, {1294683789, 614186000}, ) = 0 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0 mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0 futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1 mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0 futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 mmap(0x7fc2e023, 121962496, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fc2e023 mmap(0x7fbca58e, 237568, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fbca58e Any ideas about what's happening and if there's anyway to mitigate it? If the box at least recovered then I could run another slave and load balance between them working on the principle that the second box would pick up the slack whilst the first box restabilised but, as it is, that's not reliable. Thanks, Simon
Re: Box occasionally pegs one cpu at 100%
On Mon, Jan 10, 2011 at 01:56:27PM -0500, Brian Burke said: This sounds like it could be garbage collection related, especially with a heap that large. Depending on your jvm tuning, a FGC could take quite a while, effectively 'pausing' the JVM. Have you looked at something like jstat -gcutil or similar to monitor the garbage collection? I think you may have hit the nail on the head. Having checked the configuration again I noticed that the -server flag didn't appear to be present in the options passed to Java (I'm convinced it used to be there). As I understand it, this would mean that the Parallel GC wouldn't be implicitly enabled. If that's true then that's a definite strong candidate for causing the root process and only the root process to peg a single CPU. Anybody have any experience of the differences between -XX:+UseParallelGC and -XX:+UseConcMarkSweepGC with -XX:+UseParNewGC ? I believe -XX:+UseParallelGC is the default with -server so I suppose that's a good place to start but I'd appreciate any anecdotes or experiences.
Re: Box occasionally pegs one cpu at 100%
On Mon, Jan 10, 2011 at 05:58:42PM -0500, François Schiettecatte said: http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html(you need to read this one) http://java.sun.com/performance/reference/whitepapers/tuning.html (and this one). Yeah, I have these two pages bookmarked :) jstat is also very good for seeing what is going on in the JVM. I also recall there was a way to trace GC in the JVM but cant recall how off the top of my head, maybe it was a JVM option. You can use -XX:+PrintGC and -XX:+PrintGCDetail (and -XX:+PrintGCTimeStamps) as well as -Xloggc:gc.log to log to a file. I'm also finding NewRelic's RPM system great for monitoring Solr - the integration is really good, I give it two thumbs up.
Very slow sorting, even on small result sets
We've got a largish corpus (~94 million documents). We'd like to be able to sort on one of the string fields. However this takes an incredibly long time. A warming query for that field takes about ~20 minutes. However most of the time the result sets are small since we use filters heavily - typically a result set is between 2 and 100 documents. Yet sorting on the string field is still very, very slow. Now, as I understand it sorting on a field requires building a FieldCache for every document no matter how many documents actually match the query. Is there any way round that - is there any way to say just sort the matched documents? We can probably work round this by sorting in application space but I wanted to double check that I'm not missing anything before I implement that. thanks, Simon
Experiencing lots of full GC runs
We currently have a 30G index with 73M of .tii files running on a machine with 4 Intel 2.27GHz Xeons with 15G of memory. About once a second a process indexes ~10-20 smallish documents using the XML Update Handler. A commit happens after every update. However we see this behaviour even if the indexer isn't running. The system is running under Tomcat6 with Solr 1.4.1 955763M - mark - 2010-06-17 18:06:42 and Lucene 2.9.3 951790 - 2010-06-06 01:30:55 Out GC settings (the least worst we've found so far) currently look like -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+UseParNewGC -XX:NewSize=5G -XX:SurvivorRatio=3 -Xmx10G -Xss10M -XX:CMSInitiatingOccupancyFraction=40 -XX:+UseCMSInitiatingOccupancyOnly Everything is fine until we start to try and search at which point performance goes to hell with multi second response times and frequent full GC runs (approx every 15 seconds) looking like 2372.886: [Full GC 2372.886: [CMS2378.577: [CMS-concurrent-mark: 5.912/5.913 secs] [Times: user=6.10 sys=0.01, real=5.91 secs] (concurrent mode failure): 5242879K-5242879K(5242880K), 18.2557740 secs] 9437183K-9409440K(9437184K), [CMS Perm : 30246K-30242K(50552K)] icms_dc=100 , 18.2558680 secs] [Times: user=18.20 sys=0.05, real=18.26 secs] Looking at top jsvc is using 100% of CPU. I'm baffled - I've had way bigger indexes than this before with no performance problems. At first it was the frequent updates but the fact that it happens even when the indexer isn't running seems to put paid to that. One salient point - because of the frequent updates we don't have a queryResultCache configured. Any ideas? Hints? Tips? Simon
Re: Experiencing lots of full GC runs
On Fri, Nov 19, 2010 at 12:01:09AM +, me said: I'm baffled - I've had way bigger indexes than this before with no performance problems. At first it was the frequent updates but the fact that it happens even when the indexer isn't running seems to put paid to that. More information: - The index has ~30 million smallish documents - Once a slow query has been executed all other queries, even ones which had previously been slow but tolerable (response times ~1s) become incredibly slow - Once the process has turned slow only a kill -9 will bring it down - Upgrading to a recent nightly build of Solr (3.1-2010-11-18_05-27-29 1036325 - hudson - 2010-11-18 05:41:58) has made things even slower - I'd check with 4.0.x if someone can point me at a tool that can migrate indexes. I seem to be unable to find one and Lucene 3.0 informs me that it's incompatible with 2.9.x
Re: Possible memory leaks with frequent replication
On Mon, Nov 01, 2010 at 05:42:51PM -0700, Lance Norskog said: You should query against the indexer. I'm impressed that you got 5s replication to work reliably. That's our current solution - I was just wondering if there was anything I was missing. Thanks!
Possible memory leaks with frequent replication
We've been trying to get a setup in which a slave replicates from a master every few seconds (ideally every second but currently we have it set at every 5s). Everything seems to work fine until, periodically, the slave just stops responding from what looks like it running out of memory: org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet jsp threw exception java.lang.OutOfMemoryError: Java heap space (our monitoring seems to confirm this). Looking around my suspicion is that it takes new Readers longer to warm than the gap between replication and thus they just build up until all memory is consumed (which, I suppose isn't really memory 'leaking' per se, more just resource consumption) That said, we've tried turning off caching on the slave and that didn't help either so it's possible I'm wrong. Is there anything we can do about this? I'm reluctant to increase the heap space since I suspect that will mean that there's just a longer period between failures. Might Zoie help here? Or should we just query against the Master? Thanks, Simon
Re: Sorting on arbitary 'custom' fields
On Mon, Oct 11, 2010 at 07:17:43PM +0100, me said: It was just an idea though and I was hoping that there would be a simpler more orthodox way of doing it. In the end, for anyone who cares, we used dynamic fields. There are a lot of them but we haven't seen performance impacted that badly so far.
Re: Sorting on arbitary 'custom' fields
On Sat, Oct 09, 2010 at 06:31:19PM -0400, Erick Erickson said: I'm confused. What do you mean that a user can set any number of arbitrarily named fields on a document. It sounds like you are talking about a user adding arbitrarily may entries to a multi-valued field? Or is it some kind of key:value pairs in a field in your schema? Users can add arbitary key/values to documents. Kind of like Machine Tags. So whilst a document has some standard fields (e.g title=My Random Document, user=Simon, date=2010-10-11) I might have added current_temp_in_c=32 to one of my documents but you might have put time_taken_to_write_in_mins=30. We currently don't index these fields but we'd like to and be able to have users sort on them. Ideas I had: - Everytime a user adds a new field (e.g time_taken_to_write_in_mins) update the global schema But that would be horrible and would create an index with many thousands of fields. - Give each user their own core and update each individual schema Better but still inelegant The multi valued field idea occurred to me because I could have, for example user_field: [time_taken_to_write_in_mins=30, current_temp_in_c=32] (i.e flatten the key/value) I could then maybe write something that allowed sorting only on matched values of multi-value field. sort=user_field:time_taken_to_write_in_mins=* or fq=user_field:time_taken_to_write_in_mins=*sort=user_field It was just an idea though and I was hoping that there would be a simpler more orthodox way of doing it. thanks, Simon
Problems indexing spatial field - undefined subField
I'm trying to index a latLon field. I have a fieldType in my schema.xml that looks like fieldType name=latLon class=solr.LatLonType subFieldSuffix=_latLon/ and a field that looks like field name=location type=latLon indexed=true stored=true/ I'm trying upload via the JSON update handler but I'm getting a 400 error undefined field location_0_latLon FWIW the JSON looks like location: 38.044337,-103.513824 Any idea what I'm doing wrong. Maybe I shouldn't be using the JSON update handler? Simon
Re: Problems indexing spatial field - undefined subField
On Wed, Sep 01, 2010 at 01:05:47AM +0100, me said: I'm trying to index a latLon field. fieldType name=latLon class=solr.LatLonType subFieldSuffix=_latLon/ field name=location type=latLon indexed=true stored=true/ Turns out changing it to fieldType name=latLon class=solr.LatLonType subFieldType=double/ fixed it.
Re: Getting the character offset from highlighted fragments
On Thu, Apr 22, 2010 at 02:15:08AM +0100, me said: It looks like org.apache.lucene.search.highlight.TextFragment has the right information to do this (i.e textStartPos) Turns out that it doesn't seem to have the right information in that textStartPos always seems to be 0 (and textEndPos just seems to be the lenght of the fragment). Any suggestions?
Getting the character offset from highlighted fragments
Having poked around little it doesn't look like there's an query param to turn this on but it'd be really useful if highlighted fragments could have a character offset return somehow - maybe something like lst name=highlighting lst name=27314523 arr name=content str offset=600 Lorem ipsum dolor sit amet, emconsectetur/em adipisicing /str /arr /lst /lst or even lst name=highlighting lst name=27314523 arr name=content str Lorem ipsum dolor sit amet, emconsectetur/em adipisicing /str /arr arr name=offsets int 600 /int /arr /lst /lst It looks like org.apache.lucene.search.highlight.TextFragment has the right information to do this (i.e textStartPos) but before I start writing patches ... - I'm a duplicating exisiting work - Am I missing some reason why this is a dumb idea - Is this desirable (or, to be more succinct, if I write a patch, is it likely to be accepted?) Thanks, Simon
Re: Slow QueryComponent.process() when queries have numbers in them
On Wed, Feb 03, 2010 at 07:38:13PM -0800, Lance Norskog said: The debugQuery parameter shows you how the query is parsed into a tree of Lucene query objects. Well, that's kind of what I'm asking - I know how the query is being parsed: str name=rawquerystringmyers 8e psychology chapter 9/str str name=querystringmyers 8e psychology chapter 9/str str name=parsedquery +((DisjunctionMaxQuery((content:myer^0.8 | title:myer^1.5)~0.01) DisjunctionMaxQuery((content:8 e~2^0.8 | title:8 e~2^1.5)~0.01) DisjunctionMaxQuery((content:psycholog^0.8 | title:psycholog^1.5)~0.01) DisjunctionMaxQuery((content:chapter^0.8 | title:chapter^1.5)~0.01) DisjunctionMaxQuery((content:9^0.8 | title:9^1.5)~0.01))~4) () /str str name=parsedquery_toString +(((content:myer^0.8 | title:myer^1.5)~0.01 (content:8 e~2^0.8 | title:8 e~2^1.5)~0.01 (content:psycholog^0.8 | title:psycholog^1.5)~0.01 (content:chapter^0.8 | title:chapter^1.5)~0.01 (content:9^0.8 | title:9^1.5)~0.01)~4) () /str But that's sort of besides the point - I was really asking if this is a known issue (i.e queries with numbers in them can be very slow) and whether there are any workarounds
Slow QueryComponent.process() when queries have numbers in them
According to my logs org.apache.solr.handler.component.QueryComponent.process() takes a significant amount of time (5s but I've seen up to 15s) when a query has an odd pattern of numbers in e.g neodymium megagauss-oersteds (MGOe) (1 MG·Oe = 7,958·10³ T·A/m = 7,958 kJ/m³ myers 8e psychology chapter 9 JOHN PIPER 1 TIMOTEO 3:1? lab 2.6.2: using wireshark to view protocol data units malha de aço 3x3 6mm - peso m2 or even looks like it could be a query An experiment has two outcomes, A and A. If A is three time as likely to occur as , what is P(A)? other params were fl: *,score fq: +num_pages:[2 TO *] AND +language:1 hl: true hl.fl: content title description hl.simple.post: /strong hl.simple.pre: strong hl.snippets: 2 qf: title^1.5 content^0.8 qs: 2 qt: dismax rows: 10 sort: score desc start: 0 wt: json is this just something I'm going to have to put up with? Or is there something I can do to mitigate it. If it's a bug any suggestions on how to start patching it?
Problems with spellchecker
The spellchecker in my 1.4 install started behaving increasingly erratically andsuggestions would only be returned some of the time with the same query. I tried to force a rebuild using spellcheck.build=yes The full request being /select/?q=alexandr the great indent=on fl=title spellcheck=yes spellcheck.collate=yes spellcheck.count=3 qt=dismax spellcheck.build=yes and the request spun for a while and then returned HTTP Status 500 - this IndexReader is closed org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:209) at org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:624) at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:147) at org.apache.lucene.search.spell.SpellChecker.exist(SpellChecker.java:315) at org.apache.lucene.search.spell.SpellChecker.indexDictionary(SpellChecker.java:339) at org.apache.lucene.search.spell.SpellChecker.indexDictionary(SpellChecker.java:362) at org.apache.solr.spelling.IndexBasedSpellChecker.build(IndexBasedSpellChecker.java:89) at org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:102) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174) at [ .. snip .. ] And the spellcheck directory now has a write.lock file. I tried a couple more times then stopped tomcat, deleted the write.lock, restarted and tried again. Same error. So I stopped tomcat again, nuked the spellcheck directory, restarted tomcat and tried again. Same error. I tried one more time and got a This file does not exists _f00.cfs tried again and got the Index Reader is closed error again. My source index is 58Gb. Any ideas? Simon
Oddly slow replication
I have a Master server with two Slaves populated via Solr 1.4 native replication. Slave1 syncs at a respectable speed i.e around 100MB/s but Slave2 runs much, much slower - the peak I've seen is 56KB/s. Both are running off the same hardware with the same config - compression is set to 'internal' and http(Conn|Read)Timeout are defaults (5000/1). I've checked too see if it was a disk problem using dd and if it was a network problem by doing a manual scp and an rsync from the slave to the master and the master to the slave. I've shut down the replication polling on Slave1 just to see if that was causing the problem but there's been no improvement. Any ideas?
Re: Oddness with Phrase Query
On Mon, Nov 23, 2009 at 12:10:42PM -0800, Chris Hostetter said: ...hmm, you shouldn't have to reindex everything. arey ou sure you restarted solr after making the enablePositionIncrements=true change to the query analyzer? Yup - definitely restarted what do the offsets look like when you go to analysis.jsp and past in that sentence? org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true, enablePositionIncrements=true} term position: 1 4 term text: HereDragons term type: wordword source start,end0,4 14,21 payload the other thing to consider: you can increase the slop value on that phrase query (to allow looser matching) using the qs param (query slop) ... that could help in this situation (stop words getting striped out of hte query) as well as other situations (ie: what if the user just types here be dragons -- with or without stop words) After fiddling with the position incremements stuff I upped the query slop to 2 which seems to now provide better results but I'm worried about that effecting relevancy elsewhere (which I presume is the reason why it's not the default value). If that's the case - is it worth writing something for my app so that if it detects a phrase query with lots of stop words it ups the phrase slop? Either way it seems to be working now - thanks for all the help, Simon
Re: Oddness with Phrase Query
On Tue, Nov 17, 2009 at 11:09:38AM -0800, Chris Hostetter said: Several things about your message don't make sense... Hmm, sorry - a byproduct of building up the mail over time I think. The query ?q=Here there be dragons fl=id,title,score debugQuery=on qt=dismax qf=title gets echoed as lst name=params str name=qftitle/str str name=flid,title,score/str str name=debugQueryon/str str name=qHere there be dragons/str str name=qtdismax/str /lst and gets parsed as +DisjunctionMaxQuery((title:here dragon)~0.01) () and gets no results. Whereas ?q=Here there be dragons fl=id,title,score debugQuery=on qt=dismax qf=title gets echoed as lst name=params str name=debugQueryon/str str name=flid,title,score/str str name=qHere, there be dragons/str str name=qftitle/str str name=qtdismax/str /lst and parsed as +((DisjunctionMaxQuery((title:here)~0.01) DisjunctionMaxQuery((title:dragon)~0.01))~2) () Gets one result doc float name=score6.3863463/float str name=id20980889/str str name=titleZelazny, Roger - Here There Be Dragons/str /doc It looks like it might be related to SOLR-879: Enable position increments in the query parser and fix the example schema to enable position increments for the stop filter in both the index and query analyzers to fix the bug with phrase queries with stopwords. (yonik) http://issues.apache.org/jira/browse/SOLR-879 Although I added enablePositionIncrements=true to filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ in to the analyzer type=query for fieldType name=text in the schema which didn't fix it - I presume this means that I have to reindex everything (although the StopFilterFactory in analyzer type=index already had it).
Oddness with Phrase Query
I have a document with the title Here, there be dragons and a body. When I search for q = Here, there be dragons qf = title^2.0 body^0.8 qt = dismax Which is parsed as +DisjunctionMaxQuery((content:here dragon^0.8 | title:here dragon^2.0)~0.01) () I get the document as the first hit which is what I'd suspect. However, if change the query to q = Here, there be dragons (with quotes) which is parsed as +DisjunctionMaxQuery((content:here dragon^0.8 | title:here dragon^2.0)~0.01) () then I don't get the document at all. Which is not what I'd suspect. I've tried modifying the phrase slop but still don't get any results back. Am I doing something wrong - do I have to have an untokenized copy of fields lying around? Thanks, Simon
Re: Issues with SolrJ and IndexReader reopening
On Fri, Oct 30, 2009 at 11:20:19AM +0530, Shalin Shekhar Mangar said: That is very strange. IndexReaders do get re-opened after commits. Do you see a commit message in the Solr logs? Sorry for the delay - I've been trying to puzzle over this some more. The code looks like server.add(docs); server.commit(); server.optimize(); I'm seeing which would seem to indicate that both a commit and an optimize are happening and that a new searcher is getting opened. documentCache{lookups=0,hits=0,hitratio=0.00,inserts=10,evictions=0,size=10,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} Nov 4, 2009 11:38:06 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {commit=} 0 264 Nov 4, 2009 11:38:06 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=null path=/update params={commit=truewaitFlush=truewaitSearcher=true} status=0 QTime=264 Nov 4, 2009 11:38:06 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=true,waitSearcher=true) Nov 4, 2009 11:38:08 AM org.apache.solr.search.SolrIndexSearcher init INFO: Opening searc...@6d3f199b main Nov 4, 2009 11:38:08 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush Nov 4, 2009 11:38:08 AM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming searc...@6d3f199b main from searc...@2f8acca4 main But I still the same result. We ended up going a different route anyway but I'm still slightly confused. For what it's worth - the code for instantiating the server is SolrConfig solrConfig = new SolrConfig(CONFIG_PATH,solrconfig.xml,null); IndexSchema indexSchema = new IndexSchema(solrConfig,schema.xml,null); SolrResourceLoader resource = new SolrResourceLoader (SolrResourceLoader.locateInstanceDir()) CoreContainer container = new CoreContainer(resource); CoreDescriptor dcore = new CoreDescriptor(container, , solrConfig.getResourceLoader().getInstanceDir()); dcore.setConfigName(solrConfig.getResourceName()); dcore.setSchemaName(indexSchema.getResourceName()); core = new SolrCore(null, DATA_PATH, solrConfig, indexSchema, dcore); container.register(, core, false); server = new EmbeddedSolrServer(container, ); Thanks, Simon
Issues with SolrJ and IndexReader reopening
We've been trying to build an indexing pipeline using SolrJ but we've run into a couple of issues - namely that IndexReaders don't seem to get reopened after a commit(). After an index or delete the change doesn't show up until I restart solr. I've tried commit() and commit(true, true) just to try and be specific. I've also tried adding an optimize(true, true) but nothing doing. Am I missing something obvious?
Index Corruption (possibly during commit)
We have an indexing script which has been running for a couple of weeks now without problems. It indexes documents and then periodically commit (which is a tad redundant I suppose) both via the HTTP interface. All documents are indexed to a master and a slave rsyncs them off using the standard 1.3.0 replication. Recently the indexing script got into problems when the commit was taking longer than the request timeout. I killed the script, did a commit by hand (using bin/commit) and then started to index again and it still wouldn't commit. We then tried to go to the stats page and got the error org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _mib: fieldsReader shows 1 but segmentInfo shows 718 at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:960) at org.apache.solr.core.SolrCore.init(SolrCore.java:470) at This is a stock 1.3.0 running off tomcat 6.0.20 with java version 1.6.0_13 Java(TM) SE Runtime Environment (build 1.6.0_13-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) Linux solr.local 2.6.18-128.1.10.el5 #1 SMP Thu May 7 10:35:59 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Plenty of RAM and disk space (usage is 31% - 353G used from 534G) CheckIndex says Opening index @ index/ Segments file=segments_c8z numSegments=28 version=FORMAT_HAS_PROX [Lucene 2.4] Checking only these segments: _mib: 22 of 28: name=_mib docCount=718 compound=false hasProx=true numFiles=9 size (MB)=0.029 has deletions [delFileName=_mib_1.del] test: open reader.FAILED WARNING: fixIndex() would remove reference to this segment; full exception: org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _mib: fieldsReader shows 1 but segmentInfo shows 718 at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:282) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:640) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:591) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:491) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) WARNING: 1 broken segments (containing 718 documents) detected WARNING: would write new segments file, and 718 documents would be lost, if -fix were specified Any ideas? We can restore from back ups and back fill but really we'd love to know what caused this so we can avoid a repetition. Simon
'Down' boosting shorter docs
Our index has some items in it which basically contain a title and a single word body. If the user searches for a word in the title (especially if title is of itself only oen word) then that doc will get scored quite highly, despite the fact that, in this case, it's not really relevant. I've tried something like qf=title^2.0 content^0.5 bf=num_pages but that disproportionally boosts long documents to the detriment of relevancy bf=product(num_pages,0.05) has no effect but bf=product(num_pages,0.06) has a bunch of long documents which don't seem to return any highlighted fields plus the short document with only the query in the title which is progress in that it's almost exactly the opposite of what I want. Any suggestions? Am I going to need to reindex and add the length in bytes or characters of the document? Simon
Advantages of different Servlet Containers
I know that the Solr FAQ says Users should decide for themselves which Servlet Container they consider the easiest/best for their use cases based on their needs/experience. For high traffic scenarios, investing time for tuning the servlet container can often make a big difference. but is there anywhere that lists some of the variosu advantages and disadvantages of, say, Tomcat over Jetty for someone who isn't current with the Java ecosystem? Also, I'm currently using Jetty but I've had to do a horrific hack to make it work under init.d in that I start it up in the background and then tail the output waiting for the line that says the SocketConnector has been started while [ '' = $(tail -1 $LOG | grep 'Started SocketConnector') ] ; do sleep 1 done There's *got* to be a better way of doing this, right? Thanks, Simon