Re: speeding up queries (MySQL faster)
FYI, this optimization resulted in a fantastic performance boost! I went from 133 queries/sec to 990 queries per sec! I'm now more limited by socket overhead, as I get 1700 queries/sec when I stick the clients right in the same process as the server. Oddly enough, the performance increased, but the CPU utilization decreased to around 55% (in both configurations above). I'll have to look into that later, but any additional performance at this point is pure gravy. -Yonik --- Yonik Seeley <[EMAIL PROTECTED]> wrote: > Doug wrote: > > For example, Nutch automatically translates such > > clauses into QueryFilters. > > Thanks for the excellent pointer Doug! I'll will > definitely be implementing this optimization. __ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
> For example, Nutch automatically translates such > clauses into QueryFilters. Thanks for the excellent pointer Doug! I'll will definitely be implementing this optimization. If anyone cares, I did a 1 minute hprof test with the search server in a servlet container. Here are the results (sorry about Yahoo's short line length). -Yonik resin.hprof.txt: Exclusive Method Times (CPU) (virtual times) 27390 (37.5%) java.net.PlainSocketImpl.socketAccept 14885 (20.4%) org.apache.lucene.index.SegmentTermDocs.skipTo 6700 (9.2%) org.apache.lucene.index.CompoundFileReader$CSInputStream.rea dInternal 5810 (8.0%) java.io.UnixFileSystem.list 4785 (6.5%) org.apache.lucene.store.InputStream.readByte 3315 (4.5%) java.io.RandomAccessFile.readBytes 1302 (1.8%) java.net.SocketOutputStream.socketWrite0 1004 (1.4%) java.io.RandomAccessFile.seek 546 (0.7%) java.lang.String.intern 336 (0.5%) com.caucho.vfs.WriteStream.print 248 (0.3%) org.apache.lucene.search.TermScorer.next 236 (0.3%) org.apache.lucene.queryParser.QueryParser.jj_scan_token 232 (0.3%) org.apache.lucene.index.SegmentTermEnum.readTerm 228 (0.3%) org.apache.lucene.search.ConjunctionScorer.score 200 (0.3%) org.apache.lucene.queryParser.FastCharStream.refill 196 (0.3%) org.apache.lucene.store.InputStream.readVInt 180 (0.2%) java.security.AccessController.doPrivileged 172 (0.2%) org.apache.lucene.search.ConjunctionScorer.doNext 152 (0.2%) java.lang.Object.clone 152 (0.2%) org.apache.lucene.index.SegmentReader.document 148 (0.2%) java.lang.Throwable.fillInStackTrace 128 (0.2%) org.apache.lucene.index.SegmentReader.norms 116 (0.2%) org.apache.lucene.store.InputStream.readString 112 (0.2%) java.lang.StrictMath.log 108 (0.1%) java.util.LinkedList.addLast 100 (0.1%) java.net.SocketInputStream.socketRead0 88 (0.1%) org.apache.lucene.search.ConjunctionScorer.next __ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
Yonik Seeley wrote: Setup info & Stats: - 4.3M documents, 12 keyword fields per document, 11 [ ... ] "field1:4 AND field2:188453 AND field3:1" field1:4 done alone selects around 4.2M records field2:188453 done alone selects around 1.6M records field3:1 done alone selects around 1K records The whole query normally selects less than 50 records Only the first 10 are returned (or whatever range the client selects). The "field1:4" clause is probably dominating the cost of query execution. Clauses which match large portions of the collection are slow to evaluate. If there are not too many different such clauses then you can optimize this by re-using a Filter in place of such clauses, typically a QueryFilter. For example, Nutch automatically translates such clauses into QueryFilters. See: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/searcher/LuceneQueryOptimizer.java?view=markup Note that this only converts clauses whose boost is zero. Since filters do not affect ranking we can only safely convert clauses which do not contribute to the score, i.e, those whose boost is zero. Scores might still be different in the filtered results because of Similarity.coord(). But, in Nutch, Similarity.coord() is overidden to always return 1.0, so that the replacement of clauses with filters does not alter the final scores at all. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
Oops, CPU usage is *not* 50%, but closer to 98%. This is due to a bug in CPU% on RHEL 3 on multiprocessor CPUS (I ran run multiple threads in while(1) loops, and it will still only show 50% CPU usage for that process). The agregated (not per-process) statistics shown by top are correct, and they show about 73% user time, 25% system time, and anywhere between .5% and 2% idle time. Unfortunately, this means that I won't be getting any performance improvements from using a second IndexSearcher, and I'm stuck at being 3 times slower than MySQL on the same data/queries. I guess the next step is some profiling... move the server out of the servlet container and move the clients in with the server, and then try some hprof work. Does anyone have pointers to lucene caching and how to tune it? -Yonik --- Bernhard Messer <[EMAIL PROTECTED]> wrote: > Yonik, > > there is another "synchronized" block in > CSInputStream which could block > your second cpu out. __ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
Yonik, there is another "synchronized" block in CSInputStream which could block your second cpu out. Do you think there is a chance to recreate the index (maybe a smaller subset) without compound file option enabled and run your test again, so that we can see if this helps ? regards Bernhard Otis Gospodnetic wrote: Ah, you may be right (no stack trace in email any more). Somebody recenly identified a few bottlenecks that, if I recall correctly, were related to synchronized blocks. I believe Doug committed some improvements, but I can't remember which version of Lucene that is in. It's definitely in 1.4.1. Otis --- Yonik Seeley <[EMAIL PROTECTED]> wrote: --- Otis Gospodnetic <[EMAIL PROTECTED]> wrote: The bottleneck seems to be disk IO. But it's not. Linux is caching the whole file, and there really isn't any disk activity at all. Most of the threads are blocked on InputStream.refill, not waiting for the disk, but waiting for their turn into the synchronized block to read from the disk (which is why I asked about cacheing above that level). CPU is a constant 50% on a dual CPU system (meaning 100% of 1 cpu). -Yonik __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
Ah, you may be right (no stack trace in email any more). Somebody recenly identified a few bottlenecks that, if I recall correctly, were related to synchronized blocks. I believe Doug committed some improvements, but I can't remember which version of Lucene that is in. It's definitely in 1.4.1. Otis --- Yonik Seeley <[EMAIL PROTECTED]> wrote: > > --- Otis Gospodnetic <[EMAIL PROTECTED]> > wrote: > > > The bottleneck seems to be disk IO. > > But it's not. Linux is caching the whole file, and > there really isn't any disk activity at all. Most of > the threads are blocked on InputStream.refill, not > waiting for the disk, but waiting for their turn into > the synchronized block to read from the disk (which is > why I asked about cacheing above that level). > > CPU is a constant 50% on a dual CPU system (meaning > 100% of 1 cpu). > > -Yonik > > __ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
--- Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > The bottleneck seems to be disk IO. But it's not. Linux is caching the whole file, and there really isn't any disk activity at all. Most of the threads are blocked on InputStream.refill, not waiting for the disk, but waiting for their turn into the synchronized block to read from the disk (which is why I asked about cacheing above that level). CPU is a constant 50% on a dual CPU system (meaning 100% of 1 cpu). -Yonik __ Do you Yahoo!? Yahoo! Mail is new and improved - Check it out! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
--- Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > The bottleneck seems to be disk IO. But it's not. Linux is caching the whole file, and there really isn't any disk activity at all. Most of the threads are blocked on InputStream.refill, not waiting for the disk, but waiting for their turn into the synchronized block to read from the disk (which is why I asked about cacheing above that level). CPU is a constant 50% on a dual CPU system (meaning 100% of 1 cpu). -Yonik __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
The bottleneck seems to be disk IO. Since this is a read-only index, why not spread some of the frequently scanned index files over multiple disks, or put the index on SCSI disks hooked up in a RAID. Maybe this is already the case, but you didn't mention in. Oh, I already answered a similar question once before: http://www.mail-archive.com/[EMAIL PROTECTED]/msg05103.html Otis http://www.simpy.com/ -- Index, Search and Share your bookmarks --- Yonik Seeley <[EMAIL PROTECTED]> wrote: > Hi, > > I'm trying to figure out how to speed up queries to a > large index. > I'm currently getting 133 req/sec, which isn't bad, > but isn't too close > to MySQL, which is getting 500 req/sec on the same > hardware with the > same set of documents. > > Setup info & Stats: > - 4.3M documents, 12 keyword fields per document, 11 > unindexed fields per document. > - lucene index size on disk=1.3G > - Hardware: dual opteron w/ 16GB memory, running 64 > bit JVM (Sun 1.5 beta) > - Lucene version 1.4.1 > - Hitting multithreaded server w/ 10 clients at once > - This is a read-only index... no updating is done > - Single IndexSearcher that is reused for all requests > > > Q1) while hitting it with multiple queries at once, > lucene is pegged at 50% CPU usage (meaning it is > only using 1 out of 2 CPUs on average). I took a > thread dump > and all of the lucene threads except one are blocked > on > reading a file (see trace below). I could create two > index > readers, but that seems like it might be a waste, and > fixing > a symptom instead of the root problem. Would multiple > IndexSearchers or IndexReaders share internal caches? > Is there a way to cache more info at a higher level > such that > it would get rid of this bottleneck? The JVM isn't > taking up > much space (125M or so), and I have 16GB to work with! > The OS (linux) is obviously caching the index file, > but > that doesn't get rid of the synchronization issues, > and the > overhead of re-reading. > How is caching in lucene configured? > Does it internally use FieldCache, or do I have to use > that > somehow myself? > > "tcpConnection-8080-72" daemon prio=1 > tid=0x002b24412490 nid=0x34a4 waiting for monitor > entry > > [0x45aba000..0x45abb2d0] > at > org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:215) > - waiting to lock <0x002ae153fa00> (a > org.apache.lucene.store.FSInputStream) > at > org.apache.lucene.store.InputStream.refill(InputStream.java:158) > at > org.apache.lucene.store.InputStream.readByte(InputStream.java:43) > at > org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) > at > org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:176) > at > org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:88) > at > org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:53) > at > org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:48) > at > org.apache.lucene.search.Scorer.score(Scorer.java:37) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92) > at > org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) > at > org.apache.lucene.search.Hits.(Hits.java:43) > at > org.apache.lucene.search.Searcher.search(Searcher.java:33) > at > org.apache.lucene.search.Searcher.search(Searcher.java:27) > > > Even using only 1 cpu though, MySQL is faster. Here is > what > the queries look like: > > "field1:4 AND field2:188453 AND field3:1" > > field1:4 done alone selects around 4.2M records > field2:188453 done alone selects around 1.6M records > field3:1 done alone selects around 1K records > The whole query normally selects less than 50 records > Only the first 10 are returned (or whatever range > the client selects). > > The fields are all keywords checked for exact matches > (no > fulltext search is done). Is there anything I can do > to > speed these queries up, or is the structure just more > suited > to MySQL (and not an inverted index)? > > How is a query like this carried out? > > Any help would be greatly appreciated. There's not a > lot of info > on searching (much more on updating). I'm looking > forward > to "Lucene in Action"! too bad it's not out till > October. > > -Yonik > > > > ___ > Do you Yahoo!? > Win 1 of 4,000 free domain names from Yahoo! Enter now. > http://promotions.yahoo.com/goldrush > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
speeding up queries (MySQL faster)
Hi, I'm trying to figure out how to speed up queries to a large index. I'm currently getting 133 req/sec, which isn't bad, but isn't too close to MySQL, which is getting 500 req/sec on the same hardware with the same set of documents. Setup info & Stats: - 4.3M documents, 12 keyword fields per document, 11 unindexed fields per document. - lucene index size on disk=1.3G - Hardware: dual opteron w/ 16GB memory, running 64 bit JVM (Sun 1.5 beta) - Lucene version 1.4.1 - Hitting multithreaded server w/ 10 clients at once - This is a read-only index... no updating is done - Single IndexSearcher that is reused for all requests Q1) while hitting it with multiple queries at once, lucene is pegged at 50% CPU usage (meaning it is only using 1 out of 2 CPUs on average). I took a thread dump and all of the lucene threads except one are blocked on reading a file (see trace below). I could create two index readers, but that seems like it might be a waste, and fixing a symptom instead of the root problem. Would multiple IndexSearchers or IndexReaders share internal caches? Is there a way to cache more info at a higher level such that it would get rid of this bottleneck? The JVM isn't taking up much space (125M or so), and I have 16GB to work with! The OS (linux) is obviously caching the index file, but that doesn't get rid of the synchronization issues, and the overhead of re-reading. How is caching in lucene configured? Does it internally use FieldCache, or do I have to use that somehow myself? "tcpConnection-8080-72" daemon prio=1 tid=0x002b24412490 nid=0x34a4 waiting for monitor entry [0x45aba000..0x45abb2d0] at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:215) - waiting to lock <0x002ae153fa00> (a org.apache.lucene.store.FSInputStream) at org.apache.lucene.store.InputStream.refill(InputStream.java:158) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:176) at org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:88) at org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:53) at org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:48) at org.apache.lucene.search.Scorer.score(Scorer.java:37) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) at org.apache.lucene.search.Hits.(Hits.java:43) at org.apache.lucene.search.Searcher.search(Searcher.java:33) at org.apache.lucene.search.Searcher.search(Searcher.java:27) Even using only 1 cpu though, MySQL is faster. Here is what the queries look like: "field1:4 AND field2:188453 AND field3:1" field1:4 done alone selects around 4.2M records field2:188453 done alone selects around 1.6M records field3:1 done alone selects around 1K records The whole query normally selects less than 50 records Only the first 10 are returned (or whatever range the client selects). The fields are all keywords checked for exact matches (no fulltext search is done). Is there anything I can do to speed these queries up, or is the structure just more suited to MySQL (and not an inverted index)? How is a query like this carried out? Any help would be greatly appreciated. There's not a lot of info on searching (much more on updating). I'm looking forward to "Lucene in Action"! too bad it's not out till October. -Yonik ___ Do you Yahoo!? Win 1 of 4,000 free domain names from Yahoo! Enter now. http://promotions.yahoo.com/goldrush - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]