Re: Best Practices for Distributing Lucene Indexing and Searching
6. Index locally and synchronize changes periodically. This is an interesting idea and bears looking into. Lucene can combine multiple indexes into a single one, which can be written out somewhere else, and then distributed back to the search nodes to replace their existing index. This is a promising idea for handling a high update volume because it avoids all of the search nodes having to do the analysis phase. Unfortunately, the way addIndexes() is implemented looks like it's going to present some new problems: public synchronized void addIndexes(Directory[] dirs) throws IOException { optimize(); // start with zero or 1 seg for (int i = 0; i dirs.length; i++) { SegmentInfos sis = new SegmentInfos(); // read infos from dir sis.read(dirs[i]); for (int j = 0; j sis.size(); j++) { segmentInfos.addElement(sis.info(j)); // add each info } } optimize(); // final cleanup } We need to deal with some very large indexes (40G+), and an optimize rewrites the entire index, no matter how few documents were added. Since our strategy calls for deleting some docs on the primary index before calling addIndexes() this means *both* calls to optimize() will end up rewriting the entire index! The ideal behavior would be that of addDocument() - segments are only merged occasionally. That said, I'll throw out a replacement implementation that probably doesn't work, but hopefully will spur someone with more knowledge of Lucene internals to take a look at this. public synchronized void addIndexes(Directory[] dirs) throws IOException { // REMOVED: optimize(); for (int i = 0; i dirs.length; i++) { SegmentInfos sis = new SegmentInfos(); // read infos from dir sis.read(dirs[i]); for (int j = 0; j sis.size(); j++) { segmentInfos.addElement(sis.info(j)); // add each info } } maybeMergeSegments(); // replaces optimize } -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexWriter.addIndexes efficiency
I'd like to use addIndexes(Directory[] dirs) to add batches of documents to a main index. My main problem is that the addIndexes() implementation calls optimize() at the beginning and the end. Now, my main index will be ~25GB in size, so adding a single document and then doing an optimize will mean rewriting 25GB of files, right? That sounds like it is going to be too expensive to do often. What I would really like is to be able to control more explicitly when an optimize happens. Could addIndexes() be easily rewritten to just call maybeMergeSegments()? -Yonik __ Do you Yahoo!? All your favorites on one personal page Try My Yahoo! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
I think it depends on the query. If the query (q1) covers a large number of documents and the fiter covers a very small number, then using a RangeFilter will probably be slower than a RangeQuery. -Yonik See, this is what I'm not getting: what is the advantage of the second world? :) ... in what situations would using... s.search(q1, new QueryFilter(new RangeQuery(t1,t2,true)); ...be a better choice then... s.search(q1, new RangeFilter(t1.field(),t1.text(),t2.text(),true,true); __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
Hmmm, scratch that. I explained the tradeoff of a filter vs a range query - not between the different types of filters you talk about. --- Yonik Seeley [EMAIL PROTECTED] wrote: I think it depends on the query. If the query (q1) covers a large number of documents and the fiter covers a very small number, then using a RangeFilter will probably be slower than a RangeQuery. -Yonik See, this is what I'm not getting: what is the advantage of the second world? :) ... in what situations would using... s.search(q1, new QueryFilter(new RangeQuery(t1,t2,true)); ...be a better choice then... s.search(q1, new RangeFilter(t1.field(),t1.text(),t2.text(),true,true); __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: version documents
This won't fully work. You still need to delete the original out of the lucene index to avoid it showing up in searches. Example: myfile v1: I want a cat myfile v2: I want a dog If you change cat to dog in myfile, and then do a search for cat, you will *only* get v1 and hence the sort on version doesn't help. -Yonik --- Justin Swanhart [EMAIL PROTECTED] wrote: Split the filename into basefilename and version and make each a keyword. Sort your query by version descending, and only use the first basefile you encounter. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardTermEnum skipping terms containing numbers?!
test __ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Atomicity in Lucene operations
Hi Nader, I would greatly appreciate it if you could CC me on the docs or the code. Thanks! Yonik --- Nader Henein [EMAIL PROTECTED] wrote: It's pretty integrated into our system at this point, I'm working on Packaging it and cleaning up my documentation and then I'll make it available, I can give you the documents and if you still want the code I'll slap together a ruff copy for you and ship it across. Nader Henein Roy Shan wrote: Hello, Nader: I am very interested in how you implement the atomicity. Could you send me a copy of your code? Thanks in advance. Roy __ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
FYI, this optimization resulted in a fantastic performance boost! I went from 133 queries/sec to 990 queries per sec! I'm now more limited by socket overhead, as I get 1700 queries/sec when I stick the clients right in the same process as the server. Oddly enough, the performance increased, but the CPU utilization decreased to around 55% (in both configurations above). I'll have to look into that later, but any additional performance at this point is pure gravy. -Yonik --- Yonik Seeley [EMAIL PROTECTED] wrote: Doug wrote: For example, Nutch automatically translates such clauses into QueryFilters. Thanks for the excellent pointer Doug! I'll will definitely be implementing this optimization. __ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
Oops, CPU usage is *not* 50%, but closer to 98%. This is due to a bug in CPU% on RHEL 3 on multiprocessor CPUS (I ran run multiple threads in while(1) loops, and it will still only show 50% CPU usage for that process). The agregated (not per-process) statistics shown by top are correct, and they show about 73% user time, 25% system time, and anywhere between .5% and 2% idle time. Unfortunately, this means that I won't be getting any performance improvements from using a second IndexSearcher, and I'm stuck at being 3 times slower than MySQL on the same data/queries. I guess the next step is some profiling... move the server out of the servlet container and move the clients in with the server, and then try some hprof work. Does anyone have pointers to lucene caching and how to tune it? -Yonik --- Bernhard Messer [EMAIL PROTECTED] wrote: Yonik, there is another synchronized block in CSInputStream which could block your second cpu out. __ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
For example, Nutch automatically translates such clauses into QueryFilters. Thanks for the excellent pointer Doug! I'll will definitely be implementing this optimization. If anyone cares, I did a 1 minute hprof test with the search server in a servlet container. Here are the results (sorry about Yahoo's short line length). -Yonik resin.hprof.txt: Exclusive Method Times (CPU) (virtual times) 27390 (37.5%) java.net.PlainSocketImpl.socketAccept 14885 (20.4%) org.apache.lucene.index.SegmentTermDocs.skipTo 6700 (9.2%) org.apache.lucene.index.CompoundFileReader$CSInputStream.rea dInternal 5810 (8.0%) java.io.UnixFileSystem.list 4785 (6.5%) org.apache.lucene.store.InputStream.readByte 3315 (4.5%) java.io.RandomAccessFile.readBytes 1302 (1.8%) java.net.SocketOutputStream.socketWrite0 1004 (1.4%) java.io.RandomAccessFile.seek 546 (0.7%) java.lang.String.intern 336 (0.5%) com.caucho.vfs.WriteStream.print 248 (0.3%) org.apache.lucene.search.TermScorer.next 236 (0.3%) org.apache.lucene.queryParser.QueryParser.jj_scan_token 232 (0.3%) org.apache.lucene.index.SegmentTermEnum.readTerm 228 (0.3%) org.apache.lucene.search.ConjunctionScorer.score 200 (0.3%) org.apache.lucene.queryParser.FastCharStream.refill 196 (0.3%) org.apache.lucene.store.InputStream.readVInt 180 (0.2%) java.security.AccessController.doPrivileged 172 (0.2%) org.apache.lucene.search.ConjunctionScorer.doNext 152 (0.2%) java.lang.Object.clone 152 (0.2%) org.apache.lucene.index.SegmentReader.document 148 (0.2%) java.lang.Throwable.fillInStackTrace 128 (0.2%) org.apache.lucene.index.SegmentReader.norms 116 (0.2%) org.apache.lucene.store.InputStream.readString 112 (0.2%) java.lang.StrictMath.log 108 (0.1%) java.util.LinkedList.addLast 100 (0.1%) java.net.SocketInputStream.socketRead0 88 (0.1%) org.apache.lucene.search.ConjunctionScorer.next __ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
speeding up queries (MySQL faster)
Hi, I'm trying to figure out how to speed up queries to a large index. I'm currently getting 133 req/sec, which isn't bad, but isn't too close to MySQL, which is getting 500 req/sec on the same hardware with the same set of documents. Setup info Stats: - 4.3M documents, 12 keyword fields per document, 11 unindexed fields per document. - lucene index size on disk=1.3G - Hardware: dual opteron w/ 16GB memory, running 64 bit JVM (Sun 1.5 beta) - Lucene version 1.4.1 - Hitting multithreaded server w/ 10 clients at once - This is a read-only index... no updating is done - Single IndexSearcher that is reused for all requests Q1) while hitting it with multiple queries at once, lucene is pegged at 50% CPU usage (meaning it is only using 1 out of 2 CPUs on average). I took a thread dump and all of the lucene threads except one are blocked on reading a file (see trace below). I could create two index readers, but that seems like it might be a waste, and fixing a symptom instead of the root problem. Would multiple IndexSearchers or IndexReaders share internal caches? Is there a way to cache more info at a higher level such that it would get rid of this bottleneck? The JVM isn't taking up much space (125M or so), and I have 16GB to work with! The OS (linux) is obviously caching the index file, but that doesn't get rid of the synchronization issues, and the overhead of re-reading. How is caching in lucene configured? Does it internally use FieldCache, or do I have to use that somehow myself? tcpConnection-8080-72 daemon prio=1 tid=0x002b24412490 nid=0x34a4 waiting for monitor entry [0x45aba000..0x45abb2d0] at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:215) - waiting to lock 0x002ae153fa00 (a org.apache.lucene.store.FSInputStream) at org.apache.lucene.store.InputStream.refill(InputStream.java:158) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:176) at org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:88) at org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:53) at org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:48) at org.apache.lucene.search.Scorer.score(Scorer.java:37) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) at org.apache.lucene.search.Hits.init(Hits.java:43) at org.apache.lucene.search.Searcher.search(Searcher.java:33) at org.apache.lucene.search.Searcher.search(Searcher.java:27) Even using only 1 cpu though, MySQL is faster. Here is what the queries look like: field1:4 AND field2:188453 AND field3:1 field1:4 done alone selects around 4.2M records field2:188453 done alone selects around 1.6M records field3:1 done alone selects around 1K records The whole query normally selects less than 50 records Only the first 10 are returned (or whatever range the client selects). The fields are all keywords checked for exact matches (no fulltext search is done). Is there anything I can do to speed these queries up, or is the structure just more suited to MySQL (and not an inverted index)? How is a query like this carried out? Any help would be greatly appreciated. There's not a lot of info on searching (much more on updating). I'm looking forward to Lucene in Action! too bad it's not out till October. -Yonik ___ Do you Yahoo!? Win 1 of 4,000 free domain names from Yahoo! Enter now. http://promotions.yahoo.com/goldrush - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
--- Otis Gospodnetic [EMAIL PROTECTED] wrote: The bottleneck seems to be disk IO. But it's not. Linux is caching the whole file, and there really isn't any disk activity at all. Most of the threads are blocked on InputStream.refill, not waiting for the disk, but waiting for their turn into the synchronized block to read from the disk (which is why I asked about cacheing above that level). CPU is a constant 50% on a dual CPU system (meaning 100% of 1 cpu). -Yonik __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]