Re: Best Practices for Distributing Lucene Indexing and Searching

2005-03-01 Thread Yonik Seeley
 6. Index locally and synchronize changes periodically. This is an
 interesting idea and bears looking into. Lucene can combine multiple
 indexes into a single one, which can be written out somewhere else, and
 then distributed back to the search nodes to replace their existing
 index.

This is a promising idea for handling a high update volume because it
avoids all of the search nodes having to do the analysis phase.

Unfortunately, the way addIndexes() is implemented looks like it's
going to present some new problems:

  public synchronized void addIndexes(Directory[] dirs)
  throws IOException {
optimize();   // start with zero or 1 seg
for (int i = 0; i  dirs.length; i++) {
  SegmentInfos sis = new SegmentInfos();  // read infos from dir
  sis.read(dirs[i]);
  for (int j = 0; j  sis.size(); j++) {
segmentInfos.addElement(sis.info(j)); // add each info
  }
}
optimize();   // final cleanup
  }

We need to deal with some very large indexes (40G+), and an optimize
rewrites the entire index, no matter how few documents were added. 
Since our strategy calls for deleting some docs on the primary index
before calling addIndexes() this means *both* calls to optimize() will
end up rewriting the entire index!

The ideal behavior would be that of addDocument() - segments are only
merged occasionally.   That said, I'll throw out a replacement
implementation that probably doesn't work, but hopefully will spur
someone with more knowledge of Lucene internals to take a look at
this.

  public synchronized void addIndexes(Directory[] dirs)
  throws IOException {
// REMOVED: optimize();
for (int i = 0; i  dirs.length; i++) {
  SegmentInfos sis = new SegmentInfos();  // read infos from dir
  sis.read(dirs[i]);
  for (int j = 0; j  sis.size(); j++) {
segmentInfos.addElement(sis.info(j)); // add each info
  }
}
maybeMergeSegments();   // replaces optimize
  }

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexWriter.addIndexes efficiency

2004-11-28 Thread Yonik Seeley
I'd like to use addIndexes(Directory[] dirs) to add
batches of documents to a main index.

My main problem is that the addIndexes()
implementation calls optimize() at the beginning and
the end.

Now, my main index will be ~25GB in size, so adding a
single document and then doing an optimize will mean
rewriting 25GB of files, right?  That sounds like it
is going to be too expensive to do often.

What I would really like is to be able to control more
explicitly when an optimize happens.  Could
addIndexes() be easily rewritten to just call
maybeMergeSegments()?

-Yonik





__ 
Do you Yahoo!? 
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Yonik Seeley
I think it depends on the query.  If the query (q1)
covers a large number of documents and the fiter
covers a very small number, then using a RangeFilter
will probably be slower than a RangeQuery.

-Yonik


 See, this is what I'm not getting: what is the
 advantage of the second
 world? :) ... in what situations would using...
 
s.search(q1, new QueryFilter(new
 RangeQuery(t1,t2,true));
 
 ...be a better choice then...
 
s.search(q1, new

RangeFilter(t1.field(),t1.text(),t2.text(),true,true);



__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Yonik Seeley
Hmmm, scratch that.  I explained the tradeoff of a
filter vs a range query - not between the different
types of filters you talk about.

--- Yonik Seeley [EMAIL PROTECTED] wrote:
 I think it depends on the query.  If the query (q1)
 covers a large number of documents and the fiter
 covers a very small number, then using a RangeFilter
 will probably be slower than a RangeQuery.
 
 -Yonik
 
 
  See, this is what I'm not getting: what is the
  advantage of the second
  world? :) ... in what situations would using...
  
 s.search(q1, new QueryFilter(new
  RangeQuery(t1,t2,true));
  
  ...be a better choice then...
  
 s.search(q1, new
 

RangeFilter(t1.field(),t1.text(),t2.text(),true,true);
 
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam
 protection around 
 http://mail.yahoo.com 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: version documents

2004-11-18 Thread Yonik Seeley
This won't fully work.  You still need to delete the
original out of the lucene index to avoid it showing
up in searches.

Example:
myfile v1:  I want a cat
myfile v2:  I want a dog

If you change cat to dog in myfile, and then do a
search for cat, you will *only* get v1 and hence the
sort on version doesn't help.

-Yonik


--- Justin Swanhart [EMAIL PROTECTED] wrote:
 Split the filename into basefilename and version
 and make each a keyword.
 
 Sort your query by version descending, and only use
 the first
 basefile you encounter.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WildcardTermEnum skipping terms containing numbers?!

2004-11-17 Thread Yonik Seeley
test



__ 
Do you Yahoo!? 
The all-new My Yahoo! - Get yours free! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Atomicity in Lucene operations

2004-10-18 Thread Yonik Seeley
Hi Nader,
I would greatly appreciate it if you could CC me on
the docs or the code.

Thanks!
Yonik


--- Nader Henein [EMAIL PROTECTED] wrote:

 It's pretty integrated into our system at this
 point, I'm working on
 Packaging it and cleaning up my documentation and
 then I'll make it
 available, I can give you the documents and if you
 still want the code
 I'll slap together a ruff copy for you and ship it
 across.
 
 
 Nader Henein
 
 Roy Shan wrote:
 
 Hello, Nader:
 
 I am very interested in how you implement the
 atomicity. Could you
 send me a copy of your code?
 
 Thanks in advance.
 
 Roy




__
Do you Yahoo!?
Yahoo! Mail - Helps protect you from nasty viruses.
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-27 Thread Yonik Seeley
FYI, this optimization resulted in a fantastic
performance boost!  I went from 133 queries/sec to 990
queries per sec!  I'm now more limited by socket
overhead, as I get 1700 queries/sec when I stick the
clients right in the same process as the server.

Oddly enough, the performance increased, but the CPU
utilization decreased to around 55% (in both
configurations above).  I'll have to look into that
later, but any additional performance at this point is
pure gravy.

-Yonik


--- Yonik Seeley [EMAIL PROTECTED] wrote:
 Doug wrote:
  For example, Nutch automatically translates such
  clauses into QueryFilters.
 
 Thanks for the excellent pointer Doug!  I'll will
 definitely be implementing this optimization.




__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-22 Thread Yonik Seeley
Oops, CPU usage is *not* 50%, but closer to 98%.
This is due to a bug in CPU% on RHEL 3 on
multiprocessor CPUS (I ran run multiple threads in
while(1) loops, and it will still only show 50% CPU
usage for that process).  The agregated (not
per-process) statistics shown by top are correct, and
they show about 73% user time, 25% system time, and
anywhere between .5% and 2% idle time.

Unfortunately, this means that I won't be getting any
performance improvements from using a second
IndexSearcher, and I'm stuck at being 3 times slower
than MySQL on the same data/queries.

I guess the next step is some profiling... move the
server out of the servlet container and move the
clients in with the server, and then try some hprof
work.

Does anyone have pointers to lucene caching and how to
tune it?

-Yonik 





--- Bernhard Messer [EMAIL PROTECTED]
wrote:
 Yonik,
 
 there is another synchronized block in
 CSInputStream which could block 
 your second cpu out.



__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-22 Thread Yonik Seeley

 For example, Nutch automatically translates such
 clauses into QueryFilters.

Thanks for the excellent pointer Doug!  I'll will
definitely be implementing this optimization.

If anyone cares, I did a 1 minute hprof test with the
search server in a servlet container.  Here are the
results (sorry about Yahoo's short line length).

-Yonik

resin.hprof.txt: Exclusive Method Times (CPU) (virtual
times)
 27390  (37.5%)
java.net.PlainSocketImpl.socketAccept
 14885  (20.4%)
org.apache.lucene.index.SegmentTermDocs.skipTo
  6700   (9.2%)
org.apache.lucene.index.CompoundFileReader$CSInputStream.rea
dInternal
  5810   (8.0%) java.io.UnixFileSystem.list
  4785   (6.5%)
org.apache.lucene.store.InputStream.readByte
  3315   (4.5%) java.io.RandomAccessFile.readBytes
  1302   (1.8%)
java.net.SocketOutputStream.socketWrite0
  1004   (1.4%) java.io.RandomAccessFile.seek
   546   (0.7%) java.lang.String.intern
   336   (0.5%) com.caucho.vfs.WriteStream.print
   248   (0.3%)
org.apache.lucene.search.TermScorer.next
   236   (0.3%)
org.apache.lucene.queryParser.QueryParser.jj_scan_token
   232   (0.3%)
org.apache.lucene.index.SegmentTermEnum.readTerm
   228   (0.3%)
org.apache.lucene.search.ConjunctionScorer.score
   200   (0.3%)
org.apache.lucene.queryParser.FastCharStream.refill
   196   (0.3%)
org.apache.lucene.store.InputStream.readVInt
   180   (0.2%)
java.security.AccessController.doPrivileged
   172   (0.2%)
org.apache.lucene.search.ConjunctionScorer.doNext
   152   (0.2%) java.lang.Object.clone
   152   (0.2%)
org.apache.lucene.index.SegmentReader.document
   148   (0.2%)
java.lang.Throwable.fillInStackTrace
   128   (0.2%)
org.apache.lucene.index.SegmentReader.norms
   116   (0.2%)
org.apache.lucene.store.InputStream.readString
   112   (0.2%) java.lang.StrictMath.log
   108   (0.1%) java.util.LinkedList.addLast
   100   (0.1%)
java.net.SocketInputStream.socketRead0
88   (0.1%)
org.apache.lucene.search.ConjunctionScorer.next





__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



speeding up queries (MySQL faster)

2004-08-20 Thread Yonik Seeley
Hi,

I'm trying to figure out how to speed up queries to a
large index.
I'm currently getting 133 req/sec, which isn't bad,
but isn't too close
to MySQL, which is getting 500 req/sec on the same
hardware with the
same set of documents.

Setup info  Stats:
- 4.3M documents, 12 keyword fields per document, 11
unindexed fields per document.
- lucene index size on disk=1.3G
- Hardware: dual opteron w/ 16GB memory, running 64
bit JVM (Sun 1.5 beta)
- Lucene version 1.4.1
- Hitting multithreaded server w/ 10 clients at once
- This is a read-only index... no updating is done
- Single IndexSearcher that is reused for all requests
 

Q1)  while hitting it with multiple queries at once,
lucene is pegged at 50% CPU usage (meaning it is
only using 1 out of 2 CPUs on average).  I took a
thread dump
and all of the lucene threads except one are blocked
on
reading a file (see trace below).  I could create two
index
readers, but that seems like it might be a waste, and
fixing
a symptom instead of the root problem.  Would multiple
IndexSearchers or IndexReaders share internal caches?
Is there a way to cache more info at a higher level
such that
it would get rid of this bottleneck?  The JVM isn't
taking up
much space (125M or so), and I have 16GB to work with!
The OS (linux) is obviously caching the index file,
but
that doesn't get rid of the synchronization issues,
and the
overhead of re-reading.
How is caching in lucene configured?
Does it internally use FieldCache, or do I have to use
that
somehow myself?
 
tcpConnection-8080-72 daemon prio=1
tid=0x002b24412490 nid=0x34a4 waiting for monitor
entry 

[0x45aba000..0x45abb2d0]
at
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:215)
- waiting to lock 0x002ae153fa00 (a
org.apache.lucene.store.FSInputStream)
at
org.apache.lucene.store.InputStream.refill(InputStream.java:158)
at
org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at
org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
at
org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:176)
at
org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:88)
at
org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:53)
at
org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:48)
at
org.apache.lucene.search.Scorer.score(Scorer.java:37)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92)
at
org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at
org.apache.lucene.search.Hits.init(Hits.java:43)
at
org.apache.lucene.search.Searcher.search(Searcher.java:33)
at
org.apache.lucene.search.Searcher.search(Searcher.java:27)


Even using only 1 cpu though, MySQL is faster. Here is
what
the queries look like:

field1:4 AND field2:188453 AND field3:1

field1:4  done alone selects around 4.2M records
field2:188453 done alone selects around 1.6M records
field3:1  done alone selects around 1K records
The whole query normally selects less than 50 records
Only the first 10 are returned (or whatever range
the client selects).

The fields are all keywords checked for exact matches
(no
fulltext search is done).  Is there anything I can do
to
speed these queries up, or is the structure just more
suited
to MySQL (and not an inverted index)?

How is a query like this carried out?

Any help would be greatly appreciated.  There's not a
lot of info
on searching (much more on updating). I'm looking
forward
to Lucene in Action!  too bad it's not out till
October.

-Yonik



___
Do you Yahoo!?
Win 1 of 4,000 free domain names from Yahoo! Enter now.
http://promotions.yahoo.com/goldrush

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-20 Thread Yonik Seeley

--- Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 The bottleneck seems to be disk IO.

But it's not.  Linux is caching the whole file, and
there really isn't any disk activity at all.  Most of
the threads are blocked on InputStream.refill, not
waiting for the disk, but waiting for their turn into
the synchronized block to read from the disk (which is
why I asked about cacheing above that level).

CPU is a constant 50% on a dual CPU system (meaning
100% of 1 cpu).

-Yonik

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]