Hello and quick answers:

See IndexWriter javadoc and in particular mergeFactor, minMergeDocs,
and maxMergeDocs.  This will let you control the size of your segments,
the frequency of segment merges, the amount of buffered Documents in
RAM between segment merges and such.  Also, you ask about calling
optimize periodically - no need, Lucene should already merge segments
once in a while for you.  Optimize at the end.  You can also experiment
with different JVM args for various GC algorithms.

Otis

--- Chris Hostetter <[EMAIL PROTECTED]> wrote:

> 
> I've been running into an interesting situation that I wanted to ask
> about.
> 
> I've been doing some testing by building up indexes with code that
> looks
> like this...
> 
>      IndexWriter writer = null;
>      try {
>          writer = new IndexWriter("index", new StandardAnalyzer(),
> true);
>          writer.mergeFactor = MERGE_FACTOR;
>          PooledExecutor queue = new
> PooledExecutor(NUM_UPDATE_THREADS);
>          queue.waitWhenBlocked();
> 
>          for (int min=low; min < high; min += BATCH_SIZE) {
>              int max = min + BATCH_SIZE;
>              if (high < max) {
>                  max = high;
>              }
>              queue.execute(new BatchIndexer(writer, min, max));
>          }
>          end = new Date();
>          System.out.println("Build Time: " + (end.getTime() -
> start.getTime()) + "ms");
>          start = end;
>          writer.optimize();
>      } finally {
>          if (null != writer) {
>              try { writer.close(); } catch (Exception ignore)
> {/*NOOP*/; }
>          }
>      }
>      end = new Date();
>      System.out.println("Optimize Time: " + (end.getTime() -
> start.getTime()) + "ms");
> 
> 
> (where BatchIndexer is a class i have that gets a DB connection, and
> slurps all records from my DB between min and max and builds some
> simple
> Documents out of them and calls writer.addDocument(doc) on each)
> 
> This was working fine with small ranges, but then i tried building up
> a
> nice big index for doing some performance testing.  i left it running
> overnight and when i came back in the morning i discovered that after
> successfully building up the whole index (~112K docs, ~1.5GB disk) it
> crashed with an OutOfMemory exception while trying to optimize.
> 
> I then realized i was only running my JVM with a 256m upper limit on
> RAM,
> and i figured that PooledExecutor was still in scope, and maybe it
> was
> maintaining some state that was using up a lot of space, so i whiped
> up a
> quick little app to solve my problem...
> 
>     public static void main(String[] args) throws Exception {
>         IndexWriter writer = null;
>         try {
>             writer = new IndexWriter("index", new StandardAnalyzer(),
> false);
>             writer.optimize();
>         } finally {
>             if (null != writer) {
>                 try { writer.close(); } catch (Exception ignore) {
> /*NOOP*/; }
>             }
>         }
>     }
> 
> ...but I was dissapointed to discover that even this couldn't run
> with
> only 256m of ram.  I bumped it up to 512m and then it manged to
> complete
> successfully (the final index was only 1.1GB of disk).
> 
> 
> This raises a few questions in my mind:
> 
> 1) Is there a rule of thumb for knowing how much memory it takes to
>    optimize an index?
> 
> 2) Is there a "Best Practice" to follow when building up a large
> index
>    from scratch in order to reduce the amount of memory needed to
> optimize
>    once the whole index is build?  (ie: would spining up a thread
> that
>    called writer.optimize() every N minutes be a good idea?)
> 
> 3) Given an unoptimized index that's allready been built (ie: in the
> case
>    where my builder crashed and i wanted to try and optimize it
> without
>    having to rebuild from scratch) is there anyway to get IndexWriter
> to
>    use less RAM and more disk (trading spead for a smaller form
> factor --
>    and aparently: greater stability so that the app doesn't crash)
> 
> 
> I imagine that the answers to #1 and #2 are largely dependent on the
> nature of the data in the index (ie: the frequency of terms) but i'm
> wondering if there is a high level formula that could be used to say
> "based on the nature of your data, you want to take this approach to
> optimizing when you build"


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to