Re: Running OutOfMemory while optimizing and searching

2004-09-20 Thread John Z
Doug
 
Thank you for confirming this.
 
ZJ

Doug Cutting [EMAIL PROTECTED] wrote:
John Z wrote:
 We have indexes of around 1 million docs and around 25 searchable fields.
 We noticed that without any searches performed on the indexes, on startup, the 
 memory taken up by the searcher is roughly 7 times the .tii file size. The .tii file 
 is read into memory as per the code. Our .tii files are around 8-10 MB in size and 
 our startup memory foot print is around 60-70 MB.
 
 Then when we start doing our searches, the memory goes up, depending on the fields 
 we search on. We are noticing that if we start searching on new fields, the memory 
 kind of goes up. 
 
 Doug, 
 
 Your calculation below on what is taken up by the searcher, does it take into 
 account the .tii file being read into memory or am I not making any sense ? 
 
 1 byte * Number of searchable fields in your index * Number of docs in 
 your index
 plus
 1k bytes * number of terms in query
 plus
 1k bytes * number of phrase terms in query

You make perfect sense. The formula above does not include the .tii. 
My mistake: I forgot that. By default, every 128th Term in the index is 
read into memory, to permit random access to terms. These are stored in 
the .tii file, compressed. So it is not surprising that they require 7x 
the size of the .tii file in memory.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
Express yourself with Y! Messenger! Free. Download now.

Re: Running OutOfMemory while optimizing and searching

2004-09-17 Thread Doug Cutting
John Z wrote:
We have indexes of around 1 million docs and around 25 searchable fields.
We noticed that without any searches performed on the indexes, on startup, the memory taken up by the searcher is roughly 7 times the .tii file size. The .tii file is read into memory as per the code. Our .tii files are around 8-10 MB in size and our startup memory foot print is around 60-70 MB.
 
Then when we start doing our searches, the memory goes up, depending on the fields we search on. We are noticing that if we start searching on new fields, the memory kind of goes up. 
 
Doug, 
 
Your calculation below on what is taken up by the searcher, does it take into account the .tii file being read into memory  or am I not making any sense ? 
 
1 byte * Number of searchable fields in your index * Number of docs in 
your index
plus
1k bytes * number of terms in query
plus
1k bytes * number of phrase terms in query
You make perfect sense.  The formula above does not include the .tii. 
My mistake: I forgot that.  By default, every 128th Term in the index is 
read into memory, to permit random access to terms.  These are stored in 
the .tii file, compressed.  So it is not surprising that they require 7x 
the size of the .tii file in memory.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Running OutOfMemory while optimizing and searching

2004-09-16 Thread John Z
Hi
 
We are trying to get the memory footprint on our searchers.
 
We have indexes of around 1 million docs and around 25 searchable fields.
We noticed that without any searches performed on the indexes, on startup, the memory 
taken up by the searcher is roughly 7 times the .tii file size. The .tii file is read 
into memory as per the code. Our .tii files are around 8-10 MB in size and our startup 
memory foot print is around 60-70 MB.
 
Then when we start doing our searches, the memory goes up, depending on the fields we 
search on. We are noticing that if we start searching on new fields, the memory kind 
of goes up. 
 
Doug, 
 
Your calculation below on what is taken up by the searcher, does it take into account 
the .tii file being read into memory  or am I not making any sense ? 
 
1 byte * Number of searchable fields in your index * Number of docs in 
your index
plus
1k bytes * number of terms in query
plus
1k bytes * number of phrase terms in query


Thank you
ZJ

Doug Cutting [EMAIL PROTECTED] wrote:
 What do your queries look like? The memory required
 for a query can be computed by the following equation:

 1 Byte * Number of fields in your query * Number of
 docs in your index

 So if your query searches on all 50 fields of your 3.5
 Million document index then each search would take
 about 175MB. If your 3-4 searches run concurrently
 then that's about 525MB to 700MB chewed up at once.

That's not quite right. If you use the same IndexSearcher (or 
IndexReader) for all of the searches, then only 175MB are used. The 
arrays in question (the norms) are read-only and can be shared by all 
searches.

In general, the amount of memory required is:

1 byte * Number of searchable fields in your index * Number of docs in 
your index

plus

1k bytes * number of terms in query

plus

1k bytes * number of phrase terms in query

The latter are for i/o buffers. There are a few other things, but these 
are the major ones.

Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Running OutOfMemory while optimizing and searching

2004-07-06 Thread Otis Gospodnetic
Note that force is really just 'suggest'.  Regardless, I have seen apps
running under 1.3.1 JVM where this worked.

Otis

--- David Spencer [EMAIL PROTECTED] wrote:
 This in theory should not help, but anyway, just in case, the idea is
 to 
 call gc() periodically to force gc - this is the code I use which 
 tries to force it...
 
 
 public static long gc()
   {
   long bef = mem();
   System.gc();
   sleep( 100);
   System.runFinalization();
   sleep( 100);
   System.gc();
   long aft= mem();
   return aft-bef;
   }
 
 Mark Florence wrote:
 
  Thanks, Jim. I'm pretty sure I'm throwing OOM for real,
  and not because I've run out of file handles. I can easily
  recreate the latter condition, and it is always reported
  accurately. I've also monitored the OOM as it occurs using
  top and I can see memory usage climbing until it is
  exhausted -- if you will excuse the pun!
  
  I'm not familiar with the new compound file format. Where
  can I look to find more information?
  
  -- Mark
  
  -Original Message-
  From: James Dunn [mailto:[EMAIL PROTECTED]
  Sent: Friday, July 02, 2004 01:29 pm
  To: Lucene Users List
  Subject: Re: Running OutOfMemory while optimizing and searching
  
  
  Ah yes, I don't think I made that clear enough.  From
  Mark's original post, I believe he mentioned that he
  used seperate readers for each simultaneous query.
  
  His other issue was that he was getting an OOM during
  an optimize, even when he set the JVM heap to 2GB.  He
  said his index was about 10.5GB spread over ~7000
  files on Linux.  
  
  My guess is that OOM might actually be a too many
  open files error.  I have seen that type of error
  being reported by the JVM as an OutOfMemory error on
  Linux before.  I had the same problem but once I
  switched to the new Lucene compound file format, I
  haven't had that problem since.  
  
  Mark, have you tried switching to the compound file
  format?  
  
  Jim
  
  
  
  
  --- Doug Cutting [EMAIL PROTECTED] wrote:
  
   What do your queries look like?  The memory
 required
   for a query can be computed by the following
 equation:
  
   1 Byte * Number of fields in your query * Number
 of
   docs in your index
  
   So if your query searches on all 50 fields of
 your 3.5
   Million document index then each search would
 take
   about 175MB.  If your 3-4 searches run
 concurrently
   then that's about 525MB to 700MB chewed up at
 once.
 
 That's not quite right.  If you use the same
 IndexSearcher (or 
 IndexReader) for all of the searches, then only
 175MB are used.  The 
 arrays in question (the norms) are read-only and can
 be shared by all 
 searches.
 
 In general, the amount of memory required is:
 
 1 byte * Number of searchable fields in your index *
 Number of docs in 
 your index
 
 plus
 
 1k bytes * number of terms in query
 
 plus
 
 1k bytes * number of phrase terms in query
 
 The latter are for i/o buffers.  There are a few
 other things, but these 
 are the major ones.
 
 Doug
 
 
 
  
 
 -
  
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
  
  
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Running OutOfMemory while optimizing and searching

2004-07-02 Thread Mark Florence
Thanks, Jim. I'm pretty sure I'm throwing OOM for real,
and not because I've run out of file handles. I can easily
recreate the latter condition, and it is always reported
accurately. I've also monitored the OOM as it occurs using
top and I can see memory usage climbing until it is
exhausted -- if you will excuse the pun!

I'm not familiar with the new compound file format. Where
can I look to find more information?

-- Mark

-Original Message-
From: James Dunn [mailto:[EMAIL PROTECTED]
Sent: Friday, July 02, 2004 01:29 pm
To: Lucene Users List
Subject: Re: Running OutOfMemory while optimizing and searching


Ah yes, I don't think I made that clear enough.  From
Mark's original post, I believe he mentioned that he
used seperate readers for each simultaneous query.

His other issue was that he was getting an OOM during
an optimize, even when he set the JVM heap to 2GB.  He
said his index was about 10.5GB spread over ~7000
files on Linux.  

My guess is that OOM might actually be a too many
open files error.  I have seen that type of error
being reported by the JVM as an OutOfMemory error on
Linux before.  I had the same problem but once I
switched to the new Lucene compound file format, I
haven't had that problem since.  

Mark, have you tried switching to the compound file
format?  

Jim




--- Doug Cutting [EMAIL PROTECTED] wrote:
   What do your queries look like?  The memory
 required
   for a query can be computed by the following
 equation:
  
   1 Byte * Number of fields in your query * Number
 of
   docs in your index
  
   So if your query searches on all 50 fields of
 your 3.5
   Million document index then each search would
 take
   about 175MB.  If your 3-4 searches run
 concurrently
   then that's about 525MB to 700MB chewed up at
 once.
 
 That's not quite right.  If you use the same
 IndexSearcher (or 
 IndexReader) for all of the searches, then only
 175MB are used.  The 
 arrays in question (the norms) are read-only and can
 be shared by all 
 searches.
 
 In general, the amount of memory required is:
 
 1 byte * Number of searchable fields in your index *
 Number of docs in 
 your index
 
 plus
 
 1k bytes * number of terms in query
 
 plus
 
 1k bytes * number of phrase terms in query
 
 The latter are for i/o buffers.  There are a few
 other things, but these 
 are the major ones.
 
 Doug
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Running OutOfMemory while optimizing and searching

2004-07-02 Thread David Spencer
This in theory should not help, but anyway, just in case, the idea is to 
call gc() periodically to force gc - this is the code I use which 
tries to force it...

public static long gc()
{
long bef = mem();
System.gc();
sleep( 100);
System.runFinalization();
sleep( 100);
System.gc();
long aft= mem();
return aft-bef;
}
Mark Florence wrote:
Thanks, Jim. I'm pretty sure I'm throwing OOM for real,
and not because I've run out of file handles. I can easily
recreate the latter condition, and it is always reported
accurately. I've also monitored the OOM as it occurs using
top and I can see memory usage climbing until it is
exhausted -- if you will excuse the pun!
I'm not familiar with the new compound file format. Where
can I look to find more information?
-- Mark
-Original Message-
From: James Dunn [mailto:[EMAIL PROTECTED]
Sent: Friday, July 02, 2004 01:29 pm
To: Lucene Users List
Subject: Re: Running OutOfMemory while optimizing and searching
Ah yes, I don't think I made that clear enough.  From
Mark's original post, I believe he mentioned that he
used seperate readers for each simultaneous query.
His other issue was that he was getting an OOM during
an optimize, even when he set the JVM heap to 2GB.  He
said his index was about 10.5GB spread over ~7000
files on Linux.  

My guess is that OOM might actually be a too many
open files error.  I have seen that type of error
being reported by the JVM as an OutOfMemory error on
Linux before.  I had the same problem but once I
switched to the new Lucene compound file format, I
haven't had that problem since.  

Mark, have you tried switching to the compound file
format?  

Jim

--- Doug Cutting [EMAIL PROTECTED] wrote:
 What do your queries look like?  The memory
required
 for a query can be computed by the following
equation:

 1 Byte * Number of fields in your query * Number
of
 docs in your index

 So if your query searches on all 50 fields of
your 3.5
 Million document index then each search would
take
 about 175MB.  If your 3-4 searches run
concurrently
 then that's about 525MB to 700MB chewed up at
once.
That's not quite right.  If you use the same
IndexSearcher (or 
IndexReader) for all of the searches, then only
175MB are used.  The 
arrays in question (the norms) are read-only and can
be shared by all 
searches.

In general, the amount of memory required is:
1 byte * Number of searchable fields in your index *
Number of docs in 
your index

plus
1k bytes * number of terms in query
plus
1k bytes * number of phrase terms in query
The latter are for i/o buffers.  There are a few
other things, but these 
are the major ones.

Doug

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Running OutOfMemory while optimizing and searching

2004-07-01 Thread Doug Cutting
 What do your queries look like?  The memory required
 for a query can be computed by the following equation:

 1 Byte * Number of fields in your query * Number of
 docs in your index

 So if your query searches on all 50 fields of your 3.5
 Million document index then each search would take
 about 175MB.  If your 3-4 searches run concurrently
 then that's about 525MB to 700MB chewed up at once.
That's not quite right.  If you use the same IndexSearcher (or 
IndexReader) for all of the searches, then only 175MB are used.  The 
arrays in question (the norms) are read-only and can be shared by all 
searches.

In general, the amount of memory required is:
1 byte * Number of searchable fields in your index * Number of docs in 
your index

plus
1k bytes * number of terms in query
plus
1k bytes * number of phrase terms in query
The latter are for i/o buffers.  There are a few other things, but these 
are the major ones.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Running OutOfMemory while optimizing and searching

2004-06-30 Thread Paul Smith
Wow, I have say that those sort of numbers are concerning to me... Now I
know 3.5 million documents is a lot, but still... What would be causing a
query to require and hold that much memory?  I could understand that it
surely would be doing a lot of memory work, but why would it need to hold
onto/grab that much memory for the length of the query?  (this is getting
into the internals a bit, but it's always good to know what's going on under
the hood from a design decision point of view and how I would have to
structure an App to handle this sort of load).
Cheers,
Paul Smith

 1 Byte * Number of fields in your query * Number of
 docs in your index
 
 So if your query searches on all 50 fields of your 3.5
 Million document index then each search would take
 about 175MB.  If your 3-4 searches run concurrently
 then that's about 525MB to 700MB chewed up at once.
 
 Also, if your queries use wildcards, the memory
 requirements could be much greater.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Running OutOfMemory while optimizing and searching

2004-06-28 Thread Otis Gospodnetic
Mark,

Tough situation.  I hate when things like this happen on production :(.
 You are not mentioning what you are using for various IndexWriter
parameters.  You may be able to get this working by tweaking them (see
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#field_summary).
Hm, now that I think about it, I am not sure if those are considered
during index optimization.  I'll try checking the sources later.

Otis

--- Mark Florence [EMAIL PROTECTED] wrote:
 Hi, I'm using Lucene to index ~3.5M documents, over about 50 fields.
 The Lucene
 index itself is ~10.5GB, spread over ~7,000 files. Some of these
 files are
 large -- that is, several PRX files are ~1.5GB.
 
 Lucene runs on a dedicated server (Linux on a 1Ghz Dell, with 1GB
 RAM). Clients
 on other machines use RMI to perform reads / writes. Each night the
 server
 automatically performs an optimize.
 
 The problem is that the optimize now dies with an OutOfMemory
 exception, even
 when the JVM heap size is set to its maximum of 2GB. I need to
 optimize, because
 as the number of Lucene files grows, search performance becomes
 unacceptable.
 
 Search performance is also adversely affected because I've had to
 effectively
 single-thread reads and writes. I was using a simple read / write
 lock
 mechanism, allowing multiple readers to simultaneously search, but
 now more than
 3-4 simultaneous readers will also cause an OutOfMemory condition.
 Searches can
 take as long as 30-40 seconds, and with single-threading, that's
 crippling the
 main client application.
 
 Needless to say, the Lucene index is mission-critical, and must run
 24/7.
 
 I've seen other posts along this same vein, but no definite
 consensus. Is my
 problem simply inadequate hardware? Should I run on a 64-bit
 platform, where I
 can allocate a Java heap of  2GB?
 
 Or could there be something fundamentally wrong with my index? I
 should add
 that I've just spent about a week (!!) rebuilding from scratch, over
 all 3.5M
 documents.
 
 -- Many thanks for any help! Mark Florence
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]