Memory usage

2004-12-05 Thread Sreedhar, Dantam
Hi,

I am using lucene -1.3 final version. 

Let us say there are 10,000 files with size of 20 MB. So, total file
system size = 10,000 * 20 MB = 200 GB. I want to index these files.

Let us say, the merge factor = 10

Min heap size required by JVM = 10 * 20 = 200 MB

From the http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html
article,


---

For instance, if we set mergeFactor to 10, a new segment will be created
on the disk for every 10 documents added to the index. When the 10th
segment of size 10 is added, all 10 will be merged into a single segment
of size 100. When 10 such segments of size 100 have been added, they
will be merged into a single segment containing 1000 documents, and so
on. Therefore, at any time, there will be no more than 9 segments in
each power of 10 index size. 


-

If the lucene is indexing the 1000th document, then the current time
segment size would be 100. At that time, how many documents would the
lucene hold in memory (10 documents or 100 documents)? If the lucene
holds 100 documents, then min heap memory required will be 100 * 20 = 2
GB, which is unlikely.

Is the optimize process memory intensive? How much memory lucene would
take while doing the optimize? 

Is it safe to assume that the maximum heap size required by lucene is
mergefactor * maximum_file_size?

I am planning to use the default maxMergeDocs as default, as stated in
the article.

Any help on the above questions is highly appreciated.

Thanks,
-Sreedhar







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexWriter.optimize and memory usage

2004-12-03 Thread Paul Elschot
On Friday 03 December 2004 08:43, Paul Elschot wrote:
 On Friday 03 December 2004 07:50, Chris Hostetter wrote:
...
  So, If I'm understanding you (and the javadocs) correctly, the real key
  here is maxMergeDocs.  It seems like addDocument will never merge a
  segment untill maxMergeDocs have been added? ... meaning that I need a
  value less then the default (Integer.MAX_VALUE) if I want IndexWriter to
  do incrimental merges as I go ...
  
  ...except...
  
  ...if that were the case, then exactly is the meaning of mergeFactor?

oops correction=minMergeDocs should be replaced by mergeFactor:

 maxMergeDocs controls the sizes of the intermediate segments
 when adding documents.
 With maxMergeDocs at default, adding a document can take as much time as
: (and have the same effect as) optimize.  Eg. with mergeFactor at 10, the
 1000'th added document will create a segment of size 1000.
 With maxMergeDocs at a lower value than 1000, the last merge (of the 10
 segments with 100 docs each) will not be done.
: optimize() uses mergeFactor for its final merges, but it ignores
 maxMergeDocs. 

/oops

Meanwhile these fields have been deprecated in the development
version for set... methods.
Setting minMergeDocs is is deprecated and to be replaced by
setMaxBufferedDocs(). The javadoc for this reads:

Determines the minimal number of documents required before the buffered 
in-memory documents are merging and a new Segment is created. Since Documents 
are merged in a RAMDirectory, large value gives faster indexing. At the same 
time, mergeFactor limits the number of files open in a FSDirectory.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexWriter.optimize and memory usage

2004-12-02 Thread Chris Hostetter

I've been running into an interesting situation that I wanted to ask
about.

I've been doing some testing by building up indexes with code that looks
like this...

 IndexWriter writer = null;
 try {
 writer = new IndexWriter(index, new StandardAnalyzer(), true);
 writer.mergeFactor = MERGE_FACTOR;
 PooledExecutor queue = new PooledExecutor(NUM_UPDATE_THREADS);
 queue.waitWhenBlocked();

 for (int min=low; min  high; min += BATCH_SIZE) {
 int max = min + BATCH_SIZE;
 if (high  max) {
 max = high;
 }
 queue.execute(new BatchIndexer(writer, min, max));
 }
 end = new Date();
 System.out.println(Build Time:  + (end.getTime() - start.getTime()) 
+ ms);
 start = end;
 writer.optimize();
 } finally {
 if (null != writer) {
 try { writer.close(); } catch (Exception ignore) {/*NOOP*/; }
 }
 }
 end = new Date();
 System.out.println(Optimize Time:  + (end.getTime() - start.getTime()) + 
ms);


(where BatchIndexer is a class i have that gets a DB connection, and
slurps all records from my DB between min and max and builds some simple
Documents out of them and calls writer.addDocument(doc) on each)

This was working fine with small ranges, but then i tried building up a
nice big index for doing some performance testing.  i left it running
overnight and when i came back in the morning i discovered that after
successfully building up the whole index (~112K docs, ~1.5GB disk) it
crashed with an OutOfMemory exception while trying to optimize.

I then realized i was only running my JVM with a 256m upper limit on RAM,
and i figured that PooledExecutor was still in scope, and maybe it was
maintaining some state that was using up a lot of space, so i whiped up a
quick little app to solve my problem...

public static void main(String[] args) throws Exception {
IndexWriter writer = null;
try {
writer = new IndexWriter(index, new StandardAnalyzer(), false);
writer.optimize();
} finally {
if (null != writer) {
try { writer.close(); } catch (Exception ignore) { /*NOOP*/; }
}
}
}

...but I was dissapointed to discover that even this couldn't run with
only 256m of ram.  I bumped it up to 512m and then it manged to complete
successfully (the final index was only 1.1GB of disk).


This raises a few questions in my mind:

1) Is there a rule of thumb for knowing how much memory it takes to
   optimize an index?

2) Is there a Best Practice to follow when building up a large index
   from scratch in order to reduce the amount of memory needed to optimize
   once the whole index is build?  (ie: would spining up a thread that
   called writer.optimize() every N minutes be a good idea?)

3) Given an unoptimized index that's allready been built (ie: in the case
   where my builder crashed and i wanted to try and optimize it without
   having to rebuild from scratch) is there anyway to get IndexWriter to
   use less RAM and more disk (trading spead for a smaller form factor --
   and aparently: greater stability so that the app doesn't crash)


I imagine that the answers to #1 and #2 are largely dependent on the
nature of the data in the index (ie: the frequency of terms) but i'm
wondering if there is a high level formula that could be used to say
based on the nature of your data, you want to take this approach to
optimizing when you build



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexWriter.optimize and memory usage

2004-12-02 Thread Otis Gospodnetic
Hello and quick answers:

See IndexWriter javadoc and in particular mergeFactor, minMergeDocs,
and maxMergeDocs.  This will let you control the size of your segments,
the frequency of segment merges, the amount of buffered Documents in
RAM between segment merges and such.  Also, you ask about calling
optimize periodically - no need, Lucene should already merge segments
once in a while for you.  Optimize at the end.  You can also experiment
with different JVM args for various GC algorithms.

Otis

--- Chris Hostetter [EMAIL PROTECTED] wrote:

 
 I've been running into an interesting situation that I wanted to ask
 about.
 
 I've been doing some testing by building up indexes with code that
 looks
 like this...
 
  IndexWriter writer = null;
  try {
  writer = new IndexWriter(index, new StandardAnalyzer(),
 true);
  writer.mergeFactor = MERGE_FACTOR;
  PooledExecutor queue = new
 PooledExecutor(NUM_UPDATE_THREADS);
  queue.waitWhenBlocked();
 
  for (int min=low; min  high; min += BATCH_SIZE) {
  int max = min + BATCH_SIZE;
  if (high  max) {
  max = high;
  }
  queue.execute(new BatchIndexer(writer, min, max));
  }
  end = new Date();
  System.out.println(Build Time:  + (end.getTime() -
 start.getTime()) + ms);
  start = end;
  writer.optimize();
  } finally {
  if (null != writer) {
  try { writer.close(); } catch (Exception ignore)
 {/*NOOP*/; }
  }
  }
  end = new Date();
  System.out.println(Optimize Time:  + (end.getTime() -
 start.getTime()) + ms);
 
 
 (where BatchIndexer is a class i have that gets a DB connection, and
 slurps all records from my DB between min and max and builds some
 simple
 Documents out of them and calls writer.addDocument(doc) on each)
 
 This was working fine with small ranges, but then i tried building up
 a
 nice big index for doing some performance testing.  i left it running
 overnight and when i came back in the morning i discovered that after
 successfully building up the whole index (~112K docs, ~1.5GB disk) it
 crashed with an OutOfMemory exception while trying to optimize.
 
 I then realized i was only running my JVM with a 256m upper limit on
 RAM,
 and i figured that PooledExecutor was still in scope, and maybe it
 was
 maintaining some state that was using up a lot of space, so i whiped
 up a
 quick little app to solve my problem...
 
 public static void main(String[] args) throws Exception {
 IndexWriter writer = null;
 try {
 writer = new IndexWriter(index, new StandardAnalyzer(),
 false);
 writer.optimize();
 } finally {
 if (null != writer) {
 try { writer.close(); } catch (Exception ignore) {
 /*NOOP*/; }
 }
 }
 }
 
 ...but I was dissapointed to discover that even this couldn't run
 with
 only 256m of ram.  I bumped it up to 512m and then it manged to
 complete
 successfully (the final index was only 1.1GB of disk).
 
 
 This raises a few questions in my mind:
 
 1) Is there a rule of thumb for knowing how much memory it takes to
optimize an index?
 
 2) Is there a Best Practice to follow when building up a large
 index
from scratch in order to reduce the amount of memory needed to
 optimize
once the whole index is build?  (ie: would spining up a thread
 that
called writer.optimize() every N minutes be a good idea?)
 
 3) Given an unoptimized index that's allready been built (ie: in the
 case
where my builder crashed and i wanted to try and optimize it
 without
having to rebuild from scratch) is there anyway to get IndexWriter
 to
use less RAM and more disk (trading spead for a smaller form
 factor --
and aparently: greater stability so that the app doesn't crash)
 
 
 I imagine that the answers to #1 and #2 are largely dependent on the
 nature of the data in the index (ie: the frequency of terms) but i'm
 wondering if there is a high level formula that could be used to say
 based on the nature of your data, you want to take this approach to
 optimizing when you build


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage: IndexSearcher Sort

2004-09-30 Thread Erik Hatcher
On Sep 29, 2004, at 3:11 PM, Bryan Dotzour wrote:
3.  Certainly some of you on this list are using Lucene in a web-app
environment.  Can anyone list some best practices on managing
reading/writing/searching a Lucene index in that context?
Beyond the advice already given on this thread, since you said you were 
using Tapestry, I keep an IndexSearcher as a transient lazy-init'd 
property of my Global object.  It needs to be transient in case you are 
scaling with distributed servers in a farm and lazy init'd so to 
instantiate the first time.

Global, in Tapestry, makes a good place to put index operations.  As 
for searching - a good first try is to re-query for each page of search 
results (if you're implementing paging of results, that is).  It is 
often fast enough.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Memory usage: IndexSearcher Sort

2004-09-30 Thread Cocula Remi


-Message d'origine-
De : Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 29 septembre 2004 18:28
À : Lucene Users List
Objet : RE: Memory usage: IndexSearcher  Sort



 2.  How does this approach work with multiple, simultaneous users?

IndexSearcher is thread-safe.

You mean one can invoque at the same time the search method of a unique Searcheable in 
two different threads, 
Don't you ?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory usage: IndexSearcher Sort

2004-09-30 Thread Otis Gospodnetic
Correct.  I think there is a FAQ entry at jguru.com that answers this.

Otis

--- Cocula Remi [EMAIL PROTECTED] wrote:
  2.  How does this approach work with multiple, simultaneous users?
 
 IndexSearcher is thread-safe.
 
 You mean one can invoque at the same time the search method of a
 unique Searcheable in two different threads, 
 Don't you ?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Memory usage: IndexSearcher Sort

2004-09-29 Thread Bryan Dotzour
I have been investigating a serious memory problem in our web app (using
Tapestry, Hibernate,  Lucene) and have reduced it to being the way in which
we are using Lucene to search on things.  Being a webapp, we have focused on
doing our work within a user's request.  So we basically end up opening at
least one new IndexSearcher on each individual page view.  In one particular
case, we were doing this in a loop, eventually opening ~20-~40
IndexSearchers which caused our memory usage to skyrocket.  After viewing
that one page 3 or 4 times we would exhaust the server's memory allocation.
 
Most helpful in this search was the following thread from Bugzilla:
 
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 
 
From this thread, it sounds like constantly opening and closing
IndexSearcher objects is a BAD THING, but it is exactly what we are doing
in our app.  
There are a few things that puzzle me and I'd love it if anyone has some
input that might clear up some of these questions.
 
1.  According to the Bugzilla thread, and from my own testing, you can open
lots of IndexSearchers in a loop and do a search WITHOUT SORTING and not
have this memory problem.  Is there an issue with the Sort code?
2.  Can anyone give a brief, technical explanation as to why opening
multiple IndexSearcher objects is bad?
3.  Certainly some of you on this list are using Lucene in a web-app
environment.  Can anyone list some best practices on managing
reading/writing/searching a Lucene index in that context?
 
 
Thank you all
Bryan
---
Some extra information about my Lucene setup:
 
Lucene 1.4.1
We maintain 5 different indexes, all in RAMDirectories.  The indexes aren't
especially big ( 100,000 total objects combined).
  
 


Re: Memory usage: IndexSearcher Sort

2004-09-29 Thread Damian Gajda
 Most helpful in this search was the following thread from Bugzilla:
  
 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 
  

We had a similar problem in our webapp.

Please look at the bug
http://issues.apache.org/bugzilla/show_bug.cgi?id=31240

My co-worker Rafa has fixed this bug and submitted a patch today.

Have fun ;)
-- 
Damian Gajda
Caltha Sp. j.
Warszawa 02-807
ul. Kukuki 2
tel. +48 22 643 20 20
mobile: +48 501 032 506
http://www.caltha.pl/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory usage: IndexSearcher Sort

2004-09-29 Thread Cocula Remi
My solution is :

I have bound in an RMI registry one RemoteSearchable object for each index.
Thus I do not have to create any IndexSearcher and I can execute query from any 
application.
This has been implemented in the Lucene Server that I have just began to create.
http://sourceforge.net/projects/luceneserver/

I use it in a web app.

It would be nice if some people could test it (don't you want ?)
  

-Message d'origine-
De : Bryan Dotzour [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 29 septembre 2004 15:11
À : '[EMAIL PROTECTED]'
Objet : Memory usage: IndexSearcher  Sort


I have been investigating a serious memory problem in our web app (using
Tapestry, Hibernate,  Lucene) and have reduced it to being the way in which
we are using Lucene to search on things.  Being a webapp, we have focused on
doing our work within a user's request.  So we basically end up opening at
least one new IndexSearcher on each individual page view.  In one particular
case, we were doing this in a loop, eventually opening ~20-~40
IndexSearchers which caused our memory usage to skyrocket.  After viewing
that one page 3 or 4 times we would exhaust the server's memory allocation.
 
Most helpful in this search was the following thread from Bugzilla:
 
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 
 
From this thread, it sounds like constantly opening and closing
IndexSearcher objects is a BAD THING, but it is exactly what we are doing
in our app.  
There are a few things that puzzle me and I'd love it if anyone has some
input that might clear up some of these questions.
 
1.  According to the Bugzilla thread, and from my own testing, you can open
lots of IndexSearchers in a loop and do a search WITHOUT SORTING and not
have this memory problem.  Is there an issue with the Sort code?
2.  Can anyone give a brief, technical explanation as to why opening
multiple IndexSearcher objects is bad?
3.  Certainly some of you on this list are using Lucene in a web-app
environment.  Can anyone list some best practices on managing
reading/writing/searching a Lucene index in that context?
 
 
Thank you all
Bryan
---
Some extra information about my Lucene setup:
 
Lucene 1.4.1
We maintain 5 different indexes, all in RAMDirectories.  The indexes aren't
especially big ( 100,000 total objects combined).
  
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage: IndexSearcher Sort

2004-09-29 Thread Otis Gospodnetic
Hello,

--- Bryan Dotzour [EMAIL PROTECTED] wrote:

 I have been investigating a serious memory problem in our web app
 (using
 Tapestry, Hibernate,  Lucene) and have reduced it to being the way
 in which
 we are using Lucene to search on things.  Being a webapp, we have
 focused on
 doing our work within a user's request.  So we basically end up
 opening at
 least one new IndexSearcher on each individual page view.  In one
 particular
 case, we were doing this in a loop, eventually opening ~20-~40
 IndexSearchers which caused our memory usage to skyrocket.  After
 viewing
 that one page 3 or 4 times we would exhaust the server's memory
 allocation.
  
 Most helpful in this search was the following thread from Bugzilla:
  
 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 
  
 From this thread, it sounds like constantly opening and closing
 IndexSearcher objects is a BAD THING, but it is exactly what we are
 doing
 in our app.  
 There are a few things that puzzle me and I'd love it if anyone has
 some
 input that might clear up some of these questions.
  
 1.  According to the Bugzilla thread, and from my own testing, you
 can open
 lots of IndexSearchers in a loop and do a search WITHOUT SORTING and
 not
 have this memory problem.  Is there an issue with the Sort code?

Yes, there is a memory leak in Sort code.  A kind person from Poland
contributed a patch earlier today.  It's not in CVS yet.

 2.  Can anyone give a brief, technical explanation as to why opening
 multiple IndexSearcher objects is bad?

Very simple.  A Lucene index consists of X number of files that reside
on a disk.  Every time you open a new IndexSearcher, some of these
files need to be read.  If files do not change (no documents
added/removed), why do this repetitive work?  Just do it once.  When
these files are read, some data is stored in memory.  If you read them
multiple times, you will store the same data in memory multiple times.

 3.  Certainly some of you on this list are using Lucene in a web-app
 environment.  Can anyone list some best practices on managing
 reading/writing/searching a Lucene index in that context?

I use something like this for http://www.simpy.com/ and it works well
for me:

private IndexDescriptor getIndexDescriptor(String indexID)
throws SearcherException
{
File indexDir = validateIndex(indexID);
IndexDescriptor indexDescriptor =
getIndexDescriptorFromCache(indexDir);

try
{
// if this is a known index
if (indexDescriptor != null)
{
// if the index has changed since this Searcher was
created, make a new Searcher
long currentVersion =
IndexReader.getCurrentVersion(indexDir);
if (currentVersion  indexDescriptor.lastKnownVersion)
{
indexDescriptor.lastKnownVersion = currentVersion;
indexDescriptor.searcher = new
LuceneUserSearcher(indexDir);
}
}
// if this is a new index
else
{
indexDescriptor = new IndexDescriptor();
indexDescriptor.indexDir = indexDir;
indexDescriptor.lastKnownVersion =
IndexReader.getCurrentVersion(indexDir);
indexDescriptor.searcher = new
LuceneUserSearcher(indexDir);
}
return cacheIndexDescriptor(indexDescriptor);
}
catch (IOException e)
{
throw new SearcherException(Cannot open index:  +
indexDir, e);
}
}

IndexDescriptor is a simple 'struct' with everything public (not good
practise, you should change that):

final class IndexDescriptor
{
public File indexDir;
public long lastKnownVersion;
public Searcher searcher;

public String toString()
{
return IndexDescriptor.class.getName() + : index directory: 
+ indexDir.getAbsolutePath()
+ , last known version:  + lastKnownVersion + ,
searcher:  + searcher;
}
}

These two things combined allow me to re-open an IndexSearcher when the
index changes, and re-use the same IndexSearcher while the index
remains unmodified.  Of course, that LuceneUserSearcher could be
Lucene's IndexSearcher, probably.

Otis
http://www.simpy.com/ -- Index, Search and Share your bookmarks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory usage: IndexSearcher Sort

2004-09-29 Thread Bryan Dotzour
Thanks very much for the reply Otis.  Your code snippet is pretty
interesting and made me think about a few questions. 

1.  Do you just have one IndexReader for a given index?  It looks like you
are handing out a new IndexSearcher when the IndexReader has been modified.

2.  How does this approach work with multiple, simultaneous users?
3.  When does the reader need to get closed?

Thanks again.  
Bryan

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 29, 2004 8:47 AM
To: Lucene Users List
Subject: Re: Memory usage: IndexSearcher  Sort


Hello,

--- Bryan Dotzour [EMAIL PROTECTED] wrote:

 I have been investigating a serious memory problem in our web app 
 (using Tapestry, Hibernate,  Lucene) and have reduced it to being the 
 way in which
 we are using Lucene to search on things.  Being a webapp, we have
 focused on
 doing our work within a user's request.  So we basically end up
 opening at
 least one new IndexSearcher on each individual page view.  In one
 particular
 case, we were doing this in a loop, eventually opening ~20-~40
 IndexSearchers which caused our memory usage to skyrocket.  After
 viewing
 that one page 3 or 4 times we would exhaust the server's memory
 allocation.
  
 Most helpful in this search was the following thread from Bugzilla:
  
 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
  
 From this thread, it sounds like constantly opening and closing 
 IndexSearcher objects is a BAD THING, but it is exactly what we are 
 doing in our app.
 There are a few things that puzzle me and I'd love it if anyone has
 some
 input that might clear up some of these questions.
  
 1.  According to the Bugzilla thread, and from my own testing, you can 
 open lots of IndexSearchers in a loop and do a search WITHOUT SORTING 
 and not
 have this memory problem.  Is there an issue with the Sort code?

Yes, there is a memory leak in Sort code.  A kind person from Poland
contributed a patch earlier today.  It's not in CVS yet.

 2.  Can anyone give a brief, technical explanation as to why opening 
 multiple IndexSearcher objects is bad?

Very simple.  A Lucene index consists of X number of files that reside on a
disk.  Every time you open a new IndexSearcher, some of these files need to
be read.  If files do not change (no documents added/removed), why do this
repetitive work?  Just do it once.  When these files are read, some data is
stored in memory.  If you read them multiple times, you will store the same
data in memory multiple times.

 3.  Certainly some of you on this list are using Lucene in a web-app 
 environment.  Can anyone list some best practices on managing 
 reading/writing/searching a Lucene index in that context?

I use something like this for http://www.simpy.com/ and it works well for
me:

private IndexDescriptor getIndexDescriptor(String indexID)
throws SearcherException
{
File indexDir = validateIndex(indexID);
IndexDescriptor indexDescriptor =
getIndexDescriptorFromCache(indexDir);

try
{
// if this is a known index
if (indexDescriptor != null)
{
// if the index has changed since this Searcher was created,
make a new Searcher
long currentVersion =
IndexReader.getCurrentVersion(indexDir);
if (currentVersion  indexDescriptor.lastKnownVersion)
{
indexDescriptor.lastKnownVersion = currentVersion;
indexDescriptor.searcher = new
LuceneUserSearcher(indexDir);
}
}
// if this is a new index
else
{
indexDescriptor = new IndexDescriptor();
indexDescriptor.indexDir = indexDir;
indexDescriptor.lastKnownVersion =
IndexReader.getCurrentVersion(indexDir);
indexDescriptor.searcher = new LuceneUserSearcher(indexDir);
}
return cacheIndexDescriptor(indexDescriptor);
}
catch (IOException e)
{
throw new SearcherException(Cannot open index:  + indexDir,
e);
}
}

IndexDescriptor is a simple 'struct' with everything public (not good
practise, you should change that):

final class IndexDescriptor
{
public File indexDir;
public long lastKnownVersion;
public Searcher searcher;

public String toString()
{
return IndexDescriptor.class.getName() + : index directory: 
+ indexDir.getAbsolutePath()
+ , last known version:  + lastKnownVersion + ,
searcher:  + searcher;
}
}

These two things combined allow me to re-open an IndexSearcher when the
index changes, and re-use the same IndexSearcher while the index remains
unmodified.  Of course, that LuceneUserSearcher could be Lucene's
IndexSearcher, probably.

Otis
http://www.simpy.com/ -- Index, Search and Share your

RE: Memory usage: IndexSearcher Sort

2004-09-29 Thread Otis Gospodnetic
Hello Bryan,

--- Bryan Dotzour [EMAIL PROTECTED] wrote:

 Thanks very much for the reply Otis.  Your code snippet is pretty
 interesting and made me think about a few questions. 
 
 1.  Do you just have one IndexReader for a given index?  It looks
 like you
 are handing out a new IndexSearcher when the IndexReader has been
 modified.

1 index for each index.  Simpy has a LOT of Lucene indices.

 2.  How does this approach work with multiple, simultaneous users?

IndexSearcher is thread-safe.

 3.  When does the reader need to get closed?

Just leave it open.  You can close it when you are sure you no longer
need it, if you can determine that in your application.

Otis

 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, September 29, 2004 8:47 AM
 To: Lucene Users List
 Subject: Re: Memory usage: IndexSearcher  Sort
 
 
 Hello,
 
 --- Bryan Dotzour [EMAIL PROTECTED] wrote:
 
  I have been investigating a serious memory problem in our web app 
  (using Tapestry, Hibernate,  Lucene) and have reduced it to being
 the 
  way in which
  we are using Lucene to search on things.  Being a webapp, we have
  focused on
  doing our work within a user's request.  So we basically end up
  opening at
  least one new IndexSearcher on each individual page view.  In one
  particular
  case, we were doing this in a loop, eventually opening ~20-~40
  IndexSearchers which caused our memory usage to skyrocket.  After
  viewing
  that one page 3 or 4 times we would exhaust the server's memory
  allocation.
   
  Most helpful in this search was the following thread from Bugzilla:
   
  http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
  http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
   
  From this thread, it sounds like constantly opening and closing 
  IndexSearcher objects is a BAD THING, but it is exactly what we
 are 
  doing in our app.
  There are a few things that puzzle me and I'd love it if anyone has
  some
  input that might clear up some of these questions.
   
  1.  According to the Bugzilla thread, and from my own testing, you
 can 
  open lots of IndexSearchers in a loop and do a search WITHOUT
 SORTING 
  and not
  have this memory problem.  Is there an issue with the Sort code?
 
 Yes, there is a memory leak in Sort code.  A kind person from Poland
 contributed a patch earlier today.  It's not in CVS yet.
 
  2.  Can anyone give a brief, technical explanation as to why
 opening 
  multiple IndexSearcher objects is bad?
 
 Very simple.  A Lucene index consists of X number of files that
 reside on a
 disk.  Every time you open a new IndexSearcher, some of these files
 need to
 be read.  If files do not change (no documents added/removed), why do
 this
 repetitive work?  Just do it once.  When these files are read, some
 data is
 stored in memory.  If you read them multiple times, you will store
 the same
 data in memory multiple times.
 
  3.  Certainly some of you on this list are using Lucene in a
 web-app 
  environment.  Can anyone list some best practices on managing 
  reading/writing/searching a Lucene index in that context?
 
 I use something like this for http://www.simpy.com/ and it works well
 for
 me:
 
 private IndexDescriptor getIndexDescriptor(String indexID)
 throws SearcherException
 {
 File indexDir = validateIndex(indexID);
 IndexDescriptor indexDescriptor =
 getIndexDescriptorFromCache(indexDir);
 
 try
 {
 // if this is a known index
 if (indexDescriptor != null)
 {
 // if the index has changed since this Searcher was
 created,
 make a new Searcher
 long currentVersion =
 IndexReader.getCurrentVersion(indexDir);
 if (currentVersion 
 indexDescriptor.lastKnownVersion)
 {
 indexDescriptor.lastKnownVersion =
 currentVersion;
 indexDescriptor.searcher = new
 LuceneUserSearcher(indexDir);
 }
 }
 // if this is a new index
 else
 {
 indexDescriptor = new IndexDescriptor();
 indexDescriptor.indexDir = indexDir;
 indexDescriptor.lastKnownVersion =
 IndexReader.getCurrentVersion(indexDir);
 indexDescriptor.searcher = new
 LuceneUserSearcher(indexDir);
 }
 return cacheIndexDescriptor(indexDescriptor);
 }
 catch (IOException e)
 {
 throw new SearcherException(Cannot open index:  +
 indexDir,
 e);
 }
 }
 
 IndexDescriptor is a simple 'struct' with everything public (not good
 practise, you should change that):
 
 final class IndexDescriptor
 {
 public File indexDir;
 public long lastKnownVersion;
 public Searcher searcher;
 
 public String toString()
 {
 return IndexDescriptor.class.getName() + : index directory

Re: Memory usage

2004-05-27 Thread Otis Gospodnetic
Sorry if I'm stating the obvious.  Is this happening in some
stand-alone unit tests, or are you running things from some application
and in some environment, like Tomcat, Jetty or in some non-web app?

Your queries are pretty big (although I recall some people using even
bigger ones... but it all depends on the hardware they had), but are
you sure running out of memory is due to Lucene, or could it be a leak
in the app from which you are running queries?

Otis


--- James Dunn [EMAIL PROTECTED] wrote:
 Doug,
 
 We only search on analyzed text fields.  There are a
 couple of additional fields in the index like
 OBJECT_ID that are keywords but we don't search
 against those, we only use them once we get a result
 back to find the thing that document represents.
 
 Thanks,
 
 Jim
 
 --- Doug Cutting [EMAIL PROTECTED] wrote:
  It is cached by the IndexReader and lives until the
  index reader is 
  garbage collected.  50-70 searchable fields is a
  *lot*.  How many are 
  analyzed text, and how many are simply keywords?
  
  Doug
  
  James Dunn wrote:
   Doug,
   
   Thanks!  
   
   I just asked a question regarding how to calculate
  the
   memory requirements for a search.  Does this
  memory
   only get used only during the search operation
  itself,
   or is it referenced by the Hits object or anything
   else after the actual search completes?
   
   Thanks again,
   
   Jim
   
   
   --- Doug Cutting [EMAIL PROTECTED] wrote:
   
  James Dunn wrote:
  
  Also I search across about 50 fields but I don't
  
  use
  
  wildcard or range queries. 
  
  Lucene uses one byte of RAM per document per
  searched field, to hold the 
  normalization values.  So if you search a 10M
  document collection with 
  50 fields, then you'll end up using 500MB of RAM.
  
  If you're using unanalyzed fields, then an easy
  workaround to reduce the 
  number of fields is to combine many in a single
  field.  So, instead of, 
  e.g., using an f1 field with value abc, and an
  f2 field with value 
  efg, use a single field named f with values
  1_abc and 2_efg.
  
  We could optimize this in Lucene.  If no values of
  an indexed field are 
  analyzed, then we could store no norms for the
  field
  and hence read none 
  into memory.  This wouldn't be too hard to
  implement...
  
  Doug
  
  
   
  
 
 -
   
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
   
   
   
   
 
 
   __
   Do you Yahoo!?
   Friends.  Fun.  Try the all-new Yahoo! Messenger.
   http://messenger.yahoo.com/ 
   
  
 
 -
   To unsubscribe, e-mail:
  [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
   
  
 
 -
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
 
 
 
   
   
 __
 Do you Yahoo!?
 Friends.  Fun.  Try the all-new Yahoo! Messenger.
 http://messenger.yahoo.com/ 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-27 Thread James Dunn
Otis,

My app does run within Tomcat.  But when I started
getting these OutOfMemoryErrors I wrote a little unit
test to watch the memory usage without Tomcat in the
middle and I still see the memory usage.

Thanks,

Jim
--- Otis Gospodnetic [EMAIL PROTECTED]
wrote:
 Sorry if I'm stating the obvious.  Is this happening
 in some
 stand-alone unit tests, or are you running things
 from some application
 and in some environment, like Tomcat, Jetty or in
 some non-web app?
 
 Your queries are pretty big (although I recall some
 people using even
 bigger ones... but it all depends on the hardware
 they had), but are
 you sure running out of memory is due to Lucene, or
 could it be a leak
 in the app from which you are running queries?
 
 Otis
 
 
 --- James Dunn [EMAIL PROTECTED] wrote:
  Doug,
  
  We only search on analyzed text fields.  There are
 a
  couple of additional fields in the index like
  OBJECT_ID that are keywords but we don't search
  against those, we only use them once we get a
 result
  back to find the thing that document represents.
  
  Thanks,
  
  Jim
  
  --- Doug Cutting [EMAIL PROTECTED] wrote:
   It is cached by the IndexReader and lives until
 the
   index reader is 
   garbage collected.  50-70 searchable fields is a
   *lot*.  How many are 
   analyzed text, and how many are simply keywords?
   
   Doug
   
   James Dunn wrote:
Doug,

Thanks!  

I just asked a question regarding how to
 calculate
   the
memory requirements for a search.  Does this
   memory
only get used only during the search operation
   itself,
or is it referenced by the Hits object or
 anything
else after the actual search completes?

Thanks again,

Jim


--- Doug Cutting [EMAIL PROTECTED] wrote:

   James Dunn wrote:
   
   Also I search across about 50 fields but I
 don't
   
   use
   
   wildcard or range queries. 
   
   Lucene uses one byte of RAM per document per
   searched field, to hold the 
   normalization values.  So if you search a 10M
   document collection with 
   50 fields, then you'll end up using 500MB of
 RAM.
   
   If you're using unanalyzed fields, then an
 easy
   workaround to reduce the 
   number of fields is to combine many in a
 single
   field.  So, instead of, 
   e.g., using an f1 field with value abc,
 and an
   f2 field with value 
   efg, use a single field named f with
 values
   1_abc and 2_efg.
   
   We could optimize this in Lucene.  If no
 values of
   an indexed field are 
   analyzed, then we could store no norms for the
   field
   and hence read none 
   into memory.  This wouldn't be too hard to
   implement...
   
   Doug
   
   

   
  
 

-

   To unsubscribe, e-mail:
   [EMAIL PROTECTED]
   For additional commands, e-mail:
   [EMAIL PROTECTED]
   






__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo!
 Messenger.
http://messenger.yahoo.com/ 

   
  
 

-
To unsubscribe, e-mail:
   [EMAIL PROTECTED]
For additional commands, e-mail:
   [EMAIL PROTECTED]

   
  
 

-
   To unsubscribe, e-mail:
   [EMAIL PROTECTED]
   For additional commands, e-mail:
   [EMAIL PROTECTED]
   
  
  
  
  
  
  __
  Do you Yahoo!?
  Friends.  Fun.  Try the all-new Yahoo! Messenger.
  http://messenger.yahoo.com/ 
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Memory usage

2004-05-26 Thread James Dunn
Hello,

I was wondering if anyone has had problems with memory
usage and MultiSearcher.

My index is composed of two sub-indexes that I search
with a MultiSearcher.  The total size of the index is
about 3.7GB with the larger sub-index being 3.6GB and
the smaller being 117MB.

I am using Lucene 1.3 Final with the compound file
format.

Also I search across about 50 fields but I don't use
wildcard or range queries. 

Doing repeated searches in this way seems to
eventually chew up about 500MB of memory which seems
excessive to me.

Does anyone have any ideas where I could look to
reduce the memory my queries consume?

Thanks,

Jim




__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory usage

2004-05-26 Thread wallen
This sounds like a memory leakage situation.  If you are using tomcat I
would suggest you make sure you are on a recent version, as it is known to
have some memory leaks in version 4.  It doesn't make sense that repeated
queries would use more memory that the most demanding query unless objects
are not getting freed from memory.

-Will

-Original Message-
From: James Dunn [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 3:02 PM
To: [EMAIL PROTECTED]
Subject: Memory usage


Hello,

I was wondering if anyone has had problems with memory
usage and MultiSearcher.

My index is composed of two sub-indexes that I search
with a MultiSearcher.  The total size of the index is
about 3.7GB with the larger sub-index being 3.6GB and
the smaller being 117MB.

I am using Lucene 1.3 Final with the compound file
format.

Also I search across about 50 fields but I don't use
wildcard or range queries. 

Doing repeated searches in this way seems to
eventually chew up about 500MB of memory which seems
excessive to me.

Does anyone have any ideas where I could look to
reduce the memory my queries consume?

Thanks,

Jim




__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory usage

2004-05-26 Thread James Dunn
Will,

Thanks for your response.  It may be an object leak. 
I will look into that.

I just ran some more tests and this time I create a
20GB index by repeatedly merging my large index into
itself.

When I ran my test query against that index I got an
OutOfMemoryError on the very first query.  I have my
heap set to 512MB.  Should a query against a 20GB
index require that much memory?  I page through the
results 100 at a time, so I should never have more
than 100 Document objects in memory.  

Any help would be appreciated, thanks!

Jim
--- [EMAIL PROTECTED] wrote:
 This sounds like a memory leakage situation.  If you
 are using tomcat I
 would suggest you make sure you are on a recent
 version, as it is known to
 have some memory leaks in version 4.  It doesn't
 make sense that repeated
 queries would use more memory that the most
 demanding query unless objects
 are not getting freed from memory.
 
 -Will
 
 -Original Message-
 From: James Dunn [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, May 26, 2004 3:02 PM
 To: [EMAIL PROTECTED]
 Subject: Memory usage
 
 
 Hello,
 
 I was wondering if anyone has had problems with
 memory
 usage and MultiSearcher.
 
 My index is composed of two sub-indexes that I
 search
 with a MultiSearcher.  The total size of the index
 is
 about 3.7GB with the larger sub-index being 3.6GB
 and
 the smaller being 117MB.
 
 I am using Lucene 1.3 Final with the compound file
 format.
 
 Also I search across about 50 fields but I don't use
 wildcard or range queries. 
 
 Doing repeated searches in this way seems to
 eventually chew up about 500MB of memory which seems
 excessive to me.
 
 Does anyone have any ideas where I could look to
 reduce the memory my queries consume?
 
 Thanks,
 
 Jim
 
 
   
   
 __
 Do you Yahoo!?
 Friends.  Fun.  Try the all-new Yahoo! Messenger.
 http://messenger.yahoo.com/ 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread Erik Hatcher
How big are your actual Documents?  Are you caching Hits?  It stores, 
internally, up to 200 documents.

Erik
On May 26, 2004, at 4:08 PM, James Dunn wrote:
Will,
Thanks for your response.  It may be an object leak.
I will look into that.
I just ran some more tests and this time I create a
20GB index by repeatedly merging my large index into
itself.
When I ran my test query against that index I got an
OutOfMemoryError on the very first query.  I have my
heap set to 512MB.  Should a query against a 20GB
index require that much memory?  I page through the
results 100 at a time, so I should never have more
than 100 Document objects in memory.
Any help would be appreciated, thanks!
Jim
--- [EMAIL PROTECTED] wrote:
This sounds like a memory leakage situation.  If you
are using tomcat I
would suggest you make sure you are on a recent
version, as it is known to
have some memory leaks in version 4.  It doesn't
make sense that repeated
queries would use more memory that the most
demanding query unless objects
are not getting freed from memory.
-Will
-Original Message-
From: James Dunn [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 3:02 PM
To: [EMAIL PROTECTED]
Subject: Memory usage
Hello,
I was wondering if anyone has had problems with
memory
usage and MultiSearcher.
My index is composed of two sub-indexes that I
search
with a MultiSearcher.  The total size of the index
is
about 3.7GB with the larger sub-index being 3.6GB
and
the smaller being 117MB.
I am using Lucene 1.3 Final with the compound file
format.
Also I search across about 50 fields but I don't use
wildcard or range queries.
Doing repeated searches in this way seems to
eventually chew up about 500MB of memory which seems
excessive to me.
Does anyone have any ideas where I could look to
reduce the memory my queries consume?
Thanks,
Jim


__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Memory usage

2004-05-26 Thread Doug Cutting
James Dunn wrote:
Also I search across about 50 fields but I don't use
wildcard or range queries. 
Lucene uses one byte of RAM per document per searched field, to hold the 
normalization values.  So if you search a 10M document collection with 
50 fields, then you'll end up using 500MB of RAM.

If you're using unanalyzed fields, then an easy workaround to reduce the 
number of fields is to combine many in a single field.  So, instead of, 
e.g., using an f1 field with value abc, and an f2 field with value 
efg, use a single field named f with values 1_abc and 2_efg.

We could optimize this in Lucene.  If no values of an indexed field are 
analyzed, then we could store no norms for the field and hence read none 
into memory.  This wouldn't be too hard to implement...

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Memory usage

2004-05-26 Thread James Dunn
Erik,

Thanks for the response.  

My actual documents are fairly small.  Most docs only
have about 10 fields.  Some of those fields are
stored, however, like the OBJECT_ID, NAME and DESC
fields.  The stored fields are pretty small as well. 
None should be more than 4KB and very few will
approach that limit.

I'm also using the default maxFieldSize value of
1.  

I'm not caching hits, either.

Could it be my query?  I have about 80 total unique
fields in the index although no document has all 80. 
My query ends up looking like this:

+(F1:test F2:test ..  F80:test)

From previous mails that doesn't look like an enormous
amount of fields to be searching against.  Is there
some formula for the amount of memory required for a
query based on the number of clauses and terms?

Jim



--- Erik Hatcher [EMAIL PROTECTED] wrote:
 How big are your actual Documents?  Are you caching
 Hits?  It stores, 
 internally, up to 200 documents.
 
   Erik
 
 
 On May 26, 2004, at 4:08 PM, James Dunn wrote:
 
  Will,
 
  Thanks for your response.  It may be an object
 leak.
  I will look into that.
 
  I just ran some more tests and this time I create
 a
  20GB index by repeatedly merging my large index
 into
  itself.
 
  When I ran my test query against that index I got
 an
  OutOfMemoryError on the very first query.  I have
 my
  heap set to 512MB.  Should a query against a 20GB
  index require that much memory?  I page through
 the
  results 100 at a time, so I should never have more
  than 100 Document objects in memory.
 
  Any help would be appreciated, thanks!
 
  Jim
  --- [EMAIL PROTECTED] wrote:
  This sounds like a memory leakage situation.  If
 you
  are using tomcat I
  would suggest you make sure you are on a recent
  version, as it is known to
  have some memory leaks in version 4.  It doesn't
  make sense that repeated
  queries would use more memory that the most
  demanding query unless objects
  are not getting freed from memory.
 
  -Will
 
  -Original Message-
  From: James Dunn [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, May 26, 2004 3:02 PM
  To: [EMAIL PROTECTED]
  Subject: Memory usage
 
 
  Hello,
 
  I was wondering if anyone has had problems with
  memory
  usage and MultiSearcher.
 
  My index is composed of two sub-indexes that I
  search
  with a MultiSearcher.  The total size of the
 index
  is
  about 3.7GB with the larger sub-index being 3.6GB
  and
  the smaller being 117MB.
 
  I am using Lucene 1.3 Final with the compound
 file
  format.
 
  Also I search across about 50 fields but I don't
 use
  wildcard or range queries.
 
  Doing repeated searches in this way seems to
  eventually chew up about 500MB of memory which
 seems
  excessive to me.
 
  Does anyone have any ideas where I could look to
  reduce the memory my queries consume?
 
  Thanks,
 
  Jim
 
 
 
 
  __
  Do you Yahoo!?
  Friends.  Fun.  Try the all-new Yahoo! Messenger.
  http://messenger.yahoo.com/
 
 
 

-
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
 
 
 

-
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
 
 
 
  __
  Do You Yahoo!?
  Tired of spam?  Yahoo! Mail has the best spam
 protection around
  http://mail.yahoo.com
 
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread James Dunn
Doug,

Thanks!  

I just asked a question regarding how to calculate the
memory requirements for a search.  Does this memory
only get used only during the search operation itself,
or is it referenced by the Hits object or anything
else after the actual search completes?

Thanks again,

Jim


--- Doug Cutting [EMAIL PROTECTED] wrote:
 James Dunn wrote:
  Also I search across about 50 fields but I don't
 use
  wildcard or range queries. 
 
 Lucene uses one byte of RAM per document per
 searched field, to hold the 
 normalization values.  So if you search a 10M
 document collection with 
 50 fields, then you'll end up using 500MB of RAM.
 
 If you're using unanalyzed fields, then an easy
 workaround to reduce the 
 number of fields is to combine many in a single
 field.  So, instead of, 
 e.g., using an f1 field with value abc, and an
 f2 field with value 
 efg, use a single field named f with values
 1_abc and 2_efg.
 
 We could optimize this in Lucene.  If no values of
 an indexed field are 
 analyzed, then we could store no norms for the field
 and hence read none 
 into memory.  This wouldn't be too hard to
 implement...
 
 Doug
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread Doug Cutting
It is cached by the IndexReader and lives until the index reader is 
garbage collected.  50-70 searchable fields is a *lot*.  How many are 
analyzed text, and how many are simply keywords?

Doug
James Dunn wrote:
Doug,
Thanks!  

I just asked a question regarding how to calculate the
memory requirements for a search.  Does this memory
only get used only during the search operation itself,
or is it referenced by the Hits object or anything
else after the actual search completes?
Thanks again,
Jim
--- Doug Cutting [EMAIL PROTECTED] wrote:
James Dunn wrote:
Also I search across about 50 fields but I don't
use
wildcard or range queries. 
Lucene uses one byte of RAM per document per
searched field, to hold the 
normalization values.  So if you search a 10M
document collection with 
50 fields, then you'll end up using 500MB of RAM.

If you're using unanalyzed fields, then an easy
workaround to reduce the 
number of fields is to combine many in a single
field.  So, instead of, 
e.g., using an f1 field with value abc, and an
f2 field with value 
efg, use a single field named f with values
1_abc and 2_efg.

We could optimize this in Lucene.  If no values of
an indexed field are 
analyzed, then we could store no norms for the field
and hence read none 
into memory.  This wouldn't be too hard to
implement...

Doug

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]


	
		
__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Memory usage

2004-05-26 Thread James Dunn
Doug,

We only search on analyzed text fields.  There are a
couple of additional fields in the index like
OBJECT_ID that are keywords but we don't search
against those, we only use them once we get a result
back to find the thing that document represents.

Thanks,

Jim

--- Doug Cutting [EMAIL PROTECTED] wrote:
 It is cached by the IndexReader and lives until the
 index reader is 
 garbage collected.  50-70 searchable fields is a
 *lot*.  How many are 
 analyzed text, and how many are simply keywords?
 
 Doug
 
 James Dunn wrote:
  Doug,
  
  Thanks!  
  
  I just asked a question regarding how to calculate
 the
  memory requirements for a search.  Does this
 memory
  only get used only during the search operation
 itself,
  or is it referenced by the Hits object or anything
  else after the actual search completes?
  
  Thanks again,
  
  Jim
  
  
  --- Doug Cutting [EMAIL PROTECTED] wrote:
  
 James Dunn wrote:
 
 Also I search across about 50 fields but I don't
 
 use
 
 wildcard or range queries. 
 
 Lucene uses one byte of RAM per document per
 searched field, to hold the 
 normalization values.  So if you search a 10M
 document collection with 
 50 fields, then you'll end up using 500MB of RAM.
 
 If you're using unanalyzed fields, then an easy
 workaround to reduce the 
 number of fields is to combine many in a single
 field.  So, instead of, 
 e.g., using an f1 field with value abc, and an
 f2 field with value 
 efg, use a single field named f with values
 1_abc and 2_efg.
 
 We could optimize this in Lucene.  If no values of
 an indexed field are 
 analyzed, then we could store no norms for the
 field
 and hence read none 
 into memory.  This wouldn't be too hard to
 implement...
 
 Doug
 
 
  
 

-
  
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  
  
  
  
  
  
  __
  Do you Yahoo!?
  Friends.  Fun.  Try the all-new Yahoo! Messenger.
  http://messenger.yahoo.com/ 
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory Usage?

2001-11-12 Thread Doug Cutting

This was a single query?  How many terms, and of what type are in the query?
From the trace it looks like there could be over 40,000 terms in the query!
Is this a prefix or wildcard query?  These can generate *very* large
queries...

Doug


 -Original Message-
 From: Anders Nielsen [mailto:[EMAIL PROTECTED]]
 Sent: Sunday, November 11, 2001 6:59 AM
 To: Lucene Users List
 Subject: RE: Memory Usage?
 
 
 I am not very familiar with the output of -Xrunhprof, but 
 I've attached the
 output of a run of a search through and index of 50.000 
 documents. It gave
 me out-of-memory errors until I allocated 100 megabytes of heap-space.
 
 The top 10:
 
 SITES BEGIN (ordered by live bytes) Sun Nov 11 15:50:31 2001
   percent live   alloc'ed  stack class
  rank   self  accumbytes objs   bytes objs trace name
 1 26.41% 26.41% 12485200 12005 45566560 43814  1783 [B
 2 25.18% 51.59% 11904880 11447 44867680 43142  1796 [B
 3  4.15% 55.74%  1962904 69214 171546352 5510292  1632 [C
 4  3.83% 59.58%  1812096 3432 1812096 3432  1768 [I
 5  3.83% 63.41%  1812096 3432 1812096 3432  1769 [I
 6  3.34% 66.75%  1580688 65862 130618992 5442458  1631 
 java.lang.String
 7  3.19% 69.95%  1509584 44763 1509584 44763   458 [C
 8  3.03% 72.98%  1432416 44763 1432416 44763   459
 org.apache.lucene.index.TermInfo
 9  2.27% 75.25%  1074312 44763 1074312 44763   457 
 java.lang.String
10  2.23% 77.48%  1053792 65862 87079328 5442458  1631
 org.apache.lucene.index.Term
 
 and the top 3 traces were:
 
 TRACE 1783:
 
 org.apache.lucene.store.InputStream.refill(InputStream.java:165)
 
 org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
 
 org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)
 
 org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:101)
 
 TRACE 1796:
 
 org.apache.lucene.store.InputStream.refill(InputStream.java:165)
 
 org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
 
 org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)
 
 org.apache.lucene.index.SegmentTermPositions.next(SegmentTermP
 ositions.java:
 100)
 
 TRACE 1632:
 java.lang.String.init(String.java:198)
 
 org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEn
 um.java:134)
 
 org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:114)
 
 org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosRead
 er.java:166)
 
 
 I've attached the whole trace as gzipped.txt
 
 regards,
 Anders Nielsen
 
 -Original Message-
 From: Doug Cutting [mailto:[EMAIL PROTECTED]]
 Sent: 10. november 2001 04:35
 To: 'Lucene Users List'
 Subject: RE: Memory Usage?
 
 
 I'm surprised that your memory use is that high.
 
 An IndexReader requires:
   one byte per field per document in index (norms)
   one open file per file in index
   1/128 of the Terms in the index
 a Term has two pointers (8 bytes)
  and a String (4 pointers = 24 bytes, one to 16-bit chars)
 
 A Search requires:
   1 1024 byte buffer per TermQuery
   2 128 int buffers per TermQuery
   2 1024 byte buffers per PhraseQuery term
   1 1024 element bucket array per BooleanQuery
 each bucket has 5 fields, and hence requires ~20 bytes
   1 bit per document in index per DateFilter
 
 A Hits requires:
   up to n+100 ScoreDocs (float+int, 8 bytes)
 where n is the highest Hits.doc(n) accessed
   up to 200 Document objects
 
 I may have forgotten something...
 
 Let's assume that your 1M document index has 2M unique terms, 
 and that you
 only look at the top-100 hits, that your index has three 
 fields, and that
 the typical document has two stored fields, each 20 characters.  Your
 30-term boolean query over a 1M document index should use around the
 following numbers of bytes:
   IndexReader:
 3,000,000 (norms)
 1,000,000 (1/128 of 2M terms, each requiring ~50 bytes)
   during search
50,000 (TermQuery buffers)
20,000 (BooleanQuery buckets)
   100,000 (DateFilter bit vector)
   in Hits
 2,000 (200 ScoreDocs)
30,000 (up to 200 cached Documents)
 
 So searches should run in a 5Mb heap.  Are my assumptions off?
 
 You can also see why it is useful to keep a single 
 IndexReader and use it
 for all queries.  (IndexReader is thread safe.)
 
 You could also 'java -Xrunhprof:heap=sites' to see what's 
 using memory.
 
 Doug
 
 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 
 

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Memory Usage?

2001-11-12 Thread Anders Nielsen

hmm, I seem to be getting a different number of hits when I use the files
you sent out.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]]
Sent: 12. november 2001 20:47
To: 'Lucene Users List'
Subject: RE: Memory Usage?


 From: Anders Nielsen [mailto:[EMAIL PROTECTED]]

 this was a big boolean query, with several prefixqueries but
 no wildcard
 queries in the or-branches.

Well it looks like those prefixes are expanding to a lot of terms, a total
of over 40,000!  (A prefix query expands into a BooleanQuery with all the
terms matching the prefix.)

If most of these expansions are low-frequency, then a simple fix should
improve things considerably.  I've attached an optimized version of
TermQuery that will hold less memory per low-frequency term.  In particular,
if a term occurs fewer than 128 times then a 1024 byte InputStream buffer is
freed immediately.

Tell me how this works.  Please send another heap dump.

Longer term, or if lots of the expanded terms occur more than 128 times,
perhaps BooleanScorer should use a different algorithm when there are
thousands of terms.  In this case it might use less memory to construct an
array of score buckets for all documents.  If (query.termCount() * 1024) 
(12 * getMaxDoc()) then this would use less memory.  In your case, with
500,000 documents and a 40,000 term query, it's currently taking 40MB/query,
and could be done in 6MB/query.  This optimization would not be too
difficult, as it could be mostly isolated to BooleanQuery and BooleanScorer.

Doug




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Memory Usage?

2001-11-12 Thread Doug Cutting

 From: Anders Nielsen [mailto:[EMAIL PROTECTED]]
 
 hmm, I seem to be getting a different number of hits when I 
 use the files
 you sent out.

Please provide more information!  Is it larger or smaller than before?  By
how much?  What differences show up in the hits?  That's a terrible bug
report...

I think before it may have been possible to get a spurious hit if a query
term only occurred in deleted documents.  A wildcard query with 40,000 terms
might make this sort of thing happen more often, and unless you tried to
access the Hits.doc() for such a hit, you would not see an error.  If this
was in fact a problem, the code I just sent out would have fixed it.  So
your results may in fact be better.  Or there may be a bug in what I sent.
Or both!

For the cases I have tried I get the same results with and without those
changes.

Doug

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]