Re: Memory usage: IndexSearcher & Sort

2004-10-01 Thread Bernhard Messer
hi,
the memory leak patch will be included in release 1.4.2. The new release 
will be available at least on monday (Doug wrote this on the dev-list)

Bernhard
Praveen Peddi wrote:
Hello all,
is this patch going to be part of 1.4.2 release. If so, does anyone 
know when this release is due. I am currently using 1.4 final and 
wanted to migrate to 1.4.1. But after knowing that there is a 
memoryleak in 1.4.1 sorting, I have decided to wait until the next 
release.

Praveen
- Original Message - From: "Damian Gajda" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 29, 2004 9:25 AM
Subject: Re: Memory usage: IndexSearcher & Sort

Most helpful in this search was the following thread from Bugzilla:
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
<http://issues.apache.org/bugzilla/show_bug.cgi?id=30628>
We had a similar problem in our webapp.
Please look at the bug
http://issues.apache.org/bugzilla/show_bug.cgi?id=31240
My co-worker Rafał has fixed this bug and submitted a patch today.
Have fun ;)
--
Damian Gajda
Caltha Sp. j.
Warszawa 02-807
ul. Kukułki 2
tel. +48 22 643 20 20
mobile: +48 501 032 506
http://www.caltha.pl/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Memory usage: IndexSearcher & Sort

2004-10-01 Thread Praveen Peddi
Hello all,
is this patch going to be part of 1.4.2 release. If so, does anyone know 
when this release is due. I am currently using 1.4 final and wanted to 
migrate to 1.4.1. But after knowing that there is a memoryleak in 1.4.1 
sorting, I have decided to wait until the next release.

Praveen
- Original Message - 
From: "Damian Gajda" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 29, 2004 9:25 AM
Subject: Re: Memory usage: IndexSearcher & Sort


Most helpful in this search was the following thread from Bugzilla:
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
<http://issues.apache.org/bugzilla/show_bug.cgi?id=30628>
We had a similar problem in our webapp.
Please look at the bug
http://issues.apache.org/bugzilla/show_bug.cgi?id=31240
My co-worker Rafał has fixed this bug and submitted a patch today.
Have fun ;)
--
Damian Gajda
Caltha Sp. j.
Warszawa 02-807
ul. Kukułki 2
tel. +48 22 643 20 20
mobile: +48 501 032 506
http://www.caltha.pl/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Memory usage: IndexSearcher & Sort

2004-09-30 Thread Otis Gospodnetic
Correct.  I think there is a FAQ entry at jguru.com that answers this.

Otis

--- Cocula Remi <[EMAIL PROTECTED]> wrote:
> >> 2.  How does this approach work with multiple, simultaneous users?
> 
> >IndexSearcher is thread-safe.
> 
> You mean one can invoque at the same time the search method of a
> unique Searcheable in two different threads, 
> Don't you ?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory usage: IndexSearcher & Sort

2004-09-30 Thread Cocula Remi


-Message d'origine-
De : Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 29 septembre 2004 18:28
À : Lucene Users List
Objet : RE: Memory usage: IndexSearcher & Sort



>> 2.  How does this approach work with multiple, simultaneous users?

>IndexSearcher is thread-safe.

You mean one can invoque at the same time the search method of a unique Searcheable in 
two different threads, 
Don't you ?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage: IndexSearcher & Sort

2004-09-29 Thread Erik Hatcher
On Sep 29, 2004, at 3:11 PM, Bryan Dotzour wrote:
3.  Certainly some of you on this list are using Lucene in a web-app
environment.  Can anyone list some best practices on managing
reading/writing/searching a Lucene index in that context?
Beyond the advice already given on this thread, since you said you were 
using Tapestry, I keep an IndexSearcher as a transient lazy-init'd 
property of my Global object.  It needs to be transient in case you are 
scaling with distributed servers in a farm and lazy init'd so to 
instantiate the first time.

Global, in Tapestry, makes a good place to put index operations.  As 
for searching - a good first try is to re-query for each page of search 
results (if you're implementing paging of results, that is).  It is 
often fast enough.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Memory usage: IndexSearcher & Sort

2004-09-29 Thread Otis Gospodnetic
Hello Bryan,

--- Bryan Dotzour <[EMAIL PROTECTED]> wrote:

> Thanks very much for the reply Otis.  Your code snippet is pretty
> interesting and made me think about a few questions. 
> 
> 1.  Do you just have one IndexReader for a given index?  It looks
> like you
> are handing out a new IndexSearcher when the IndexReader has been
> modified.

1 index for each index.  Simpy has a LOT of Lucene indices.

> 2.  How does this approach work with multiple, simultaneous users?

IndexSearcher is thread-safe.

> 3.  When does the reader need to get closed?

Just leave it open.  You can close it when you are sure you no longer
need it, if you can determine that in your application.

Otis

> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, September 29, 2004 8:47 AM
> To: Lucene Users List
> Subject: Re: Memory usage: IndexSearcher & Sort
> 
> 
> Hello,
> 
> --- Bryan Dotzour <[EMAIL PROTECTED]> wrote:
> 
> > I have been investigating a serious memory problem in our web app 
> > (using Tapestry, Hibernate, & Lucene) and have reduced it to being
> the 
> > way in which
> > we are using Lucene to search on things.  Being a webapp, we have
> > focused on
> > doing our work within a user's request.  So we basically end up
> > opening at
> > least one new IndexSearcher on each individual page view.  In one
> > particular
> > case, we were doing this in a loop, eventually opening ~20-~40
> > IndexSearchers which caused our memory usage to skyrocket.  After
> > viewing
> > that one page 3 or 4 times we would exhaust the server's memory
> > allocation.
> >  
> > Most helpful in this search was the following thread from Bugzilla:
> >  
> > http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
> > <http://issues.apache.org/bugzilla/show_bug.cgi?id=30628>
> >  
> > From this thread, it sounds like constantly opening and closing 
> > IndexSearcher objects is a "BAD THING", but it is exactly what we
> are 
> > doing in our app.
> > There are a few things that puzzle me and I'd love it if anyone has
> > some
> > input that might clear up some of these questions.
> >  
> > 1.  According to the Bugzilla thread, and from my own testing, you
> can 
> > open lots of IndexSearchers in a loop and do a search WITHOUT
> SORTING 
> > and not
> > have this memory problem.  Is there an issue with the Sort code?
> 
> Yes, there is a memory leak in Sort code.  A kind person from Poland
> contributed a patch earlier today.  It's not in CVS yet.
> 
> > 2.  Can anyone give a brief, technical explanation as to why
> opening 
> > multiple IndexSearcher objects is bad?
> 
> Very simple.  A Lucene index consists of X number of files that
> reside on a
> disk.  Every time you open a new IndexSearcher, some of these files
> need to
> be read.  If files do not change (no documents added/removed), why do
> this
> repetitive work?  Just do it once.  When these files are read, some
> data is
> stored in memory.  If you read them multiple times, you will store
> the same
> data in memory multiple times.
> 
> > 3.  Certainly some of you on this list are using Lucene in a
> web-app 
> > environment.  Can anyone list some best practices on managing 
> > reading/writing/searching a Lucene index in that context?
> 
> I use something like this for http://www.simpy.com/ and it works well
> for
> me:
> 
> private IndexDescriptor getIndexDescriptor(String indexID)
> throws SearcherException
> {
> File indexDir = validateIndex(indexID);
> IndexDescriptor indexDescriptor =
> getIndexDescriptorFromCache(indexDir);
> 
> try
> {
> // if this is a known index
> if (indexDescriptor != null)
> {
> // if the index has changed since this Searcher was
> created,
> make a new Searcher
> long currentVersion =
> IndexReader.getCurrentVersion(indexDir);
> if (currentVersion >
> indexDescriptor.lastKnownVersion)
> {
> indexDescriptor.lastKnownVersion =
> currentVersion;
> indexDescriptor.searcher = new
> LuceneUserSearcher(indexDir);
> }
> }
> // if this is a new index
> else
> {
> indexDescriptor = new IndexDescriptor();
> indexDescriptor.indexDir = indexDir;
> indexDescriptor.lastKnownVersion =
> IndexReader.getCurrentVersion

RE: Memory usage: IndexSearcher & Sort

2004-09-29 Thread Bryan Dotzour
Thanks very much for the reply Otis.  Your code snippet is pretty
interesting and made me think about a few questions. 

1.  Do you just have one IndexReader for a given index?  It looks like you
are handing out a new IndexSearcher when the IndexReader has been modified.

2.  How does this approach work with multiple, simultaneous users?
3.  When does the reader need to get closed?

Thanks again.  
Bryan

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 29, 2004 8:47 AM
To: Lucene Users List
Subject: Re: Memory usage: IndexSearcher & Sort


Hello,

--- Bryan Dotzour <[EMAIL PROTECTED]> wrote:

> I have been investigating a serious memory problem in our web app 
> (using Tapestry, Hibernate, & Lucene) and have reduced it to being the 
> way in which
> we are using Lucene to search on things.  Being a webapp, we have
> focused on
> doing our work within a user's request.  So we basically end up
> opening at
> least one new IndexSearcher on each individual page view.  In one
> particular
> case, we were doing this in a loop, eventually opening ~20-~40
> IndexSearchers which caused our memory usage to skyrocket.  After
> viewing
> that one page 3 or 4 times we would exhaust the server's memory
> allocation.
>  
> Most helpful in this search was the following thread from Bugzilla:
>  
> http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
> <http://issues.apache.org/bugzilla/show_bug.cgi?id=30628>
>  
> From this thread, it sounds like constantly opening and closing 
> IndexSearcher objects is a "BAD THING", but it is exactly what we are 
> doing in our app.
> There are a few things that puzzle me and I'd love it if anyone has
> some
> input that might clear up some of these questions.
>  
> 1.  According to the Bugzilla thread, and from my own testing, you can 
> open lots of IndexSearchers in a loop and do a search WITHOUT SORTING 
> and not
> have this memory problem.  Is there an issue with the Sort code?

Yes, there is a memory leak in Sort code.  A kind person from Poland
contributed a patch earlier today.  It's not in CVS yet.

> 2.  Can anyone give a brief, technical explanation as to why opening 
> multiple IndexSearcher objects is bad?

Very simple.  A Lucene index consists of X number of files that reside on a
disk.  Every time you open a new IndexSearcher, some of these files need to
be read.  If files do not change (no documents added/removed), why do this
repetitive work?  Just do it once.  When these files are read, some data is
stored in memory.  If you read them multiple times, you will store the same
data in memory multiple times.

> 3.  Certainly some of you on this list are using Lucene in a web-app 
> environment.  Can anyone list some best practices on managing 
> reading/writing/searching a Lucene index in that context?

I use something like this for http://www.simpy.com/ and it works well for
me:

private IndexDescriptor getIndexDescriptor(String indexID)
throws SearcherException
{
File indexDir = validateIndex(indexID);
IndexDescriptor indexDescriptor =
getIndexDescriptorFromCache(indexDir);

try
{
// if this is a known index
if (indexDescriptor != null)
{
// if the index has changed since this Searcher was created,
make a new Searcher
long currentVersion =
IndexReader.getCurrentVersion(indexDir);
if (currentVersion > indexDescriptor.lastKnownVersion)
{
indexDescriptor.lastKnownVersion = currentVersion;
indexDescriptor.searcher = new
LuceneUserSearcher(indexDir);
}
}
// if this is a new index
else
{
indexDescriptor = new IndexDescriptor();
indexDescriptor.indexDir = indexDir;
indexDescriptor.lastKnownVersion =
IndexReader.getCurrentVersion(indexDir);
indexDescriptor.searcher = new LuceneUserSearcher(indexDir);
}
return cacheIndexDescriptor(indexDescriptor);
}
catch (IOException e)
{
throw new SearcherException("Cannot open index: " + indexDir,
e);
}
}

IndexDescriptor is a simple 'struct' with everything public (not good
practise, you should change that):

final class IndexDescriptor
{
public File indexDir;
public long lastKnownVersion;
public Searcher searcher;

public String toString()
{
return IndexDescriptor.class.getName() + ": index directory: "
+ indexDir.getAbsolutePath()
+ ", last known version: " + lastKnownVersion + ",
searcher: " + searcher;
}
}

These two things combined allow me 

Re: Memory usage: IndexSearcher & Sort

2004-09-29 Thread Otis Gospodnetic
Hello,

--- Bryan Dotzour <[EMAIL PROTECTED]> wrote:

> I have been investigating a serious memory problem in our web app
> (using
> Tapestry, Hibernate, & Lucene) and have reduced it to being the way
> in which
> we are using Lucene to search on things.  Being a webapp, we have
> focused on
> doing our work within a user's request.  So we basically end up
> opening at
> least one new IndexSearcher on each individual page view.  In one
> particular
> case, we were doing this in a loop, eventually opening ~20-~40
> IndexSearchers which caused our memory usage to skyrocket.  After
> viewing
> that one page 3 or 4 times we would exhaust the server's memory
> allocation.
>  
> Most helpful in this search was the following thread from Bugzilla:
>  
> http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
>  
>  
> From this thread, it sounds like constantly opening and closing
> IndexSearcher objects is a "BAD THING", but it is exactly what we are
> doing
> in our app.  
> There are a few things that puzzle me and I'd love it if anyone has
> some
> input that might clear up some of these questions.
>  
> 1.  According to the Bugzilla thread, and from my own testing, you
> can open
> lots of IndexSearchers in a loop and do a search WITHOUT SORTING and
> not
> have this memory problem.  Is there an issue with the Sort code?

Yes, there is a memory leak in Sort code.  A kind person from Poland
contributed a patch earlier today.  It's not in CVS yet.

> 2.  Can anyone give a brief, technical explanation as to why opening
> multiple IndexSearcher objects is bad?

Very simple.  A Lucene index consists of X number of files that reside
on a disk.  Every time you open a new IndexSearcher, some of these
files need to be read.  If files do not change (no documents
added/removed), why do this repetitive work?  Just do it once.  When
these files are read, some data is stored in memory.  If you read them
multiple times, you will store the same data in memory multiple times.

> 3.  Certainly some of you on this list are using Lucene in a web-app
> environment.  Can anyone list some best practices on managing
> reading/writing/searching a Lucene index in that context?

I use something like this for http://www.simpy.com/ and it works well
for me:

private IndexDescriptor getIndexDescriptor(String indexID)
throws SearcherException
{
File indexDir = validateIndex(indexID);
IndexDescriptor indexDescriptor =
getIndexDescriptorFromCache(indexDir);

try
{
// if this is a known index
if (indexDescriptor != null)
{
// if the index has changed since this Searcher was
created, make a new Searcher
long currentVersion =
IndexReader.getCurrentVersion(indexDir);
if (currentVersion > indexDescriptor.lastKnownVersion)
{
indexDescriptor.lastKnownVersion = currentVersion;
indexDescriptor.searcher = new
LuceneUserSearcher(indexDir);
}
}
// if this is a new index
else
{
indexDescriptor = new IndexDescriptor();
indexDescriptor.indexDir = indexDir;
indexDescriptor.lastKnownVersion =
IndexReader.getCurrentVersion(indexDir);
indexDescriptor.searcher = new
LuceneUserSearcher(indexDir);
}
return cacheIndexDescriptor(indexDescriptor);
}
catch (IOException e)
{
throw new SearcherException("Cannot open index: " +
indexDir, e);
}
}

IndexDescriptor is a simple 'struct' with everything public (not good
practise, you should change that):

final class IndexDescriptor
{
public File indexDir;
public long lastKnownVersion;
public Searcher searcher;

public String toString()
{
return IndexDescriptor.class.getName() + ": index directory: "
+ indexDir.getAbsolutePath()
+ ", last known version: " + lastKnownVersion + ",
searcher: " + searcher;
}
}

These two things combined allow me to re-open an IndexSearcher when the
index changes, and re-use the same IndexSearcher while the index
remains unmodified.  Of course, that LuceneUserSearcher could be
Lucene's IndexSearcher, probably.

Otis
http://www.simpy.com/ -- Index, Search and Share your bookmarks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory usage: IndexSearcher & Sort

2004-09-29 Thread Cocula Remi
My solution is :

I have bound in an RMI registry one RemoteSearchable object for each index.
Thus I do not have to create any IndexSearcher and I can execute query from any 
application.
This has been implemented in the Lucene Server that I have just began to create.
http://sourceforge.net/projects/luceneserver/

I use it in a web app.

It would be nice if some people could test it (don't you want ?)
  

-Message d'origine-
De : Bryan Dotzour [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 29 septembre 2004 15:11
À : '[EMAIL PROTECTED]'
Objet : Memory usage: IndexSearcher & Sort


I have been investigating a serious memory problem in our web app (using
Tapestry, Hibernate, & Lucene) and have reduced it to being the way in which
we are using Lucene to search on things.  Being a webapp, we have focused on
doing our work within a user's request.  So we basically end up opening at
least one new IndexSearcher on each individual page view.  In one particular
case, we were doing this in a loop, eventually opening ~20-~40
IndexSearchers which caused our memory usage to skyrocket.  After viewing
that one page 3 or 4 times we would exhaust the server's memory allocation.
 
Most helpful in this search was the following thread from Bugzilla:
 
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
 
 
>From this thread, it sounds like constantly opening and closing
IndexSearcher objects is a "BAD THING", but it is exactly what we are doing
in our app.  
There are a few things that puzzle me and I'd love it if anyone has some
input that might clear up some of these questions.
 
1.  According to the Bugzilla thread, and from my own testing, you can open
lots of IndexSearchers in a loop and do a search WITHOUT SORTING and not
have this memory problem.  Is there an issue with the Sort code?
2.  Can anyone give a brief, technical explanation as to why opening
multiple IndexSearcher objects is bad?
3.  Certainly some of you on this list are using Lucene in a web-app
environment.  Can anyone list some best practices on managing
reading/writing/searching a Lucene index in that context?
 
 
Thank you all
Bryan
---
Some extra information about my Lucene setup:
 
Lucene 1.4.1
We maintain 5 different indexes, all in RAMDirectories.  The indexes aren't
especially big (< 100,000 total objects combined).
  
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage: IndexSearcher & Sort

2004-09-29 Thread Damian Gajda
> Most helpful in this search was the following thread from Bugzilla:
>  
> http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
>  
>  

We had a similar problem in our webapp.

Please look at the bug
http://issues.apache.org/bugzilla/show_bug.cgi?id=31240

My co-worker Rafał has fixed this bug and submitted a patch today.

Have fun ;)
-- 
Damian Gajda
Caltha Sp. j.
Warszawa 02-807
ul. Kukułki 2
tel. +48 22 643 20 20
mobile: +48 501 032 506
http://www.caltha.pl/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-27 Thread James Dunn
Otis,

My app does run within Tomcat.  But when I started
getting these OutOfMemoryErrors I wrote a little unit
test to watch the memory usage without Tomcat in the
middle and I still see the memory usage.

Thanks,

Jim
--- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:
> Sorry if I'm stating the obvious.  Is this happening
> in some
> stand-alone unit tests, or are you running things
> from some application
> and in some environment, like Tomcat, Jetty or in
> some non-web app?
> 
> Your queries are pretty big (although I recall some
> people using even
> bigger ones... but it all depends on the hardware
> they had), but are
> you sure running out of memory is due to Lucene, or
> could it be a leak
> in the app from which you are running queries?
> 
> Otis
> 
> 
> --- James Dunn <[EMAIL PROTECTED]> wrote:
> > Doug,
> > 
> > We only search on analyzed text fields.  There are
> a
> > couple of additional fields in the index like
> > OBJECT_ID that are keywords but we don't search
> > against those, we only use them once we get a
> result
> > back to find the thing that document represents.
> > 
> > Thanks,
> > 
> > Jim
> > 
> > --- Doug Cutting <[EMAIL PROTECTED]> wrote:
> > > It is cached by the IndexReader and lives until
> the
> > > index reader is 
> > > garbage collected.  50-70 searchable fields is a
> > > *lot*.  How many are 
> > > analyzed text, and how many are simply keywords?
> > > 
> > > Doug
> > > 
> > > James Dunn wrote:
> > > > Doug,
> > > > 
> > > > Thanks!  
> > > > 
> > > > I just asked a question regarding how to
> calculate
> > > the
> > > > memory requirements for a search.  Does this
> > > memory
> > > > only get used only during the search operation
> > > itself,
> > > > or is it referenced by the Hits object or
> anything
> > > > else after the actual search completes?
> > > > 
> > > > Thanks again,
> > > > 
> > > > Jim
> > > > 
> > > > 
> > > > --- Doug Cutting <[EMAIL PROTECTED]> wrote:
> > > > 
> > > >>James Dunn wrote:
> > > >>
> > > >>>Also I search across about 50 fields but I
> don't
> > > >>
> > > >>use
> > > >>
> > > >>>wildcard or range queries. 
> > > >>
> > > >>Lucene uses one byte of RAM per document per
> > > >>searched field, to hold the 
> > > >>normalization values.  So if you search a 10M
> > > >>document collection with 
> > > >>50 fields, then you'll end up using 500MB of
> RAM.
> > > >>
> > > >>If you're using unanalyzed fields, then an
> easy
> > > >>workaround to reduce the 
> > > >>number of fields is to combine many in a
> single
> > > >>field.  So, instead of, 
> > > >>e.g., using an "f1" field with value "abc",
> and an
> > > >>"f2" field with value 
> > > >>"efg", use a single field named "f" with
> values
> > > >>"1_abc" and "2_efg".
> > > >>
> > > >>We could optimize this in Lucene.  If no
> values of
> > > >>an indexed field are 
> > > >>analyzed, then we could store no norms for the
> > > field
> > > >>and hence read none 
> > > >>into memory.  This wouldn't be too hard to
> > > >>implement...
> > > >>
> > > >>Doug
> > > >>
> > > >>
> > > > 
> > > >
> > >
> >
>
-
> > > > 
> > > >>To unsubscribe, e-mail:
> > > >>[EMAIL PROTECTED]
> > > >>For additional commands, e-mail:
> > > >>[EMAIL PROTECTED]
> > > >>
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > __
> > > > Do you Yahoo!?
> > > > Friends.  Fun.  Try the all-new Yahoo!
> Messenger.
> > > > http://messenger.yahoo.com/ 
> > > > 
> > > >
> > >
> >
>
-
> > > > To unsubscribe, e-mail:
> > > [EMAIL PROTECTED]
> > > > For additional commands, e-mail:
> > > [EMAIL PROTECTED]
> > > > 
> > > 
> > >
> >
>
-
> > > To unsubscribe, e-mail:
> > > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > > [EMAIL PROTECTED]
> > > 
> > 
> > 
> > 
> > 
> > 
> > __
> > Do you Yahoo!?
> > Friends.  Fun.  Try the all-new Yahoo! Messenger.
> > http://messenger.yahoo.com/ 
> > 
> >
>
-
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-27 Thread Otis Gospodnetic
Sorry if I'm stating the obvious.  Is this happening in some
stand-alone unit tests, or are you running things from some application
and in some environment, like Tomcat, Jetty or in some non-web app?

Your queries are pretty big (although I recall some people using even
bigger ones... but it all depends on the hardware they had), but are
you sure running out of memory is due to Lucene, or could it be a leak
in the app from which you are running queries?

Otis


--- James Dunn <[EMAIL PROTECTED]> wrote:
> Doug,
> 
> We only search on analyzed text fields.  There are a
> couple of additional fields in the index like
> OBJECT_ID that are keywords but we don't search
> against those, we only use them once we get a result
> back to find the thing that document represents.
> 
> Thanks,
> 
> Jim
> 
> --- Doug Cutting <[EMAIL PROTECTED]> wrote:
> > It is cached by the IndexReader and lives until the
> > index reader is 
> > garbage collected.  50-70 searchable fields is a
> > *lot*.  How many are 
> > analyzed text, and how many are simply keywords?
> > 
> > Doug
> > 
> > James Dunn wrote:
> > > Doug,
> > > 
> > > Thanks!  
> > > 
> > > I just asked a question regarding how to calculate
> > the
> > > memory requirements for a search.  Does this
> > memory
> > > only get used only during the search operation
> > itself,
> > > or is it referenced by the Hits object or anything
> > > else after the actual search completes?
> > > 
> > > Thanks again,
> > > 
> > > Jim
> > > 
> > > 
> > > --- Doug Cutting <[EMAIL PROTECTED]> wrote:
> > > 
> > >>James Dunn wrote:
> > >>
> > >>>Also I search across about 50 fields but I don't
> > >>
> > >>use
> > >>
> > >>>wildcard or range queries. 
> > >>
> > >>Lucene uses one byte of RAM per document per
> > >>searched field, to hold the 
> > >>normalization values.  So if you search a 10M
> > >>document collection with 
> > >>50 fields, then you'll end up using 500MB of RAM.
> > >>
> > >>If you're using unanalyzed fields, then an easy
> > >>workaround to reduce the 
> > >>number of fields is to combine many in a single
> > >>field.  So, instead of, 
> > >>e.g., using an "f1" field with value "abc", and an
> > >>"f2" field with value 
> > >>"efg", use a single field named "f" with values
> > >>"1_abc" and "2_efg".
> > >>
> > >>We could optimize this in Lucene.  If no values of
> > >>an indexed field are 
> > >>analyzed, then we could store no norms for the
> > field
> > >>and hence read none 
> > >>into memory.  This wouldn't be too hard to
> > >>implement...
> > >>
> > >>Doug
> > >>
> > >>
> > > 
> > >
> >
> -
> > > 
> > >>To unsubscribe, e-mail:
> > >>[EMAIL PROTECTED]
> > >>For additional commands, e-mail:
> > >>[EMAIL PROTECTED]
> > >>
> > > 
> > > 
> > > 
> > > 
> > >   
> > >   
> > > __
> > > Do you Yahoo!?
> > > Friends.  Fun.  Try the all-new Yahoo! Messenger.
> > > http://messenger.yahoo.com/ 
> > > 
> > >
> >
> -
> > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > > 
> > 
> >
> -
> > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > 
> 
> 
> 
>   
>   
> __
> Do you Yahoo!?
> Friends.  Fun.  Try the all-new Yahoo! Messenger.
> http://messenger.yahoo.com/ 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread James Dunn
Doug,

We only search on analyzed text fields.  There are a
couple of additional fields in the index like
OBJECT_ID that are keywords but we don't search
against those, we only use them once we get a result
back to find the thing that document represents.

Thanks,

Jim

--- Doug Cutting <[EMAIL PROTECTED]> wrote:
> It is cached by the IndexReader and lives until the
> index reader is 
> garbage collected.  50-70 searchable fields is a
> *lot*.  How many are 
> analyzed text, and how many are simply keywords?
> 
> Doug
> 
> James Dunn wrote:
> > Doug,
> > 
> > Thanks!  
> > 
> > I just asked a question regarding how to calculate
> the
> > memory requirements for a search.  Does this
> memory
> > only get used only during the search operation
> itself,
> > or is it referenced by the Hits object or anything
> > else after the actual search completes?
> > 
> > Thanks again,
> > 
> > Jim
> > 
> > 
> > --- Doug Cutting <[EMAIL PROTECTED]> wrote:
> > 
> >>James Dunn wrote:
> >>
> >>>Also I search across about 50 fields but I don't
> >>
> >>use
> >>
> >>>wildcard or range queries. 
> >>
> >>Lucene uses one byte of RAM per document per
> >>searched field, to hold the 
> >>normalization values.  So if you search a 10M
> >>document collection with 
> >>50 fields, then you'll end up using 500MB of RAM.
> >>
> >>If you're using unanalyzed fields, then an easy
> >>workaround to reduce the 
> >>number of fields is to combine many in a single
> >>field.  So, instead of, 
> >>e.g., using an "f1" field with value "abc", and an
> >>"f2" field with value 
> >>"efg", use a single field named "f" with values
> >>"1_abc" and "2_efg".
> >>
> >>We could optimize this in Lucene.  If no values of
> >>an indexed field are 
> >>analyzed, then we could store no norms for the
> field
> >>and hence read none 
> >>into memory.  This wouldn't be too hard to
> >>implement...
> >>
> >>Doug
> >>
> >>
> > 
> >
>
-
> > 
> >>To unsubscribe, e-mail:
> >>[EMAIL PROTECTED]
> >>For additional commands, e-mail:
> >>[EMAIL PROTECTED]
> >>
> > 
> > 
> > 
> > 
> > 
> > 
> > __
> > Do you Yahoo!?
> > Friends.  Fun.  Try the all-new Yahoo! Messenger.
> > http://messenger.yahoo.com/ 
> > 
> >
>
-
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread Doug Cutting
It is cached by the IndexReader and lives until the index reader is 
garbage collected.  50-70 searchable fields is a *lot*.  How many are 
analyzed text, and how many are simply keywords?

Doug
James Dunn wrote:
Doug,
Thanks!  

I just asked a question regarding how to calculate the
memory requirements for a search.  Does this memory
only get used only during the search operation itself,
or is it referenced by the Hits object or anything
else after the actual search completes?
Thanks again,
Jim
--- Doug Cutting <[EMAIL PROTECTED]> wrote:
James Dunn wrote:
Also I search across about 50 fields but I don't
use
wildcard or range queries. 
Lucene uses one byte of RAM per document per
searched field, to hold the 
normalization values.  So if you search a 10M
document collection with 
50 fields, then you'll end up using 500MB of RAM.

If you're using unanalyzed fields, then an easy
workaround to reduce the 
number of fields is to combine many in a single
field.  So, instead of, 
e.g., using an "f1" field with value "abc", and an
"f2" field with value 
"efg", use a single field named "f" with values
"1_abc" and "2_efg".

We could optimize this in Lucene.  If no values of
an indexed field are 
analyzed, then we could store no norms for the field
and hence read none 
into memory.  This wouldn't be too hard to
implement...

Doug

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]


	
		
__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Memory usage

2004-05-26 Thread James Dunn
Doug,

Thanks!  

I just asked a question regarding how to calculate the
memory requirements for a search.  Does this memory
only get used only during the search operation itself,
or is it referenced by the Hits object or anything
else after the actual search completes?

Thanks again,

Jim


--- Doug Cutting <[EMAIL PROTECTED]> wrote:
> James Dunn wrote:
> > Also I search across about 50 fields but I don't
> use
> > wildcard or range queries. 
> 
> Lucene uses one byte of RAM per document per
> searched field, to hold the 
> normalization values.  So if you search a 10M
> document collection with 
> 50 fields, then you'll end up using 500MB of RAM.
> 
> If you're using unanalyzed fields, then an easy
> workaround to reduce the 
> number of fields is to combine many in a single
> field.  So, instead of, 
> e.g., using an "f1" field with value "abc", and an
> "f2" field with value 
> "efg", use a single field named "f" with values
> "1_abc" and "2_efg".
> 
> We could optimize this in Lucene.  If no values of
> an indexed field are 
> analyzed, then we could store no norms for the field
> and hence read none 
> into memory.  This wouldn't be too hard to
> implement...
> 
> Doug
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread James Dunn
Erik,

Thanks for the response.  

My actual documents are fairly small.  Most docs only
have about 10 fields.  Some of those fields are
stored, however, like the OBJECT_ID, NAME and DESC
fields.  The stored fields are pretty small as well. 
None should be more than 4KB and very few will
approach that limit.

I'm also using the default maxFieldSize value of
1.  

I'm not caching hits, either.

Could it be my query?  I have about 80 total unique
fields in the index although no document has all 80. 
My query ends up looking like this:

+(F1:test F2:test ..  F80:test)

>From previous mails that doesn't look like an enormous
amount of fields to be searching against.  Is there
some formula for the amount of memory required for a
query based on the number of clauses and terms?

Jim



--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
> How big are your actual Documents?  Are you caching
> Hits?  It stores, 
> internally, up to 200 documents.
> 
>   Erik
> 
> 
> On May 26, 2004, at 4:08 PM, James Dunn wrote:
> 
> > Will,
> >
> > Thanks for your response.  It may be an object
> leak.
> > I will look into that.
> >
> > I just ran some more tests and this time I create
> a
> > 20GB index by repeatedly merging my large index
> into
> > itself.
> >
> > When I ran my test query against that index I got
> an
> > OutOfMemoryError on the very first query.  I have
> my
> > heap set to 512MB.  Should a query against a 20GB
> > index require that much memory?  I page through
> the
> > results 100 at a time, so I should never have more
> > than 100 Document objects in memory.
> >
> > Any help would be appreciated, thanks!
> >
> > Jim
> > --- [EMAIL PROTECTED] wrote:
> >> This sounds like a memory leakage situation.  If
> you
> >> are using tomcat I
> >> would suggest you make sure you are on a recent
> >> version, as it is known to
> >> have some memory leaks in version 4.  It doesn't
> >> make sense that repeated
> >> queries would use more memory that the most
> >> demanding query unless objects
> >> are not getting freed from memory.
> >>
> >> -Will
> >>
> >> -Original Message-
> >> From: James Dunn [mailto:[EMAIL PROTECTED]
> >> Sent: Wednesday, May 26, 2004 3:02 PM
> >> To: [EMAIL PROTECTED]
> >> Subject: Memory usage
> >>
> >>
> >> Hello,
> >>
> >> I was wondering if anyone has had problems with
> >> memory
> >> usage and MultiSearcher.
> >>
> >> My index is composed of two sub-indexes that I
> >> search
> >> with a MultiSearcher.  The total size of the
> index
> >> is
> >> about 3.7GB with the larger sub-index being 3.6GB
> >> and
> >> the smaller being 117MB.
> >>
> >> I am using Lucene 1.3 Final with the compound
> file
> >> format.
> >>
> >> Also I search across about 50 fields but I don't
> use
> >> wildcard or range queries.
> >>
> >> Doing repeated searches in this way seems to
> >> eventually chew up about 500MB of memory which
> seems
> >> excessive to me.
> >>
> >> Does anyone have any ideas where I could look to
> >> reduce the memory my queries consume?
> >>
> >> Thanks,
> >>
> >> Jim
> >>
> >>
> >>
> >>
> >> __
> >> Do you Yahoo!?
> >> Friends.  Fun.  Try the all-new Yahoo! Messenger.
> >> http://messenger.yahoo.com/
> >>
> >>
> >
>
-
> >> To unsubscribe, e-mail:
> >> [EMAIL PROTECTED]
> >> For additional commands, e-mail:
> >> [EMAIL PROTECTED]
> >>
> >>
> >
>
-
> >> To unsubscribe, e-mail:
> >> [EMAIL PROTECTED]
> >> For additional commands, e-mail:
> >> [EMAIL PROTECTED]
> >>
> >
> >
> > __
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> >
>
-
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage

2004-05-26 Thread Doug Cutting
James Dunn wrote:
Also I search across about 50 fields but I don't use
wildcard or range queries. 
Lucene uses one byte of RAM per document per searched field, to hold the 
normalization values.  So if you search a 10M document collection with 
50 fields, then you'll end up using 500MB of RAM.

If you're using unanalyzed fields, then an easy workaround to reduce the 
number of fields is to combine many in a single field.  So, instead of, 
e.g., using an "f1" field with value "abc", and an "f2" field with value 
"efg", use a single field named "f" with values "1_abc" and "2_efg".

We could optimize this in Lucene.  If no values of an indexed field are 
analyzed, then we could store no norms for the field and hence read none 
into memory.  This wouldn't be too hard to implement...

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Memory usage

2004-05-26 Thread Erik Hatcher
How big are your actual Documents?  Are you caching Hits?  It stores, 
internally, up to 200 documents.

Erik
On May 26, 2004, at 4:08 PM, James Dunn wrote:
Will,
Thanks for your response.  It may be an object leak.
I will look into that.
I just ran some more tests and this time I create a
20GB index by repeatedly merging my large index into
itself.
When I ran my test query against that index I got an
OutOfMemoryError on the very first query.  I have my
heap set to 512MB.  Should a query against a 20GB
index require that much memory?  I page through the
results 100 at a time, so I should never have more
than 100 Document objects in memory.
Any help would be appreciated, thanks!
Jim
--- [EMAIL PROTECTED] wrote:
This sounds like a memory leakage situation.  If you
are using tomcat I
would suggest you make sure you are on a recent
version, as it is known to
have some memory leaks in version 4.  It doesn't
make sense that repeated
queries would use more memory that the most
demanding query unless objects
are not getting freed from memory.
-Will
-Original Message-
From: James Dunn [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 3:02 PM
To: [EMAIL PROTECTED]
Subject: Memory usage
Hello,
I was wondering if anyone has had problems with
memory
usage and MultiSearcher.
My index is composed of two sub-indexes that I
search
with a MultiSearcher.  The total size of the index
is
about 3.7GB with the larger sub-index being 3.6GB
and
the smaller being 117MB.
I am using Lucene 1.3 Final with the compound file
format.
Also I search across about 50 fields but I don't use
wildcard or range queries.
Doing repeated searches in this way seems to
eventually chew up about 500MB of memory which seems
excessive to me.
Does anyone have any ideas where I could look to
reduce the memory my queries consume?
Thanks,
Jim


__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Memory usage

2004-05-26 Thread James Dunn
Will,

Thanks for your response.  It may be an object leak. 
I will look into that.

I just ran some more tests and this time I create a
20GB index by repeatedly merging my large index into
itself.

When I ran my test query against that index I got an
OutOfMemoryError on the very first query.  I have my
heap set to 512MB.  Should a query against a 20GB
index require that much memory?  I page through the
results 100 at a time, so I should never have more
than 100 Document objects in memory.  

Any help would be appreciated, thanks!

Jim
--- [EMAIL PROTECTED] wrote:
> This sounds like a memory leakage situation.  If you
> are using tomcat I
> would suggest you make sure you are on a recent
> version, as it is known to
> have some memory leaks in version 4.  It doesn't
> make sense that repeated
> queries would use more memory that the most
> demanding query unless objects
> are not getting freed from memory.
> 
> -Will
> 
> -Original Message-
> From: James Dunn [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, May 26, 2004 3:02 PM
> To: [EMAIL PROTECTED]
> Subject: Memory usage
> 
> 
> Hello,
> 
> I was wondering if anyone has had problems with
> memory
> usage and MultiSearcher.
> 
> My index is composed of two sub-indexes that I
> search
> with a MultiSearcher.  The total size of the index
> is
> about 3.7GB with the larger sub-index being 3.6GB
> and
> the smaller being 117MB.
> 
> I am using Lucene 1.3 Final with the compound file
> format.
> 
> Also I search across about 50 fields but I don't use
> wildcard or range queries. 
> 
> Doing repeated searches in this way seems to
> eventually chew up about 500MB of memory which seems
> excessive to me.
> 
> Does anyone have any ideas where I could look to
> reduce the memory my queries consume?
> 
> Thanks,
> 
> Jim
> 
> 
>   
>   
> __
> Do you Yahoo!?
> Friends.  Fun.  Try the all-new Yahoo! Messenger.
> http://messenger.yahoo.com/ 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory usage

2004-05-26 Thread wallen
This sounds like a memory leakage situation.  If you are using tomcat I
would suggest you make sure you are on a recent version, as it is known to
have some memory leaks in version 4.  It doesn't make sense that repeated
queries would use more memory that the most demanding query unless objects
are not getting freed from memory.

-Will

-Original Message-
From: James Dunn [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 3:02 PM
To: [EMAIL PROTECTED]
Subject: Memory usage


Hello,

I was wondering if anyone has had problems with memory
usage and MultiSearcher.

My index is composed of two sub-indexes that I search
with a MultiSearcher.  The total size of the index is
about 3.7GB with the larger sub-index being 3.6GB and
the smaller being 117MB.

I am using Lucene 1.3 Final with the compound file
format.

Also I search across about 50 fields but I don't use
wildcard or range queries. 

Doing repeated searches in this way seems to
eventually chew up about 500MB of memory which seems
excessive to me.

Does anyone have any ideas where I could look to
reduce the memory my queries consume?

Thanks,

Jim




__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory Usage?

2001-11-13 Thread Anders Nielsen

oh, Im sorry.. I was searching through an index of about 50.000. I have no
deleted entries.

With the files you sent out I got 4 hits, with the old lucene.jar file I got
8.

I'll try to investigate further which hits that doesn't show up and why.



Venlig hilsen

regards,
Anders Nielsen


-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]]
Sent: 12. november 2001 22:53
To: 'Lucene Users List'
Subject: RE: Memory Usage?


> From: Anders Nielsen [mailto:[EMAIL PROTECTED]]
>
> hmm, I seem to be getting a different number of hits when I
> use the files
> you sent out.

Please provide more information!  Is it larger or smaller than before?  By
how much?  What differences show up in the hits?  That's a terrible bug
report...

I think before it may have been possible to get a spurious hit if a query
term only occurred in deleted documents.  A wildcard query with 40,000 terms
might make this sort of thing happen more often, and unless you tried to
access the Hits.doc() for such a hit, you would not see an error.  If this
was in fact a problem, the code I just sent out would have fixed it.  So
your results may in fact be better.  Or there may be a bug in what I sent.
Or both!

For the cases I have tried I get the same results with and without those
changes.

Doug

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




RE: Memory Usage?

2001-11-13 Thread Halácsy Péter



> -Original Message-
> From: Brian Goetz [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, November 13, 2001 8:58 AM
> To: Lucene Users List
> Cc: [EMAIL PROTECTED]
> Subject: Re: Memory Usage?
> 
> 
> > Since this is changing behavior that people are depending 
> on, what about 
> > creating a new QueryParser called QueryParserSafe that 
> excludes option.
> > I don't like the idea of removing functionality with no backward 
> > compatibility.
> 
> I knew this was coming.  
> 
> I'm sorry, but I have to laugh just a little bit.  The new query
> parser has only existed for less than two months -- and people have
> built empires based on it?  I'm perfectly willing to debate whether
> its a good idea or not to remove the wildcard match syntax from the
> query parser, but I think the "backward compatibility" argument is one
> of the less compelling arguments against doing so.  Bear in mind that
> no one is suggesting removing the functionality from the core -- just
> restricting its use to programmatically generated queries.  A strong
> argument can be made for not exposing the "don't try this at home"
> behavior through an interface that is bound to be used by naive
> end-users.
> 

How about this:
"You must have at least four non-wildcard characters in a word before
you introduce a wildcard."  (source:
http://www.northernlight.com/docs/search_help_optimize.html)

I think the best approach would be to have a parameter (of query
parser?, of indexsearcher?) to set the minimal non wild-char characters
before any wildchar.

peter




Re: Memory Usage?

2001-11-12 Thread carlson

Since this is changing behavior that people are depending on, what about 
creating a new QueryParser called QueryParserSafe that excludes option.
I don't like the idea of removing functionality with no backward 
compatibility.

Any thoughts.

--Peter


On Monday, November 12, 2001, at 12:57 PM, Brian Goetz wrote:

>> This was a single query?  How many terms, and of what type are in the 
>> query?
>>> From the trace it looks like there could be over 40,000 terms in the 
>>> query!
>> Is this a prefix or wildcard query?  These can generate *very* large
>> queries...
>
> I think the fact that prefix / wildcard queries can generate such
> nasty huge queries is a good reason to consider taking the Foo* syntax
> out of the query parser.  I realize that its cool, and there are some
> people who probably rely on it already, but it seems that the prefix
> query is in the category of "professional driver, don't try this at
> home" and therefore should be kept behind the locked cabinet where
> ordinary shoppers won't take it home by accident (to mix some
> metaphors horribly.)
>
> --
> To unsubscribe, e-mail:    [EMAIL PROTECTED]>
> For additional commands, e-mail:  [EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Memory Usage?

2001-11-12 Thread Brian Goetz

> This was a single query?  How many terms, and of what type are in the query?
> >From the trace it looks like there could be over 40,000 terms in the query!
> Is this a prefix or wildcard query?  These can generate *very* large
> queries...

I think the fact that prefix / wildcard queries can generate such
nasty huge queries is a good reason to consider taking the Foo* syntax
out of the query parser.  I realize that its cool, and there are some
people who probably rely on it already, but it seems that the prefix
query is in the category of "professional driver, don't try this at
home" and therefore should be kept behind the locked cabinet where
ordinary shoppers won't take it home by accident (to mix some
metaphors horribly.)

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Memory Usage?

2001-11-12 Thread Doug Cutting

> From: Anders Nielsen [mailto:[EMAIL PROTECTED]]
> 
> hmm, I seem to be getting a different number of hits when I 
> use the files
> you sent out.

Please provide more information!  Is it larger or smaller than before?  By
how much?  What differences show up in the hits?  That's a terrible bug
report...

I think before it may have been possible to get a spurious hit if a query
term only occurred in deleted documents.  A wildcard query with 40,000 terms
might make this sort of thing happen more often, and unless you tried to
access the Hits.doc() for such a hit, you would not see an error.  If this
was in fact a problem, the code I just sent out would have fixed it.  So
your results may in fact be better.  Or there may be a bug in what I sent.
Or both!

For the cases I have tried I get the same results with and without those
changes.

Doug

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Memory Usage?

2001-11-12 Thread Anders Nielsen

hmm, I seem to be getting a different number of hits when I use the files
you sent out.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]]
Sent: 12. november 2001 20:47
To: 'Lucene Users List'
Subject: RE: Memory Usage?


> From: Anders Nielsen [mailto:[EMAIL PROTECTED]]
>
> this was a big boolean query, with several prefixqueries but
> no wildcard
> queries in the or-branches.

Well it looks like those prefixes are expanding to a lot of terms, a total
of over 40,000!  (A prefix query expands into a BooleanQuery with all the
terms matching the prefix.)

If most of these expansions are low-frequency, then a simple fix should
improve things considerably.  I've attached an optimized version of
TermQuery that will hold less memory per low-frequency term.  In particular,
if a term occurs fewer than 128 times then a 1024 byte InputStream buffer is
freed immediately.

Tell me how this works.  Please send another heap dump.

Longer term, or if lots of the expanded terms occur more than 128 times,
perhaps BooleanScorer should use a different algorithm when there are
thousands of terms.  In this case it might use less memory to construct an
array of score buckets for all documents.  If (query.termCount() * 1024) >
(12 * getMaxDoc()) then this would use less memory.  In your case, with
500,000 documents and a 40,000 term query, it's currently taking 40MB/query,
and could be done in 6MB/query.  This optimization would not be too
difficult, as it could be mostly isolated to BooleanQuery and BooleanScorer.

Doug




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




RE: Memory Usage?

2001-11-12 Thread Doug Cutting

> From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
> 
> I think something like this would be a HUGE boon for us.  We 
> do a lot of
> complex queries on a lot of different indexes and end up 
> suffering from
> severe garbage collection issues on our system.  I'd be 
> willing to help out
> in any way to make this issue go away as soon as possible.

Did you try the code I just sent out?  Did it help much?

A problem with things like PrefixQuery are that they let folks easily
construct queries which are *very* expensive to evaluate.  It is no
coincidence that Google et. al. do not permit these sort of queries.  So,
while we can remove some of the GC overhead, don't forget that these are
still expensive operations and will still be rather slow.  A feature like
PrefixQuery should thus be used sparingly.

Doug

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Memory Usage?

2001-11-12 Thread Scott Ganyo

I think something like this would be a HUGE boon for us.  We do a lot of
complex queries on a lot of different indexes and end up suffering from
severe garbage collection issues on our system.  I'd be willing to help out
in any way to make this issue go away as soon as possible.

Scott

> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED]]
> Sent: Monday, November 12, 2001 2:47 PM
> To: 'Lucene Users List'
> Subject: RE: Memory Usage?
> 
> 
> > From: Anders Nielsen [mailto:[EMAIL PROTECTED]]
> > 
> > this was a big boolean query, with several prefixqueries but 
> > no wildcard
> > queries in the or-branches.
> 
> Well it looks like those prefixes are expanding to a lot of 
> terms, a total
> of over 40,000!  (A prefix query expands into a BooleanQuery 
> with all the
> terms matching the prefix.)
> 
> If most of these expansions are low-frequency, then a simple 
> fix should
> improve things considerably.  I've attached an optimized version of
> TermQuery that will hold less memory per low-frequency term.  
> In particular,
> if a term occurs fewer than 128 times then a 1024 byte 
> InputStream buffer is
> freed immediately.
> 
> Tell me how this works.  Please send another heap dump.
> 
> Longer term, or if lots of the expanded terms occur more than 
> 128 times,
> perhaps BooleanScorer should use a different algorithm when there are
> thousands of terms.  In this case it might use less memory to 
> construct an
> array of score buckets for all documents.  If 
> (query.termCount() * 1024) >
> (12 * getMaxDoc()) then this would use less memory.  In your 
> case, with
> 500,000 documents and a 40,000 term query, it's currently 
> taking 40MB/query,
> and could be done in 6MB/query.  This optimization would not be too
> difficult, as it could be mostly isolated to BooleanQuery and 
> BooleanScorer.
> 
> Doug
> 
> 
> 



RE: Memory Usage?

2001-11-12 Thread Doug Cutting

> From: Anders Nielsen [mailto:[EMAIL PROTECTED]]
> 
> this was a big boolean query, with several prefixqueries but 
> no wildcard
> queries in the or-branches.

Well it looks like those prefixes are expanding to a lot of terms, a total
of over 40,000!  (A prefix query expands into a BooleanQuery with all the
terms matching the prefix.)

If most of these expansions are low-frequency, then a simple fix should
improve things considerably.  I've attached an optimized version of
TermQuery that will hold less memory per low-frequency term.  In particular,
if a term occurs fewer than 128 times then a 1024 byte InputStream buffer is
freed immediately.

Tell me how this works.  Please send another heap dump.

Longer term, or if lots of the expanded terms occur more than 128 times,
perhaps BooleanScorer should use a different algorithm when there are
thousands of terms.  In this case it might use less memory to construct an
array of score buckets for all documents.  If (query.termCount() * 1024) >
(12 * getMaxDoc()) then this would use less memory.  In your case, with
500,000 documents and a 40,000 term query, it's currently taking 40MB/query,
and could be done in 6MB/query.  This optimization would not be too
difficult, as it could be mostly isolated to BooleanQuery and BooleanScorer.

Doug





PhraseQuery.java
Description: Binary data


TermQuery.java
Description: Binary data


TermScorer.java
Description: Binary data

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 


RE: Memory Usage?

2001-11-12 Thread Anders Nielsen

this was a big boolean query, with several prefixqueries but no wildcard
queries in the or-branches.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]]
Sent: 12. november 2001 17:41
To: Lucene Users List
Subject: RE: Memory Usage?


This was a single query?  How many terms, and of what type are in the query?
>From the trace it looks like there could be over 40,000 terms in the query!
Is this a prefix or wildcard query?  These can generate *very* large
queries...

Doug


> -Original Message-
> From: Anders Nielsen [mailto:[EMAIL PROTECTED]]
> Sent: Sunday, November 11, 2001 6:59 AM
> To: Lucene Users List
> Subject: RE: Memory Usage?
>
>
> I am not very familiar with the output of -Xrunhprof, but
> I've attached the
> output of a run of a search through and index of 50.000
> documents. It gave
> me out-of-memory errors until I allocated 100 megabytes of heap-space.
>
> The top 10:
>
> SITES BEGIN (ordered by live bytes) Sun Nov 11 15:50:31 2001
>   percent live   alloc'ed  stack class
>  rank   self  accumbytes objs   bytes objs trace name
> 1 26.41% 26.41% 12485200 12005 45566560 43814  1783 [B
> 2 25.18% 51.59% 11904880 11447 44867680 43142  1796 [B
> 3  4.15% 55.74%  1962904 69214 171546352 5510292  1632 [C
> 4  3.83% 59.58%  1812096 3432 1812096 3432  1768 [I
> 5  3.83% 63.41%  1812096 3432 1812096 3432  1769 [I
> 6  3.34% 66.75%  1580688 65862 130618992 5442458  1631
> java.lang.String
> 7  3.19% 69.95%  1509584 44763 1509584 44763   458 [C
> 8  3.03% 72.98%  1432416 44763 1432416 44763   459
> org.apache.lucene.index.TermInfo
> 9  2.27% 75.25%  1074312 44763 1074312 44763   457
> java.lang.String
>10  2.23% 77.48%  1053792 65862 87079328 5442458  1631
> org.apache.lucene.index.Term
>
> and the top 3 traces were:
>
> TRACE 1783:
>
> org.apache.lucene.store.InputStream.refill(InputStream.java:165)
>
> org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
>
> org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)
>
> org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:101)
>
> TRACE 1796:
>
> org.apache.lucene.store.InputStream.refill(InputStream.java:165)
>
> org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
>
> org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)
>
> org.apache.lucene.index.SegmentTermPositions.next(SegmentTermP
> ositions.java:
> 100)
>
> TRACE 1632:
> java.lang.String.(String.java:198)
>
> org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEn
> um.java:134)
>
> org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:114)
>
> org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosRead
> er.java:166)
>
>
> I've attached the whole trace as gzipped.txt
>
> regards,
> Anders Nielsen
>
> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED]]
> Sent: 10. november 2001 04:35
> To: 'Lucene Users List'
> Subject: RE: Memory Usage?
>
>
> I'm surprised that your memory use is that high.
>
> An IndexReader requires:
>   one byte per field per document in index (norms)
>   one open file per file in index
>   1/128 of the Terms in the index
> a Term has two pointers (8 bytes)
>  and a String (4 pointers = 24 bytes, one to 16-bit chars)
>
> A Search requires:
>   1 1024 byte buffer per TermQuery
>   2 128 int buffers per TermQuery
>   2 1024 byte buffers per PhraseQuery term
>   1 1024 element bucket array per BooleanQuery
> each bucket has 5 fields, and hence requires ~20 bytes
>   1 bit per document in index per DateFilter
>
> A Hits requires:
>   up to n+100 ScoreDocs (float+int, 8 bytes)
> where n is the highest Hits.doc(n) accessed
>   up to 200 Document objects
>
> I may have forgotten something...
>
> Let's assume that your 1M document index has 2M unique terms,
> and that you
> only look at the top-100 hits, that your index has three
> fields, and that
> the typical document has two stored fields, each 20 characters.  Your
> 30-term boolean query over a 1M document index should use around the
> following numbers of bytes:
>   IndexReader:
> 3,000,000 (norms)
> 1,000,000 (1/128 of 2M terms, each requiring ~50 bytes)
>   during search
>50,000 (TermQuery buffers)
>20,000 (BooleanQuery buckets)
>   100,000 (DateFilter bit vector)
>   in Hits
> 2,000 (200 ScoreDocs)
>30,000 (up to 200 cached Documents)
>
> So searches should run in a 5Mb heap.  Are my assumptions off?
>
> You can also see why it is useful to ke

RE: Memory Usage?

2001-11-12 Thread Doug Cutting

This was a single query?  How many terms, and of what type are in the query?
>From the trace it looks like there could be over 40,000 terms in the query!
Is this a prefix or wildcard query?  These can generate *very* large
queries...

Doug


> -Original Message-
> From: Anders Nielsen [mailto:[EMAIL PROTECTED]]
> Sent: Sunday, November 11, 2001 6:59 AM
> To: Lucene Users List
> Subject: RE: Memory Usage?
> 
> 
> I am not very familiar with the output of -Xrunhprof, but 
> I've attached the
> output of a run of a search through and index of 50.000 
> documents. It gave
> me out-of-memory errors until I allocated 100 megabytes of heap-space.
> 
> The top 10:
> 
> SITES BEGIN (ordered by live bytes) Sun Nov 11 15:50:31 2001
>   percent live   alloc'ed  stack class
>  rank   self  accumbytes objs   bytes objs trace name
> 1 26.41% 26.41% 12485200 12005 45566560 43814  1783 [B
> 2 25.18% 51.59% 11904880 11447 44867680 43142  1796 [B
> 3  4.15% 55.74%  1962904 69214 171546352 5510292  1632 [C
> 4  3.83% 59.58%  1812096 3432 1812096 3432  1768 [I
> 5  3.83% 63.41%  1812096 3432 1812096 3432  1769 [I
> 6  3.34% 66.75%  1580688 65862 130618992 5442458  1631 
> java.lang.String
> 7  3.19% 69.95%  1509584 44763 1509584 44763   458 [C
> 8  3.03% 72.98%  1432416 44763 1432416 44763   459
> org.apache.lucene.index.TermInfo
> 9  2.27% 75.25%  1074312 44763 1074312 44763   457 
> java.lang.String
>10  2.23% 77.48%  1053792 65862 87079328 5442458  1631
> org.apache.lucene.index.Term
> 
> and the top 3 traces were:
> 
> TRACE 1783:
> 
> org.apache.lucene.store.InputStream.refill(InputStream.java:165)
> 
> org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
> 
> org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)
> 
> org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:101)
> 
> TRACE 1796:
> 
> org.apache.lucene.store.InputStream.refill(InputStream.java:165)
> 
> org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
> 
> org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)
> 
> org.apache.lucene.index.SegmentTermPositions.next(SegmentTermP
> ositions.java:
> 100)
> 
> TRACE 1632:
> java.lang.String.(String.java:198)
> 
> org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEn
> um.java:134)
> 
> org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:114)
> 
> org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosRead
> er.java:166)
> 
> 
> I've attached the whole trace as gzipped.txt
> 
> regards,
> Anders Nielsen
> 
> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED]]
> Sent: 10. november 2001 04:35
> To: 'Lucene Users List'
> Subject: RE: Memory Usage?
> 
> 
> I'm surprised that your memory use is that high.
> 
> An IndexReader requires:
>   one byte per field per document in index (norms)
>   one open file per file in index
>   1/128 of the Terms in the index
> a Term has two pointers (8 bytes)
>  and a String (4 pointers = 24 bytes, one to 16-bit chars)
> 
> A Search requires:
>   1 1024 byte buffer per TermQuery
>   2 128 int buffers per TermQuery
>   2 1024 byte buffers per PhraseQuery term
>   1 1024 element bucket array per BooleanQuery
> each bucket has 5 fields, and hence requires ~20 bytes
>   1 bit per document in index per DateFilter
> 
> A Hits requires:
>   up to n+100 ScoreDocs (float+int, 8 bytes)
> where n is the highest Hits.doc(n) accessed
>   up to 200 Document objects
> 
> I may have forgotten something...
> 
> Let's assume that your 1M document index has 2M unique terms, 
> and that you
> only look at the top-100 hits, that your index has three 
> fields, and that
> the typical document has two stored fields, each 20 characters.  Your
> 30-term boolean query over a 1M document index should use around the
> following numbers of bytes:
>   IndexReader:
> 3,000,000 (norms)
> 1,000,000 (1/128 of 2M terms, each requiring ~50 bytes)
>   during search
>50,000 (TermQuery buffers)
>20,000 (BooleanQuery buckets)
>   100,000 (DateFilter bit vector)
>   in Hits
> 2,000 (200 ScoreDocs)
>30,000 (up to 200 cached Documents)
> 
> So searches should run in a 5Mb heap.  Are my assumptions off?
> 
> You can also see why it is useful to keep a single 
> IndexReader and use it
> for all queries.  (IndexReader is thread safe.)
> 
> You could also 'java -Xrunhprof:heap=sites' to see what's 
> using memory.
> 
> Doug
> 
> --
> To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> 
> 

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>