date window

2007-05-17 Thread James O'Rourke

Hi All,

I've been thinking about this problem for some time now. I'm trying  
to figure out a way to store date windows in lucene so that I can  
easily filter as follows.


A particular document can have several date windows.
Give a specific date, only return those documents where that date  
falls within at least one of those windows.


From what I can see, the only way I can think of doing it is to  
create a special field format and create a custom filter. The filter  
isn't that useful for caching though, because every query will have   
new date (essentially NOW())


Also, note that there are multiple windows here for a single  
document, we can't just search between min start and max end.


Ideas from those more familiar with lucene would be greatly appreciated.

James


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Paul Elschot
On Thursday 17 May 2007 08:10, Andreas Guther wrote:
> I am currently exploring how to solve performance problems I encounter with
> Lucene document reads.
> 
> We have amongst other fields one field (default) storing all searchable
> fields.  This field can become of considerable size since we are  indexing
> documents and  store the content for display within results.
> 
> I noticed that the read can be very expensive.  I wonder now if it would
> make sense to add this field as Field.Store.Compress to the index.  Can
> someone tell me if this would speed up the document read or if this is
> something only interesting for saving space.

I have not tried the compression yet, but in my experience a good way
to reduce the costs of document reads from a disk is by reading them
in document number order whenever possible. In this way one saves
on the disk head seeks.
Compression should actually help reducing the costs of disk head seeks
even more.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Group the search results by a given field

2007-05-17 Thread Sawan Sharma

Hi All,

I was wondering - is it possible to search and group the results by a
given field?

For example, I have an index with several million records. Most of
them are different Features of the same ID.

I'd love to be able to do.. groupby=ID or something like that
in the results, and provide the ID as a clickable link to see
all the Features of that ID.

I have used HitCollector class to accomplish this goal. In Collect method I
have used following algo...

Collect()
{
   if Searcher.Doc(doc_id).get(ID) is not exist in HashKey then
  Add Searcher.Doc(doc_id).get(ID) as new HashKey in hash
table and assign value = 1
   else
  increment HashKey( Searcher.Doc(doc_id).get(ID)) value
with 1
}

But, it depends on HitCount. As soon as I get more hits it takes more time
and my search performance is degrade.

How it can be done with best performance..?

Any ideas?

Sawan


about to get

2007-05-17 Thread 童小军
Hi lucener:

I am want get the TermFreqVector 。but I must get docNum first.

titleVector = reader.getTermFreqVector(docNum, "title");

 

but I can’t get Docnum by lucene Document.

how can I get the docNum use Document object? 

Like this getTermFreqVector(doc,”title”);

 

xiaojun tong

 

010-64489518-613

[EMAIL PROTECTED]

www.feedsky.com

 



Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Grant Ingersoll
I haven't tried compression either.  I know there was some talk a  
while ago about deprecating, but that hasn't happened.  The current  
implementation yields the highest level of compression.  You might  
find better results by compressing in your application and storing as  
a binary field, thus giving you more control over CPU used.  This is  
our current recommendation for dealing w/ compression.


If you are not actually displaying that field, you should look into  
the FieldSelector API (via IndexReader).  It allows you to lazily  
load fields or skip them all together and can yield a pretty  
significant savings when it comes to loading documents.   
FieldSelector is available in 2.1.


-Grant

On May 17, 2007, at 4:01 AM, Paul Elschot wrote:


On Thursday 17 May 2007 08:10, Andreas Guther wrote:
I am currently exploring how to solve performance problems I  
encounter with

Lucene document reads.

We have amongst other fields one field (default) storing all  
searchable
fields.  This field can become of considerable size since we are   
indexing

documents and  store the content for display within results.

I noticed that the read can be very expensive.  I wonder now if it  
would
make sense to add this field as Field.Store.Compress to the  
index.  Can
someone tell me if this would speed up the document read or if  
this is

something only interesting for saving space.


I have not tried the compression yet, but in my experience a good way
to reduce the costs of document reads from a disk is by reading them
in document number order whenever possible. In this way one saves
on the disk head seeks.
Compression should actually help reducing the costs of disk head seeks
even more.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: about to get

2007-05-17 Thread Grant Ingersoll
You can get it from a Hits object (see the id() method) or you can  
iterate over the docs from 0 to maxDoc -1 (skipping deleted docs)


I have some code at http://www.cnlp.org/apachecon2005/ that shows  
various usages for Term Vector.  The Lucene in Action book has some  
good examples as well.


-Grant

On May 17, 2007, at 6:10 AM, 童小军 wrote:


Hi lucener:

I am want get the TermFreqVector 。but I must get docNum first.

titleVector = reader.getTermFreqVector(docNum, "title");



but I can’t get Docnum by lucene Document.

how can I get the docNum use Document object?

Like this getTermFreqVector(doc,”title”);



xiaojun tong



010-64489518-613

[EMAIL PROTECTED]

www.feedsky.com





--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to ignore scoring for a Query?

2007-05-17 Thread Benjamin Pasero

Hi,

I have two different use-cases for my queries. For the first,
performance is not too critical
and I want to sort the results by relevance (score). The second however,
is performance critical,
but the score for each result is not interesting. I guess, if it was
possible to disable scoring
for the query, I could improve performance (note that omitNorms on a
Field is not an option, due
to the first use case).

Is there a straightforward way to disable scoring for a query (its a
BooleanQuery btw with some
clauses, which can be any other query).

Thanks,
Ben


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Group the search results by a given field

2007-05-17 Thread Erick Erickson

There has been significant discussion on this topic (way more than
I can remember clearly) on the mail thread, but as I remember it's
been referred to as "facet" or "faceted". I think you would get a lot
of info searching for these terms at...

http://www.gossamer-threads.com/lists/lucene/java-user/

Best
Erick


On 5/17/07, Sawan Sharma <[EMAIL PROTECTED]> wrote:


Hi All,

I was wondering - is it possible to search and group the results by a
given field?

For example, I have an index with several million records. Most of
them are different Features of the same ID.

I'd love to be able to do.. groupby=ID or something like that
in the results, and provide the ID as a clickable link to see
all the Features of that ID.

I have used HitCollector class to accomplish this goal. In Collect method
I
have used following algo...

Collect()
{
if Searcher.Doc(doc_id).get(ID) is not exist in HashKey then
   Add Searcher.Doc(doc_id).get(ID) as new HashKey in hash
table and assign value = 1
else
   increment HashKey( Searcher.Doc(doc_id).get(ID)) value
with 1
}

But, it depends on HitCount. As soon as I get more hits it takes more time
and my search performance is degrade.

How it can be done with best performance..?

Any ideas?

Sawan



Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Erick Erickson

Some time ago I posted the results in my peculiar app of using
FieldSelector, and it gave dramatic improvements in my case (a
factor of about 10). I suspect much of that was peculiar to my
index design, so your mileage may vary.

See  a thread titled...

*Lucene 2.1, using FieldSelector speeds up my app by a factor of 10+*


Best
Erick

On 5/17/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


I haven't tried compression either.  I know there was some talk a
while ago about deprecating, but that hasn't happened.  The current
implementation yields the highest level of compression.  You might
find better results by compressing in your application and storing as
a binary field, thus giving you more control over CPU used.  This is
our current recommendation for dealing w/ compression.

If you are not actually displaying that field, you should look into
the FieldSelector API (via IndexReader).  It allows you to lazily
load fields or skip them all together and can yield a pretty
significant savings when it comes to loading documents.
FieldSelector is available in 2.1.

-Grant

On May 17, 2007, at 4:01 AM, Paul Elschot wrote:

> On Thursday 17 May 2007 08:10, Andreas Guther wrote:
>> I am currently exploring how to solve performance problems I
>> encounter with
>> Lucene document reads.
>>
>> We have amongst other fields one field (default) storing all
>> searchable
>> fields.  This field can become of considerable size since we are
>> indexing
>> documents and  store the content for display within results.
>>
>> I noticed that the read can be very expensive.  I wonder now if it
>> would
>> make sense to add this field as Field.Store.Compress to the
>> index.  Can
>> someone tell me if this would speed up the document read or if
>> this is
>> something only interesting for saving space.
>
> I have not tried the compression yet, but in my experience a good way
> to reduce the costs of document reads from a disk is by reading them
> in document number order whenever possible. In this way one saves
> on the disk head seeks.
> Compression should actually help reducing the costs of disk head seeks
> even more.
>
> Regards,
> Paul Elschot
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Andreas Guther

I am actually using the FieldSelector and unless I did something wrong it
did not provide me any load performance improvements which was surprising to
me and disappointing at the same time.  The only difference I could see was
when I returned for all fields a NO_LOAD which from my understanding is the
same as skipping over the document.

Right now I am looking into fragmentation problems of my huge index files.
I am de-fragmenting the hard drive to see if this brings any read
performance improvements.

I am also wondering if the FieldCache as discussed in
http://www.gossamer-threads.com/lists/lucene/general/28252 would help
improve the situation.

Andreas

On 5/17/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


I haven't tried compression either.  I know there was some talk a
while ago about deprecating, but that hasn't happened.  The current
implementation yields the highest level of compression.  You might
find better results by compressing in your application and storing as
a binary field, thus giving you more control over CPU used.  This is
our current recommendation for dealing w/ compression.

If you are not actually displaying that field, you should look into
the FieldSelector API (via IndexReader).  It allows you to lazily
load fields or skip them all together and can yield a pretty
significant savings when it comes to loading documents.
FieldSelector is available in 2.1.

-Grant

On May 17, 2007, at 4:01 AM, Paul Elschot wrote:

> On Thursday 17 May 2007 08:10, Andreas Guther wrote:
>> I am currently exploring how to solve performance problems I
>> encounter with
>> Lucene document reads.
>>
>> We have amongst other fields one field (default) storing all
>> searchable
>> fields.  This field can become of considerable size since we are
>> indexing
>> documents and  store the content for display within results.
>>
>> I noticed that the read can be very expensive.  I wonder now if it
>> would
>> make sense to add this field as Field.Store.Compress to the
>> index.  Can
>> someone tell me if this would speed up the document read or if
>> this is
>> something only interesting for saving space.
>
> I have not tried the compression yet, but in my experience a good way
> to reduce the costs of document reads from a disk is by reading them
> in document number order whenever possible. In this way one saves
> on the disk head seeks.
> Compression should actually help reducing the costs of disk head seeks
> even more.
>
> Regards,
> Paul Elschot
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Indexing Open Office documents

2007-05-17 Thread jim shirreffs


Anyone know how to add OpenOffice document to a Lucene index? Is there a 
parser for OpenOffice?


thanks in advance

jim s. 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Erick Erickson

h. Now that I re-read your first mail, something else
suggests itself. You stated:

"We have amongst other fields one field (default) storing all searchable
fields".

Do you need to store this field at all? You can search fields that are
indexed but NOT stored. I've used something of the same technique
where I index lots of different fields in the same search field so my
queries aren't as complex, but return various stored fields to the
user for display purposes. Often these latter fields are stored but
NOT indexed.

It might also be useful if you'd post some of your relevant code
snippets, perhaps some innocent line is messing you up... Are you,
perhaps, calling get() in a HitCollector? Or iterating through
many documents with a Hits object? Or.

Best
Erick

On 5/17/07, Andreas Guther <[EMAIL PROTECTED]> wrote:


I am actually using the FieldSelector and unless I did something wrong it
did not provide me any load performance improvements which was surprising
to
me and disappointing at the same time.  The only difference I could see
was
when I returned for all fields a NO_LOAD which from my understanding is
the
same as skipping over the document.

Right now I am looking into fragmentation problems of my huge index files.
I am de-fragmenting the hard drive to see if this brings any read
performance improvements.

I am also wondering if the FieldCache as discussed in
http://www.gossamer-threads.com/lists/lucene/general/28252 would help
improve the situation.

Andreas

On 5/17/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> I haven't tried compression either.  I know there was some talk a
> while ago about deprecating, but that hasn't happened.  The current
> implementation yields the highest level of compression.  You might
> find better results by compressing in your application and storing as
> a binary field, thus giving you more control over CPU used.  This is
> our current recommendation for dealing w/ compression.
>
> If you are not actually displaying that field, you should look into
> the FieldSelector API (via IndexReader).  It allows you to lazily
> load fields or skip them all together and can yield a pretty
> significant savings when it comes to loading documents.
> FieldSelector is available in 2.1.
>
> -Grant
>
> On May 17, 2007, at 4:01 AM, Paul Elschot wrote:
>
> > On Thursday 17 May 2007 08:10, Andreas Guther wrote:
> >> I am currently exploring how to solve performance problems I
> >> encounter with
> >> Lucene document reads.
> >>
> >> We have amongst other fields one field (default) storing all
> >> searchable
> >> fields.  This field can become of considerable size since we are
> >> indexing
> >> documents and  store the content for display within results.
> >>
> >> I noticed that the read can be very expensive.  I wonder now if it
> >> would
> >> make sense to add this field as Field.Store.Compress to the
> >> index.  Can
> >> someone tell me if this would speed up the document read or if
> >> this is
> >> something only interesting for saving space.
> >
> > I have not tried the compression yet, but in my experience a good way
> > to reduce the costs of document reads from a disk is by reading them
> > in document number order whenever possible. In this way one saves
> > on the disk head seeks.
> > Compression should actually help reducing the costs of disk head seeks
> > even more.
> >
> > Regards,
> > Paul Elschot
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> --
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
>
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> LuceneFAQ
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



Re: Indexing Open Office documents

2007-05-17 Thread Enis Soztutar

These is a parser for open office in Nutch. It is a plugin called parse-oo.
You can find more information in the nutch mailing lists.

On 5/17/07, jim shirreffs <[EMAIL PROTECTED]> wrote:



Anyone know how to add OpenOffice document to a Lucene index? Is there a
parser for OpenOffice?

thanks in advance

jim s.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: date window

2007-05-17 Thread Chris Hostetter
: A particular document can have several date windows.
: Give a specific date, only return those documents where that date
: falls within at least one of those windows.

: Also, note that there are multiple windows here for a single
: document, we can't just search between min start and max end.

This can tehoreticaly be done using a custom varient of
PhraseQuery and "parallel fields" where you have a range_start field and a
range_end field and the positions of the dates in each field "line up" so
that you can tell which start corrisponds with which end.

I mentioned this idea (which actual came from on off hand comment Doug
made about PhraseQuery at last year's ApacheCon US) in this Solr thread
when someone asked a similar question...

http://www.nabble.com/One-item%2C-multiple-fields%2C-and-range-queries-tf2969183.html#a8404600


..i have never tried implementing this in practice.

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Mike Klaas


On 17-May-07, at 6:43 AM, Andreas Guther wrote:

I am actually using the FieldSelector and unless I did something  
wrong it
did not provide me any load performance improvements which was  
surprising to
me and disappointing at the same time.  The only difference I could  
see was
when I returned for all fields a NO_LOAD which from my  
understanding is the

same as skipping over the document.


Note that storing the field as binary or compressed will increase the  
speed gains from lazy loading.  If the stored field is just text,  
then lucene has to scan the characters instead of .seek()ing to a  
byte position.


-MIke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Is it possible to use a custom similarity class to cause extra terms in a field to lower score?

2007-05-17 Thread Daniel Einspanjer

If I have two items in an index:
Terminator 2
Terminator 2: Judgment Day

And I score them against the query +title:(Terminator 2)
they come up with the same score (which makes sense, it just isn't
quite what I want)

Would there be some method or combination of methods in Similarity
that I could easily override to allow me to penalize the second item
because it had "unused terms"?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is it possible to use a custom similarity class to cause extra terms in a field to lower score?

2007-05-17 Thread Chris Hostetter

: Terminator 2
: Terminator 2: Judgment Day
:
: And I score them against the query +title:(Terminator 2)

: Would there be some method or combination of methods in Similarity
: that I could easily override to allow me to penalize the second item
: because it had "unused terms"?

that's what the DefaultSimilarity does, it uses the (length)norm
information stored when the documents are indexed to know which one is a
better match (because it matches on a shorter field)

I you aren'tseeing that behavior then perhaps you turned omitNorms for
that field, or perhaps the byte encoding is making the distinction between
your various terms too small -- overriding the lengthNorm function and
reindexing might help.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is it possible to use a custom similarity class to cause extra terms in a field to lower score?

2007-05-17 Thread Daniel Einspanjer

Oops.  I do indeed have omitNorms turned on.  I will re-read the
documentation on it and look at turning it off.

Sorry for the bother. :/

On 5/17/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: Terminator 2
: Terminator 2: Judgment Day
:
: And I score them against the query +title:(Terminator 2)

: Would there be some method or combination of methods in Similarity
: that I could easily override to allow me to penalize the second item
: because it had "unused terms"?

that's what the DefaultSimilarity does, it uses the (length)norm
information stored when the documents are indexed to know which one is a
better match (because it matches on a shorter field)

I you aren'tseeing that behavior then perhaps you turned omitNorms for
that field, or perhaps the byte encoding is making the distinction between
your various terms too small -- overriding the lengthNorm function and
reindexing might help.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to ignore scoring for a Query?

2007-05-17 Thread Otis Gospodnetic
Scoring cannot be turned off, currently.  I once thought it is possible to skip 
scoring with the patch in LUCENE-584 JIRA issue, but I was wrong.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Benjamin Pasero <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, May 17, 2007 8:49:15 AM
Subject: How to ignore scoring for a Query?

Hi,

I have two different use-cases for my queries. For the first,
performance is not too critical
and I want to sort the results by relevance (score). The second however,
is performance critical,
but the score for each result is not interesting. I guess, if it was
possible to disable scoring
for the query, I could improve performance (note that omitNorms on a
Field is not an option, due
to the first use case).

Is there a straightforward way to disable scoring for a query (its a
BooleanQuery btw with some
clauses, which can be any other query).

Thanks,
Ben


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Otis Gospodnetic
- Original Message 
From: Paul Elschot <[EMAIL PROTECTED]>

On Thursday 17 May 2007 08:10, Andreas Guther wrote:
> I am currently exploring how to solve performance problems I encounter with
> Lucene document reads.
> 
> We have amongst other fields one field (default) storing all searchable
> fields.  This field can become of considerable size since we are  indexing
> documents and  store the content for display within results.
> 
> I noticed that the read can be very expensive.  I wonder now if it would
> make sense to add this field as Field.Store.Compress to the index.  Can
> someone tell me if this would speed up the document read or if this is
> something only interesting for saving space.

I have not tried the compression yet, but in my experience a good way
to reduce the costs of document reads from a disk is by reading them
in document number order whenever possible. In this way one saves
on the disk head seeks.
Compression should actually help reducing the costs of disk head seeks
even more.

OG: Does this really help in a multi-user environment where there are multiple 
parallel queries hitting the index and reading data from all over the index and 
the disk?  They will all share the same disk head, so the head will still have 
to jump around to service all these requests, even if each request is being 
careful to read documents in docId order, no?

Otis





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory leak (JVM 1.6 only)

2007-05-17 Thread Otis Gospodnetic
Hi Steve,

You said the OOM happens only when you are indexing.  You don't need 
LuceneIndexAccess for that, so get rid of that to avoid one suspect that is not 
part of Lucene core.  What is your maxBufferedDocs set to?  And since you are 
using JVM 1.6, check out jmap, jconsole & friends, they'll provide insight into 
where your OOM is coming from.  I see your app is a webapp.  How do you know 
it's Lucene and its indexing that are the source of OOM and not something else, 
such as a bug in Tomcat?

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Stephen Gray <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, May 15, 2007 2:31:05 AM
Subject: Memory leak (JVM 1.6 only)

Hi everyone,

I have an application that indexes/searches xml documents using Lucene. 
I'm having a problem with what looks like a memory leak, which occurs 
when indexing a large number of documents, but only when the application 
is running under JVM 1.6. Under JVM 1.5 there is no problem. What 
happens is that the memory allocated consistently rises during indexing 
until the JVM crashes with an OutOfMemory exception.

I'm using Lucene 2.1, and am using Maik Schreiber's LuceneIndexAccess 
API, which hands out references to cached IndexWriter/Reader/Searchers 
to objects that need to use them, and handles closing and re-opening 
IndexSearchers after documents are added to the index. The application 
is running under Tomcat 6.

I'm a bit out of my depth determining the source of the leak - I've 
tried using Netbeans profiler, which shows a large number of HashMap 
instances that survive a long time, but these are created by many 
different classes so it's difficult to pinpoint one source.

Has anyone found similar problems with Lucene indexing operations 
running under JVM 1.6? Does anyone have any suggestions re how to deal 
with this?

Any help much appreciated.

Thanks,
Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How can I limit the number of hits in my query?

2007-05-17 Thread David Leangen

Thank you, Erick, this is very useful!

Have you ever taken a look at Google Suggest[1]? It's very fast, and the
results are impressive. I think your suggestion will go a long way to
fixing my problem, but there's probably still quite a gap between this
approach and the kind of results that Google Suggest provides.

I wonder how it could be possible to do the same with Lucene...

Anyway, thanks a lot for the help!


[1] http://www.google.com/webhp?complete=1&hl=en


On Tue, 2007-05-15 at 12:46 -0400, Erick Erickson wrote:
> OK, I'm going to go on the assumption that all you're interested in
> is auto completion, so don't try to generalized this to queries..
> 
> Don't use queries, PrefixQuery or otherwise. Use one of the 
> TermEnums, probably WildcardTermEnum. What that will do is 
> allow you to find all the terms, in lexical order, that match 
> your fragment without using queries at all. This has several 
> advantages 
> 
> 1> it's fast. It doesn't require you to do anything but march down
>  some index list.
> 2> it doesn't expand the terms prior to processing. No such thing
> as "TooManyClauses". Perhaps OutOfMemory if you try to 
> return 100,000,000 terms 
> 3> you can stop whenever you've accumulated "enough" terms,
>  where "enough" is up to you.
> 
> NOTE: there's also a RegexTermEnum in the 
> contrib section (last I knew, it may be in core in 2.1) that 
> allows arbitrary regex enumerations, but it's significantly
> slower than WildcardTermEnum, which is hardly surprising
> since it has to do more work...
> 
> It's reasonable to ask how much use auto-completion is when the
> poor user has, say, 10,000 terms to choose from, so I think it's
> entirely reasonable to get the first, say, 100 terms and quit. You 
> should be able to do something like this quite easily with the Enums.
> 
> I think your original solution of using queries would not be
> satisfactory for the user anyway, *assuming* that the user is
> as impatient as I am and wants some auto-complete options 
> RIGHT NOW , even if you solved the TooManyClauses
> issue.
> 
> Along the same lines, another question is whether you should
> try to auto-complete when the user has typed less than, say,
> 3 characters, but that's your design decision. 
> 
> Really, try the WildcardTermEnum. It's pretty neat.
> 
> Hope this helps
> Erick
> 
> On 5/14/07, David Leangen <[EMAIL PROTECTED]> wrote:
> 
> Thank you very much for this. Some more questions inline... 
> 
> 
> >
> > - How can I limit the number of hits? I don't know
> in
> >advance what the data will be, so it's not
> feasible for
> >me to use RangeQuery.
> >
> >
> > You can use a TopDocs or a HitCollector object which allows
> you
> > to process each object as it's hit. But I doubt you need to
> do this.
> 
> > No.  I expect you're using a wildcard, and wildcard handling
> is 
> > complicated.
> 
> 
> Ok, you're right. It's not the limiting of the results that's
> the
> problem, it's the way the search is expanded.
> 
> Since this is an autocomplete, when the user types, for
> example "a" or a 
> Japanese character "あ", I am using PrefixFilter for this, so
> I guess
> the search turns into "a*" and "あ*" respectively.
> 
> In the archive, the related posts I read either refer to a
> DateRange 
> (where it is possible to search first by year, then month...
> etc.), or
> they suggest to increase the max count.
> 
> Neither of these solutions work in my case... It's not a date,
> and I
> have no idea of the results in advance and it would not be
> practical or 
> elegant to speculate on the results (for example first try
> aa*~ab* and
> see what that gives, etc.).
> 
> I can get access to the "weight" values of the terms (a data
> field
> determined by their frequency of use), so I'll try something
> related to 
> that. For people with more experience, would that be a good
> path to
> take?
> 
> Otherwise, would a reasonable solution be to override or
> re-implement
> PrefixFilter?
> 
> 
> Thank you so much!
> David
> 
> 
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Andreas Guther

I found a similar recommendation about the disc access and reading in order
in the following message and implemented this in my code:
http://www.gossamer-threads.com/lists/lucene/general/28268#28268

Since I am dealing with multiple index directories I sorted the document
references by index number and then by doc id.  This actually improved the
read access and reduced the read time to 50% and less.

I think this is a very interesting performance improvement tip that should
have its place in the FAQ.

Thanks for the input.

Andreas


On 5/17/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:


- Original Message 
From: Paul Elschot <[EMAIL PROTECTED]>

On Thursday 17 May 2007 08:10, Andreas Guther wrote:
> I am currently exploring how to solve performance problems I encounter
with
> Lucene document reads.
>
> We have amongst other fields one field (default) storing all searchable
> fields.  This field can become of considerable size since we
are  indexing
> documents and  store the content for display within results.
>
> I noticed that the read can be very expensive.  I wonder now if it would
> make sense to add this field as Field.Store.Compress to the index.  Can
> someone tell me if this would speed up the document read or if this is
> something only interesting for saving space.

I have not tried the compression yet, but in my experience a good way
to reduce the costs of document reads from a disk is by reading them
in document number order whenever possible. In this way one saves
on the disk head seeks.
Compression should actually help reducing the costs of disk head seeks
even more.

OG: Does this really help in a multi-user environment where there are
multiple parallel queries hitting the index and reading data from all over
the index and the disk?  They will all share the same disk head, so the head
will still have to jump around to service all these requests, even if each
request is being careful to read documents in docId order, no?

Otis





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Memory leak (JVM 1.6 only)

2007-05-17 Thread Stephen Gray

Hi Otis,

Thanks very much for your reply.

I've removed the LuceneIndexAccessor code, and still have the same 
problem, so that at least rules out LuceneIndexAccessor as the source. 
maxBufferedDocs is just set to the default, which I believe is 10.


I've tried jconsole, + jmap/jhat for looking at the contents of the 
heap. One interesting thing is that although the memory allocated as 
reported by the processes tab of Windows Task Manager goes up and up, 
and the JVM eventually crashes with an OutOfMemory error, the total size 
of heap + non-heap as reported by jconsole is constant and much lower 
than the Windows-reported allocated memory. I've also tried Netbeans 
profiler, which suggests that the variables in the heap that are 
continually surviving garbage collection do not all originate from one 
class.


I can't definitely rule out Tomcat. Clearly something is interacting 
with a change in JVM 1.6 and causing the problem. The fact that it only 
occurs during indexing not searching suggested that it might be related 
to the indexing code rather than Tomcat. It's much more likely that it's 
my code than Lucene, but I can't see anything in my code though I'm 
definitely no expert on memory leaks. All the variables created during 
indexing except IndexReader and Searcher instances are local to my 
addDocument function so should be garbage collected after each document 
is added. I did wonder if it might be related to SnowballAnalyzer as 
quite a few long lived variables in the heap were created by this - but 
then the heap is not increasing.


Regards,
Steve

Otis Gospodnetic wrote:

Hi Steve,

You said the OOM happens only when you are indexing.  You don't need 
LuceneIndexAccess for that, so get rid of that to avoid one suspect that is not 
part of Lucene core.  What is your maxBufferedDocs set to?  And since you are using 
JVM 1.6, check out jmap, jconsole & friends, they'll provide insight into where 
your OOM is coming from.  I see your app is a webapp.  How do you know it's Lucene 
and its indexing that are the source of OOM and not something else, such as a bug 
in Tomcat?

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Stephen Gray <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, May 15, 2007 2:31:05 AM
Subject: Memory leak (JVM 1.6 only)

Hi everyone,

I have an application that indexes/searches xml documents using Lucene. 
I'm having a problem with what looks like a memory leak, which occurs 
when indexing a large number of documents, but only when the application 
is running under JVM 1.6. Under JVM 1.5 there is no problem. What 
happens is that the memory allocated consistently rises during indexing 
until the JVM crashes with an OutOfMemory exception.


I'm using Lucene 2.1, and am using Maik Schreiber's LuceneIndexAccess 
API, which hands out references to cached IndexWriter/Reader/Searchers 
to objects that need to use them, and handles closing and re-opening 
IndexSearchers after documents are added to the index. The application 
is running under Tomcat 6.


I'm a bit out of my depth determining the source of the leak - I've 
tried using Netbeans profiler, which shows a large number of HashMap 
instances that survive a long time, but these are created by many 
different classes so it's difficult to pinpoint one source.


Has anyone found similar problems with Lucene indexing operations 
running under JVM 1.6? Does anyone have any suggestions re how to deal 
with this?


Any help much appreciated.

Thanks,
Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



--
Stephen Gray
Archive IT Officer
Australian Social Science Data Archive
18 Balmain Crescent (Building #66)
The Australian National University
Canberra ACT 0200

Phone +61 2 6125 2185
Fax +61 2 6125 0627
Web http://assda.anu.edu.au/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: snowball (english) and filenames

2007-05-17 Thread Arnold Leung


On 16-May-07, at 11:00 PM, Doron Cohen wrote:


If you enter a.b.c.d.e.f.g.h to that demo you'll see that
the demo simply breaks the input text on '.' - that has
nothing to do with filenames.


That is not what I am seeing from my testing:

a.b.c.d.e.f.g.h is not broken apart like how the snowball demo  
indicates it should do.


At http://snowball.tartarus.org/demo.php

"a.b.c.d.e.f.g.h" shows:

a -> a
b -> b
c -> c
d -> d
e -> e
f -> f
g -> g
h -> h

For my lucene testing, I indexed one text file with one   
"a.b.c.d.e.f.g.h" string in it and opened the index up using Luke.   
It only indexed the string a.b.c.d.e.f.g.h (and didn't parse the  
string based on the periods).



As a real world example, Logon.dll is being converted to "Logon.dl"  
rather than "Logon" and "dll" as indicated by the snowball demo.


Also:

Demo:
some-msp.msp

somemsp -> somemsp
msp -> msp

Lucene:
some-msp.msp

some
msp.msp


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory leak (JVM 1.6 only)

2007-05-17 Thread Doron Cohen
Stephen Gray <[EMAIL PROTECTED]> wrote on 17/05/2007 22:40:01:

> One interesting thing is that although the memory allocated as
> reported by the processes tab of Windows Task Manager goes up and up,
> and the JVM eventually crashes with an OutOfMemory error, the total size
> of heap + non-heap as reported by jconsole is constant and much lower
> than the Windows-reported allocated memory. I've also tried Netbeans
> profiler, which suggests that the variables in the heap that are
> continually surviving garbage collection do not all originate from one
> class.

Smells like native memory leak? Can jconsole/jmap/jhat monitor native mem?
I once spent some time on what finally was a GZipOutputStream native mem
usage/leak. Moving from Java 1.5 to 1.6 could expose such problem...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory leak (JVM 1.6 only)

2007-05-17 Thread Stephen Gray
Thanks. If the extra memory allocated is native memory I don't think 
jconsole includes it in "non-heap" as it doesn't show this as 
increasing, and jmap/jhat just dump/analyse the heap. Do you know of an 
application that can report native memory usage?


Thanks,
Steve

Doron Cohen wrote:

Stephen Gray <[EMAIL PROTECTED]> wrote on 17/05/2007 22:40:01:

  

One interesting thing is that although the memory allocated as
reported by the processes tab of Windows Task Manager goes up and up,
and the JVM eventually crashes with an OutOfMemory error, the total size
of heap + non-heap as reported by jconsole is constant and much lower
than the Windows-reported allocated memory. I've also tried Netbeans
profiler, which suggests that the variables in the heap that are
continually surviving garbage collection do not all originate from one
class.



Smells like native memory leak? Can jconsole/jmap/jhat monitor native mem?
I once spent some time on what finally was a GZipOutputStream native mem
usage/leak. Moving from Java 1.5 to 1.6 could expose such problem...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



--
Stephen Gray
Archive IT Officer
Australian Social Science Data Archive
18 Balmain Crescent (Building #66)
The Australian National University
Canberra ACT 0200

Phone +61 2 6125 2185
Fax +61 2 6125 0627
Web http://assda.anu.edu.au/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: snowball (english) and filenames

2007-05-17 Thread Doron Cohen
> a.b.c.d.e.f.g.h is not broken apart like how the snowball demo
> indicates it should do.

I am not sure about the "should" here - the way I see it, this
is just how the demo works: Snowball stemmers operate on words,
so the demo first breaks the input text into words and only
then applies stemming.

> For my lucene testing, I indexed one text file with one
> "a.b.c.d.e.f.g.h" string in it and opened the index up using Luke.
> It only indexed the string a.b.c.d.e.f.g.h (and didn't parse the
> string based on the periods).

In Lucene the way text is "broken" into words is up to
application - and depends on the analyzer being used.
WhitespaceAnalyzer would break on white space. StandardAnalyzer
would do more sophisticated work. Analyzers are extendable,
so you could modify their behavior. The wiki page
"AnalysisParalysis" has some relevant info.

Using Lucene's SimpleAnalyzer btw would break "a.b.c" into
"a b c" which seems to be what you are looking for?

HTH,
Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]