RE: fetching similar wordlist as given word

2004-11-23 Thread Chuck Williams
Lucene does support stemming, but that is not what your example requires
(stemming equates "roaming", "roam", "roamed", etc.).  For stemming,
look at PorterStemFilter or better, the Snowball stemmers in the
sandbox.  For your similar word list, I think you are looking for the
class FuzzyTermEnum.  This should give you the terms you need, although
perhaps only those with a common prefix of a specified length.
Otherwise, you could develop your own algorithm to look for similar
terms in the index.

Chuck

  > -Original Message-
  > From: Santosh [mailto:[EMAIL PROTECTED]
  > Sent: Tuesday, November 23, 2004 11:15 PM
  > To: Lucene Users List
  > Subject: fetching similar wordlist as given word
  > 
  > can lucene will be able to do stemming?
  > If I am searching for "roam" then I know that it can give result for
  > "foam" using fuzzy query. But my requirement is if I search for
"roam"
  > can I get the similar wordlist as output. so that I can show the end
  > user in the column  ---   do you mean "foam"?
  > How can I get similar word list in the given content?
  > 
  > 
  > 
  > ---SOFTPRO
DISCLAIMER--
  > 
  > 
  > 
  > Information contained in this E-MAIL and any attachments are
  > 
  > confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
  > 
  > and 'confidential'.
  > 
  > 
  > 
  > If you are not an intended or authorised recipient of this E-MAIL or
  > 
  > have received it in error, You are notified that any use, copying or
  > 
  > dissemination  of the information contained in this E-MAIL in any
  > 
  > manner whatsoever is strictly prohibited. Please delete it
immediately
  > 
  > and notify the sender by E-MAIL.
  > 
  > 
  > 
  > In such a case reading, reproducing, printing or further
dissemination
  > 
  > of this E-MAIL is strictly prohibited and may be unlawful.
  > 
  > 
  > 
  > SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
  > 
  > hereto is free from computer viruses or other defects.
  > 
  > 
  > 
  > The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
  > 
  > those of the author and are not necessarily those of SOFTPRO
SYSTEMS.
  > 
  >



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: modifying existing index

2004-11-23 Thread Cheolgoo Kang
On Wed, 24 Nov 2004 13:04:20 +0530, Santosh <[EMAIL PROTECTED]> wrote:
> I have gon through IndexReader , I got method : delete(int docNum)   ,
> but from where I will get document number? Is  this predifined? or we have
> to give a number prior  to indexing?

The number(aka doc-id) is given by lucene and is it's an internal sequential
integer. This number is usually retrieved from Hits.id(int) of your search.

Hits myHits = myIndexSearcher.search( myQuery );
for ( int i=0; i 
> 
> - Original Message -
> From: "Luke Francl" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Wednesday, November 24, 2004 1:26 AM
> Subject: Re: modifying existing index
> 
> > On Tue, 2004-11-23 at 13:59, Santosh wrote:
> > > I am using lucene for indexing, when I am creating Index the docuemnts
> are added. but when I want to modify the single existing document and
> reIndex again, it is taking as new document and adding one more time, so
> that I am getting same document twice in the results.
> > > To overcome this I am deleting existing Index and again recreating whole
> Index. but is it possibe to index  the modified document again and overwrite
> existing document without deleting and recreation. can I do this? If so how?
> >
> > You do not need to recreate the whole index. Just mark the document as
> > deleted using the IndexReader and then add it again with the
> > IndexWriter. Remember to close your IndexReader and IndexWriter after
> > doing this.
> >
> > The deleted document will be removed the next time you optimize your
> > index.
> >
> > Luke Francl
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-- 
Cheolgoo, Kang

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: modifying existing index

2004-11-23 Thread Chuck Williams
A good way to do this is to add a keyword field with whatever unique id
you have for the document.  Then you can delete the term containing a
unique id to delete the document from the index (look at
IndexReader.delete(Term)).  You can look at the demo class IndexHTML to
see how it does incremental indexing for an example.

Chuck

  > -Original Message-
  > From: Santosh [mailto:[EMAIL PROTECTED]
  > Sent: Tuesday, November 23, 2004 11:34 PM
  > To: Lucene Users List
  > Subject: Re: modifying existing index
  > 
  > I have gon through IndexReader , I got method : delete(int
  > docNum)   ,
  > but from where I will get document number? Is  this predifined? or
we
  > have
  > to give a number prior  to indexing?
  > - Original Message -
  > From: "Luke Francl" <[EMAIL PROTECTED]>
  > To: "Lucene Users List" <[EMAIL PROTECTED]>
  > Sent: Wednesday, November 24, 2004 1:26 AM
  > Subject: Re: modifying existing index
  > 
  > 
  > > On Tue, 2004-11-23 at 13:59, Santosh wrote:
  > > > I am using lucene for indexing, when I am creating Index the
  > docuemnts
  > are added. but when I want to modify the single existing document
and
  > reIndex again, it is taking as new document and adding one more
time, so
  > that I am getting same document twice in the results.
  > > > To overcome this I am deleting existing Index and again
recreating
  > whole
  > Index. but is it possibe to index  the modified document again and
  > overwrite
  > existing document without deleting and recreation. can I do this? If
so
  > how?
  > >
  > > You do not need to recreate the whole index. Just mark the
document as
  > > deleted using the IndexReader and then add it again with the
  > > IndexWriter. Remember to close your IndexReader and IndexWriter
after
  > > doing this.
  > >
  > > The deleted document will be removed the next time you optimize
your
  > > index.
  > >
  > > Luke Francl
  > >
  > >
  > >
-
  > > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > > For additional commands, e-mail:
[EMAIL PROTECTED]
  > >
  > 
  > 
  >
-
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help on the Query Parser

2004-11-23 Thread Morus Walter
Terence Lai writes:
> 
> Look likes that the wildcard query disappeared. In fact, I am expecting 
> text:"java* developer" to be returned. It seems to me that the QueryParser 
> cannot handle the wildcard within a quoted String.
> 
That's not just QueryParser. 
Lucene itself doesn't handle wildcards within phrases.
You could have a query text:"java* developer" if '*' isn't removed by the 
analyzer. But it would only search for the token 'java*' not any expansion of 
that. I guess this is not, what you want.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: modifying existing index

2004-11-23 Thread Santosh
I have gon through IndexReader , I got method : delete(int docNum)   ,
but from where I will get document number? Is  this predifined? or we have
to give a number prior  to indexing?
- Original Message -
From: "Luke Francl" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 24, 2004 1:26 AM
Subject: Re: modifying existing index


> On Tue, 2004-11-23 at 13:59, Santosh wrote:
> > I am using lucene for indexing, when I am creating Index the docuemnts
are added. but when I want to modify the single existing document and
reIndex again, it is taking as new document and adding one more time, so
that I am getting same document twice in the results.
> > To overcome this I am deleting existing Index and again recreating whole
Index. but is it possibe to index  the modified document again and overwrite
existing document without deleting and recreation. can I do this? If so how?
>
> You do not need to recreate the whole index. Just mark the document as
> deleted using the IndexReader and then add it again with the
> IndexWriter. Remember to close your IndexReader and IndexWriter after
> doing this.
>
> The deleted document will be removed the next time you optimize your
> index.
>
> Luke Francl
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



fetching similar wordlist as given word

2004-11-23 Thread Santosh
can lucene will be able to do stemming?
If I am searching for "roam" then I know that it can give result for "foam" 
using fuzzy query. But my requirement is if I search for "roam" can I get the 
similar wordlist as output. so that I can show the end user in the column  
---   do you mean "foam"?
How can I get similar word list in the given content?  



---SOFTPRO DISCLAIMER--



Information contained in this E-MAIL and any attachments are

confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'

and 'confidential'.



If you are not an intended or authorised recipient of this E-MAIL or

have received it in error, You are notified that any use, copying or

dissemination  of the information contained in this E-MAIL in any

manner whatsoever is strictly prohibited. Please delete it immediately

and notify the sender by E-MAIL.



In such a case reading, reproducing, printing or further dissemination

of this E-MAIL is strictly prohibited and may be unlawful.



SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment

hereto is free from computer viruses or other defects.



The opinions expressed in this E-MAIL and any ATTACHEMENTS may be

those of the author and are not necessarily those of SOFTPRO SYSTEMS.





MERGERINDEX + SOLUTION

2004-11-23 Thread Karthik N S

Hi Guys

Apologies



I have a MERGERINDEX [ Merged 1000 subindexes] ,


The Question  is

Does Somebody have any solution  for recorrecting  the  Mergerindex [ in
case of Corruption ]

If so Please Let the Form  know about this,so developers like us would use
the same.


Thx in Advance





  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Help on the Query Parser

2004-11-23 Thread Terence Lai
Hi all,

I am trying to use the QueryParser.parse() to parse a query string like "java* 
developer". Note that I want the wildcard string, java*, followed by the word 
developer. The following is the code.

-
String qryStr = "\"java* developer\"";
String fieldname = "text";
StandardAnalyzer analyzer = new StandAnalyzer();

Query qry = org.apache.lucene.queryParser.QueryParser.parse(qryStr, fieldname, 
analyzer);
-

When I do a qryStr.toString() to print out the contents, I got the following 
output:

-
text:"java developer"
-

Look likes that the wildcard query disappeared. In fact, I am expecting 
text:"java* developer" to be returned. It seems to me that the QueryParser 
cannot handle the wildcard within a quoted String.

Does anyone has a solution on this? Am I missing something in the code?

Thanks,
Terence





--
Get your free email account from http://www.trekspace.com
  Your Internet Virtual Desktop!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene Scorers

2004-11-23 Thread Chuck Williams
Hi Ken,

I'm glad our replies were helpful.  It sounds like you looked at the
code in MaxDisjunctionQuery, so you probably noticed that it also
implements skipTo().  Your suggestion sounds like a good thing to do.  I
thought about that when writing MaxDisjunctionQuery, but didn't need the
generality, and it does make the code more complex.  I think Lucene
needs one of these mechanisms in it, at least to solve the problems
associated with the current default use of BooleanQuery for multiple
field expansions.  Your proposal would generalize this to solve
additional cases where different accrual operators are appropriate.

You could write and submit the generalization, although there are no
guarantees anybody would do anything with it.  I didn't get anywhere in
my attempt to submit MaxDisjunctionQuery.  I think there is also a
serious problem in scoring with the current score normalization (it does
not provide meaningfully comaparable scores across different searches,
which means that absolute score numbers like 0.8 have no intrinsic
meaning concerning how good a result is or is not).  When I finally get
back to tuning search in my app, that's the next one I'll try a
submission on.

Chuck

  > -Original Message-
  > From: Ken McCracken [mailto:[EMAIL PROTECTED]
  > Sent: Tuesday, November 23, 2004 4:31 PM
  > To: Lucene Users List
  > Subject: Re: lucene Scorers
  > 
  > Hi,
  > 
  > Thanks the pointers in your replies.  Would it be possible to
include
  > some sort of accrual scorer interface somewhere in the Lucene Query
  > APIs?  This could be passed into a query similar to
  > MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc.,
  > according to the implementor's discretion, to compute the overall
  > score for a document.
  > 
  > -Ken
  > 
  > On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot
  > <[EMAIL PROTECTED]> wrote:
  > > On Friday 12 November 2004 22:56, Chuck Williams wrote:
  > >
  > >
  > > > I had a similar need and wrote MaxDisjunctionQuery and
  > > > MaxDisjunctionScorer.  Unfortunately these are not available as
a
  > patch
  > > > but I've included the original message below that has the code
  > (modulo
  > > > line breaks added by simple text email format).
  > > >
  > > > This code is functional -- I use it in my app.  It is optimized
for
  > its
  > > > stated use, which involves a small number of clauses.  You'd
want to
  > > > improve the incremental sorting (e.g., using the bucket
technique of
  > > > BooleanQuery) if you need it for large numbers of clauses.
  > >
  > > When you're interested, you can also have a look here for
  > > yet another DisjunctionScorer:
  > > http://issues.apache.org/bugzilla/show_bug.cgi?id=31785
  > >
  > > It has the advantage that it implements skipTo() so that it can
  > > be used as a subscorer of ConjunctionScorer, ie. it can be
  > > faster in situations like this:
  > >
  > > aa AND (bb OR cc)
  > >
  > > where bb and cc are treated by the DisjunctionScorer.
  > > When aa is a filter this can also be used to implement
  > > a filtering query.
  > >
  > >
  > >
  > >
  > > > Re. Paul's suggested steps below, I did not integrate this with
  > query
  > > > parser as I didn't need that functionality (since I'm generating
the
  > > > multi-field expansions for which max is a much better scoring
choice
  > > > than sum).
  > > >
  > > > Chuck
  > > >
  > > > Included message:
  > > >
  > > > -Original Message-
  > > > From: Chuck Williams [mailto:[EMAIL PROTECTED]
  > > > Sent: Monday, October 11, 2004 9:55 PM
  > > > To: [EMAIL PROTECTED]
  > > > Subject: Contribution: better multi-field searching
  > > >
  > > > The files included below (MaxDisjunctionQuery.java and
  > > > MaxDisjunctionScorer.java) provide a new mechanism for searching
  > across
  > > > multiple fields.
  > >
  > > The maximum indeed works well, also when the fields differ a lot
  > length.
  > >
  > > Regards,
  > > Paul
  > >
  > >
  > >
  > >
  > >
-
  > > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > > For additional commands, e-mail:
[EMAIL PROTECTED]
  > >
  > >
  > 
  >
-
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: retrieving added document

2004-11-23 Thread Cheolgoo Kang
On Tue, 23 Nov 2004 22:47:21 +0100, Paul <[EMAIL PROTECTED]> wrote:
> Hi,
> I'm creating a document and adding it with a writer to the index. For
> some reason I need to add data to this specific document later on
> (minutes, not hours or days). Is it possible to retrieve it and add
> additonal data?

No, you cannot add additional data (or modify) to previously added document.
It's easy to delete the old one from the index and add a new document with
additional data included.

> I found the document(int n) - method within the IndexReader (btw: the
> description makes no sense for me: "Returns the stored fields of the
> nth Document in this index." - but it returns a Document and not a
> list of fields..) but where do I get that number from? (and the
> numbers change, I know..)

Usually you search using IndexSearcher and it's resulting Hits has the doc-id
(the number) in that index. And the Document contains the list of
(stored) fields.

> 
> thanks for any help
> 
> Paul
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-- 
Cheolgoo, Kang

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Scorers

2004-11-23 Thread Ken McCracken
Hi,

Thanks the pointers in your replies.  Would it be possible to include
some sort of accrual scorer interface somewhere in the Lucene Query
APIs?  This could be passed into a query similar to
MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc.,
according to the implementor's discretion, to compute the overall
score for a document.

-Ken

On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot <[EMAIL PROTECTED]> wrote:
> On Friday 12 November 2004 22:56, Chuck Williams wrote:
> 
> 
> > I had a similar need and wrote MaxDisjunctionQuery and
> > MaxDisjunctionScorer.  Unfortunately these are not available as a patch
> > but I've included the original message below that has the code (modulo
> > line breaks added by simple text email format).
> >
> > This code is functional -- I use it in my app.  It is optimized for its
> > stated use, which involves a small number of clauses.  You'd want to
> > improve the incremental sorting (e.g., using the bucket technique of
> > BooleanQuery) if you need it for large numbers of clauses.
> 
> When you're interested, you can also have a look here for
> yet another DisjunctionScorer:
> http://issues.apache.org/bugzilla/show_bug.cgi?id=31785
> 
> It has the advantage that it implements skipTo() so that it can
> be used as a subscorer of ConjunctionScorer, ie. it can be
> faster in situations like this:
> 
> aa AND (bb OR cc)
> 
> where bb and cc are treated by the DisjunctionScorer.
> When aa is a filter this can also be used to implement
> a filtering query.
> 
> 
> 
> 
> > Re. Paul's suggested steps below, I did not integrate this with query
> > parser as I didn't need that functionality (since I'm generating the
> > multi-field expansions for which max is a much better scoring choice
> > than sum).
> >
> > Chuck
> >
> > Included message:
> >
> > -Original Message-
> > From: Chuck Williams [mailto:[EMAIL PROTECTED]
> > Sent: Monday, October 11, 2004 9:55 PM
> > To: [EMAIL PROTECTED]
> > Subject: Contribution: better multi-field searching
> >
> > The files included below (MaxDisjunctionQuery.java and
> > MaxDisjunctionScorer.java) provide a new mechanism for searching across
> > multiple fields.
> 
> The maximum indeed works well, also when the fields differ a lot length.
> 
> Regards,
> Paul
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JDBCDirectory to prevent optimize()?

2004-11-23 Thread Erik Hatcher
On Nov 23, 2004, at 6:02 PM, Kevin A. Burton wrote:
Erik Hatcher wrote:
Also, there is a DBDirectory in the sandbox to store a Lucene index 
inside Berkeley DB.
I assume this would prevent prefix queries from working...
Huh?  Why would you assume that?  As far as I know, and I've tested 
this some, a Lucene index inside Berkeley DB works the same as if it 
had been in RAM or on the filesystem.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: URGENT: Help indexing large document set

2004-11-23 Thread John Wang
Thanks Chuck! I missed the call: getIndexOffset.
I am profiling it again to pin-point where the performance problem is.

-John

On Tue, 23 Nov 2004 16:13:22 -0800, Chuck Williams <[EMAIL PROTECTED]> wrote:
> Are you sure you have a performance problem with
> TermInfosReader.get(Term)?  It looks to me like it scans sequentially
> only within a small buffer window (of size
> SegmentTermEnum.indexInterval) and that it uses binary search otherwise.
> See TermInfosReader.getIndexOffset(Term).
> 
> Chuck
> 
> 
> 
>  > -Original Message-
>  > From: John Wang [mailto:[EMAIL PROTECTED]
>  > Sent: Tuesday, November 23, 2004 3:38 PM
>  > To: [EMAIL PROTECTED]
>  > Subject: URGENT: Help indexing large document set
>  >
>  > Hi:
>  >
>  >I am trying to index 1M documents, with batches of 500 documents.
>  >
>  >Each document has an unique text key, which is added as a
>  > Field.KeyWord(name,value).
>  >
>  >For each batch of 500, I need to make sure I am not adding a
>  > document with a key that is already in the current index.
>  >
>  >   To do this, I am calling IndexSearcher.docFreq for each document
> and
>  > delete the document currently in the index with the same key:
>  >
>  >while (keyIter.hasNext()) {
>  > String objectID = (String) keyIter.next();
>  > term = new Term("key", objectID);
>  > int count = localSearcher.docFreq(term);
>  >
>  > if (count != 0) {
>  > localReader.delete(term);
>  > }
>  >   }
>  >
>  > Then I proceed with adding the documents.
>  >
>  > This turns out to be extremely expensive, I looked into the code and
> I
>  > see in
>  > TermInfosReader.get(Term term) it is doing a linear look up for each
>  > term. So as the index grows, the above operation degrades at a
> linear
>  > rate. So for each commit, we are doing a docFreq for 500 documents.
>  >
>  > I also tried to create a BooleanQuery composed of 500 TermQueries
> and
>  > do 1 search for each batch, and the performance didn't get better.
> And
>  > if the batch size increases to say 50,000, creating a BooleanQuery
>  > composed of 50,000 TermQuery instances may introduce huge memory
>  > costs.
>  >
>  > Is there a better way to do this?
>  >
>  > Can TermInfosReader.get(Term term) be optimized to do a binary
> lookup
>  > instead of a linear walk? Of course that depends on whether the
> terms
>  > are stored in sorted order, are they?
>  >
>  > This is very urgent, thanks in advance for all your help.
>  >
>  > -John
>  >
>  >
> -
>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: URGENT: Help indexing large document set

2004-11-23 Thread Chuck Williams
Are you sure you have a performance problem with
TermInfosReader.get(Term)?  It looks to me like it scans sequentially
only within a small buffer window (of size
SegmentTermEnum.indexInterval) and that it uses binary search otherwise.
See TermInfosReader.getIndexOffset(Term).

Chuck

  > -Original Message-
  > From: John Wang [mailto:[EMAIL PROTECTED]
  > Sent: Tuesday, November 23, 2004 3:38 PM
  > To: [EMAIL PROTECTED]
  > Subject: URGENT: Help indexing large document set
  > 
  > Hi:
  > 
  >I am trying to index 1M documents, with batches of 500 documents.
  > 
  >Each document has an unique text key, which is added as a
  > Field.KeyWord(name,value).
  > 
  >For each batch of 500, I need to make sure I am not adding a
  > document with a key that is already in the current index.
  > 
  >   To do this, I am calling IndexSearcher.docFreq for each document
and
  > delete the document currently in the index with the same key:
  > 
  >while (keyIter.hasNext()) {
  > String objectID = (String) keyIter.next();
  > term = new Term("key", objectID);
  > int count = localSearcher.docFreq(term);
  > 
  > if (count != 0) {
  > localReader.delete(term);
  > }
  >   }
  > 
  > Then I proceed with adding the documents.
  > 
  > This turns out to be extremely expensive, I looked into the code and
I
  > see in
  > TermInfosReader.get(Term term) it is doing a linear look up for each
  > term. So as the index grows, the above operation degrades at a
linear
  > rate. So for each commit, we are doing a docFreq for 500 documents.
  > 
  > I also tried to create a BooleanQuery composed of 500 TermQueries
and
  > do 1 search for each batch, and the performance didn't get better.
And
  > if the batch size increases to say 50,000, creating a BooleanQuery
  > composed of 50,000 TermQuery instances may introduce huge memory
  > costs.
  > 
  > Is there a better way to do this?
  > 
  > Can TermInfosReader.get(Term term) be optimized to do a binary
lookup
  > instead of a linear walk? Of course that depends on whether the
terms
  > are stored in sorted order, are they?
  > 
  > This is very urgent, thanks in advance for all your help.
  > 
  > -John
  > 
  >
-
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



URGENT: Help indexing large document set

2004-11-23 Thread John Wang
Hi:

   I am trying to index 1M documents, with batches of 500 documents.

   Each document has an unique text key, which is added as a
Field.KeyWord(name,value).

   For each batch of 500, I need to make sure I am not adding a
document with a key that is already in the current index.

  To do this, I am calling IndexSearcher.docFreq for each document and
delete the document currently in the index with the same key:
 
   while (keyIter.hasNext()) {
String objectID = (String) keyIter.next();
term = new Term("key", objectID);
int count = localSearcher.docFreq(term);

if (count != 0) {
localReader.delete(term);
}
  }

Then I proceed with adding the documents.

This turns out to be extremely expensive, I looked into the code and I see in 
TermInfosReader.get(Term term) it is doing a linear look up for each
term. So as the index grows, the above operation degrades at a linear
rate. So for each commit, we are doing a docFreq for 500 documents.

I also tried to create a BooleanQuery composed of 500 TermQueries and
do 1 search for each batch, and the performance didn't get better. And
if the batch size increases to say 50,000, creating a BooleanQuery
composed of 50,000 TermQuery instances may introduce huge memory
costs.

Is there a better way to do this?

Can TermInfosReader.get(Term term) be optimized to do a binary lookup
instead of a linear walk? Of course that depends on whether the terms
are stored in sorted order, are they?

This is very urgent, thanks in advance for all your help.

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JDBCDirectory to prevent optimize()?

2004-11-23 Thread Kevin A. Burton
Erik Hatcher wrote:
Also, there is a DBDirectory in the sandbox to store a Lucene index 
inside Berkeley DB.
I assume this would prevent prefix queries from working...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Chris Hostetter

: Note that I said FilteredQuery, not QueryFilter.

Doh .. right sorry, I confused myself by thinking you were still refering
to your comments 2004-03-29 comparing DateFilter with RangeQuery wrapped
in a QueryFilter.

: I debate (with myself) on whether add-ons that can be done with other
: code is worth adding to Lucene's core.  In this case the utility
: methods are so commonly needed that it makes sense.  But it could be

In particular, having a class of utilities like that in the code base is
usefull, because now the javadocs for classes like RangeQuery and
RangeFilter can refrence them as being neccessary important to ensure that
ranges work the way you expect ... and hopefully fewer people will be
confused in the future.

: I think there needs to be some discussion on what other utility methods
: should be added.  For example, most of the numerics I index are
: positive integers and using a zero-padded is sufficient.  I'd rather
: have clearly recognizable numbers in my fields than some strange
: contortion that requires a conversion process to see.

I'm of two minds, on one hand I think there's no big harm in providing
every concievable utility function known to man so people have their
choice of representation.  On the other hand, I think it would be nice if
Lucene had a much simpler API for dealing with "non-strings" that just did
"the right thing" based on simple expectations -- without the user having
to ask themselves: "Will i ever need negative numbers?  Will I ever need
numbers bigger then 1000?" or to later remember that they padded tis field
to 5 digits and that field to 7 digits.

Having clearly recognized values is something that can (should?) be easily
accomplished by indexing the contorted but lexically sortable value, and
storing the more readable value...

Document d = /* some doc */;
Long l = /* some value */;
Field f1 = Field.UnIndexed("field", l.toString());
Field f2 = Field.UnStored("field", NumerTools.longToString(l));
d.add(f1);
d.add(f2);

(I'm not imagining things right?  that should work, correct?)

What would really be sweet, Is if Lucene had an API that
transparently dealt with all of the major primitive types, both at
indexing time and at query time, so that users ddn't have to pay any
attention to the stringification, or when to Index a different value
then they store...

Field f = Field.Long("field", l); /* indexes one string, stores the other */
d.add(f);
...
Query q = new RangeQuery("field", l1, l2); /* knows to use the contorted 
string */
...
String s = hits.doc(i).getValue("field"); /* returns pretty string */
Long l = hits.doc(i).getValue("field");   /* returns orriginal Long */

--

---
"Oh, you're a tricky one."Chris M Hostetter
 -- Trisha Weir[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



retrieving added document

2004-11-23 Thread Paul
Hi,
I'm creating a document and adding it with a writer to the index. For
some reason I need to add data to this specific document later on
(minutes, not hours or days). Is it possible to retrieve it and add
additonal data?
I found the document(int n) - method within the IndexReader (btw: the
description makes no sense for me: "Returns the stored fields of the
nth Document in this index." - but it returns a Document and not a
list of fields..) but where do I get that number from? (and the
numbers change, I know..)

thanks for any help

Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Erik Hatcher
On Nov 23, 2004, at 3:41 PM, Erik Hatcher wrote:
On Nov 23, 2004, at 2:16 PM, Chris Hostetter wrote:
First: Is there any reason Matt Quail's "LongField" class hasn't been
added to CVS (or has it and I'm just not seeing it?)
Laziness is the only reason, at least on my part.  I think adding it 
is a great thing.  I'll look into it.
I'm feeling particularly commit-y today.  I dug up Matt Quail's 
original LongField contribution in e-mail and adapted it to a new 
NumberTools class.  I committed it along with the tests he contributed 
also.

I think there needs to be some discussion on what other utility methods 
should be added.  For example, most of the numerics I index are 
positive integers and using a zero-padded is sufficient.  I'd rather 
have clearly recognizable numbers in my fields than some strange 
contortion that requires a conversion process to see.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Erik Hatcher
On Nov 23, 2004, at 2:16 PM, Chris Hostetter wrote:
: I did a little code cleanup, Chris, renaming some RangeFilter 
variables
: and correcting typos in the Javadocs.  Let me know if everything 
looks
: ok.

Wow ... that was fast.  Things look fine to me (typo's in javadocs are 
my
specialty)  but now I wish I'd included more tests
We can always add more tests.  Anytime.
First: Is there any reason Matt Quail's "LongField" class hasn't been
added to CVS (or has it and I'm just not seeing it?)
Laziness is the only reason, at least on my part.  I think adding it is 
a great thing.  I'll look into it.

I haven't tested it extensively, but strikes me as being a crucial 
utility
for people who want to do any serious sorting or filtering of numeric
values.
I debate (with myself) on whether add-ons that can be done with other 
code is worth adding to Lucene's core.  In this case the utility 
methods are so commonly needed that it makes sense.  But it could be 
argued also that there are are classes in Lucene that are not central 
to its operation.

Although I would suggest a few minor tweaks:
  a) Rename to something like NumberTools (to be consistent with the 
new
 DateTools and because...)
Agreed.
  b) Add some one line convinience methods like intToString and
 floatToString and doubleToString ala:
 return longToString(Double.doubleToLongBits(d));
No objects to having convenience methods - though I need to look at 
what the LongField code is providing before commenting in detail.

: And now with FilteredQuery you can have the best of both worlds :)
See, this is what I'm not getting: what is the advantage of the second
world? :) ... in what situations would using...
   s.search(q1, new QueryFilter(new RangeQuery(t1,t2,true));
...be a better choice then...
   s.search(q1, new 
RangeFilter(t1.field(),t1.text(),t2.text(),true,true);
Note that I said FilteredQuery, not QueryFilter.
Certainly RangeFilter is cleaner than using a QueryFilter(RangeFilter) 
combination - that's why we added it.  :)

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Yonik Seeley
Hmmm, scratch that.  I explained the tradeoff of a
filter vs a range query - not between the different
types of filters you talk about.

--- Yonik Seeley <[EMAIL PROTECTED]> wrote:
> I think it depends on the query.  If the query (q1)
> covers a large number of documents and the fiter
> covers a very small number, then using a RangeFilter
> will probably be slower than a RangeQuery.
> 
> -Yonik
> 
> 
> > See, this is what I'm not getting: what is the
> > advantage of the second
> > world? :) ... in what situations would using...
> > 
> >s.search(q1, new QueryFilter(new
> > RangeQuery(t1,t2,true));
> > 
> > ...be a better choice then...
> > 
> >s.search(q1, new
> >
>
RangeFilter(t1.field(),t1.text(),t2.text(),true,true);
> 
> 
> 
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 




__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Yonik Seeley
I think it depends on the query.  If the query (q1)
covers a large number of documents and the fiter
covers a very small number, then using a RangeFilter
will probably be slower than a RangeQuery.

-Yonik


> See, this is what I'm not getting: what is the
> advantage of the second
> world? :) ... in what situations would using...
> 
>s.search(q1, new QueryFilter(new
> RangeQuery(t1,t2,true));
> 
> ...be a better choice then...
> 
>s.search(q1, new
>
RangeFilter(t1.field(),t1.text(),t2.text(),true,true);



__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Erik Hatcher
On Nov 23, 2004, at 10:01 AM, Praveen Peddi wrote:
Chris's RangeFilter does not cache anything where as QueryFilter does 
caching. Is it better to add the caching funtionality to RangeFilter 
also? or does it not make any difference?
Caching is a different _aspect_.  Filtering and caching are not related 
and should not be intimately tied, in my opinion.  The solution is to 
use the CachingWrapperFilter to wrap a RangeFilter when caching is 
desired.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


experiences with PDF files

2004-11-23 Thread Paul
Hi,
I read a lot of mails about the time consuming pdf-parsing and tried
myself some solutions. My example PDF file has 181 pages in 1,5 MB
(mostly text nearly no grafics).
-with pdfbox.org's toolkit it took 17m32s to parse&read it's content
-after installing ghostscript and ps2text / ps2ascii my parsing failed
after page 54 and 2m51s because of irregular fonts
-installing XPDF and using it's tool pdftotext parsing completed after
7-10seconds

My machine is a Celeren 1700 with VMWare Workstation 3.2 (128 MB
assigned) and Linux Suse 7.3.

I will parse my pdf files with xpdf and something like
Runtime.getRuntime().exec("pdftotext -nopgbrk -raw "+pdfFileName+"
"+txtFileName);


Paul

P.S. look at http://www.jguru.com/faq/view.jsp?EID=1074237 for links and tipps

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: modifying existing index

2004-11-23 Thread Will Allen
To update a document you need to insert the modified document, then delete the 
old one.

Here is some code that I use to get you going in the right direction (it wont 
compile, but if you follow it closely you will see how I take an array of 
lucene documents with new properties and add them, then delete the old ones.):


public void  updateDocuments( Document[] documentsToUpdate )
{
if ( documentsToUpdate.length > 0 )
{
String updateDate = Dates.formatDate( new Date(), 
"MMddHHmm" );
//  wait on some other modification to finish
HashSet failedToAdd = new HashSet();
waitToModify();
synchronized(directory)
{
IndexWriter indexWriter = null;
try
{
indexWriter = getWriter();
indexWriter.mergeFactor = 2; //this 
seems to be needed to accomodate a lucene (ver 1.4.2) bug
//otherwise the index does not 
accurately reflect the change
//load data from new document into old 
document
for ( int i = 0; i < 
documentsToUpdate.length; i++ )
{
try
{
Document newDoc = 
modifyDocument( documentsToUpdate[i], updateDate );
if ( newDoc != null )
{

documentsToUpdate[i] = newDoc;

indexWriter.addDocument( newDoc );
}
else
{

failedToAdd.add( documentsToUpdate[i].get( "messageid" ) );
}
}
catch ( IOException 
addDocException )
{
//if we fail to add, 
make a note and dont delete it
logger.error( " 
["+getContext().getID()+"] error updating message:" + 
documentsToUpdate[i].get("messageid") ,addDocException );
failedToAdd.add( 
documentsToUpdate[i].get( "messageid" ) );
}
catch ( 
java.lang.IllegalStateException ise )
{
//if we fail to add, 
make a note and dont delete it
logger.error( " 
["+getContext().getID()+"] error updating message:" + 
documentsToUpdate[i].get("messageid") ,ise );
failedToAdd.add( 
documentsToUpdate[i].get( "messageid" ) );
}
}
//if we fail to close the writer, we 
dont want to continue
closeWriter();
searcherVersion = -1; //establish that 
the searcher needs to update
IndexReader reader = IndexReader.open( 
indexPath );
int testid = -1;
for ( int i = 0; i < 
documentsToUpdate.length; i++ )
{
Document newDoc = 
documentsToUpdate[i];
try
{
logger.debug( "delete 
id:" + newDoc.get( "deleteid" ) + " messageid: "
+ newDoc.get( 
"messageid" ) );
reader.delete( 
Integer.parseInt( newDoc.get( "deleteid" ) ) );
testid = 
Integer.parseInt( newDoc.get( "deleteid" ) );
}
catch ( NumberFormatEx

Re: modifying existing index

2004-11-23 Thread Luke Francl
On Tue, 2004-11-23 at 13:59, Santosh wrote:
> I am using lucene for indexing, when I am creating Index the docuemnts are 
> added. but when I want to modify the single existing document and reIndex 
> again, it is taking as new document and adding one more time, so that I am 
> getting same document twice in the results.
> To overcome this I am deleting existing Index and again recreating whole 
> Index. but is it possibe to index  the modified document again and overwrite 
> existing document without deleting and recreation. can I do this? If so how? 

You do not need to recreate the whole index. Just mark the document as
deleted using the IndexReader and then add it again with the
IndexWriter. Remember to close your IndexReader and IndexWriter after
doing this.

The deleted document will be removed the next time you optimize your
index.

Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



modifying existing index

2004-11-23 Thread Santosh
I am using lucene for indexing, when I am creating Index the docuemnts are 
added. but when I want to modify the single existing document and reIndex 
again, it is taking as new document and adding one more time, so that I am 
getting same document twice in the results.
To overcome this I am deleting existing Index and again recreating whole Index. 
but is it possibe to index  the modified document again and overwrite existing 
document without deleting and recreation. can I do this? If so how? 

and one more question.
can lucene will be able to do stemming?
If I am searching for "roam" then I know that it can give result for "foam" 
using fuzzy query. But my requirement is if I search for "roam" can I get the 
similar worlist as output. so that I can show the end user in the column  
---   do you mean "foam"?
How can I get similar word list in the given content?  




---SOFTPRO DISCLAIMER--



Information contained in this E-MAIL and any attachments are

confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'

and 'confidential'.



If you are not an intended or authorised recipient of this E-MAIL or

have received it in error, You are notified that any use, copying or

dissemination  of the information contained in this E-MAIL in any

manner whatsoever is strictly prohibited. Please delete it immediately

and notify the sender by E-MAIL.



In such a case reading, reproducing, printing or further dissemination

of this E-MAIL is strictly prohibited and may be unlawful.



SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment

hereto is free from computer viruses or other defects.



The opinions expressed in this E-MAIL and any ATTACHEMENTS may be

those of the author and are not necessarily those of SOFTPRO SYSTEMS.





Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Chris Hostetter

: Done.  I deprecated DateField and DateFilter, and added the RangeFilter
: class contributed by Chris.
:
: I did a little code cleanup, Chris, renaming some RangeFilter variables
: and correcting typos in the Javadocs.  Let me know if everything looks
: ok.

Wow ... that was fast.  Things look fine to me (typo's in javadocs are my
specialty)  but now I wish I'd included more tests

I still feel a little confused about two things though...

First: Is there any reason Matt Quail's "LongField" class hasn't been
added to CVS (or has it and I'm just not seeing it?)

I haven't tested it extensively, but strikes me as being a crucial utility
for people who want to do any serious sorting or filtering of numeric
values.

Although I would suggest a few minor tweaks:
  a) Rename to something like NumberTools (to be consistent with the new
 DateTools and because...)
  b) Add some one line convinience methods like intToString and
 floatToString and doubleToString ala:
 return longToString(Double.doubleToLongBits(d));

Second...

: RangeQuery wrapped inside a QueryFilter is more specifically what I
: said.  I'm not a fan of DateField and how the built-in date support in
: Lucene works, so this is why I don't like DateFilter personally.
:
: Your RangeFilter, however, is nicely done and well worth deprecating
: DateFilter for.
  [...]
: > and RangeQuery. [5] Based on my limited tests, using a Filter to
: > restrict
: > to a Range is a lot faster then using RangeQuery -- independent of
: > caching.
:
: And now with FilteredQuery you can have the best of both worlds :)

See, this is what I'm not getting: what is the advantage of the second
world? :) ... in what situations would using...

   s.search(q1, new QueryFilter(new RangeQuery(t1,t2,true));

...be a better choice then...

   s.search(q1, new RangeFilter(t1.field(),t1.text(),t2.text(),true,true);


?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Praveen Peddi
Chris's RangeFilter does not cache anything where as QueryFilter does 
caching. Is it better to add the caching funtionality to RangeFilter also? 
or does it not make any difference?

Praveen
- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 23, 2004 9:19 AM
Subject: Re: Numeric Range Restrictions: Queries vs Filters


On Nov 23, 2004, at 4:18 AM, Doug Cutting wrote:
Hoss wrote:
The attachment contains my RangeFilter, a unit test that demonstrates 
it,
and a Benchmarking unit test that does a side-by-side comparison with
RangeQuery [6].  If developers feel that this class is useful, then by 
all
means roll it into the code base.  (90% of it is cut/pasted from
DateFilter/RangeQuery anyway)
+1
DateFilter could be deprecated, and replaced with the more generally and 
appropriately named RangeFilter.  Should we also deprecate DateField, in 
preference for DateTools?
Done.  I deprecated DateField and DateFilter, and added the RangeFilter 
class contributed by Chris.

I did a little code cleanup, Chris, renaming some RangeFilter variables 
and correcting typos in the Javadocs.  Let me know if everything looks ok.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Erik Hatcher
On Nov 23, 2004, at 4:18 AM, Doug Cutting wrote:
Hoss wrote:
The attachment contains my RangeFilter, a unit test that demonstrates 
it,
and a Benchmarking unit test that does a side-by-side comparison with
RangeQuery [6].  If developers feel that this class is useful, then 
by all
means roll it into the code base.  (90% of it is cut/pasted from
DateFilter/RangeQuery anyway)
+1
DateFilter could be deprecated, and replaced with the more generally 
and appropriately named RangeFilter.  Should we also deprecate 
DateField, in preference for DateTools?
Done.  I deprecated DateField and DateFilter, and added the RangeFilter 
class contributed by Chris.

I did a little code cleanup, Chris, renaming some RangeFilter variables 
and correcting typos in the Javadocs.  Let me know if everything looks 
ok.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Erik Hatcher
On Nov 22, 2004, at 9:25 PM, Hoss wrote:
I'm rather new to Lucene (and this list), so if I'm grossly
misunderstanding things, forgive me.
You're spot on!
But I was surprised then to see the following quote from "Erik 
Hatcher" in
the archives:

  "In fact, DateFilter by itself is practically of no use, I think." 
[4]

...Erik goes on to suggest that given "a set of canned date ranges", it
doesn't really matter if you use a RangeQuery or a DateFilter -- as 
long
as you cache them to reuse them (with something like 
CachingWrappingFilter
or QueryFilter).  I'm hoping that he might elaborate on that comment?
RangeQuery wrapped inside a QueryFilter is more specifically what I 
said.  I'm not a fan of DateField and how the built-in date support in 
Lucene works, so this is why I don't like DateFilter personally.

Your RangeFilter, however, is nicely done and well worth deprecating 
DateFilter for.

As a test, I wrote a "RangeFilter" which borrows heavily from 
DateFilter
to both convince myself it could work, and to do a comparison between 
it
and RangeQuery. [5] Based on my limited tests, using a Filter to 
restrict
to a Range is a lot faster then using RangeQuery -- independent of
caching.
And now with FilteredQuery you can have the best of both worlds :)
Thanks for your detailed code, tests, and contribution.  We'll fold it 
in.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Doug Cutting
Hoss wrote:
The attachment contains my RangeFilter, a unit test that demonstrates it,
and a Benchmarking unit test that does a side-by-side comparison with
RangeQuery [6].  If developers feel that this class is useful, then by all
means roll it into the code base.  (90% of it is cut/pasted from
DateFilter/RangeQuery anyway)
+1
DateFilter could be deprecated, and replaced with the more generally and 
appropriately named RangeFilter.  Should we also deprecate DateField, in 
preference for DateTools?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: too many files open issue

2004-11-23 Thread Neelam Bhatnagar
Hi Dmitry,
 
Thank you so much for your reply. 
 
I'd like to answer your specific questions. 
 
>>It also depends on whether you are using "compound files" or not (this
is a flag on the IndexWriter). >>With compound files flag on, segments
have fixed number of files, regardless of how many fields >>you use.
Without the flag, each field is a separate file.
 
We are using Lucene 1.2 and hence we don't have this compound file
property in IndexWriter class. This would mean that we are having a
separate file for each field. 
 
>>By the way, it is usual to have the file descriptors limit set at 9000
or so for unix machines running >>production web applications. By the
way 2, on Solaris, you will need to modify a value in >>/etc/systems to
get up to this level. Not sure about Linux or other flavors.
 
We are using SunOS 5.8 on a Sparc Sunfire280 R machine. Running ulimit
-n gives number 256. 
This is the number we had first tried to reduce to 200 and then bring
back up to 500 without any luck. Then ultimately, everything started to
work on the default number 256. 
We had tried to alter this number using ulimit command itself instead of
changing it in the /etc/system file. 
 
>>Another suggestion - you may want to look into a tool called "lsof".
It is a utility that will show file >>handles open by a particular
process. It could be that some other part of your process (or of the
>>application server, VM, etc) is not closing files. This tool will help
you see what files are open and >>you can validate that all of the
really need to be open.
 
The "lsof" tool is available through following path 
ftp://vic.cc.purdue.edu/pub/tools/unix/lsof
which is not accepting anonymous access. Hence we have not been able to
download this tool to figure out what's going on with the processes and
the files being opened by them.
 
 
The most worrying aspect about the whole scenario is that there's no
consistency in the way system behaves. It works fine with the default
settings then suddenly it stops working. Then after changing the
settings several times, it works again then breaks again. Our worry is
that we may not be going in the right direction with this approach. 
 
Kindly advise.
 
Thanks and regards
Neelam Bhatnagar
 
 
-Original Message-
From: Dmitry [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 22, 2004 8:46 PM
To: Lucene Users List
Subject: Re: Too many open files issue
 
I'm sorry, I wasn't involved in the original conversation but maybe I 
can jump in with some info that will help.
 
The number of files depends on the merge factor, number of segments, and

number of indexed fields in your index. It also depends on whether you 
are using "compound files" or not (this is a flag on the IndexWriter). 
With compound files flag on, segments have fixed number of files, 
regardless of how many fields you use. Without the flag, each field is a

separate file.
 
Let's say you have 10 segments (per your merge factor) that are being 
merged into a new segment (via an optimize call or just because you have

reached the merge factor). This means there are 11 segments open at the 
same time. If you have 20 indexed fields and are not using compound 
files, that's 20 * 11 = 220 files. There are a few other files open as 
well, plus whatever other files and sockets that your JVM process is 
holding open at that time. This would include incoming connections, for 
example, if this is running inside a web server. If you are running in 
an application server, this could include connections and files open by 
other applications in that same app server.
 
So the numbers run up quite a bit.
 
By the way, it is usual to have the file descriptors limit set at 9000 
or so for unix machines running production web applications. By the way 
2, on Solaris, you will need to modify a value in /etc/systems to get up

to this level. Not sure about Linux or other flavors.
 
Another suggestion - you may want to look into a tool called "lsof". It 
is a utility that will show file handles open by a particular process. 
It could be that some other part of your process (or of the application 
server, VM, etc) is not closing files. This tool will help you see what 
files are open and you can validate that all of the really need to be
open.
 
Best of luck.
Dmitry.
 
 
Neelam Bhatnagar wrote:
 
>Hi,
> 
>I had requested help on an issue we have been facing with the "Too many
>open files" Exception garbling the search indexes and crashing the
>search on the web site. 
>As a suggestion, you had asked us to look at the articles on O'Reilly
>Network which had specific context around this exact problem. 
>One of the suggestions was to increase the limit on the number of file
>descriptors on the file system. We tried it by first lowering the limit
>to 200 from 256 in order to reproduce the exception. The exception did
>get reproduced but even after increasing the limit to 500, the
exception
>kept coming until after several rounds of trying to rebuild the inde

Re: JDBCDirectory to prevent optimize()?

2004-11-23 Thread Erik Hatcher
Also, there is a DBDirectory in the sandbox to store a Lucene index 
inside Berkeley DB.

Erik
On Nov 22, 2004, at 6:06 PM, Kevin A. Burton wrote:
It seems that when compared to other datastores that Lucene starts to 
fall down.  For example lucene doesn't perform online index 
optimizations so if you add 10 documents you have to run optimize() 
again and this isn't exactly a fast operation.

I'm wondering about the potential for a generic JDBCDirectory for 
keeping the lucene index within a database.
It sounds somewhat unconventional would allow you to perform live 
addDirectory updates without performing an optimize() again.

Has anyone looked at this?  How practical would it be.
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then 
you should work for Rojo!  If you recommend someone and we hire them 
you'll get a free iPod!
   Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Morus Walter
Hoss writes:
> 
> (c) Filtering.  Filters in general make a lot of sense to me.  They are a
> way to specify (at query time) that only a certain subset of the index
> should be considered for results.  The Filter class has a very straight
> forward API that seems very easy to subclass to get the behavior I want.
> The Query API on the other hand ... I freely admit, that I can't make
> heads or tails out of it.  I don't even know where I would begin to try
> and write a new subclass of Query if I wanted to.
> 
> I would think that most people who want to do a "numeric range
> restriction" on their data, probably don't care about the Scoring benefits
> of RangeQuery.  Looking at the code base, the way DateFilter works seems
> like it provides an ideal solution to any sort of Range restriction (not
> just Dates) that *should* be more efficient then using RangeQuery when
> dealing with an unbounded value set. (Both approaches need to iterate over
> all of the terms in the specified field using TermEnum, but RangeQuery has
> to build up an set of BooleanQuery objects for each matching term, and
> then each of those queries have to help score the documents -- DateFilter
> on the other hand only has to maintain a single BitSet of documents that
> it finds as it iterates)
> 
IMO there's another option, at least as long as the number of your documents
isn't too high.
Sorting already creates a list of all field values for some field that 
will be used during the search for sorting.
Nothing prevents you from using that aproach for search restriction also.
The advantage is, that you can create that list once and use it for different
ranges until the index is changed whereas a filter can only represent
one range.
The disadvantate is, that you have to keep one value for each document in
memory instead of one bit in a filter.

I did that (before the sort code was introduced) for date queries in order
to be able to sort and restrict searches on dates.
But I haven't thought about how a general API for such a solution might look 
like so far.

Of course it depends on a number of questions, which way is preferable.
How often is the index modified, are range queries usually done for the
same or different ranges, how many documents are indexed and so on.

Morus
  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Paul Elschot
Chris,

On Tuesday 23 November 2004 03:25, Hoss wrote:
> (NOTE: numbers in [] indicate Footnotes)
> 
> I'm rather new to Lucene (and this list), so if I'm grossly
> misunderstanding things, forgive me.
> 
> One of my main needs as I investigate Search technologies is to restrict
> results based on Ranges of numeric values.  Looking over the archives of
> this list, it seems that lots of people have run into problems dealing
> with this.  In particular, whenever someone asks a question about "Numeric
> Ranges" the question seem to always involve one (or more) of the
> following:
> 
>(a) Lexical sorting puts 11 in the range "1 TO 5"
>(b) Dates (or Dates and Times)
>(c) BooleanQuery$TooManyClauses Exceptions
>(d) Should I use a filter?

FWIW, the javadoc of the development version of
BooleanQuery.maxClauseCount reads:

  The maximum number of clauses permitted. Default value is 1024. Use the  
  org.apache.lucene.maxClauseCount system property to override. 

  TermQuery clauses are generated from for example prefix queries and
  fuzzy queries. Each TermQuery needs some buffer space during search,
  so this parameter indirectly controls the maximum buffer requirements for
  query search. Normally the buffers are allocated by the JVM. When using
  for example MMapDirectory the buffering is left to the operating system.

MMapDirectory uses memory mapped files for the index.

It would be useful to also provide a reference to filters (DateFilter)
and to LongField in case it is added to the code base.

...
> The Query API on the other hand ... I freely admit, that I can't make
> heads or tails out of it.  I don't even know where I would begin to try
> and write a new subclass of Query if I wanted to.

In a nutshell:

A Query either rewrites to another Query, or it provides a Weight.
A Weight first does normalisation and then provides a Scorer
to be used during search.

RangeQuery is a good example:

A RangeQuery rewrites to a BooleanQuery over TermQuery's
for the matching terms.
A BooleanQuery provides a BooleanScorer via its Weight.
A TermQuery provides a TermScorer via its Weight.

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JDBCDirectory to prevent optimize()?

2004-11-23 Thread Daniel Naber
On Tuesday 23 November 2004 00:06, Kevin A. Burton wrote:

> I'm wondering about the potential for a generic JDBCDirectory for
> keeping the lucene index within a database.

Such a thing already exists: http://ppinew.mnis.com/jdbcdirectory/, but I 
don't know about its scalability.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]