RE: fetching similar wordlist as given word
Lucene does support stemming, but that is not what your example requires (stemming equates "roaming", "roam", "roamed", etc.). For stemming, look at PorterStemFilter or better, the Snowball stemmers in the sandbox. For your similar word list, I think you are looking for the class FuzzyTermEnum. This should give you the terms you need, although perhaps only those with a common prefix of a specified length. Otherwise, you could develop your own algorithm to look for similar terms in the index. Chuck > -Original Message- > From: Santosh [mailto:[EMAIL PROTECTED] > Sent: Tuesday, November 23, 2004 11:15 PM > To: Lucene Users List > Subject: fetching similar wordlist as given word > > can lucene will be able to do stemming? > If I am searching for "roam" then I know that it can give result for > "foam" using fuzzy query. But my requirement is if I search for "roam" > can I get the similar wordlist as output. so that I can show the end > user in the column --- do you mean "foam"? > How can I get similar word list in the given content? > > > > ---SOFTPRO DISCLAIMER-- > > > > Information contained in this E-MAIL and any attachments are > > confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' > > and 'confidential'. > > > > If you are not an intended or authorised recipient of this E-MAIL or > > have received it in error, You are notified that any use, copying or > > dissemination of the information contained in this E-MAIL in any > > manner whatsoever is strictly prohibited. Please delete it immediately > > and notify the sender by E-MAIL. > > > > In such a case reading, reproducing, printing or further dissemination > > of this E-MAIL is strictly prohibited and may be unlawful. > > > > SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment > > hereto is free from computer viruses or other defects. > > > > The opinions expressed in this E-MAIL and any ATTACHEMENTS may be > > those of the author and are not necessarily those of SOFTPRO SYSTEMS. > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: modifying existing index
On Wed, 24 Nov 2004 13:04:20 +0530, Santosh <[EMAIL PROTECTED]> wrote: > I have gon through IndexReader , I got method : delete(int docNum) , > but from where I will get document number? Is this predifined? or we have > to give a number prior to indexing? The number(aka doc-id) is given by lucene and is it's an internal sequential integer. This number is usually retrieved from Hits.id(int) of your search. Hits myHits = myIndexSearcher.search( myQuery ); for ( int i=0; i > > - Original Message - > From: "Luke Francl" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Wednesday, November 24, 2004 1:26 AM > Subject: Re: modifying existing index > > > On Tue, 2004-11-23 at 13:59, Santosh wrote: > > > I am using lucene for indexing, when I am creating Index the docuemnts > are added. but when I want to modify the single existing document and > reIndex again, it is taking as new document and adding one more time, so > that I am getting same document twice in the results. > > > To overcome this I am deleting existing Index and again recreating whole > Index. but is it possibe to index the modified document again and overwrite > existing document without deleting and recreation. can I do this? If so how? > > > > You do not need to recreate the whole index. Just mark the document as > > deleted using the IndexReader and then add it again with the > > IndexWriter. Remember to close your IndexReader and IndexWriter after > > doing this. > > > > The deleted document will be removed the next time you optimize your > > index. > > > > Luke Francl > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Cheolgoo, Kang - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: modifying existing index
A good way to do this is to add a keyword field with whatever unique id you have for the document. Then you can delete the term containing a unique id to delete the document from the index (look at IndexReader.delete(Term)). You can look at the demo class IndexHTML to see how it does incremental indexing for an example. Chuck > -Original Message- > From: Santosh [mailto:[EMAIL PROTECTED] > Sent: Tuesday, November 23, 2004 11:34 PM > To: Lucene Users List > Subject: Re: modifying existing index > > I have gon through IndexReader , I got method : delete(int > docNum) , > but from where I will get document number? Is this predifined? or we > have > to give a number prior to indexing? > - Original Message - > From: "Luke Francl" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Wednesday, November 24, 2004 1:26 AM > Subject: Re: modifying existing index > > > > On Tue, 2004-11-23 at 13:59, Santosh wrote: > > > I am using lucene for indexing, when I am creating Index the > docuemnts > are added. but when I want to modify the single existing document and > reIndex again, it is taking as new document and adding one more time, so > that I am getting same document twice in the results. > > > To overcome this I am deleting existing Index and again recreating > whole > Index. but is it possibe to index the modified document again and > overwrite > existing document without deleting and recreation. can I do this? If so > how? > > > > You do not need to recreate the whole index. Just mark the document as > > deleted using the IndexReader and then add it again with the > > IndexWriter. Remember to close your IndexReader and IndexWriter after > > doing this. > > > > The deleted document will be removed the next time you optimize your > > index. > > > > Luke Francl > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Help on the Query Parser
Terence Lai writes: > > Look likes that the wildcard query disappeared. In fact, I am expecting > text:"java* developer" to be returned. It seems to me that the QueryParser > cannot handle the wildcard within a quoted String. > That's not just QueryParser. Lucene itself doesn't handle wildcards within phrases. You could have a query text:"java* developer" if '*' isn't removed by the analyzer. But it would only search for the token 'java*' not any expansion of that. I guess this is not, what you want. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: modifying existing index
I have gon through IndexReader , I got method : delete(int docNum) , but from where I will get document number? Is this predifined? or we have to give a number prior to indexing? - Original Message - From: "Luke Francl" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, November 24, 2004 1:26 AM Subject: Re: modifying existing index > On Tue, 2004-11-23 at 13:59, Santosh wrote: > > I am using lucene for indexing, when I am creating Index the docuemnts are added. but when I want to modify the single existing document and reIndex again, it is taking as new document and adding one more time, so that I am getting same document twice in the results. > > To overcome this I am deleting existing Index and again recreating whole Index. but is it possibe to index the modified document again and overwrite existing document without deleting and recreation. can I do this? If so how? > > You do not need to recreate the whole index. Just mark the document as > deleted using the IndexReader and then add it again with the > IndexWriter. Remember to close your IndexReader and IndexWriter after > doing this. > > The deleted document will be removed the next time you optimize your > index. > > Luke Francl > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
fetching similar wordlist as given word
can lucene will be able to do stemming? If I am searching for "roam" then I know that it can give result for "foam" using fuzzy query. But my requirement is if I search for "roam" can I get the similar wordlist as output. so that I can show the end user in the column --- do you mean "foam"? How can I get similar word list in the given content? ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS.
MERGERINDEX + SOLUTION
Hi Guys Apologies I have a MERGERINDEX [ Merged 1000 subindexes] , The Question is Does Somebody have any solution for recorrecting the Mergerindex [ in case of Corruption ] If so Please Let the Form know about this,so developers like us would use the same. Thx in Advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Help on the Query Parser
Hi all, I am trying to use the QueryParser.parse() to parse a query string like "java* developer". Note that I want the wildcard string, java*, followed by the word developer. The following is the code. - String qryStr = "\"java* developer\""; String fieldname = "text"; StandardAnalyzer analyzer = new StandAnalyzer(); Query qry = org.apache.lucene.queryParser.QueryParser.parse(qryStr, fieldname, analyzer); - When I do a qryStr.toString() to print out the contents, I got the following output: - text:"java developer" - Look likes that the wildcard query disappeared. In fact, I am expecting text:"java* developer" to be returned. It seems to me that the QueryParser cannot handle the wildcard within a quoted String. Does anyone has a solution on this? Am I missing something in the code? Thanks, Terence -- Get your free email account from http://www.trekspace.com Your Internet Virtual Desktop! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: lucene Scorers
Hi Ken, I'm glad our replies were helpful. It sounds like you looked at the code in MaxDisjunctionQuery, so you probably noticed that it also implements skipTo(). Your suggestion sounds like a good thing to do. I thought about that when writing MaxDisjunctionQuery, but didn't need the generality, and it does make the code more complex. I think Lucene needs one of these mechanisms in it, at least to solve the problems associated with the current default use of BooleanQuery for multiple field expansions. Your proposal would generalize this to solve additional cases where different accrual operators are appropriate. You could write and submit the generalization, although there are no guarantees anybody would do anything with it. I didn't get anywhere in my attempt to submit MaxDisjunctionQuery. I think there is also a serious problem in scoring with the current score normalization (it does not provide meaningfully comaparable scores across different searches, which means that absolute score numbers like 0.8 have no intrinsic meaning concerning how good a result is or is not). When I finally get back to tuning search in my app, that's the next one I'll try a submission on. Chuck > -Original Message- > From: Ken McCracken [mailto:[EMAIL PROTECTED] > Sent: Tuesday, November 23, 2004 4:31 PM > To: Lucene Users List > Subject: Re: lucene Scorers > > Hi, > > Thanks the pointers in your replies. Would it be possible to include > some sort of accrual scorer interface somewhere in the Lucene Query > APIs? This could be passed into a query similar to > MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc., > according to the implementor's discretion, to compute the overall > score for a document. > > -Ken > > On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot > <[EMAIL PROTECTED]> wrote: > > On Friday 12 November 2004 22:56, Chuck Williams wrote: > > > > > > > I had a similar need and wrote MaxDisjunctionQuery and > > > MaxDisjunctionScorer. Unfortunately these are not available as a > patch > > > but I've included the original message below that has the code > (modulo > > > line breaks added by simple text email format). > > > > > > This code is functional -- I use it in my app. It is optimized for > its > > > stated use, which involves a small number of clauses. You'd want to > > > improve the incremental sorting (e.g., using the bucket technique of > > > BooleanQuery) if you need it for large numbers of clauses. > > > > When you're interested, you can also have a look here for > > yet another DisjunctionScorer: > > http://issues.apache.org/bugzilla/show_bug.cgi?id=31785 > > > > It has the advantage that it implements skipTo() so that it can > > be used as a subscorer of ConjunctionScorer, ie. it can be > > faster in situations like this: > > > > aa AND (bb OR cc) > > > > where bb and cc are treated by the DisjunctionScorer. > > When aa is a filter this can also be used to implement > > a filtering query. > > > > > > > > > > > Re. Paul's suggested steps below, I did not integrate this with > query > > > parser as I didn't need that functionality (since I'm generating the > > > multi-field expansions for which max is a much better scoring choice > > > than sum). > > > > > > Chuck > > > > > > Included message: > > > > > > -Original Message- > > > From: Chuck Williams [mailto:[EMAIL PROTECTED] > > > Sent: Monday, October 11, 2004 9:55 PM > > > To: [EMAIL PROTECTED] > > > Subject: Contribution: better multi-field searching > > > > > > The files included below (MaxDisjunctionQuery.java and > > > MaxDisjunctionScorer.java) provide a new mechanism for searching > across > > > multiple fields. > > > > The maximum indeed works well, also when the fields differ a lot > length. > > > > Regards, > > Paul > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: retrieving added document
On Tue, 23 Nov 2004 22:47:21 +0100, Paul <[EMAIL PROTECTED]> wrote: > Hi, > I'm creating a document and adding it with a writer to the index. For > some reason I need to add data to this specific document later on > (minutes, not hours or days). Is it possible to retrieve it and add > additonal data? No, you cannot add additional data (or modify) to previously added document. It's easy to delete the old one from the index and add a new document with additional data included. > I found the document(int n) - method within the IndexReader (btw: the > description makes no sense for me: "Returns the stored fields of the > nth Document in this index." - but it returns a Document and not a > list of fields..) but where do I get that number from? (and the > numbers change, I know..) Usually you search using IndexSearcher and it's resulting Hits has the doc-id (the number) in that index. And the Document contains the list of (stored) fields. > > thanks for any help > > Paul > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Cheolgoo, Kang - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene Scorers
Hi, Thanks the pointers in your replies. Would it be possible to include some sort of accrual scorer interface somewhere in the Lucene Query APIs? This could be passed into a query similar to MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc., according to the implementor's discretion, to compute the overall score for a document. -Ken On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot <[EMAIL PROTECTED]> wrote: > On Friday 12 November 2004 22:56, Chuck Williams wrote: > > > > I had a similar need and wrote MaxDisjunctionQuery and > > MaxDisjunctionScorer. Unfortunately these are not available as a patch > > but I've included the original message below that has the code (modulo > > line breaks added by simple text email format). > > > > This code is functional -- I use it in my app. It is optimized for its > > stated use, which involves a small number of clauses. You'd want to > > improve the incremental sorting (e.g., using the bucket technique of > > BooleanQuery) if you need it for large numbers of clauses. > > When you're interested, you can also have a look here for > yet another DisjunctionScorer: > http://issues.apache.org/bugzilla/show_bug.cgi?id=31785 > > It has the advantage that it implements skipTo() so that it can > be used as a subscorer of ConjunctionScorer, ie. it can be > faster in situations like this: > > aa AND (bb OR cc) > > where bb and cc are treated by the DisjunctionScorer. > When aa is a filter this can also be used to implement > a filtering query. > > > > > > Re. Paul's suggested steps below, I did not integrate this with query > > parser as I didn't need that functionality (since I'm generating the > > multi-field expansions for which max is a much better scoring choice > > than sum). > > > > Chuck > > > > Included message: > > > > -Original Message- > > From: Chuck Williams [mailto:[EMAIL PROTECTED] > > Sent: Monday, October 11, 2004 9:55 PM > > To: [EMAIL PROTECTED] > > Subject: Contribution: better multi-field searching > > > > The files included below (MaxDisjunctionQuery.java and > > MaxDisjunctionScorer.java) provide a new mechanism for searching across > > multiple fields. > > The maximum indeed works well, also when the fields differ a lot length. > > Regards, > Paul > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: JDBCDirectory to prevent optimize()?
On Nov 23, 2004, at 6:02 PM, Kevin A. Burton wrote: Erik Hatcher wrote: Also, there is a DBDirectory in the sandbox to store a Lucene index inside Berkeley DB. I assume this would prevent prefix queries from working... Huh? Why would you assume that? As far as I know, and I've tested this some, a Lucene index inside Berkeley DB works the same as if it had been in RAM or on the filesystem. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: URGENT: Help indexing large document set
Thanks Chuck! I missed the call: getIndexOffset. I am profiling it again to pin-point where the performance problem is. -John On Tue, 23 Nov 2004 16:13:22 -0800, Chuck Williams <[EMAIL PROTECTED]> wrote: > Are you sure you have a performance problem with > TermInfosReader.get(Term)? It looks to me like it scans sequentially > only within a small buffer window (of size > SegmentTermEnum.indexInterval) and that it uses binary search otherwise. > See TermInfosReader.getIndexOffset(Term). > > Chuck > > > > > -Original Message- > > From: John Wang [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, November 23, 2004 3:38 PM > > To: [EMAIL PROTECTED] > > Subject: URGENT: Help indexing large document set > > > > Hi: > > > >I am trying to index 1M documents, with batches of 500 documents. > > > >Each document has an unique text key, which is added as a > > Field.KeyWord(name,value). > > > >For each batch of 500, I need to make sure I am not adding a > > document with a key that is already in the current index. > > > > To do this, I am calling IndexSearcher.docFreq for each document > and > > delete the document currently in the index with the same key: > > > >while (keyIter.hasNext()) { > > String objectID = (String) keyIter.next(); > > term = new Term("key", objectID); > > int count = localSearcher.docFreq(term); > > > > if (count != 0) { > > localReader.delete(term); > > } > > } > > > > Then I proceed with adding the documents. > > > > This turns out to be extremely expensive, I looked into the code and > I > > see in > > TermInfosReader.get(Term term) it is doing a linear look up for each > > term. So as the index grows, the above operation degrades at a > linear > > rate. So for each commit, we are doing a docFreq for 500 documents. > > > > I also tried to create a BooleanQuery composed of 500 TermQueries > and > > do 1 search for each batch, and the performance didn't get better. > And > > if the batch size increases to say 50,000, creating a BooleanQuery > > composed of 50,000 TermQuery instances may introduce huge memory > > costs. > > > > Is there a better way to do this? > > > > Can TermInfosReader.get(Term term) be optimized to do a binary > lookup > > instead of a linear walk? Of course that depends on whether the > terms > > are stored in sorted order, are they? > > > > This is very urgent, thanks in advance for all your help. > > > > -John > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: URGENT: Help indexing large document set
Are you sure you have a performance problem with TermInfosReader.get(Term)? It looks to me like it scans sequentially only within a small buffer window (of size SegmentTermEnum.indexInterval) and that it uses binary search otherwise. See TermInfosReader.getIndexOffset(Term). Chuck > -Original Message- > From: John Wang [mailto:[EMAIL PROTECTED] > Sent: Tuesday, November 23, 2004 3:38 PM > To: [EMAIL PROTECTED] > Subject: URGENT: Help indexing large document set > > Hi: > >I am trying to index 1M documents, with batches of 500 documents. > >Each document has an unique text key, which is added as a > Field.KeyWord(name,value). > >For each batch of 500, I need to make sure I am not adding a > document with a key that is already in the current index. > > To do this, I am calling IndexSearcher.docFreq for each document and > delete the document currently in the index with the same key: > >while (keyIter.hasNext()) { > String objectID = (String) keyIter.next(); > term = new Term("key", objectID); > int count = localSearcher.docFreq(term); > > if (count != 0) { > localReader.delete(term); > } > } > > Then I proceed with adding the documents. > > This turns out to be extremely expensive, I looked into the code and I > see in > TermInfosReader.get(Term term) it is doing a linear look up for each > term. So as the index grows, the above operation degrades at a linear > rate. So for each commit, we are doing a docFreq for 500 documents. > > I also tried to create a BooleanQuery composed of 500 TermQueries and > do 1 search for each batch, and the performance didn't get better. And > if the batch size increases to say 50,000, creating a BooleanQuery > composed of 50,000 TermQuery instances may introduce huge memory > costs. > > Is there a better way to do this? > > Can TermInfosReader.get(Term term) be optimized to do a binary lookup > instead of a linear walk? Of course that depends on whether the terms > are stored in sorted order, are they? > > This is very urgent, thanks in advance for all your help. > > -John > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
URGENT: Help indexing large document set
Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term("key", objectID); int count = localSearcher.docFreq(term); if (count != 0) { localReader.delete(term); } } Then I proceed with adding the documents. This turns out to be extremely expensive, I looked into the code and I see in TermInfosReader.get(Term term) it is doing a linear look up for each term. So as the index grows, the above operation degrades at a linear rate. So for each commit, we are doing a docFreq for 500 documents. I also tried to create a BooleanQuery composed of 500 TermQueries and do 1 search for each batch, and the performance didn't get better. And if the batch size increases to say 50,000, creating a BooleanQuery composed of 50,000 TermQuery instances may introduce huge memory costs. Is there a better way to do this? Can TermInfosReader.get(Term term) be optimized to do a binary lookup instead of a linear walk? Of course that depends on whether the terms are stored in sorted order, are they? This is very urgent, thanks in advance for all your help. -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: JDBCDirectory to prevent optimize()?
Erik Hatcher wrote: Also, there is a DBDirectory in the sandbox to store a Lucene index inside Berkeley DB. I assume this would prevent prefix queries from working... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
: Note that I said FilteredQuery, not QueryFilter. Doh .. right sorry, I confused myself by thinking you were still refering to your comments 2004-03-29 comparing DateFilter with RangeQuery wrapped in a QueryFilter. : I debate (with myself) on whether add-ons that can be done with other : code is worth adding to Lucene's core. In this case the utility : methods are so commonly needed that it makes sense. But it could be In particular, having a class of utilities like that in the code base is usefull, because now the javadocs for classes like RangeQuery and RangeFilter can refrence them as being neccessary important to ensure that ranges work the way you expect ... and hopefully fewer people will be confused in the future. : I think there needs to be some discussion on what other utility methods : should be added. For example, most of the numerics I index are : positive integers and using a zero-padded is sufficient. I'd rather : have clearly recognizable numbers in my fields than some strange : contortion that requires a conversion process to see. I'm of two minds, on one hand I think there's no big harm in providing every concievable utility function known to man so people have their choice of representation. On the other hand, I think it would be nice if Lucene had a much simpler API for dealing with "non-strings" that just did "the right thing" based on simple expectations -- without the user having to ask themselves: "Will i ever need negative numbers? Will I ever need numbers bigger then 1000?" or to later remember that they padded tis field to 5 digits and that field to 7 digits. Having clearly recognized values is something that can (should?) be easily accomplished by indexing the contorted but lexically sortable value, and storing the more readable value... Document d = /* some doc */; Long l = /* some value */; Field f1 = Field.UnIndexed("field", l.toString()); Field f2 = Field.UnStored("field", NumerTools.longToString(l)); d.add(f1); d.add(f2); (I'm not imagining things right? that should work, correct?) What would really be sweet, Is if Lucene had an API that transparently dealt with all of the major primitive types, both at indexing time and at query time, so that users ddn't have to pay any attention to the stringification, or when to Index a different value then they store... Field f = Field.Long("field", l); /* indexes one string, stores the other */ d.add(f); ... Query q = new RangeQuery("field", l1, l2); /* knows to use the contorted string */ ... String s = hits.doc(i).getValue("field"); /* returns pretty string */ Long l = hits.doc(i).getValue("field"); /* returns orriginal Long */ -- --- "Oh, you're a tricky one."Chris M Hostetter -- Trisha Weir[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
retrieving added document
Hi, I'm creating a document and adding it with a writer to the index. For some reason I need to add data to this specific document later on (minutes, not hours or days). Is it possible to retrieve it and add additonal data? I found the document(int n) - method within the IndexReader (btw: the description makes no sense for me: "Returns the stored fields of the nth Document in this index." - but it returns a Document and not a list of fields..) but where do I get that number from? (and the numbers change, I know..) thanks for any help Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
On Nov 23, 2004, at 3:41 PM, Erik Hatcher wrote: On Nov 23, 2004, at 2:16 PM, Chris Hostetter wrote: First: Is there any reason Matt Quail's "LongField" class hasn't been added to CVS (or has it and I'm just not seeing it?) Laziness is the only reason, at least on my part. I think adding it is a great thing. I'll look into it. I'm feeling particularly commit-y today. I dug up Matt Quail's original LongField contribution in e-mail and adapted it to a new NumberTools class. I committed it along with the tests he contributed also. I think there needs to be some discussion on what other utility methods should be added. For example, most of the numerics I index are positive integers and using a zero-padded is sufficient. I'd rather have clearly recognizable numbers in my fields than some strange contortion that requires a conversion process to see. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
On Nov 23, 2004, at 2:16 PM, Chris Hostetter wrote: : I did a little code cleanup, Chris, renaming some RangeFilter variables : and correcting typos in the Javadocs. Let me know if everything looks : ok. Wow ... that was fast. Things look fine to me (typo's in javadocs are my specialty) but now I wish I'd included more tests We can always add more tests. Anytime. First: Is there any reason Matt Quail's "LongField" class hasn't been added to CVS (or has it and I'm just not seeing it?) Laziness is the only reason, at least on my part. I think adding it is a great thing. I'll look into it. I haven't tested it extensively, but strikes me as being a crucial utility for people who want to do any serious sorting or filtering of numeric values. I debate (with myself) on whether add-ons that can be done with other code is worth adding to Lucene's core. In this case the utility methods are so commonly needed that it makes sense. But it could be argued also that there are are classes in Lucene that are not central to its operation. Although I would suggest a few minor tweaks: a) Rename to something like NumberTools (to be consistent with the new DateTools and because...) Agreed. b) Add some one line convinience methods like intToString and floatToString and doubleToString ala: return longToString(Double.doubleToLongBits(d)); No objects to having convenience methods - though I need to look at what the LongField code is providing before commenting in detail. : And now with FilteredQuery you can have the best of both worlds :) See, this is what I'm not getting: what is the advantage of the second world? :) ... in what situations would using... s.search(q1, new QueryFilter(new RangeQuery(t1,t2,true)); ...be a better choice then... s.search(q1, new RangeFilter(t1.field(),t1.text(),t2.text(),true,true); Note that I said FilteredQuery, not QueryFilter. Certainly RangeFilter is cleaner than using a QueryFilter(RangeFilter) combination - that's why we added it. :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
Hmmm, scratch that. I explained the tradeoff of a filter vs a range query - not between the different types of filters you talk about. --- Yonik Seeley <[EMAIL PROTECTED]> wrote: > I think it depends on the query. If the query (q1) > covers a large number of documents and the fiter > covers a very small number, then using a RangeFilter > will probably be slower than a RangeQuery. > > -Yonik > > > > See, this is what I'm not getting: what is the > > advantage of the second > > world? :) ... in what situations would using... > > > >s.search(q1, new QueryFilter(new > > RangeQuery(t1,t2,true)); > > > > ...be a better choice then... > > > >s.search(q1, new > > > RangeFilter(t1.field(),t1.text(),t2.text(),true,true); > > > > __ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
I think it depends on the query. If the query (q1) covers a large number of documents and the fiter covers a very small number, then using a RangeFilter will probably be slower than a RangeQuery. -Yonik > See, this is what I'm not getting: what is the > advantage of the second > world? :) ... in what situations would using... > >s.search(q1, new QueryFilter(new > RangeQuery(t1,t2,true)); > > ...be a better choice then... > >s.search(q1, new > RangeFilter(t1.field(),t1.text(),t2.text(),true,true); __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
On Nov 23, 2004, at 10:01 AM, Praveen Peddi wrote: Chris's RangeFilter does not cache anything where as QueryFilter does caching. Is it better to add the caching funtionality to RangeFilter also? or does it not make any difference? Caching is a different _aspect_. Filtering and caching are not related and should not be intimately tied, in my opinion. The solution is to use the CachingWrapperFilter to wrap a RangeFilter when caching is desired. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
experiences with PDF files
Hi, I read a lot of mails about the time consuming pdf-parsing and tried myself some solutions. My example PDF file has 181 pages in 1,5 MB (mostly text nearly no grafics). -with pdfbox.org's toolkit it took 17m32s to parse&read it's content -after installing ghostscript and ps2text / ps2ascii my parsing failed after page 54 and 2m51s because of irregular fonts -installing XPDF and using it's tool pdftotext parsing completed after 7-10seconds My machine is a Celeren 1700 with VMWare Workstation 3.2 (128 MB assigned) and Linux Suse 7.3. I will parse my pdf files with xpdf and something like Runtime.getRuntime().exec("pdftotext -nopgbrk -raw "+pdfFileName+" "+txtFileName); Paul P.S. look at http://www.jguru.com/faq/view.jsp?EID=1074237 for links and tipps - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: modifying existing index
To update a document you need to insert the modified document, then delete the old one. Here is some code that I use to get you going in the right direction (it wont compile, but if you follow it closely you will see how I take an array of lucene documents with new properties and add them, then delete the old ones.): public void updateDocuments( Document[] documentsToUpdate ) { if ( documentsToUpdate.length > 0 ) { String updateDate = Dates.formatDate( new Date(), "MMddHHmm" ); // wait on some other modification to finish HashSet failedToAdd = new HashSet(); waitToModify(); synchronized(directory) { IndexWriter indexWriter = null; try { indexWriter = getWriter(); indexWriter.mergeFactor = 2; //this seems to be needed to accomodate a lucene (ver 1.4.2) bug //otherwise the index does not accurately reflect the change //load data from new document into old document for ( int i = 0; i < documentsToUpdate.length; i++ ) { try { Document newDoc = modifyDocument( documentsToUpdate[i], updateDate ); if ( newDoc != null ) { documentsToUpdate[i] = newDoc; indexWriter.addDocument( newDoc ); } else { failedToAdd.add( documentsToUpdate[i].get( "messageid" ) ); } } catch ( IOException addDocException ) { //if we fail to add, make a note and dont delete it logger.error( " ["+getContext().getID()+"] error updating message:" + documentsToUpdate[i].get("messageid") ,addDocException ); failedToAdd.add( documentsToUpdate[i].get( "messageid" ) ); } catch ( java.lang.IllegalStateException ise ) { //if we fail to add, make a note and dont delete it logger.error( " ["+getContext().getID()+"] error updating message:" + documentsToUpdate[i].get("messageid") ,ise ); failedToAdd.add( documentsToUpdate[i].get( "messageid" ) ); } } //if we fail to close the writer, we dont want to continue closeWriter(); searcherVersion = -1; //establish that the searcher needs to update IndexReader reader = IndexReader.open( indexPath ); int testid = -1; for ( int i = 0; i < documentsToUpdate.length; i++ ) { Document newDoc = documentsToUpdate[i]; try { logger.debug( "delete id:" + newDoc.get( "deleteid" ) + " messageid: " + newDoc.get( "messageid" ) ); reader.delete( Integer.parseInt( newDoc.get( "deleteid" ) ) ); testid = Integer.parseInt( newDoc.get( "deleteid" ) ); } catch ( NumberFormatEx
Re: modifying existing index
On Tue, 2004-11-23 at 13:59, Santosh wrote: > I am using lucene for indexing, when I am creating Index the docuemnts are > added. but when I want to modify the single existing document and reIndex > again, it is taking as new document and adding one more time, so that I am > getting same document twice in the results. > To overcome this I am deleting existing Index and again recreating whole > Index. but is it possibe to index the modified document again and overwrite > existing document without deleting and recreation. can I do this? If so how? You do not need to recreate the whole index. Just mark the document as deleted using the IndexReader and then add it again with the IndexWriter. Remember to close your IndexReader and IndexWriter after doing this. The deleted document will be removed the next time you optimize your index. Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
modifying existing index
I am using lucene for indexing, when I am creating Index the docuemnts are added. but when I want to modify the single existing document and reIndex again, it is taking as new document and adding one more time, so that I am getting same document twice in the results. To overcome this I am deleting existing Index and again recreating whole Index. but is it possibe to index the modified document again and overwrite existing document without deleting and recreation. can I do this? If so how? and one more question. can lucene will be able to do stemming? If I am searching for "roam" then I know that it can give result for "foam" using fuzzy query. But my requirement is if I search for "roam" can I get the similar worlist as output. so that I can show the end user in the column --- do you mean "foam"? How can I get similar word list in the given content? ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS.
Re: Numeric Range Restrictions: Queries vs Filters
: Done. I deprecated DateField and DateFilter, and added the RangeFilter : class contributed by Chris. : : I did a little code cleanup, Chris, renaming some RangeFilter variables : and correcting typos in the Javadocs. Let me know if everything looks : ok. Wow ... that was fast. Things look fine to me (typo's in javadocs are my specialty) but now I wish I'd included more tests I still feel a little confused about two things though... First: Is there any reason Matt Quail's "LongField" class hasn't been added to CVS (or has it and I'm just not seeing it?) I haven't tested it extensively, but strikes me as being a crucial utility for people who want to do any serious sorting or filtering of numeric values. Although I would suggest a few minor tweaks: a) Rename to something like NumberTools (to be consistent with the new DateTools and because...) b) Add some one line convinience methods like intToString and floatToString and doubleToString ala: return longToString(Double.doubleToLongBits(d)); Second... : RangeQuery wrapped inside a QueryFilter is more specifically what I : said. I'm not a fan of DateField and how the built-in date support in : Lucene works, so this is why I don't like DateFilter personally. : : Your RangeFilter, however, is nicely done and well worth deprecating : DateFilter for. [...] : > and RangeQuery. [5] Based on my limited tests, using a Filter to : > restrict : > to a Range is a lot faster then using RangeQuery -- independent of : > caching. : : And now with FilteredQuery you can have the best of both worlds :) See, this is what I'm not getting: what is the advantage of the second world? :) ... in what situations would using... s.search(q1, new QueryFilter(new RangeQuery(t1,t2,true)); ...be a better choice then... s.search(q1, new RangeFilter(t1.field(),t1.text(),t2.text(),true,true); ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
Chris's RangeFilter does not cache anything where as QueryFilter does caching. Is it better to add the caching funtionality to RangeFilter also? or does it not make any difference? Praveen - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 23, 2004 9:19 AM Subject: Re: Numeric Range Restrictions: Queries vs Filters On Nov 23, 2004, at 4:18 AM, Doug Cutting wrote: Hoss wrote: The attachment contains my RangeFilter, a unit test that demonstrates it, and a Benchmarking unit test that does a side-by-side comparison with RangeQuery [6]. If developers feel that this class is useful, then by all means roll it into the code base. (90% of it is cut/pasted from DateFilter/RangeQuery anyway) +1 DateFilter could be deprecated, and replaced with the more generally and appropriately named RangeFilter. Should we also deprecate DateField, in preference for DateTools? Done. I deprecated DateField and DateFilter, and added the RangeFilter class contributed by Chris. I did a little code cleanup, Chris, renaming some RangeFilter variables and correcting typos in the Javadocs. Let me know if everything looks ok. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
On Nov 23, 2004, at 4:18 AM, Doug Cutting wrote: Hoss wrote: The attachment contains my RangeFilter, a unit test that demonstrates it, and a Benchmarking unit test that does a side-by-side comparison with RangeQuery [6]. If developers feel that this class is useful, then by all means roll it into the code base. (90% of it is cut/pasted from DateFilter/RangeQuery anyway) +1 DateFilter could be deprecated, and replaced with the more generally and appropriately named RangeFilter. Should we also deprecate DateField, in preference for DateTools? Done. I deprecated DateField and DateFilter, and added the RangeFilter class contributed by Chris. I did a little code cleanup, Chris, renaming some RangeFilter variables and correcting typos in the Javadocs. Let me know if everything looks ok. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
On Nov 22, 2004, at 9:25 PM, Hoss wrote: I'm rather new to Lucene (and this list), so if I'm grossly misunderstanding things, forgive me. You're spot on! But I was surprised then to see the following quote from "Erik Hatcher" in the archives: "In fact, DateFilter by itself is practically of no use, I think." [4] ...Erik goes on to suggest that given "a set of canned date ranges", it doesn't really matter if you use a RangeQuery or a DateFilter -- as long as you cache them to reuse them (with something like CachingWrappingFilter or QueryFilter). I'm hoping that he might elaborate on that comment? RangeQuery wrapped inside a QueryFilter is more specifically what I said. I'm not a fan of DateField and how the built-in date support in Lucene works, so this is why I don't like DateFilter personally. Your RangeFilter, however, is nicely done and well worth deprecating DateFilter for. As a test, I wrote a "RangeFilter" which borrows heavily from DateFilter to both convince myself it could work, and to do a comparison between it and RangeQuery. [5] Based on my limited tests, using a Filter to restrict to a Range is a lot faster then using RangeQuery -- independent of caching. And now with FilteredQuery you can have the best of both worlds :) Thanks for your detailed code, tests, and contribution. We'll fold it in. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
Hoss wrote: The attachment contains my RangeFilter, a unit test that demonstrates it, and a Benchmarking unit test that does a side-by-side comparison with RangeQuery [6]. If developers feel that this class is useful, then by all means roll it into the code base. (90% of it is cut/pasted from DateFilter/RangeQuery anyway) +1 DateFilter could be deprecated, and replaced with the more generally and appropriately named RangeFilter. Should we also deprecate DateField, in preference for DateTools? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: too many files open issue
Hi Dmitry, Thank you so much for your reply. I'd like to answer your specific questions. >>It also depends on whether you are using "compound files" or not (this is a flag on the IndexWriter). >>With compound files flag on, segments have fixed number of files, regardless of how many fields >>you use. Without the flag, each field is a separate file. We are using Lucene 1.2 and hence we don't have this compound file property in IndexWriter class. This would mean that we are having a separate file for each field. >>By the way, it is usual to have the file descriptors limit set at 9000 or so for unix machines running >>production web applications. By the way 2, on Solaris, you will need to modify a value in >>/etc/systems to get up to this level. Not sure about Linux or other flavors. We are using SunOS 5.8 on a Sparc Sunfire280 R machine. Running ulimit -n gives number 256. This is the number we had first tried to reduce to 200 and then bring back up to 500 without any luck. Then ultimately, everything started to work on the default number 256. We had tried to alter this number using ulimit command itself instead of changing it in the /etc/system file. >>Another suggestion - you may want to look into a tool called "lsof". It is a utility that will show file >>handles open by a particular process. It could be that some other part of your process (or of the >>application server, VM, etc) is not closing files. This tool will help you see what files are open and >>you can validate that all of the really need to be open. The "lsof" tool is available through following path ftp://vic.cc.purdue.edu/pub/tools/unix/lsof which is not accepting anonymous access. Hence we have not been able to download this tool to figure out what's going on with the processes and the files being opened by them. The most worrying aspect about the whole scenario is that there's no consistency in the way system behaves. It works fine with the default settings then suddenly it stops working. Then after changing the settings several times, it works again then breaks again. Our worry is that we may not be going in the right direction with this approach. Kindly advise. Thanks and regards Neelam Bhatnagar -Original Message- From: Dmitry [mailto:[EMAIL PROTECTED] Sent: Monday, November 22, 2004 8:46 PM To: Lucene Users List Subject: Re: Too many open files issue I'm sorry, I wasn't involved in the original conversation but maybe I can jump in with some info that will help. The number of files depends on the merge factor, number of segments, and number of indexed fields in your index. It also depends on whether you are using "compound files" or not (this is a flag on the IndexWriter). With compound files flag on, segments have fixed number of files, regardless of how many fields you use. Without the flag, each field is a separate file. Let's say you have 10 segments (per your merge factor) that are being merged into a new segment (via an optimize call or just because you have reached the merge factor). This means there are 11 segments open at the same time. If you have 20 indexed fields and are not using compound files, that's 20 * 11 = 220 files. There are a few other files open as well, plus whatever other files and sockets that your JVM process is holding open at that time. This would include incoming connections, for example, if this is running inside a web server. If you are running in an application server, this could include connections and files open by other applications in that same app server. So the numbers run up quite a bit. By the way, it is usual to have the file descriptors limit set at 9000 or so for unix machines running production web applications. By the way 2, on Solaris, you will need to modify a value in /etc/systems to get up to this level. Not sure about Linux or other flavors. Another suggestion - you may want to look into a tool called "lsof". It is a utility that will show file handles open by a particular process. It could be that some other part of your process (or of the application server, VM, etc) is not closing files. This tool will help you see what files are open and you can validate that all of the really need to be open. Best of luck. Dmitry. Neelam Bhatnagar wrote: >Hi, > >I had requested help on an issue we have been facing with the "Too many >open files" Exception garbling the search indexes and crashing the >search on the web site. >As a suggestion, you had asked us to look at the articles on O'Reilly >Network which had specific context around this exact problem. >One of the suggestions was to increase the limit on the number of file >descriptors on the file system. We tried it by first lowering the limit >to 200 from 256 in order to reproduce the exception. The exception did >get reproduced but even after increasing the limit to 500, the exception >kept coming until after several rounds of trying to rebuild the inde
Re: JDBCDirectory to prevent optimize()?
Also, there is a DBDirectory in the sandbox to store a Lucene index inside Berkeley DB. Erik On Nov 22, 2004, at 6:06 PM, Kevin A. Burton wrote: It seems that when compared to other datastores that Lucene starts to fall down. For example lucene doesn't perform online index optimizations so if you add 10 documents you have to run optimize() again and this isn't exactly a fast operation. I'm wondering about the potential for a generic JDBCDirectory for keeping the lucene index within a database. It sounds somewhat unconventional would allow you to perform live addDirectory updates without performing an optimize() again. Has anyone looked at this? How practical would it be. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
Hoss writes: > > (c) Filtering. Filters in general make a lot of sense to me. They are a > way to specify (at query time) that only a certain subset of the index > should be considered for results. The Filter class has a very straight > forward API that seems very easy to subclass to get the behavior I want. > The Query API on the other hand ... I freely admit, that I can't make > heads or tails out of it. I don't even know where I would begin to try > and write a new subclass of Query if I wanted to. > > I would think that most people who want to do a "numeric range > restriction" on their data, probably don't care about the Scoring benefits > of RangeQuery. Looking at the code base, the way DateFilter works seems > like it provides an ideal solution to any sort of Range restriction (not > just Dates) that *should* be more efficient then using RangeQuery when > dealing with an unbounded value set. (Both approaches need to iterate over > all of the terms in the specified field using TermEnum, but RangeQuery has > to build up an set of BooleanQuery objects for each matching term, and > then each of those queries have to help score the documents -- DateFilter > on the other hand only has to maintain a single BitSet of documents that > it finds as it iterates) > IMO there's another option, at least as long as the number of your documents isn't too high. Sorting already creates a list of all field values for some field that will be used during the search for sorting. Nothing prevents you from using that aproach for search restriction also. The advantage is, that you can create that list once and use it for different ranges until the index is changed whereas a filter can only represent one range. The disadvantate is, that you have to keep one value for each document in memory instead of one bit in a filter. I did that (before the sort code was introduced) for date queries in order to be able to sort and restrict searches on dates. But I haven't thought about how a general API for such a solution might look like so far. Of course it depends on a number of questions, which way is preferable. How often is the index modified, are range queries usually done for the same or different ranges, how many documents are indexed and so on. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
Chris, On Tuesday 23 November 2004 03:25, Hoss wrote: > (NOTE: numbers in [] indicate Footnotes) > > I'm rather new to Lucene (and this list), so if I'm grossly > misunderstanding things, forgive me. > > One of my main needs as I investigate Search technologies is to restrict > results based on Ranges of numeric values. Looking over the archives of > this list, it seems that lots of people have run into problems dealing > with this. In particular, whenever someone asks a question about "Numeric > Ranges" the question seem to always involve one (or more) of the > following: > >(a) Lexical sorting puts 11 in the range "1 TO 5" >(b) Dates (or Dates and Times) >(c) BooleanQuery$TooManyClauses Exceptions >(d) Should I use a filter? FWIW, the javadoc of the development version of BooleanQuery.maxClauseCount reads: The maximum number of clauses permitted. Default value is 1024. Use the org.apache.lucene.maxClauseCount system property to override. TermQuery clauses are generated from for example prefix queries and fuzzy queries. Each TermQuery needs some buffer space during search, so this parameter indirectly controls the maximum buffer requirements for query search. Normally the buffers are allocated by the JVM. When using for example MMapDirectory the buffering is left to the operating system. MMapDirectory uses memory mapped files for the index. It would be useful to also provide a reference to filters (DateFilter) and to LongField in case it is added to the code base. ... > The Query API on the other hand ... I freely admit, that I can't make > heads or tails out of it. I don't even know where I would begin to try > and write a new subclass of Query if I wanted to. In a nutshell: A Query either rewrites to another Query, or it provides a Weight. A Weight first does normalisation and then provides a Scorer to be used during search. RangeQuery is a good example: A RangeQuery rewrites to a BooleanQuery over TermQuery's for the matching terms. A BooleanQuery provides a BooleanScorer via its Weight. A TermQuery provides a TermScorer via its Weight. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: JDBCDirectory to prevent optimize()?
On Tuesday 23 November 2004 00:06, Kevin A. Burton wrote: > I'm wondering about the potential for a generic JDBCDirectory for > keeping the lucene index within a database. Such a thing already exists: http://ppinew.mnis.com/jdbcdirectory/, but I don't know about its scalability. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]