Re: best way of reusing IndexSearcher objects
Doug Cutting writes: Dror Matalon wrote: There are two issues: 1. Having new searches start using the new index only when it's ready, not in a half baked state, which means that you have to synchronize the switch from the old index to the new one. That's true. If you're doing updates (as opposed to just additions) then you probably want to do something like: 1. keep a single open IndexReader used by all searches 2. Every few minutes, process updates as follows: a. open a second IndexReader b. delete all documents that will be updated c. close this IndexReader, to flush deletions d. open an IndexWriter e. add all documents that are updated f. close the IndexReader g. replace the IndexReader used for searches (1, above) Right. As long as you can control the reader instance from the update process, it's better to do so, instead of checking, if the reader for search is still up to date in the reader itself. 2. It's not trivial to figure out when it's safe to discard the old index; all existing searches are done with it. To make things more complicated, the Hits object is dependent on your IndexSearcher object, so if you have Hits objects in use you probably can't close your IndexSearcher. Is this a correct analysis or is there an obvious strategy to work around this issue? Right, you cannot safely close the IndexReader that's being used for searching. Rather, just drop it on the floor and let it get garbage collected. Its files will be closed when this happens. Provided you're not updating more frequently than the garbage collector runs, you should only ever have two IndexReaders open and shouldn't run into file handle issues. I guess the alternative would be to have a reference counting that is increased whenever a search starts and decreased when the hits object is no longer used. You could then set a flag and close the index when the count reaches 0. Thanks for the comments. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Benchmark (WAS: Indexing Speed: Documents vs. Sentences)
Hello, Here's is a benchmark. I am not sure if that is proper etiquette, but I will just paste it into this mail and hope that it gets funneled into the right channels. Cheers! Jochen benchmark ul p bHardware Environment/bbr/ liiDedicated machine for indexing/ino, some other work performed on it. shouldn't influence results much since it's a multiple processor machine/li liiCPU/i2x Intel Xeon 3.05GHz/li liiRAM/i4GB/li liiDrive configuration/iSCSI/li /p p bSoftware environment/bbr/ liiJava Version/i1.4.2-b28/li liiJava VM/iJava HotSpot Client VM 1.4.2/li liiOS Version/iRedhat 8/li liiLocation of index/ilocal/li /p p bLucene indexing variables/bbr/ liiNumber of source documents/i5,000,000/li liiTotal filesize of source documents/i40GB/li liiAverage filesize of source documents/i8kB/li liiSource documents storage location/iDB on remote server/li liiFile type of source documents/ipre-parsed HTML/li liiParser(s) used, if any/in/a/li liiAnalyzer(s) used/iStandardAnalyzer/li liiNumber of fields per document/i5/li liiType of fields/iactual text is indexed but not stored in lucene index/li liiIndex persistence/i: Where the index is stored, e.g. FSDirectory, SqlDirectory, etc/li /p p bFigures/bbr/ liiTime taken (in ms/s as an average of at least 3 indexing runs)/i332 minutes/li liiTime taken / 1000 docs indexed/i4 sec/li liiMemory consumption/iabout 100MB/li /p p bNotes/bbr/ liiNotes/iWith the above configuration we pretty consistently achieve a 250 docs / sec rate of indexing. The actual text cannot be retrieved from the index, this keeps the index size down (6.1GB) and increases indexing speed. When the actual documents are stored in the index the rate drops by about 30% to 160 docs / sec./li /p /ul /benchmark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FW: Indexing Speed: Documents vs. Sentences
Stephane, The actual indexing is actually less glamorous than it sounds. When you index 1TB across 10 machines you end up with 100GB on each machine. We do not merge the indexes either, since we get better speed on indexing as well as querying when we keep indexes smaller and distributed across different machines. (But somehow I think that I'll sit down and merge all of them together and play with it when I get a chance ... 'cause it's cool :-) I'll keep you posted when it happens). My test set that I am playing with is 40GB, and I just posted a benchmark. Best, Jochen -Original Message- From: Stephane Vaucher [mailto:[EMAIL PROTECTED] Sent: Thursday, December 18, 2003 9:01 AM To: Lucene Users List; [EMAIL PROTECTED] Subject: RE: Indexing Speed: Documents vs. Sentences Jochen, If you have a bit of time, could you post some metrics, (as an example, you can look at http://jakarta.apache.org/lucene/docs/benchmarks.html). I haven't heard of anyone indexing 1TB yet. I'm sure everyone is interested in problems you could be facing and we could probably give you some ideas. I know (oddly enough) I sometimes wish I had dataset greater than a few M docs to experiment with. cheers, sv On Thu, 18 Dec 2003, Jochen Frey wrote: Hi, Yes, this is correct, I am dealing with a few 100GB (close to 1TB). I am, however, distributing the data across several machines and then merge the results from all the machines together (until I find a better faster solution). Cheers! -Original Message- From: Victor Hadianto [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 10:50 PM To: Lucene Users List Subject: Re: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Jochen .. a few 100 GB? Is this correct? /victor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
DoubleMetaphoneQuery
I've seen discussions about using the double metaphone algorithm with Lucene (basically: like soundex, used to find works that sound similar in English at least) but couldn't find an implementation, so I spent a few minutes and wrote a Query and TermEnum object for this. I may have missed the prior art so sorry if I did... [1] Here are some mail msgs that mention double metaphone wrt Lucene: http://www.geocrawler.com/archives/3/2626/2000/10/0/4566951/ http://www.geocrawler.com/archives/3/2626/2001/8/50/6382300/ http://www.mail-archive.com/[EMAIL PROTECTED]/msg04648.html [2] And Phoenix has a double metaphone Analyzer, but not a Query, which I guess is another angle on things: http://www.tangentum.biz/en/products/phonetix/api/com/tangentum/phonetix/lucene/PhoneticAnalyzer.html [3] Attached are 2 files (DoubleMetaPhoneQuery and DoubleMetaphoneTermEnum) that I think are valid contributions to the Lucene Sandbox. Hopefully all that has to be done is change the package line if the powers that be accept this. Note: My impl uses the Jakarta CODEC package ( http://jakarta.apache.org/commons/codec/ ) for the double metaphone algorithm implementation. Also, any query expansion such as this could exceed the bounds of a boolean query, thus BooleanQuery.setMaxClauseCount may need to be used to avoid an exception. [4] I've updated my Lucene demo site which has the ~3500 RFCs indexed and searchable by Lucene. I added an advanced query page to try out the DoubleMetaphoneQuery: It's a few lines down at this URL: http://www.hostmon.com/rfc/advanced.jsp [5] Most of the above is redundantly stated here as a kind of perma-link: http://www.tropo.com/techno/java/lucene/metaphone.html [6] While it's easy to write additonal Query classes, I suspect they are a kind of dead end and won't really be used unless they are integrated into the QueryParser - thus one concept is that the Lucene syntax should have some extension mechanism so you can pass a query like metaphone::protokal to it and metaphone:: (note the double colons) would mean to use DoubleMetaphoneQuery for this term. Maybe an extensible query parser should be the subject of another email? So: let me know if this is useful and plz enter it into the sandbox... thx, Dave Spencer package com.tropo.lucene; /* * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in *the documentation and/or other materials provided with the *distribution. * * 3. The end-user documentation included with the redistribution, *if any, must include the following acknowledgment: * This product includes software developed by the *Apache Software Foundation (http://www.apache.org/). *Alternately, this acknowledgment may appear in the software itself, *if and wherever such third-party acknowledgments normally appear. * * 4. The names Apache and Apache Software Foundation and *Apache Lucene must not be used to endorse or promote products *derived from this software without prior written permission. For *written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called Apache, *Apache Lucene, nor may Apache appear in their name, without *prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * http://www.apache.org/. */ import java.io.IOException;
Re: syntax of queries.
Erik, Thanks! The article is very good. thanks. I have news questions: - apiQuery.add(new TermQuery(new Term(contents, dot)), false, true); new Term(contents, dot) The Term class, work for only one word? this is right? new Term(contents, dot java) for search for dor OR java in contents. My problem is that the user, entry a phrase, and i search for any word in a phrase. No the entire phrase. I need parse de string?, take word for word and add a TermQuery for each word? Bye, Ernesto. - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, December 13, 2003 4:07 AM Subject: Re: syntax of queries. Try out the toString(fieldName) trick on your Query instances and pair them up with what you have below - this will be quite insightful for the issue - i promise! :) Look at my QueryParser article and search for toString on that page: http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html On Friday, December 12, 2003, at 10:38 PM, Ernesto De Santis wrote: Thanks Otis, I don´t resolve my problem. I see the Query sintaxis page, and the FAQ´s search section. I proof too many alternatives: body:(imprimir teclado) title:base = 451 hits body:(imprimir teclado)^5.1 title:base = 248 hits (* under 451) body:(imprimir teclado^5.1) title:base = 451 hits - first document: 3287.html body:(imprimir^5.1 teclado) title:base = 451 hits - first document: 1545.html conclusion: I think that the boost is only applicable for one word. not to parenthesys, and not to field. I wanna make the boost applicable to field. For me, is more important a hit in title that in body, for example. In the FAQ´s search secction: Clause ::= [ Modifier ] [ FieldName ':' ] BasicClause [ Boost ] BasicClause ::= ( Term | Phrase | | PrefixQuery '(' Query ')' then, in my example BasicClause=(imprimir teclado) and Boost ^5.1. but not work. Regards, Ernesto. - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED]; Ernesto De Santis [EMAIL PROTECTED] Sent: Friday, December 12, 2003 7:18 PM Subject: Re: syntax of queries. Maybe it's the spaces after title:? Try title:importar ... instead. Maybe it's the spaces before ^5.0? Try title:importar^5 instead You shouldn't need the parentheses in this case either, I believe. See Query Synax page on Lucene's site. Otis --- Ernesto De Santis [EMAIL PROTECTED] wrote: Hello I not undertanding the syntax of queries. I search with this string: title: (importar) ^5.0 OR title: (arquivos) return 6 hits. and with this: title: (arquivos) OR title: (importar) ^5.0 27 hits. why? in the first, I think that work like AND, but, why? :-( Regards, Ernesto. __ Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sentence Endings: IndexWriter.maxFieldLength and Token.setPositionIncrement()
Hi! I hope this is the right forum for this post. I was wondering if other people would consider this a bug (it might be a feature and I am missing the point of it): .The default IndexWriter.maxFieldLength is 10,000. .The point of maxFieldLength is to limit memory usage. .The current position (which is compared against maxFieldLength) is essentially determined by the sum of the PositionIncrements of all Tokens added to the index. Why does this matter? If you have setPositionIncrement(1000) for sentence ending tokens, only the first 10 sentences of your document will be indexed, the rest will not be searchable (since position will be greater than 10,000). Why I think this is a bug: If you skip 1000 positions, no memory is required by the DocumentWriter for the empty 999 positions, thus not using maxFieldLength to limit memory but simply available positions. I suggest that there be a counter in DocumentWriter, that counts the actual number of tokens in the postingTable (probably in DocumentWriter.addPosition), so that maxFieldLength is compared against the number actual entries, not the number of actual entries and the number skipped entries. Best, Jochen PS: Please let me know if this is the wrong forum for this so I'll post to the right one next time. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DoubleMetaphoneQuery
Interestingly, I used a MetaphoneAnalyzer as an example in our book in progress. I'm curious if you have measured performance with doing it at analysis time versus query time. Enumerating all terms at query time is basically the same as doing a WildcardQuery or FuzzyQuery and involves a lot of work - although certainly on moderate size indexes it is probably not too painful. Nice work on this! I'd be happy to add this to the sandbox, and will do so in the next few days hopefully. Erik On Friday, December 19, 2003, at 02:51 PM, David Spencer wrote: I've seen discussions about using the double metaphone algorithm with Lucene (basically: like soundex, used to find works that sound similar in English at least) but couldn't find an implementation, so I spent a few minutes and wrote a Query and TermEnum object for this. I may have missed the prior art so sorry if I did... [1] Here are some mail msgs that mention double metaphone wrt Lucene: http://www.geocrawler.com/archives/3/2626/2000/10/0/4566951/ http://www.geocrawler.com/archives/3/2626/2001/8/50/6382300/ http://www.mail-archive.com/[EMAIL PROTECTED]/ msg04648.html [2] And Phoenix has a double metaphone Analyzer, but not a Query, which I guess is another angle on things: http://www.tangentum.biz/en/products/phonetix/api/com/tangentum/ phonetix/lucene/PhoneticAnalyzer.html [3] Attached are 2 files (DoubleMetaPhoneQuery and DoubleMetaphoneTermEnum) that I think are valid contributions to the Lucene Sandbox. Hopefully all that has to be done is change the package line if the powers that be accept this. Note: My impl uses the Jakarta CODEC package ( http://jakarta.apache.org/commons/codec/ ) for the double metaphone algorithm implementation. Also, any query expansion such as this could exceed the bounds of a boolean query, thus BooleanQuery.setMaxClauseCount may need to be used to avoid an exception. [4] I've updated my Lucene demo site which has the ~3500 RFCs indexed and searchable by Lucene. I added an advanced query page to try out the DoubleMetaphoneQuery: It's a few lines down at this URL: http://www.hostmon.com/rfc/advanced.jsp [5] Most of the above is redundantly stated here as a kind of perma-link: http://www.tropo.com/techno/java/lucene/metaphone.html [6] While it's easy to write additonal Query classes, I suspect they are a kind of dead end and won't really be used unless they are integrated into the QueryParser - thus one concept is that the Lucene syntax should have some extension mechanism so you can pass a query like metaphone::protokal to it and metaphone:: (note the double colons) would mean to use DoubleMetaphoneQuery for this term. Maybe an extensible query parser should be the subject of another email? So: let me know if this is useful and plz enter it into the sandbox... thx, Dave Spencer package com.tropo.lucene; /* * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in *the documentation and/or other materials provided with the *distribution. * * 3. The end-user documentation included with the redistribution, *if any, must include the following acknowledgment: * This product includes software developed by the *Apache Software Foundation (http://www.apache.org/). *Alternately, this acknowledgment may appear in the software itself, *if and wherever such third-party acknowledgments normally appear. * * 4. The names Apache and Apache Software Foundation and *Apache Lucene must not be used to endorse or promote products *derived from this software without prior written permission. For *written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called Apache, *Apache Lucene, nor may Apache appear in their name, without *prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR
Re: syntax of queries.
On Friday, December 19, 2003, at 05:42 PM, Ernesto De Santis wrote: I have news questions: - apiQuery.add(new TermQuery(new Term(contents, dot)), false, true); new Term(contents, dot) The Term class, work for only one word? Careful with terminology here. It works for only one term. What is a term? That all depends on what happened during analysis. Generally speaking, though, word is the right generalization for a term - but we have to be careful technically speaking. new Term(contents, dot java) for search for dor OR java in contents. Wrong. When constructing a query through the API, if you want an OR you'd need to add two TermQuery's to a BooleanQuery, one for each word and make them not required. My problem is that the user, entry a phrase, and i search for any word in a phrase. No the entire phrase. I need parse de string?, take word for word and add a TermQuery for each word? Yes. If you have a text string of multiple terms you want added to a boolean query you could do so programatically by analyzing the string as I do in my article in AnalyzerDemo or by parsing it through some other mechanism than QueryParser and add each as a TermQuery. You're on track! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sentence Endings: IndexWriter.maxFieldLength and Token.setPositionIncrement()
Jochen, Someone else recently made a similar, reasonable complaint. I agree that this should be fixed. The fastest way to get it fixed would be to submit a patch to lucene-dev, with a test case, etc. Doug Jochen Frey wrote: Hi! I hope this is the right forum for this post. I was wondering if other people would consider this a bug (it might be a feature and I am missing the point of it): .The default IndexWriter.maxFieldLength is 10,000. .The point of maxFieldLength is to limit memory usage. .The current position (which is compared against maxFieldLength) is essentially determined by the sum of the PositionIncrements of all Tokens added to the index. Why does this matter? If you have setPositionIncrement(1000) for sentence ending tokens, only the first 10 sentences of your document will be indexed, the rest will not be searchable (since position will be greater than 10,000). Why I think this is a bug: If you skip 1000 positions, no memory is required by the DocumentWriter for the empty 999 positions, thus not using maxFieldLength to limit memory but simply available positions. I suggest that there be a counter in DocumentWriter, that counts the actual number of tokens in the postingTable (probably in DocumentWriter.addPosition), so that maxFieldLength is compared against the number actual entries, not the number of actual entries and the number skipped entries. Best, Jochen PS: Please let me know if this is the wrong forum for this so I'll post to the right one next time. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene and JavaHelp
Has anyone thought about or used Lucene to build an indexed, searchable help system? Either Server or Application Based? -M. -- Mark Diggory Software Developer Harvard MIT Data Center http://osprey.hmdc.harvard.edu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]