Re: Pool of IndexReaders or Pool of Searchers?
Can you supply details on the config tested? Vince Anson Lau wrote: Hi, When I did some load testing on a lucene powered search app, using a pool of index searchers doesn't give me any more search per second than just using a singleton index searcher. Anson Quoting [EMAIL PROTECTED]: Hi, I have multiple threads reading an index. Should they all be using the same IndexReader and using a pool of IndexSearchers? Or should they be using a pool of IndexReaders? Basically, one reader or many? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is Field.java final?
Hi: On the same thought, how about the org.apache.lucene.analysis.Token class. Can we make it non-final? I sent out this question 3 different times and still got no responses... Thanks -John On Mon, 12 Jul 2004 18:33:04 -0700, Kevin A. Burton <[EMAIL PROTECTED]> wrote: > Doug Cutting wrote: > > > Kevin A. Burton wrote: > > > >> I was going to create a new IDField class which just calls super( > >> name, value, false, true, false) but noticed I was prevented because > >> Field.java is final? > > > > > > You don't need to subclass to do this, just a static method somewhere. > > > >> Why is this? I can't see any harm in making it non-final... > > > > > > Field and Document are not designed to be extensible. They are > > persisted in such a way that added methods are not available when the > > field is restored. In other words, when a field is read, it always > > constructs an instance of Field, not a subclass. > > Thats fine... I think thats acceptable behavior. I don't think anyone > would assume that inner vars are restored or that the field is serialized. > > Not a big deal but it would be nice... > > > > -- > > Please reply using PGP. > >http://peerfear.org/pubkey.asc > >NewsMonster - http://www.newsmonster.org/ > > Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 > AIM/YIM - sfburtonator, Web - http://peerfear.org/ > GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 > IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Aviran wrote: Bug 30058 posted Which of course is here: http://issues.apache.org/bugzilla/show_bug.cgi?id=30058 Is this the source of the revision you modified? http://www.mail-archive.com/[EMAIL PROTECTED]/msg06116.html Also what version of Lucene? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is Field.java final?
Doug Cutting wrote: Kevin A. Burton wrote: I was going to create a new IDField class which just calls super( name, value, false, true, false) but noticed I was prevented because Field.java is final? You don't need to subclass to do this, just a static method somewhere. Why is this? I can't see any harm in making it non-final... Field and Document are not designed to be extensible. They are persisted in such a way that added methods are not available when the field is restored. In other words, when a field is read, it always constructs an instance of Field, not a subclass. Thats fine... I think thats acceptable behavior. I don't think anyone would assume that inner vars are restored or that the field is serialized. Not a big deal but it would be nice... -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.java -> STORED, NOT_STORED, etc...
Doug Cutting wrote: It would be best to get the compiler to check the order. If we change this, why not use type-safe enumerations: http://www.javapractices.com/Topic1.cjp The calls would look like: new Field("name", "value", Stored.YES, Indexed.NO, Tokenized.YES); Stored could be implemented as the nested class: public final class Stored { private Stored() {} public static final Stored YES = new Stored(); public static final Stored NO = new Stored(); } +1... I'm not in love with this pattern but since Java < 1.4 doesnt' support enum its better than nothing. I also didn't want to submit a recommendation that would break APIs. I assume the old API would be deprecated? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Doug Cutting wrote: I noticed that the class org.apache.lucene.index.FieldInfos uses private class members Vector byNumber and Hashtable byName, both of which are synchronized objects. By changing the Vector byNumber to ArrayList byNumber I was able to get 110% improvement in performance (number of searches per second). That's impressive! Good job finding a bottleneck! Wow... thats awesome. We have all dual XEONs with Hyperthreading and kernel 2.6 so I imagine in this situation we'd see an improvement too. I wonder if we could break this out into a patch for legacy Lucene users. I'd like to see the stacktrace too. We're using a lot of synchronized code (Hashtable, Vector, etc) so I'm willing to bet this is happening in other places. My question is: do the fields byNumber and byName have to be synchronized and what can happen if I'll change them to be ArrayList and HashMap which are not synchronized ? Can this corrupt the index or the integrity of the results? I think that is a safe change. FieldInfos is only modifed by DocumentWriter and SegmentMerger, and there is no possibility of other threads accessing those instances. Please submit a patch to the developer mailing list. That would be great! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Optimizing the index
Is optimizing the index something you should do periodically even if you are continually adding documents. I guess another way of asking the question is does optimization have any negative effects on speed of adding documents? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.java -> STORED, NOT_STORED, etc...
I think this is a great idea. I've never used the Field.Keyword and Field.Text type methods because I can never remember what their 3-boolean-argument equivalents are. I always stick the constructor format in a comment somewhere and use it. >>> Doug Cutting <[EMAIL PROTECTED]> 07/11/04 12:03PM >>> Doug Cutting wrote: > The calls would look like: > > new Field("name", "value", Stored.YES, Indexed.NO, Tokenized.YES); > > Stored could be implemented as the nested class: > > public final class Stored { > private Stored() {} > public static final Stored YES = new Stored(); > public static final Stored NO = new Stored(); > } Actually, while we're at it, Indexed and Tokenized are confounded. A single entry would be better, something like: public final class Index { private Index() {} public static final Index NO = new Index(); public static final Index TOKENIZED = new Index(); public static final Index UN_TOKENIZED = new Index(); } then calls would look like just: new Field("name", "value", Store.YES, Index.TOKENIZED); BTW, I think Stored would be better named Store too. BooleanQuery's required and prohibited flags could get the same treatment, with the addition of a nested class like: public final class Occur { private Occur() {} public static final Occur MUST_NOT = new Occur(); public static final Occur SHOULD = new Occur(); public static final Occur MUST = new Occur(); } and adding a boolean clause would look like: booleanQuery.add(new TermQuery(...), Occur.MUST); Then we can deprecate the old methods. Comments? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Could search results give an idea of which field matched
See the explain functionality in the Javadocs and previous threads. You can ask Lucene to explain why it got the results it did for a give hit. >>> [EMAIL PROTECTED] 07/12/04 04:52PM >>> I search the index on multiple fields. Could the search results also tell me which field matched so that the document was selected? From what I can tell, only the document number and a score are returned, is there a way to also find out what was the field(s) of the document matched the query? Sildy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Could search results give an idea of which field matched
I search the index on multiple fields. Could the search results also tell me which field matched so that the document was selected? From what I can tell, only the document number and a score are returned, is there a way to also find out what was the field(s) of the document matched the query? Sildy
Re: Exact match search
On Monday 12 July 2004 21:17, [EMAIL PROTECTED] wrote: > I want to match documents that exactly equal a certain value, not just > contain it. Just don't tokenize your Fields, and make sure that the query also doesn't get tokenized (the easiest way to ensure that is probably to not use QueryParser but just build a TermQuery directly from the user's input). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Exact match search
How do you go about getting an exact match for a document that can contain hundreds of words? As I understand it, when you tokenize a document it is broken into words so really all the results you show are exact matches. At 03:17 PM 12/07/2004, you wrote: Hi, I want to match documents that exactly equal a certain value, not just contain it. If I search for "foo" in Lucene I get back documents like these: "foo" "foo bar" "bar foo" Is there a way to just get the ones that exactly equal the value I'm searching for? In this case, I want to only return the first document (ex. "foo"). I have a workaround where I store all the values and then after I get the hits I go through them and skip those that don't match. But this will return result sets of hundreds of documents that I don't need. Help! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Exact match search
Hi, I want to match documents that exactly equal a certain value, not just contain it. If I search for "foo" in Lucene I get back documents like these: "foo" "foo bar" "bar foo" Is there a way to just get the ones that exactly equal the value I'm searching for? In this case, I want to only return the first document (ex. "foo"). I have a workaround where I store all the values and then after I get the hits I go through them and skip those that don't match. But this will return result sets of hundreds of documents that I don't need. Help! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene Search has poor cpu utilization on a 4-CPU machine
Bug 30058 posted Aviran -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, July 12, 2004 1:38 PM To: Lucene Users List Subject: Re: Lucene Search has poor cpu utilization on a 4-CPU machine Aviran wrote: > I use Lucene 1.4 final > > Here is the thread dump for one blocked thread (If you want a full > thread dump for all threads I can do that too) Thanks. I think I get the point. I recently removed a synchronization point higher in the stack, so that now this one shows up! Whether or not you submit a patch, please file a bug report in Bugzilla with your proposed change, so that we don't lose track of this issue. Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Browse by Letter within a Category
On Monday 12 July 2004 17:48, O'Hare, Thomas wrote: > Does Lucene have a "beginning of line" query syntax, like the regular > expression ^ symbol? For example, > Â > title:^A* If your title isn't tokenized the "^" is implicit, I think. As usual, if your title is tokenized you can easily add another field with the same value as title, but in untokenized form. > What is the best way to sort by a date? I currently have a date field > that is used for searching in the format MMDD as a Field.Keyword. Lucene 1.4 added an IndexSearcher.search() method that takes a Sort() object which lets you sort by any field. Your date field can be used for that, as it has the correct format (because sorting it alphabetically will give you the right order already). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Aviran wrote: I use Lucene 1.4 final Here is the thread dump for one blocked thread (If you want a full thread dump for all threads I can do that too) Thanks. I think I get the point. I recently removed a synchronization point higher in the stack, so that now this one shows up! Whether or not you submit a patch, please file a bug report in Bugzilla with your proposed change, so that we don't lose track of this issue. Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene Search has poor cpu utilization on a 4-CPU machine
I use Lucene 1.4 final Here is the thread dump for one blocked thread (If you want a full thread dump for all threads I can do that too) "Thread-32" daemon prio=1 tid=0x082334c0 nid=0xa66 waiting for monitor entry [4f385000..4f38687c] at java.util.Vector.elementAt(Vector.java:430) - waiting to lock <0x452b93a8> (a java.util.Vector) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151) at org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:149) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115) at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:137) at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:51) at org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:364) at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:59) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java :165) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java :165) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:154) at gov.gsa.search.SearcherByPageAndSortedField.search(SearcherByPageAndSortedFi eld.java:317) at gov.gsa.search.SearcherByPageAndSortedField.search(SearcherByPageAndSortedFi eld.java:203) at gov.gsa.search.grants.SearchGrants.searchByPageAndSortedField(SearchGrants.j ava:308) at gov.gsa.search.grants.SearchServlet.searchByIndex(SearchServlet.java:1541) at gov.gsa.search.grants.SearchServlet.getResults(SearchServlet.java:1325) at gov.gsa.search.grants.SearchServlet.doGet(SearchServlet.java:500) at javax.servlet.http.HttpServlet.service(HttpServlet.java:740) at javax.servlet.http.HttpServlet.service(HttpServlet.java:853) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application FilterChain.java:247) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh ain.java:193) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja va:256) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja va:191) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) at org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2415) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180 ) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) at org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve. java:171) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:641) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:172 ) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:641) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java :174) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) at org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:223) at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:261) at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:360) at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:604) at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:562) at org.apache.jk.common.SocketConnection.runIt(ChannelSocket.java:679) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.jav a:619) at java.lang.Thread.run(Thread.java:534) And how do I submit a patch to the developer mailing list? Just
Re: Anyone use MultiSearcher class
Hi Don, Yes, I'm using the MultiSearcher (in Zilverline), and have seen no serious performance issues with it. The app performs well with multiple indexes, it's responds so quick (with 100k+ documents) that I haven't even taken the time to measure the difference to a single index search. Michael Franken Don Vaillancourt wrote: Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re:Anyone use MultiSearcher class
Actually, after I implemented the MultiSeacher, I had totally forgotten about this class. Although it isn't clear what I does. I'm assuming that it uses threads to search multiple indexes. I'll have to try it. Thanks At 01:10 PM 12/07/2004, you wrote: I think there is a ParallelMultiSearcher class that extands Multisearcher. Have you tried it? -- Debut du message initial --- De : Don Vaillancourt <[EMAIL PROTECTED]> A : "Lucene Users List" <[EMAIL PROTECTED]> Copies : Date : Mon, 12 Jul 2004 12:36:29 -0400 Sujet : Anyone use MultiSearcher class Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Re:Anyone use MultiSearcher class
I think there is a ParallelMultiSearcher class that extands Multisearcher. Have you tried it? -- Debut du message initial --- De : Don Vaillancourt <[EMAIL PROTECTED]> A : "Lucene Users List" <[EMAIL PROTECTED]> Copies : Date : Mon, 12 Jul 2004 12:36:29 -0400 Sujet : Anyone use MultiSearcher class Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Aviran wrote: First let me explain what I found out. I'm running Lucene on a 4 CPU server. While doing some stress tests I've noticed (by doing full thread dump) that searching threads are blocked on the method: public FieldInfo fieldInfo(int fieldNumber) This causes for a significant cpu idle time. What version of Lucene are you running? Also, can you please send the stack traces of the blocked threads, or at least a description of them? I'd be interested to see what context this happens in. In particular, which IndexReader and Searcher/Scorer/Weight methods does it happen under? I noticed that the class org.apache.lucene.index.FieldInfos uses private class members Vector byNumber and Hashtable byName, both of which are synchronized objects. By changing the Vector byNumber to ArrayList byNumber I was able to get 110% improvement in performance (number of searches per second). That's impressive! Good job finding a bottleneck! My question is: do the fields byNumber and byName have to be synchronized and what can happen if I'll change them to be ArrayList and HashMap which are not synchronized ? Can this corrupt the index or the integrity of the results? I think that is a safe change. FieldInfos is only modifed by DocumentWriter and SegmentMerger, and there is no possibility of other threads accessing those instances. Please submit a patch to the developer mailing list. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Anyone use MultiSearcher class
Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Re: AW: Understanding TooManyClauses-Exception and Query-RAM-size
[EMAIL PROTECTED] wrote: What I really would like to see are some best practices or some advice from some users who are working with really large indices how they handle this situation, or why they don't have to care about it or maybe why I am completely missing the point ;-)) Many folks with really large indexes just don't permit things like wildcard and range searches. For example, Google supports no wildcards and has only recently added limited numeric range searching. Yahoo! supports neither. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Browse by Letter within a Category
You can use http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/spans/SpanFirstQuery.html Pete - Original Message - From: "O'Hare, Thomas" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, July 12, 2004 11:48 AM Subject: RE: Browse by Letter within a Category Thank you for the suggestion. I implemented what you recommended and now having it working. I'm sorting on the first word in the title. Does Lucene have a "beginning of line" query syntax, like the regular expression ^ symbol? For example, title:^A* What is the best way to sort by a date? I currently have a date field that is used for searching in the format MMDD as a Field.Keyword. Thanks, Tom _ From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Friday, July 09, 2004 4:34 AM To: Lucene Users List Subject: Re: Browse by Letter within a Category On Friday 09 July 2004 04:27, O'Hare, Thomas wrote: > Searcher.search("category:\"Products\" AND title:\"A*\"", new > Sort("title")); You can only sort on fields which are not tokenized I think. So add an extra field with the title, but untokenized, just for sorting. Also, "A*" might slow down the query execution so you might want to add another field which just contains the first letter so there's no need for the asterisk. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Browse by Letter within a Category
Thank you for the suggestion. I implemented what you recommended and now having it working. I'm sorting on the first word in the title. Does Lucene have a "beginning of line" query syntax, like the regular expression ^ symbol? For example, title:^A* What is the best way to sort by a date? I currently have a date field that is used for searching in the format MMDD as a Field.Keyword. Thanks, Tom _ From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Friday, July 09, 2004 4:34 AM To: Lucene Users List Subject: Re: Browse by Letter within a Category On Friday 09 July 2004 04:27, O'Hare, Thomas wrote: > Searcher.search("category:\"Products\" AND title:\"A*\"", new > Sort("title")); You can only sort on fields which are not tokenized I think. So add an extra field with the title, but untokenized, just for sorting. Also, "A*" might slow down the query execution so you might want to add another field which just contains the first letter so there's no need for the asterisk. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher usage and caching?
Cache/reuse your IndexSearcher. On every search, check if the index has changed (there are methods for that). If it has changed, create a new IndexSearcher and assign it to your IndexSearcher variable, and do not close the old IndexSearcher, just in case something is still using it. Otis --- Joel Shellman <[EMAIL PROTECTED]> wrote: > I'm working on a document management system using lucene to search > through all the documents. > > This means that I'll be adding/updating/deleting documents at the > same > time searches are going on. > > I thought to create an IndexSearcher and reuse it throughout, but > that > doesn't seem to work. If I do a search, then add a document, and do > another search with the same IndexSearcher, it won't find the newly > added document. > > I'd rather not have to create a new IndexSearcher for every query... > do > I have to? > > Thanks, > > -joel shellman > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexSearcher usage and caching?
I'm working on a document management system using lucene to search through all the documents. This means that I'll be adding/updating/deleting documents at the same time searches are going on. I thought to create an IndexSearcher and reuse it throughout, but that doesn't seem to work. If I do a search, then add a document, and do another search with the same IndexSearcher, it won't find the newly added document. I'd rather not have to create a new IndexSearcher for every query... do I have to? Thanks, -joel shellman - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Field.java -> STORED, NOT_STORED, etc...
I have 2 suggestions: 1) use Eclipse, or an IDE that references the javadoc with mouseovers 2) if you are going to create constants, consider using a bitflag. Then your constants can have a 2's value, ie STORED = 1 INDEXED = 2 TOKENIZED = 4 Then you can have the constructor look like: new Field("name", "value", STORED + TOKENIZED) The constructor would break that down bitwise! -Original Message- From: Kevin A. Burton [mailto:[EMAIL PROTECTED] Sent: Sunday, July 11, 2004 5:05 AM To: Lucene Users List Subject: Field.java -> STORED, NOT_STORED, etc... I've been working with the Field class doing index conversions between an old index format to my new external content store proposal (thus the email about the 14M convert). Anyway... I find the whole Field.Keyword, Field.Text thing confusing. The main problem is that the constructor to Field just takes booleans and if you forget the ordering of the booleans its very confusing. new Field( "name", "value", true, false, true ); So looking at that you have NO idea what its doing without fetching javadoc. So I added a few constants to my class: new Field( "name", "value", NOT_STORED, INDEXED, NOT_TOKENIZED ); which IMO is a lot easier to maintain. Why not add these constants to Field.java: public static final boolean STORED = true; public static final boolean NOT_STORED = false; public static final boolean INDEXED = true; public static final boolean NOT_INDEXED = false; public static final boolean TOKENIZED = true; public static final boolean NOT_TOKENIZED = false; Of course you still have to remember the order but this becomes a lot easier to maintain. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene Search has poor cpu utilization on a 4-CPU machine
Hi all, First let me explain what I found out. I'm running Lucene on a 4 CPU server. While doing some stress tests I've noticed (by doing full thread dump) that searching threads are blocked on the method: public FieldInfo fieldInfo(int fieldNumber) This causes for a significant cpu idle time. I noticed that the class org.apache.lucene.index.FieldInfos uses private class members Vector byNumber and Hashtable byName, both of which are synchronized objects. By changing the Vector byNumber to ArrayList byNumber I was able to get 110% improvement in performance (number of searches per second). My question is: do the fields byNumber and byName have to be synchronized and what can happen if I'll change them to be ArrayList and HashMap which are not synchronized ? Can this corrupt the index or the integrity of the results? Thanks, Aviran - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Understanding TooManyClauses-Exception and Query-RAM-size
Hi Kevin, thanks for your answer. That could really solve the problem with the modificationDate or similar fields. But what if you create queries that ultimately return only a few hits but contain a RangeQuery that searches for example an ID-Field of some kind, where you have to cover a wide range of IDs? I think in general, you will always have fields that contain lots of different terms and searching even a small range of one of these fields may lead to this Exception. The bottom line in my opinion is, that you have to take care for yourself, not to create certain type of queries that could lead to this Exception. The type of query completely depends on the index which means as the index grows you have to restrict the ranges of more and more rangequeries. One way would be, to catch this Exception and gracefully present a message to the user to further restrict his query. But this could lead to some confusion, if the user knows that he has entered some very restrictive query in addition to some RangeQuery that internally leads to this Exception. What I really would like to see are some best practices or some advice from some users who are working with really large indices how they handle this situation, or why they don't have to care about it or maybe why I am completely missing the point ;-)) Thanks, Martin -Ursprüngliche Nachricht- Von: Kevin A. Burton [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 8. Juli 2004 21:11 An: Lucene Users List Betreff: Re: Understanding TooManyClauses-Exception and Query-RAM-size [EMAIL PROTECTED] wrote: >Hi, > >a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything went >smoothly, but we are experiencing some problems with that new constant limit > > > maxClauseCount=1024 > >which leeds to Exceptions of type > > org.apache.lucene.search.BooleanQuery$TooManyClauses > >when certain RangeQueries are executed (in fact, we get this Excpetion when >we execute certain Wildcard queries, too). Although we are working with a >fairly small index with about 35.000 documents, we encounter this Exception >when we search for the property "modificationDate". For example > > modificationDate:[00 TO 0dwc970kw] > > > We talked about this the other day. http://wiki.apache.org/jakarta-lucene/IndexingDateFields Find out what type of precision you need and use that. If you only need days or hours or minutes then use that. Millis is just too small. We're only using days and have queries for just the last 7 days as max so this really works out well... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
On Monday 12 July 2004 09:04, Morus Walter wrote: > Lucene might work around this by creating a directory in java.io.tmpdir > setting apropriate permission (can that be done with java os > independently?) and put the lock there. But if everybody can delete your lock files, that would be a security problem. Deleting stale locks isn't a problem, but how would one decide if a lock is stale? Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
Doug Cutting writes: > Armbrust, Daniel C. wrote: > > The problem I ran into the other day with the new lock location is that Person A > > had started an index, ran into problems, erased the index and asked me to look at > > it. I tried to rebuild the index (in the same place on a Solaris machine) and > > found out that A) - her locks still existed, B) - I didn't have a clue where it > > put the locks on the Solaris machine (since no full path was given with the error > > - has this been fixed?) and C) - I didn't have permission to remove her locks. > > I think these problems have been fixed. When an index is created, all > old locks are first removed. And when a lock cannot be obtained, it's > full pathname is printed. Can you replicate this with 1.4-final? > Hmm. If user A creates a lock in /tmp and lucene crashes leaving the lock, user B won't be able to remove the lock (unless B is root) since /tmp usually has permissions drwxrwxrwt 12 root root 8192 Jul 12 08:50 tmp/ were the 't' means that normal users may delete only their own files (at least on linux and IIRC solaris). Or did I miss something? Lucene might work around this by creating a directory in java.io.tmpdir setting apropriate permission (can that be done with java os independently?) and put the lock there. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]