Re: Lucene shouldn't use java.io.tmpdir
On Monday 12 July 2004 09:04, Morus Walter wrote: Lucene might work around this by creating a directory in java.io.tmpdir setting apropriate permission (can that be done with java os independently?) and put the lock there. But if everybody can delete your lock files, that would be a security problem. Deleting stale locks isn't a problem, but how would one decide if a lock is stale? Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Understanding TooManyClauses-Exception and Query-RAM-size
Hi Kevin, thanks for your answer. That could really solve the problem with the modificationDate or similar fields. But what if you create queries that ultimately return only a few hits but contain a RangeQuery that searches for example an ID-Field of some kind, where you have to cover a wide range of IDs? I think in general, you will always have fields that contain lots of different terms and searching even a small range of one of these fields may lead to this Exception. The bottom line in my opinion is, that you have to take care for yourself, not to create certain type of queries that could lead to this Exception. The type of query completely depends on the index which means as the index grows you have to restrict the ranges of more and more rangequeries. One way would be, to catch this Exception and gracefully present a message to the user to further restrict his query. But this could lead to some confusion, if the user knows that he has entered some very restrictive query in addition to some RangeQuery that internally leads to this Exception. What I really would like to see are some best practices or some advice from some users who are working with really large indices how they handle this situation, or why they don't have to care about it or maybe why I am completely missing the point ;-)) Thanks, Martin -Ursprüngliche Nachricht- Von: Kevin A. Burton [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 8. Juli 2004 21:11 An: Lucene Users List Betreff: Re: Understanding TooManyClauses-Exception and Query-RAM-size [EMAIL PROTECTED] wrote: Hi, a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything went smoothly, but we are experiencing some problems with that new constant limit maxClauseCount=1024 which leeds to Exceptions of type org.apache.lucene.search.BooleanQuery$TooManyClauses when certain RangeQueries are executed (in fact, we get this Excpetion when we execute certain Wildcard queries, too). Although we are working with a fairly small index with about 35.000 documents, we encounter this Exception when we search for the property modificationDate. For example modificationDate:[00 TO 0dwc970kw] We talked about this the other day. http://wiki.apache.org/jakarta-lucene/IndexingDateFields Find out what type of precision you need and use that. If you only need days or hours or minutes then use that. Millis is just too small. We're only using days and have queries for just the last 7 days as max so this really works out well... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene Search has poor cpu utilization on a 4-CPU machine
Hi all, First let me explain what I found out. I'm running Lucene on a 4 CPU server. While doing some stress tests I've noticed (by doing full thread dump) that searching threads are blocked on the method: public FieldInfo fieldInfo(int fieldNumber) This causes for a significant cpu idle time. I noticed that the class org.apache.lucene.index.FieldInfos uses private class members Vector byNumber and Hashtable byName, both of which are synchronized objects. By changing the Vector byNumber to ArrayList byNumber I was able to get 110% improvement in performance (number of searches per second). My question is: do the fields byNumber and byName have to be synchronized and what can happen if I'll change them to be ArrayList and HashMap which are not synchronized ? Can this corrupt the index or the integrity of the results? Thanks, Aviran - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: Understanding TooManyClauses-Exception and Query-RAM-size
[EMAIL PROTECTED] wrote: What I really would like to see are some best practices or some advice from some users who are working with really large indices how they handle this situation, or why they don't have to care about it or maybe why I am completely missing the point ;-)) Many folks with really large indexes just don't permit things like wildcard and range searches. For example, Google supports no wildcards and has only recently added limited numeric range searching. Yahoo! supports neither. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Anyone use MultiSearcher class
Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Aviran wrote: First let me explain what I found out. I'm running Lucene on a 4 CPU server. While doing some stress tests I've noticed (by doing full thread dump) that searching threads are blocked on the method: public FieldInfo fieldInfo(int fieldNumber) This causes for a significant cpu idle time. What version of Lucene are you running? Also, can you please send the stack traces of the blocked threads, or at least a description of them? I'd be interested to see what context this happens in. In particular, which IndexReader and Searcher/Scorer/Weight methods does it happen under? I noticed that the class org.apache.lucene.index.FieldInfos uses private class members Vector byNumber and Hashtable byName, both of which are synchronized objects. By changing the Vector byNumber to ArrayList byNumber I was able to get 110% improvement in performance (number of searches per second). That's impressive! Good job finding a bottleneck! My question is: do the fields byNumber and byName have to be synchronized and what can happen if I'll change them to be ArrayList and HashMap which are not synchronized ? Can this corrupt the index or the integrity of the results? I think that is a safe change. FieldInfos is only modifed by DocumentWriter and SegmentMerger, and there is no possibility of other threads accessing those instances. Please submit a patch to the developer mailing list. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re:Anyone use MultiSearcher class
I think there is a ParallelMultiSearcher class that extands Multisearcher. Have you tried it? -- Debut du message initial --- De : Don Vaillancourt [EMAIL PROTECTED] A : Lucene Users List [EMAIL PROTECTED] Copies : Date : Mon, 12 Jul 2004 12:36:29 -0400 Sujet : Anyone use MultiSearcher class Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re:Anyone use MultiSearcher class
Actually, after I implemented the MultiSeacher, I had totally forgotten about this class. Although it isn't clear what I does. I'm assuming that it uses threads to search multiple indexes. I'll have to try it. Thanks At 01:10 PM 12/07/2004, you wrote: I think there is a ParallelMultiSearcher class that extands Multisearcher. Have you tried it? -- Debut du message initial --- De : Don Vaillancourt [EMAIL PROTECTED] A : Lucene Users List [EMAIL PROTECTED] Copies : Date : Mon, 12 Jul 2004 12:36:29 -0400 Sujet : Anyone use MultiSearcher class Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Re: Anyone use MultiSearcher class
Hi Don, Yes, I'm using the MultiSearcher (in Zilverline), and have seen no serious performance issues with it. The app performs well with multiple indexes, it's responds so quick (with 100k+ documents) that I haven't even taken the time to measure the difference to a single index search. Michael Franken Don Vaillancourt wrote: Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Aviran wrote: I use Lucene 1.4 final Here is the thread dump for one blocked thread (If you want a full thread dump for all threads I can do that too) Thanks. I think I get the point. I recently removed a synchronization point higher in the stack, so that now this one shows up! Whether or not you submit a patch, please file a bug report in Bugzilla with your proposed change, so that we don't lose track of this issue. Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Browse by Letter within a Category
On Monday 12 July 2004 17:48, O'Hare, Thomas wrote: Does Lucene have a beginning of line query syntax, like the regular expression ^ symbol? For example, title:^A* If your title isn't tokenized the ^ is implicit, I think. As usual, if your title is tokenized you can easily add another field with the same value as title, but in untokenized form. What is the best way to sort by a date? I currently have a date field that is used for searching in the format MMDD as a Field.Keyword. Lucene 1.4 added an IndexSearcher.search() method that takes a Sort() object which lets you sort by any field. Your date field can be used for that, as it has the correct format (because sorting it alphabetically will give you the right order already). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Exact match search
Hi, I want to match documents that exactly equal a certain value, not just contain it. If I search for foo in Lucene I get back documents like these: foo foo bar bar foo Is there a way to just get the ones that exactly equal the value I'm searching for? In this case, I want to only return the first document (ex. foo). I have a workaround where I store all the values and then after I get the hits I go through them and skip those that don't match. But this will return result sets of hundreds of documents that I don't need. Help! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Exact match search
On Monday 12 July 2004 21:17, [EMAIL PROTECTED] wrote: I want to match documents that exactly equal a certain value, not just contain it. Just don't tokenize your Fields, and make sure that the query also doesn't get tokenized (the easiest way to ensure that is probably to not use QueryParser but just build a TermQuery directly from the user's input). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Could search results give an idea of which field matched
See the explain functionality in the Javadocs and previous threads. You can ask Lucene to explain why it got the results it did for a give hit. [EMAIL PROTECTED] 07/12/04 04:52PM I search the index on multiple fields. Could the search results also tell me which field matched so that the document was selected? From what I can tell, only the document number and a score are returned, is there a way to also find out what was the field(s) of the document matched the query? Sildy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.java - STORED, NOT_STORED, etc...
Doug Cutting wrote: It would be best to get the compiler to check the order. If we change this, why not use type-safe enumerations: http://www.javapractices.com/Topic1.cjp The calls would look like: new Field(name, value, Stored.YES, Indexed.NO, Tokenized.YES); Stored could be implemented as the nested class: public final class Stored { private Stored() {} public static final Stored YES = new Stored(); public static final Stored NO = new Stored(); } +1... I'm not in love with this pattern but since Java 1.4 doesnt' support enum its better than nothing. I also didn't want to submit a recommendation that would break APIs. I assume the old API would be deprecated? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is Field.java final?
Doug Cutting wrote: Kevin A. Burton wrote: I was going to create a new IDField class which just calls super( name, value, false, true, false) but noticed I was prevented because Field.java is final? You don't need to subclass to do this, just a static method somewhere. Why is this? I can't see any harm in making it non-final... Field and Document are not designed to be extensible. They are persisted in such a way that added methods are not available when the field is restored. In other words, when a field is read, it always constructs an instance of Field, not a subclass. Thats fine... I think thats acceptable behavior. I don't think anyone would assume that inner vars are restored or that the field is serialized. Not a big deal but it would be nice... -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Aviran wrote: Bug 30058 posted Which of course is here: http://issues.apache.org/bugzilla/show_bug.cgi?id=30058 Is this the source of the revision you modified? http://www.mail-archive.com/[EMAIL PROTECTED]/msg06116.html Also what version of Lucene? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]