Lucene Bugs
I have reported bugs about Lucene in the fall of 2001 but no Lucene developer has responded. I am sending this summary as a reminder. My original message to the mailing list is here: [Lucene-dev] More bugs http://www.geocrawler.com/archives/3/2626/2001/8/0/6409669/ The bugs at SourceForge are here: DateFilter: call enum.next() first http://sourceforge.net/tracker/index.php?func=detailaid=451314group_id=3922; atid=103922 SegmentTermEnum.clone(), term == null http://sourceforge.net/tracker/index.php?func=detailaid=451315group_id=3922; atid=103922 Wrong ordering from Document.fields() http://sourceforge.net/tracker/index.php?func=detailaid=451317group_id=3922; atid=103922 No software is bug free; I just want to help make Lucene better. If I can be of any help, please ask. ~ David Smiley MITRE -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing and Duplication
Kelvin, I've got a little problem with indexing that I'd like to throw to everyone. My objects have a unique identifier. When indexing, before I create a new document, I'd like to check if a document has already been created with this identifier. If so, I'd like to retrieve the document corresponding to this identifier, and add the fields I currently have to this document's fields and write it. If no such document exists, then I'd create a new document, add my fields and write it. What this really does, I guess, is ensure that a document object represents a body of information which really belongs together, eliminating duplication. With the current API, writing and retrieving is performed by the IndexWriter and IndexReader respectively. This effectively means that in order to do the above, I'd have to close the writer, create a new instance of the index reader after each document has been added in order for the reader to have the most updated version of the index (!). Does anyone have any suggestions how I might approach this? Avoid closing and opening too much by batching n docs at a time on the index reader and then to the things needed for the n docs on the index writer. You might have to delete docs on the reader, too. The reasons for using the reader for reading/searching/deleting and the using writer for adding have been discussed some time ago on this list. I can't provide a pointer into the list archives as I don't recall the original subject header, sorry. Regards, Ype -- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Wildcard Searching
Hello, This was a thread on lucene-user initially, but I'm copying lucene-dev as well. Sorry about duplicates. --- Stefan Bergstrand [EMAIL PROTECTED] wrote: Doug Cutting [EMAIL PROTECTED] writes: Just noticed this problem in my program. It seems as if the analyzer passed to QueryParser.parse(), never is passed to PrefixQuery (which is what my test case is parsed to). A quick look in QueryParser.jj confirms this: q = new PrefixQuery(new Term(field, term.image.substring (0, term.image.length()-1))); I thought that queries such as 'rou?d' are considered wildcard queries by QueryParser.jj, and not Prefix queries, no? In the default definition of token in QueryParser.jj I see this: | PREFIXTERM: _TERM_START_CHAR (_TERM_CHAR)* * | WILDTERM: _TERM_START_CHAR (_TERM_CHAR | ( [ *, ? ] ))* Then further down in QueryParser.jj we have this: if (wildcard) q = new WildcardQuery(new Term(field, term.image)); So a WildWuery is being constructed, not PrefixQuery, I think. What I don't understand is why the definition of _TERM_START_CHAR looks like this: | #_TERM_START_CHAR: ~[ , \t, +, -, !, (, ), :, ^, [, ], \, {, }, ~, * ] Maybe the name is misleading, but it seems like _TERM_START_CHAR are the characters that a TERM can start with, because later in QueryParser.jj we have TERM defined as: | TERM: _TERM_START_CHAR (_TERM_CHAR)* and _TERM_CHAR has this definition: | #_TERM_CHAR: _TERM_START_CHAR So how can we have a * in _TERM_START_CHAR when terms are not allowed to start with a *, and if we do have *, how come we do not have ? as well? Can somebodyt correct me in every place where I made false statements, assumptions, and conclusions? Thanks, Otis From: Howk, Michael [mailto:[EMAIL PROTECTED]] Also, Lucene returns the parsed version of each of our searches. When we search by rou*d, Lucene parses it as rou*d (which is what we would expect). But when we search by rou?d, Lucene parses it as rou d. It seems to wrap the term in quotes and replace the question mark with a space. Any ideas? Or can someone give us an idea of how to understand WildcardQuery or WildcardTermEnum? It sounds like the problem is in the query parser. Brian? Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- --- Stefan Bergstrand Polopoly - Cultivating the information garden Ph: +46 8 506 782 67 Cell: +46 704 47 82 67 Fax: +46 8 506 782 51 [EMAIL PROTECTED], http://www.polopoly.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene Bugs
Oh I *have* downloaded the CVS source and I actually did *fix* (maybe) two of these three bugs and I did *submit* what I did exactly to fix them to the sourceforge / mailing-list for public review (but not in diff/patch format since they were one-liners). The problem is that much of Lucene is very complicated (understandably so) and I never got someone more familiar with Lucene's more complicated parts (like Doug, or perhaps some others here) to respond to see if my fix was correct and completely addresses the issue. Not one person responded except for some other guy to say he experienced the same bug and that nobody responded to his bug report either :-(. The 3rd bug, the one that I didn't fix, I took the time to write a test program that showed the bug. What's needed now for these bugs to be squashed, is someone that really knows Lucene's complicated parts to verify if my 2 fixes are sufficient and to at least investigate the 3rd bug. I'm not the one with years of search-engine writing experience ;-). I really appreciate your response by the way, it's a welcome change... and an initial step. ~ Dave Smiley On Saturday, March 16, 2002, at 08:59 PM, Andrew C. Oliver wrote: You need not be asked, help is always wanted. How about instead of submitting bugs, submit patches. Simply get the sources via CVS (click on CVS Repository on the Jakarta front page), fix the bugs and then do cvs diff -u to create patches. Post those into bugzilla and put [PATCH] on the summary line and I think you'll find them applied rather quickly. -Andy -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: corrupted index
Otis, You can remove the .lock file and try re-indexing or continuing indexing where you left off. I am not sure about the corrupt index. I have never seen it happen, and I believe I recall reading some messages from Doug Cutting saying that index should never be left in an inconsistent state. Obviously never should be, but if something's pulling the rug out from under his JRE, changes could be only partially written, right? Or is the writing format in some sense transactionally safe? I've never worked directly on something like this, but I worked at a database software company where they used transaction semantics and a journaling scheme to fake a bulletproof file system. Is this how the index-writing code is implemented? In general, I can guess Doug's response - just torch the old index directory and rebuild it; Lucene's indexing is fast enough that you don't need to get clever. This seems to be Doug's stance in general (i.e. don't get fancy, I already put all the fanciness you'll need into extremely fast indexing and searching). So far, it seems to work :-). I could be making this up, though, so I suggest you search through lucene-user and lucene-dev archives on www.mail-archive.com. A search for corrupt should do it. Once you figure things out maybe you can post a summary here. I got a little curious, so I went and did the searches. There is exactly one message in each list archive (dev and users) with the keyword corrupt in it. The lucene-users instance is irrelevant: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html The lucene-dev instance is more useful: http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html It's a post from Doug, dated sept 27, 2001, about adding not just thread-safety but process-safety: It should be impossible to corrupt an index through the Lucene API. However if a Lucene process exits unexpectedly it can leave the index locked. The remedy is simply to, at a time when it is certain that no processes are accessing the index, remove all lock files. So it sounds like it's worth trying just removing the lock files. Hm, is there a way to come up with a sanity check you can run on an index to make sure it's not corrupted? This might be an excellent thing to reassure yourself with: something went wrong? Run a sanity check, if it fails just reindex. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: corrupted index
Oh, I just thought of something (wine does body good). Perhaps one could use Runtime (the class) to catch the JVM shutdown and do whatever is needed to prevent index corruption. I believe there are some shutdown hook methods in there that may let you do that. I'm too lazy to look up the API docs now, but I rememeber reading about that once, and perhaps it was even mentioned on one of the 2 Lucene mailing lists. On the other hand, it would be great to have a tool that can verify an existing index. I don't know enough about the actual file structure yet to write something like that, but maybe somebody else has done that already or would like to contribute. Otis --- Steven J. Owens [EMAIL PROTECTED] wrote: Otis, You can remove the .lock file and try re-indexing or continuing indexing where you left off. I am not sure about the corrupt index. I have never seen it happen, and I believe I recall reading some messages from Doug Cutting saying that index should never be left in an inconsistent state. Obviously never should be, but if something's pulling the rug out from under his JRE, changes could be only partially written, right? Or is the writing format in some sense transactionally safe? I've never worked directly on something like this, but I worked at a database software company where they used transaction semantics and a journaling scheme to fake a bulletproof file system. Is this how the index-writing code is implemented? In general, I can guess Doug's response - just torch the old index directory and rebuild it; Lucene's indexing is fast enough that you don't need to get clever. This seems to be Doug's stance in general (i.e. don't get fancy, I already put all the fanciness you'll need into extremely fast indexing and searching). So far, it seems to work :-). I could be making this up, though, so I suggest you search through lucene-user and lucene-dev archives on www.mail-archive.com. A search for corrupt should do it. Once you figure things out maybe you can post a summary here. I got a little curious, so I went and did the searches. There is exactly one message in each list archive (dev and users) with the keyword corrupt in it. The lucene-users instance is irrelevant: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html The lucene-dev instance is more useful: http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html It's a post from Doug, dated sept 27, 2001, about adding not just thread-safety but process-safety: It should be impossible to corrupt an index through the Lucene API. However if a Lucene process exits unexpectedly it can leave the index locked. The remedy is simply to, at a time when it is certain that no processes are accessing the index, remove all lock files. So it sounds like it's worth trying just removing the lock files. Hm, is there a way to come up with a sanity check you can run on an index to make sure it's not corrupted? This might be an excellent thing to reassure yourself with: something went wrong? Run a sanity check, if it fails just reindex. Steven J. Owens [EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene Bugs
Hola, I don't have year of search engine writing experience either, but I did look at your reports on Sourceforge earlier and I will try to look at the source to see if they are the right fixes. I haven't used DateFilter, which, I think, you said contains the bug, so no promises, but I'll look. That part of code might have changed since your reports, and I may have trouble locating the lines you mentiones, so I may ask you to point me to the right lines in the new source. Tomorrow or Monday. Right now I have to go kill some crapes and go to bed. Otis --- David Smiley [EMAIL PROTECTED] wrote: Oh I *have* downloaded the CVS source and I actually did *fix* (maybe) two of these three bugs and I did *submit* what I did exactly to fix them to the sourceforge / mailing-list for public review (but not in diff/patch format since they were one-liners). The problem is that much of Lucene is very complicated (understandably so) and I never got someone more familiar with Lucene's more complicated parts (like Doug, or perhaps some others here) to respond to see if my fix was correct and completely addresses the issue. Not one person responded except for some other guy to say he experienced the same bug and that nobody responded to his bug report either :-(. The 3rd bug, the one that I didn't fix, I took the time to write a test program that showed the bug. What's needed now for these bugs to be squashed, is someone that really knows Lucene's complicated parts to verify if my 2 fixes are sufficient and to at least investigate the 3rd bug. I'm not the one with years of search-engine writing experience ;-). I really appreciate your response by the way, it's a welcome change... and an initial step. ~ Dave Smiley On Saturday, March 16, 2002, at 08:59 PM, Andrew C. Oliver wrote: You need not be asked, help is always wanted. How about instead of submitting bugs, submit patches. Simply get the sources via CVS (click on CVS Repository on the Jakarta front page), fix the bugs and then do cvs diff -u to create patches. Post those into bugzilla and put [PATCH] on the summary line and I think you'll find them applied rather quickly. -Andy -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]