Lucene Bugs

2002-03-16 Thread David Smiley

I have reported bugs about Lucene in the fall of 2001 but no Lucene 
developer has responded.  I am sending this summary as a reminder.

My original message to the mailing list is here:

[Lucene-dev] More bugs
http://www.geocrawler.com/archives/3/2626/2001/8/0/6409669/

The bugs at SourceForge are here:

DateFilter: call enum.next() first
http://sourceforge.net/tracker/index.php?func=detailaid=451314group_id=3922;
atid=103922

SegmentTermEnum.clone(), term == null
http://sourceforge.net/tracker/index.php?func=detailaid=451315group_id=3922;
atid=103922

Wrong ordering from Document.fields()
http://sourceforge.net/tracker/index.php?func=detailaid=451317group_id=3922;
atid=103922


No software is bug free; I just want to help make Lucene better.  If 
I can be of any help, please ask.

~ David Smiley
   MITRE


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Indexing and Duplication

2002-03-16 Thread Ype Kingma

Kelvin,

I've got a little problem with indexing that I'd like to throw to everyone.

My objects have a unique identifier. When indexing, before I create a new
document, I'd like to check if a document has already been created with this
identifier. If so, I'd like to retrieve the document corresponding to this
identifier, and add the fields I currently have to this document's fields
and write it. If no such document exists, then I'd create a new document,
add my fields and write it. What this really does, I guess, is ensure that a
document object represents a body of information which really belongs
together, eliminating duplication.

With the current API, writing and retrieving is performed by the IndexWriter
and IndexReader respectively. This effectively means that in order to do the
above, I'd have to close the writer, create a new instance of the index
reader after each document has been added in order for the reader to have
the most updated version of the index (!).

Does anyone have any suggestions how I might approach this?

Avoid closing and opening too much by batching n docs at a time
on the index reader and then to the things needed for the n docs on the
index writer. You might have to delete docs on the reader, too.

The reasons for using the reader for reading/searching/deleting
and the using writer for adding have been discussed some time ago on this
list. I can't provide a pointer into the list archives as I don't recall
the original subject header, sorry.

Regards,
Ype

-- 

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Wildcard Searching

2002-03-16 Thread Otis Gospodnetic

Hello,

This was a thread on lucene-user initially, but I'm copying lucene-dev
as well.  Sorry about duplicates.

--- Stefan Bergstrand [EMAIL PROTECTED] wrote:
 Doug Cutting [EMAIL PROTECTED] writes:
 
 Just noticed this problem in my program.
 
 It seems as if the analyzer passed to QueryParser.parse(), never is
 passed to PrefixQuery (which is what my test case is parsed to).
 
 A quick look in QueryParser.jj confirms this: 
 
  q = new PrefixQuery(new Term(field, term.image.substring
   (0, term.image.length()-1)));

I thought that queries such as 'rou?d' are considered wildcard queries
by QueryParser.jj, and not Prefix queries, no?
In the default definition of token in QueryParser.jj I see this:

| PREFIXTERM:  _TERM_START_CHAR (_TERM_CHAR)* * 
| WILDTERM:  _TERM_START_CHAR 
  (_TERM_CHAR | ( [ *, ? ] ))* 

Then further down in QueryParser.jj we have this:

   if (wildcard)
 q = new WildcardQuery(new Term(field, term.image));

So a WildWuery is being constructed, not PrefixQuery, I think.

What I don't understand is why the definition of _TERM_START_CHAR looks
like this:

| #_TERM_START_CHAR: ~[  , \t, +, -, !, (, ), :, ^, 
 [, ], \, {, }, ~, * ] 

Maybe the name is misleading, but it seems like _TERM_START_CHAR are
the characters that a TERM can start with, because later in
QueryParser.jj we have TERM defined as:

| TERM:  _TERM_START_CHAR (_TERM_CHAR)*  

and _TERM_CHAR has this definition:

| #_TERM_CHAR: _TERM_START_CHAR 

So how can we have a * in _TERM_START_CHAR when terms are not allowed
to start with a *, and if we do have *, how come we do not have ?
as well?

Can somebodyt correct me in every place where I made false statements,
assumptions, and conclusions?

Thanks,
Otis

   From: Howk, Michael [mailto:[EMAIL PROTECTED]]
   
   Also, Lucene returns the parsed version of each of our 
   searches. When we
   search by rou*d, Lucene parses it as rou*d (which is what we 
   would expect).
   But when we search by rou?d, Lucene parses it as rou d. It 
   seems to wrap
   the term in quotes and replace the question mark with a 
   space. Any ideas? Or
   can someone give us an idea of how to understand WildcardQuery or
   WildcardTermEnum?
  
  It sounds like the problem is in the query parser.  Brian?
  
  Doug
  
  --
  To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
  
  
 
 -- 
 ---
 Stefan Bergstrand
 Polopoly - Cultivating the information garden
 Ph:   +46 8 506 782 67
 Cell: +46 704 47 82 67
 Fax:  +46 8 506 782 51
 [EMAIL PROTECTED], http://www.polopoly.com
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 



__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Lucene Bugs

2002-03-16 Thread David Smiley

   Oh I *have* downloaded the CVS source and I actually did *fix* 
(maybe) two of these three bugs and I did *submit* what I did exactly 
to fix them to the sourceforge / mailing-list for public review (but 
not in diff/patch format since they were one-liners).  The problem is 
that much of Lucene is very complicated (understandably so) and I 
never got someone more familiar with Lucene's more complicated parts 
(like Doug, or perhaps some others here) to respond to see if my fix 
was correct and completely addresses the issue.  Not one person 
responded except for some other guy to say he experienced the same 
bug and that nobody responded to his bug report either :-(.  The 3rd 
bug, the one that I didn't fix, I took the time to write a test 
program that showed the bug.  What's needed now for these bugs to be 
squashed, is someone that really knows Lucene's complicated parts to 
verify if my 2 fixes are sufficient and to at least investigate the 
3rd bug.  I'm not the one with years of search-engine writing 
experience ;-).

I really appreciate your response by the way, it's a welcome 
change... and an initial step.

~ Dave Smiley

On Saturday, March 16, 2002, at 08:59  PM, Andrew C. Oliver wrote:

 You need not be asked, help is always wanted.  How about instead of
 submitting bugs, submit patches.  Simply get the sources via CVS (click
 on CVS Repository on the Jakarta front page), fix the bugs and then do
 cvs diff -u to create patches.  Post those into bugzilla and put 
 [PATCH]
 on the summary line and I think you'll find them applied rather 
 quickly.

 -Andy


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: corrupted index

2002-03-16 Thread Steven J. Owens

Otis,

 You can remove the .lock file and try re-indexing or continuing
 indexing where you left off.
 I am not sure about the corrupt index.  I have never seen it happen,
 and I believe I recall reading some messages from Doug Cutting saying
 that index should never be left in an inconsistent state.  

 Obviously never should be, but if something's pulling the rug
out from under his JRE, changes could be only partially written,
right?  

 Or is the writing format in some sense transactionally safe?
I've never worked directly on something like this, but I worked at a
database software company where they used transaction semantics and a
journaling scheme to fake a bulletproof file system.  Is this how
the index-writing code is implemented?

 In general, I can guess Doug's response - just torch the old
index directory and rebuild it; Lucene's indexing is fast enough that
you don't need to get clever.  This seems to be Doug's stance in
general (i.e. don't get fancy, I already put all the fanciness you'll
need into extremely fast indexing and searching).  So far, it seems
to work :-).

 I could be making this up, though, so I suggest you search through
 lucene-user and lucene-dev archives on www.mail-archive.com.
 A search for corrupt should do it.
 Once you figure things out maybe you can post a summary here.

 I got a little curious, so I went and did the searches.  There is
exactly one message in each list archive (dev and users) with the
keyword corrupt in it.  The lucene-users instance is irrelevant:

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html

 The lucene-dev instance is more useful:

http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html

 It's a post from Doug, dated sept 27, 2001, about adding not just
thread-safety but process-safety:

  It should be impossible to corrupt an index through the Lucene API.
  However if a Lucene process exits unexpectedly it can leave the index
  locked.  The remedy is simply to, at a time when it is certain that no
  processes are accessing the index, remove all lock files.
  
 So it sounds like it's worth trying just removing the lock files.
Hm, is there a way to come up with a sanity check you can run on an
index to make sure it's not corrupted?  This might be an excellent
thing to reassure yourself with: something went wrong?  Run a sanity
check, if it fails just reindex.

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: corrupted index

2002-03-16 Thread Otis Gospodnetic

Oh, I just thought of something (wine does body good).
Perhaps one could use Runtime (the class) to catch the JVM shutdown and
do whatever is needed to prevent index corruption.  I believe there are
some shutdown hook methods in there that may let you do that.  I'm too
lazy to look up the API docs now, but I rememeber reading about that
once, and perhaps it was even mentioned on one of the 2 Lucene mailing
lists.

On the other hand, it would be great to have a tool that can verify an
existing index.  I don't know enough about the actual file structure
yet to write something like that, but maybe somebody else has done that
already or would like to contribute.

Otis


--- Steven J. Owens [EMAIL PROTECTED] wrote:
 Otis,
 
  You can remove the .lock file and try re-indexing or continuing
  indexing where you left off.
  I am not sure about the corrupt index.  I have never seen it
 happen,
  and I believe I recall reading some messages from Doug Cutting
 saying
  that index should never be left in an inconsistent state.  
 
  Obviously never should be, but if something's pulling the rug
 out from under his JRE, changes could be only partially written,
 right?  
 
  Or is the writing format in some sense transactionally safe?
 I've never worked directly on something like this, but I worked at a
 database software company where they used transaction semantics and a
 journaling scheme to fake a bulletproof file system.  Is this how
 the index-writing code is implemented?
 
  In general, I can guess Doug's response - just torch the old
 index directory and rebuild it; Lucene's indexing is fast enough that
 you don't need to get clever.  This seems to be Doug's stance in
 general (i.e. don't get fancy, I already put all the fanciness
 you'll
 need into extremely fast indexing and searching).  So far, it seems
 to work :-).
 
  I could be making this up, though, so I suggest you search through
  lucene-user and lucene-dev archives on www.mail-archive.com.
  A search for corrupt should do it.
  Once you figure things out maybe you can post a summary here.
 
  I got a little curious, so I went and did the searches.  There
 is
 exactly one message in each list archive (dev and users) with the
 keyword corrupt in it.  The lucene-users instance is irrelevant:
 

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html
 
  The lucene-dev instance is more useful:
 

http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html
 
  It's a post from Doug, dated sept 27, 2001, about adding not
 just
 thread-safety but process-safety:
 
   It should be impossible to corrupt an index through the Lucene API.
   However if a Lucene process exits unexpectedly it can leave the
 index
   locked.  The remedy is simply to, at a time when it is certain that
 no
   processes are accessing the index, remove all lock files.
   
  So it sounds like it's worth trying just removing the lock
 files.
 Hm, is there a way to come up with a sanity check you can run on an
 index to make sure it's not corrupted?  This might be an excellent
 thing to reassure yourself with: something went wrong?  Run a sanity
 check, if it fails just reindex.
 
 Steven J. Owens
 [EMAIL PROTECTED]


__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Lucene Bugs

2002-03-16 Thread Otis Gospodnetic

Hola,

I don't have year of search engine writing experience either, but I did
look at your reports on Sourceforge earlier and I will try to look at
the source to see if they are the right fixes.  I haven't used
DateFilter, which, I think, you said contains the bug, so no promises,
but I'll look.
That part of code might have changed since your reports, and I may have
trouble locating the lines you mentiones, so I may ask you to point me
to the right lines in the new source.
Tomorrow or Monday.  Right now I have to go kill some crapes and go
to bed.

Otis

--- David Smiley [EMAIL PROTECTED] wrote:
Oh I *have* downloaded the CVS source and I actually did *fix* 
 (maybe) two of these three bugs and I did *submit* what I did exactly
 
 to fix them to the sourceforge / mailing-list for public review (but
 
 not in diff/patch format since they were one-liners).  The problem is
 
 that much of Lucene is very complicated (understandably so) and I 
 never got someone more familiar with Lucene's more complicated parts 
 (like Doug, or perhaps some others here) to respond to see if my fix 
 was correct and completely addresses the issue.  Not one person 
 responded except for some other guy to say he experienced the same 
 bug and that nobody responded to his bug report either :-(.  The 3rd 
 bug, the one that I didn't fix, I took the time to write a test 
 program that showed the bug.  What's needed now for these bugs to be 
 squashed, is someone that really knows Lucene's complicated parts to 
 verify if my 2 fixes are sufficient and to at least investigate the 
 3rd bug.  I'm not the one with years of search-engine writing 
 experience ;-).
 
 I really appreciate your response by the way, it's a welcome 
 change... and an initial step.
 
 ~ Dave Smiley
 
 On Saturday, March 16, 2002, at 08:59  PM, Andrew C. Oliver wrote:
 
  You need not be asked, help is always wanted.  How about instead of
  submitting bugs, submit patches.  Simply get the sources via CVS
 (click
  on CVS Repository on the Jakarta front page), fix the bugs and then
 do
  cvs diff -u to create patches.  Post those into bugzilla and put 
  [PATCH]
  on the summary line and I think you'll find them applied rather 
  quickly.
 
  -Andy
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]