RE: [htdig3-dev] RE: [htdig] RE: [3.2.0b2] AND operator notworking as it should?

Geoff Hutchison Mon, 21 Aug 2000 07:13:48 -0700

At 3:44 PM +0200 8/21/00, Quim Sanmarti wrote:
>I've being following this thread, since I've detected the same problems as
>Arthur.
>
>After some examination of parser.cc, I believe I found some aspects that
>are unclear. I would like to post a patched pair of parser.h/parser.cc that
>(as far as I've tested) correct the issues listed below. I annex them to
>this mail -- er, if this is not the correct way, please tell me. I'm a
>freshman here and not yet acquainted with your development process.

First off, it's customary to CC: the list as well--for example you 
wouldn't be able to follow the thread if it wasn't on the list. :-) 
It's also usually better for us bandwidth-impaired folk to just use 
diffs to files rather than the files themselves. See 
<http://dev.htdig.org/patches.html> for suggestions.

As far as parser.cc, it's all rather unclear and hacked-together. I'm 
working on a revised parser/lookup framework in the new ParseTree 
classes. You can see an example of how they work in the parsetest 
program. Hopefully by tonight, the parsing will be bug-free and I'll 
get to a point where it can start to replace parser.cc.

>1- the methods perform_and(), perform_or(), deal with results popped/peeked
>from a stack, comparing them to null before checking whether they are to be
>ignored or trying to combine result lists. It seems as if a null pointer in
>the stack corresponds to an empty result. Following the code, I can't find
>null pointers pushed in the stack. Empty lists are pushed instead. This is
>a pity, since the comparisons with null pointers lose their sense.

I think it checks for NULLs because you'd get segfaults in certain 
places otherwise. I believe the 3.0.8b2 code (and before) which used 
GDBM might have used NULLs instead of empty lists.

>2- In method perform_push(), no results are pushed if the word under
>examination is to be ignored. This is fatal for the stack's health. An
>empty ResultList with isIgnore set should be pushed in that case, so that
>further evaluation in perform_and, etc be successful.

True. Interesting that this one hasn't come up before. Of course I 
didn't notice it when I was hacking in the phrase code either.

>3- On the other hand, in score(), a word with no matches yields an empty
>list with isIgnore set -- this blows up any further 'and' logic, since 'a
>and <ignore> == a', which is not what we want.

>4- The call to 'phrase()' in factor() is quite intempestive. Following the
>current logic, the parser *does* evaluate without errors the following
>boolean query ["this is a phrase" (foo or baz)], which is clearly wrong.
>The call to phrase() should be an alternative to evaluating words or
>expressions in parens.

Phrase searching was poorly hacked in. I can't say that I want an 
excuse, but it was also written at 2AM. The current parser made it a 
bit difficult to hack in phrase searching at the level I wanted w/o 
scrapping it all.

>5- Ignored words inside phrases make the phrase search to fail. The
>distance between two words in a phrase should be set to 1 + (number of
>ignored words in between).

There are no ignored words in phrases. Earlier code in htsearch.cc 
should prevent this. It's still a hack, but it's not a bug.

>6- Optimisation issue. perform_phrase() can stop if a (non-ignored) word
>with no matches is found, or after finding that two words are nowhere
>consecutive.

Good point.

>7- Style issue. having a perform_and() that does 'and' and 'not' operations
>is at least tricky. IMHO is clearer to write two separate perform_and() and
>perform_not().

Yes, though you'll note that adding operators to the parser requires 
pulling teeth. So it's no wonder NOT -> NAND.

>#################
>
>In order to solve 1,2,3, I modified the code using the following rules:
>       an expression with no matches is represented as a null pointer
>       an ignored expression is represented as an empty ResultList 
>with isIgnore
>= 1
>       a sucessful match set is represented as a non-empty ResultList with
>isIgnore = 0
>
>I intended to follow the following truth tables for and, or, not
>operations.
>0 - no matches
>1 - some match
>x - ignore
>
>       a       b       a and b
>-------------------------------------------
>       0       0       0
>       0       1       0
>       0       x       0
>       1       0       0
>       1       1       intersect(a,b)
>       1       x       a
>       x       0       0
>       x       1       b
>       x       x       x
>
>       a       b       a not b
>-------------------------------------------
>       0       0       0
>       0       1       0
>       0       x       0
>       1       0       a
>       1       1       intersect(a,not b)
>       1       x       a
>       x       0       x
>       x       1       x
>       x       x       x
>
>       a       b       a or b
>--------------------------------------------
>       0       0       0
>       0       1       b
>       0       x       x
>       1       0       a
>       1       1       union(a, b)
>       1       x       a
>       x       0       x
>       x       1       b
>       x       x       x
>
>#################
>
>To solve 6, I modified slightly the call mechanism to perform_phrase. A
>reference to a pointer to the resulting list is passed. This pointer is
>initialized to null for each phrase. The resulting word reference list is
>allocated by perform_phrase on the first occurrence of a non-ignored word.
>Subsequent lookup and filtering is only performed if the resulting list has
>some entry.
>
>#################
>
>The corrections for 4,5,7 are implicit in the problem description (?).
>
>#################
>
>Hope all this helps. However, IMHO the parser class is far from clear and
>optimum. I'm happy to learn that there's an initiative to rewrite this part
>of ht://dig. I would like to help in that sense; I'll follow the adequate
>thread(s).

We'll be glad to merge in your patches, though hopefully they'll only 
need to exist for a snapshot or two. Any help in writing code for the 
ParseTree framework would be greatly appreciated. I would love it if 
people used the parsetest program and gave my code a review. I 
already know that there's a bug in boolean testing b/c parens are 
gobbled up by the HtWordToken. That's resolved in a version I was 
testing last night.

I'd be happy to explain any parts of the new parser that seem 
mysterious to people.

-Geoff

%parser.h.diff

parser.h.diff

%parser.cc.diff

parser.cc.diff

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.

RE: [htdig3-dev] RE: [htdig] RE: [3.2.0b2] AND operator notworking as it should?

Reply via email to