At 3:44 PM +0200 8/21/00, Quim Sanmarti wrote:
>I've being following this thread, since I've detected the same problems as
>Arthur.
>
>After some examination of parser.cc, I believe I found some aspects that
>are unclear. I would like to post a patched pair of parser.h/parser.cc that
>(as far as I've tested) correct the issues listed below. I annex them to
>this mail -- er, if this is not the correct way, please tell me. I'm a
>freshman here and not yet acquainted with your development process.
First off, it's customary to CC: the list as well--for example you
wouldn't be able to follow the thread if it wasn't on the list. :-)
It's also usually better for us bandwidth-impaired folk to just use
diffs to files rather than the files themselves. See
<http://dev.htdig.org/patches.html> for suggestions.
As far as parser.cc, it's all rather unclear and hacked-together. I'm
working on a revised parser/lookup framework in the new ParseTree
classes. You can see an example of how they work in the parsetest
program. Hopefully by tonight, the parsing will be bug-free and I'll
get to a point where it can start to replace parser.cc.
>1- the methods perform_and(), perform_or(), deal with results popped/peeked
>from a stack, comparing them to null before checking whether they are to be
>ignored or trying to combine result lists. It seems as if a null pointer in
>the stack corresponds to an empty result. Following the code, I can't find
>null pointers pushed in the stack. Empty lists are pushed instead. This is
>a pity, since the comparisons with null pointers lose their sense.
I think it checks for NULLs because you'd get segfaults in certain
places otherwise. I believe the 3.0.8b2 code (and before) which used
GDBM might have used NULLs instead of empty lists.
>2- In method perform_push(), no results are pushed if the word under
>examination is to be ignored. This is fatal for the stack's health. An
>empty ResultList with isIgnore set should be pushed in that case, so that
>further evaluation in perform_and, etc be successful.
True. Interesting that this one hasn't come up before. Of course I
didn't notice it when I was hacking in the phrase code either.
>3- On the other hand, in score(), a word with no matches yields an empty
>list with isIgnore set -- this blows up any further 'and' logic, since 'a
>and <ignore> == a', which is not what we want.
>4- The call to 'phrase()' in factor() is quite intempestive. Following the
>current logic, the parser *does* evaluate without errors the following
>boolean query ["this is a phrase" (foo or baz)], which is clearly wrong.
>The call to phrase() should be an alternative to evaluating words or
>expressions in parens.
Phrase searching was poorly hacked in. I can't say that I want an
excuse, but it was also written at 2AM. The current parser made it a
bit difficult to hack in phrase searching at the level I wanted w/o
scrapping it all.
>5- Ignored words inside phrases make the phrase search to fail. The
>distance between two words in a phrase should be set to 1 + (number of
>ignored words in between).
There are no ignored words in phrases. Earlier code in htsearch.cc
should prevent this. It's still a hack, but it's not a bug.
>6- Optimisation issue. perform_phrase() can stop if a (non-ignored) word
>with no matches is found, or after finding that two words are nowhere
>consecutive.
Good point.
>7- Style issue. having a perform_and() that does 'and' and 'not' operations
>is at least tricky. IMHO is clearer to write two separate perform_and() and
>perform_not().
Yes, though you'll note that adding operators to the parser requires
pulling teeth. So it's no wonder NOT -> NAND.
>#################
>
>In order to solve 1,2,3, I modified the code using the following rules:
> an expression with no matches is represented as a null pointer
> an ignored expression is represented as an empty ResultList
>with isIgnore
>= 1
> a sucessful match set is represented as a non-empty ResultList with
>isIgnore = 0
>
>I intended to follow the following truth tables for and, or, not
>operations.
>0 - no matches
>1 - some match
>x - ignore
>
> a b a and b
>-------------------------------------------
> 0 0 0
> 0 1 0
> 0 x 0
> 1 0 0
> 1 1 intersect(a,b)
> 1 x a
> x 0 0
> x 1 b
> x x x
>
> a b a not b
>-------------------------------------------
> 0 0 0
> 0 1 0
> 0 x 0
> 1 0 a
> 1 1 intersect(a,not b)
> 1 x a
> x 0 x
> x 1 x
> x x x
>
> a b a or b
>--------------------------------------------
> 0 0 0
> 0 1 b
> 0 x x
> 1 0 a
> 1 1 union(a, b)
> 1 x a
> x 0 x
> x 1 b
> x x x
>
>#################
>
>To solve 6, I modified slightly the call mechanism to perform_phrase. A
>reference to a pointer to the resulting list is passed. This pointer is
>initialized to null for each phrase. The resulting word reference list is
>allocated by perform_phrase on the first occurrence of a non-ignored word.
>Subsequent lookup and filtering is only performed if the resulting list has
>some entry.
>
>#################
>
>The corrections for 4,5,7 are implicit in the problem description (?).
>
>#################
>
>Hope all this helps. However, IMHO the parser class is far from clear and
>optimum. I'm happy to learn that there's an initiative to rewrite this part
>of ht://dig. I would like to help in that sense; I'll follow the adequate
>thread(s).
We'll be glad to merge in your patches, though hopefully they'll only
need to exist for a snapshot or two. Any help in writing code for the
ParseTree framework would be greatly appreciated. I would love it if
people used the parsetest program and gave my code a review. I
already know that there's a bug in boolean testing b/c parens are
gobbled up by the HtWordToken. That's resolved in a version I was
testing last night.
I'd be happy to explain any parts of the new parser that seem
mysterious to people.
-Geoff
%parser.h.diff
parser.h.diff
%parser.cc.diff
parser.cc.diff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.