Re: [HACKERS] english parser in text search: support for multiple words in the same position

Sushant Sinha Tue, 28 Sep 2010 22:30:46 -0700

Any updates on this?


On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha <sushant...@gmail.com>wrote:

> > I looked at this patch a bit.  I'm fairly unhappy that it seems to be
> > inventing a brand new mechanism to do something the ts parser can
> > already do.  Why didn't you code the url-part mechanism using the
> > existing support for compound words?
>
> I am not familiar with compound word implementation and so I am not sure
> how to split a url with compound word support. I looked into the
> documentation for compound words and that does not say much about how to
> identify components of a token. Does a compound word split by matching
> with a list of words? If yes, then we will not be able to use that as we
> do not know all the words that can appear in a url/host/email/file.
>
> I think another approach can be to use the dict_regex dictionary
> support. However, we will have to match the regex with something that
> parser is doing.
>
> The current patch is not inventing any new mechanism. It uses the
> special handler mechanism already present in the parser. For example,
> when the current parser finds a URL it runs a special handler called
> SpecialFURL which resets the parser position to the start of token to
> find hostname. After finding the host it moves to finding the path. So
> you first get the URL and then the host and finally the path.
>
> Similarly, we are resetting the parser to the start of the token on
> finding a url to output url parts. Then before entering the state that
> can lead to a url we output the url part. The state machine modification
> is similar for other tokens like file/email/host.
>
>
> > The changes made to parsetext()
> > seem particularly scary: it's not clear at all that that's not breaking
> > unrelated behaviors.  In fact, the changes in the regression test
> > results suggest strongly to me that it *is* breaking things.  Why are
> > there so many diffs in examples that include no URLs at all?
> >
>
> I think some of the difference is coming from the fact that now pos
> starts with 0 and it used to be 1 earlier. That is easily fixable
> though.
>
> > An issue that's nearly as bad is the 100% lack of documentation,
> > which makes the patch difficult to review because it's hard to tell
> > what it intends to accomplish or whether it's met the intent.
> > The patch is not committable without documentation anyway, but right
> > now I'm not sure it's even usefully reviewable.
>
> I did not provide any explanation as I could not find any place in the
> code to provide the documentation (that was just a modification of state
> machine). Should I do a separate write-up to explain the desired output
> and the changes to achieve it?
>
> >
> > In line with the lack of documentation, I would say that the choice of
> > the name "parttoken" for the new token type is not helpful.  Part of
> > what?  And none of the other token type names include the word "token",
> > so that's not a good decision either.  Possibly "url_part" would be a
> > suitable name.
> >
>
> I can modify it to output url-part/host-part/email-part/file-part if
> there is an agreement over the rest of the issues. So let me know if I
> should go ahead with this.
>
> -Sushant.
>
>

Re: [HACKERS] english parser in text search: support for multiple words in the same position

Reply via email to