Any updates on this?
On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha <sushant...@gmail.com>wrote: > > I looked at this patch a bit. I'm fairly unhappy that it seems to be > > inventing a brand new mechanism to do something the ts parser can > > already do. Why didn't you code the url-part mechanism using the > > existing support for compound words? > > I am not familiar with compound word implementation and so I am not sure > how to split a url with compound word support. I looked into the > documentation for compound words and that does not say much about how to > identify components of a token. Does a compound word split by matching > with a list of words? If yes, then we will not be able to use that as we > do not know all the words that can appear in a url/host/email/file. > > I think another approach can be to use the dict_regex dictionary > support. However, we will have to match the regex with something that > parser is doing. > > The current patch is not inventing any new mechanism. It uses the > special handler mechanism already present in the parser. For example, > when the current parser finds a URL it runs a special handler called > SpecialFURL which resets the parser position to the start of token to > find hostname. After finding the host it moves to finding the path. So > you first get the URL and then the host and finally the path. > > Similarly, we are resetting the parser to the start of the token on > finding a url to output url parts. Then before entering the state that > can lead to a url we output the url part. The state machine modification > is similar for other tokens like file/email/host. > > > > The changes made to parsetext() > > seem particularly scary: it's not clear at all that that's not breaking > > unrelated behaviors. In fact, the changes in the regression test > > results suggest strongly to me that it *is* breaking things. Why are > > there so many diffs in examples that include no URLs at all? > > > > I think some of the difference is coming from the fact that now pos > starts with 0 and it used to be 1 earlier. That is easily fixable > though. > > > An issue that's nearly as bad is the 100% lack of documentation, > > which makes the patch difficult to review because it's hard to tell > > what it intends to accomplish or whether it's met the intent. > > The patch is not committable without documentation anyway, but right > > now I'm not sure it's even usefully reviewable. > > I did not provide any explanation as I could not find any place in the > code to provide the documentation (that was just a modification of state > machine). Should I do a separate write-up to explain the desired output > and the changes to achieve it? > > > > > In line with the lack of documentation, I would say that the choice of > > the name "parttoken" for the new token type is not helpful. Part of > > what? And none of the other token type names include the word "token", > > so that's not a good decision either. Possibly "url_part" would be a > > suitable name. > > > > I can modify it to output url-part/host-part/email-part/file-part if > there is an agreement over the rest of the issues. So let me know if I > should go ahead with this. > > -Sushant. > >