[ 
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539135
 ] 

Andrzej Bialecki  commented on NUTCH-567:
-----------------------------------------

I'm slightly worried about losing track of what has been patched in the patched 
version ... because we don't keep the tagsoup patch in our svn, and the patch 
is unlikely to end up soon in the official release. At the very least I think 
we should add a README-tagsoup-patched.txt, which points to this issue and the 
patch.

> Proper (?) handling of URIs in TagSoup.
> ---------------------------------------
>
>                 Key: NUTCH-567
>                 URL: https://issues.apache.org/jira/browse/NUTCH-567
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Priority: Minor
>         Attachments: tagsoup-1.1.3-uripatched.jar , uri-entities.patch
>
>
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
> discussion on the list and at TagSoup's mailing list.
> http://tech.groups.yahoo.com/group/tagsoup-friends/message/838
> I looked at the sources of TagSoup because I'm using it myself (although the 
> URIs are not relevant for me). It seems like you can implement a naive 
> workaround by remembering the parsing state and just avoiding entity 
> resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to