[
https://issues.apache.org/jira/browse/NUTCH-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402264#comment-16402264
]
ASF GitHub Bot commented on NUTCH-2522:
---------------------------------------
okedoki commented on issue #290: NUTCH-2522
URL: https://github.com/apache/nutch/pull/290#issuecomment-373794763
@sebastian-nagel
Thanks for the suggestion to use urlnormalizer-regex. I rewrote the plugin
based on this approach( now it makes sense to refactor urlnormalizer-regex and
this plugin to use the same code base).
The usage is correct, at this moment we apply the same regex for both input
and output url and see if they match each other.
In the future it can be improved with two separated regex for input and
output.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Bidirectional URL exemption filter
> -----------------------------------
>
> Key: NUTCH-2522
> URL: https://issues.apache.org/jira/browse/NUTCH-2522
> Project: Nutch
> Issue Type: Improvement
> Components: plugin
> Reporter: Semyon Semyonov
> Priority: Minor
>
> The current Nutch Url Exemption plugin exempts based on toUrl only, the new
> plugin uses both fromUrl and toUrl and after the regex transformation,
> exempts based on condition regex(fromUrl) == regex(toUrl).
> This approach allows us to perform more complex url exemption filter checks,
> such as allow links:
> http://[www.website.com/|http://www.website.com/]home ->
> http://[website.com/a|http://www.website.com/]bout ( with/without www).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)