[
https://issues.apache.org/jira/browse/NUTCH-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2770:
-----------------------------------
Fix Version/s: 1.17
> Subcollection logic allows empty string as a whitelist value, thus matching
> every incoming document.
> ----------------------------------------------------------------------------------------------------
>
> Key: NUTCH-2770
> URL: https://issues.apache.org/jira/browse/NUTCH-2770
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.16
> Reporter: Jason Grey
> Priority: Minor
> Fix For: 1.17
>
>
> If subcollections.xml whitelist element contains empty lines at the end (ie:
> because the XML was formatted nicely) those lines can become an empty string
> in the string matching logic. That logic uses String.contains, and that in
> turn returns TRUE for an empty string as input.
> This then causes that subcollection to be tagged on EVERY incoming document.
> Here is a POC to show the issue in isolation, since I do not yet have a dev
> environment setup for nutch yet.
> {code:java}
> /**
> This is a snippet that does the same logic as Subcollection.java in nutch.
> https://github.com/apache/nutch/blob/fdee94d8e0894384f1fca7c9f16c7593a5bc928c/src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
> **/
> import java.lang.Math;
> import java.util.StringTokenizer;
> public class HelloWorld
> {
> public static void main(String[] args)
> {
> String urlToTest = "https://www.example.com/test/url/here";
> String text = "\r\n\t//research.xyz.com/\r\n\t/research/\r\n\t";
> StringTokenizer st = new StringTokenizer(text, "\n\r");
> while (st.hasMoreElements()) {
> String line = ((String) st.nextElement()).trim();
> boolean matched = urlToTest.contains(line);
> System.out.println("line: [" + line + "] = " + matched);
> }
> }
> }
> /**
> output:
> line: [//research.xyz.com/] = false
> line: [/research/] = false
> line: [] = true
> as we can see, for the text in our XML config, it's outputting an extra line
> which is matching on EVERYTHING...
> **/
> {code}
> There is a workaround, if you collapse the whitespace in the XML file, but I
> think we should fix this anyway. I will try to do so and submit a patch soon
> which will filter out empty string.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)