[ https://issues.apache.org/jira/browse/NUTCH-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Giuseppe Totaro updated NUTCH-1989: ----------------------------------- Attachment: NUTCH-1989.patch > Handling invalid URLs in CommonCrawlDataDumper > ---------------------------------------------- > > Key: NUTCH-1989 > URL: https://issues.apache.org/jira/browse/NUTCH-1989 > Project: Nutch > Issue Type: Improvement > Components: tool > Affects Versions: 1.10 > Reporter: Giuseppe Totaro > Priority: Minor > Attachments: NUTCH-1989.patch > > > Hi all, > running the {{CommonCrawlDataDumper}} tool ({{bin/nutch commoncrawldump}}) > with the new options (as described in > [NUTCH-1975|https://issues.apache.org/jira/browse/NUTCH-1975]) I noticed > there are some problems if an invalid URL is detected. > For example, the following URLs (that I found in crawled data) break the > naming schema provided by using {{-epochFilename}} command-line option: > * http://www/ > * http:/ > More in detail, using {{-epochFilename}} option, files extracted will be > organized in a reversed-DNS tree based on the FQDN of the webpage, followed > by a SHA1 hash of the complete URL. When the tool detect the URLs as above, > it is not able to build the reversed-DNS tree. > You can find in attachment a simple patch for detecting invalid URLs. The > patch uses the [Apache Commons > Validator|http://commons.apache.org/proper/commons-validator/] APIs to detect > invalid URLs: > {code} > UrlValidator urlValidator = new UrlValidator(); > if (!urlValidator.isValid(url)) { > LOG.warn("Not valid URL detected: " + url); > } > {code} > The tool logs a warning message if an invalid URL is detected. I am just > wondering if we can perform a specific action if invalid URLs occur. We could > skip invalid URLs but I notice that also the following URLs are detected as > invalid: > {noformat} > 2015-04-15 13:49:40,386 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/ > 2015-04-15 13:49:41,603 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: http://www/ > 2015-04-15 13:49:41,632 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: http:/ > 2015-04-15 13:49:44,601 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/ > 2015-04-15 13:50:34,821 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/ > 2015-04-15 13:50:35,847 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: http://www/ > 2015-04-15 13:50:35,866 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: http:/ > 2015-04-15 13:50:38,605 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/ > 2015-04-15 13:51:20,013 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: http://antilop.cc/sr/users/nomad bloodbath > 2015-04-15 13:51:20,499 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/ars.to\/1aPaqvW > 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com > 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com\/gaming\/2015\/04\/mortal-kombat-x-charges-players-for-easy-fatalities\/ > 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets > 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/civis > 2015-04-15 13:51:20,588 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/ars.to\/1tECmHU > 2015-04-15 13:51:20,589 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/arstechnica.com > 2015-04-15 13:51:20,589 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/arstechnica.com\/tech-policy\/2014\/11\/prosecutor-silk-road-2-0-suspect-did-admit-to-everything\/ > 2015-04-15 13:51:20,590 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets > 2015-04-15 13:51:20,590 WARN tools.CommonCrawlDataDumper - Not valid URL > detected: > http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/civis > {noformat} > I would be very pleased to get your feedback on action to perform when > invalid URLs are detected, avoiding to drop off data and break the naming > schema if {{-epochFilename}} option is used. > Now I am going to add a counter for invalid URLs. Thanks [~lewismc] for > supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)