[jira] [Commented] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662720#comment-13662720 ] Hudson commented on NUTCH-1053: --- Integrated in Nutch-trunk #2210 (See [https://builds.apache.org/job/Nutch-trunk/2210/]) NUTCH-1053 Parsing of RSS feeds fails (Revision 1484628) Result = SUCCESS tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1484628 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/feed/ivy.xml Parsing of RSS feeds fails --- Key: NUTCH-1053 URL: https://issues.apache.org/jira/browse/NUTCH-1053 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.7 Attachments: nutch-1053.patch, NUTCH-1053.trunk.patch, seed.txt See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662722#comment-13662722 ] Hudson commented on NUTCH-1513: --- Integrated in Nutch-trunk #2210 (See [https://builds.apache.org/job/Nutch-trunk/2210/]) NUTCH-1513 Support Robots.txt for Ftp urls (Revision 1484638) Result = SUCCESS tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1484638 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java * /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java Support Robots.txt for Ftp urls --- Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement Affects Versions: 1.7, 2.2 Reporter: Tejas Patil Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 2.3, 1.8 Attachments: NUTCH-1513.2.x.v2.patch, NUTCH-1513.trunk.patch, NUTCH-1513.trunk.v2.patch As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls. In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_ {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { return EmptyRobotRules.RULES; }{noformat} Its not clear of this was part of design or if its a bug. [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement
[ https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662721#comment-13662721 ] Hudson commented on NUTCH-1249: --- Integrated in Nutch-trunk #2210 (See [https://builds.apache.org/job/Nutch-trunk/2210/]) NUTCH-1249 and NUTCH-1275 : Resolve all issues flagged up by adding javac -Xlint argument (Revision 1484634) Result = SUCCESS tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1484634 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/build.xml * /nutch/trunk/src/java/org/apache/nutch/crawl/FetchScheduleFactory.java * /nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java * /nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java * /nutch/trunk/src/java/org/apache/nutch/crawl/LinkDbReader.java * /nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java * /nutch/trunk/src/java/org/apache/nutch/crawl/NutchWritable.java * /nutch/trunk/src/java/org/apache/nutch/crawl/SignatureComparator.java * /nutch/trunk/src/java/org/apache/nutch/crawl/SignatureFactory.java * /nutch/trunk/src/java/org/apache/nutch/fetcher/OldFetcher.java * /nutch/trunk/src/java/org/apache/nutch/indexer/NutchField.java * /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java * /nutch/trunk/src/java/org/apache/nutch/metadata/Metadata.java * /nutch/trunk/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java * /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java * /nutch/trunk/src/java/org/apache/nutch/parse/HTMLMetaTags.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParserFactory.java * /nutch/trunk/src/java/org/apache/nutch/plugin/Extension.java * /nutch/trunk/src/java/org/apache/nutch/plugin/PluginDescriptor.java * /nutch/trunk/src/java/org/apache/nutch/plugin/PluginRepository.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/LinkRank.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/LoopReader.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/Loops.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/NodeReader.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMergeFilter.java * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMergeFilters.java * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java * /nutch/trunk/src/java/org/apache/nutch/tools/FreeGenerator.java * /nutch/trunk/src/java/org/apache/nutch/tools/ResolveUrls.java * /nutch/trunk/src/java/org/apache/nutch/tools/proxy/SegmentHandler.java * /nutch/trunk/src/java/org/apache/nutch/util/GenericWritableConfigurable.java * /nutch/trunk/src/java/org/apache/nutch/util/PrefixStringMatcher.java * /nutch/trunk/src/java/org/apache/nutch/util/SuffixStringMatcher.java * /nutch/trunk/src/plugin/creativecommons/src/java/org/creativecommons/nutch/CCParseFilter.java * /nutch/trunk/src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java * /nutch/trunk/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag/RelTagParser.java * /nutch/trunk/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java * /nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java * /nutch/trunk/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java * /nutch/trunk/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf/SWFParser.java * /nutch/trunk/src/plugin/parse-swf/src/test/org/apache/nutch/parse/swf/TestSWFParser.java * /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java * /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java * /nutch/trunk/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java * /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java * /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java * /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java * /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java * /nutch/trunk/src/plugin/protocol-httpclient/src/test/org/apache/nutch/protocol/httpclient/TestProtocolHttpClient.java * /nutch/trunk/src/plugin/urlfilter-prefix/src/java/org/apache/nutch/urlfilter/prefix/PrefixURLFilter.java * /nutch/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java *
[jira] [Commented] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662727#comment-13662727 ] Hudson commented on NUTCH-1053: --- Integrated in Nutch-nutchgora #613 (See [https://builds.apache.org/job/Nutch-nutchgora/613/]) NUTCH-1053 Parsing of RSS feeds fails (Revision 1484627) Result = SUCCESS tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1484627 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/plugin/feed/ivy.xml Parsing of RSS feeds fails --- Key: NUTCH-1053 URL: https://issues.apache.org/jira/browse/NUTCH-1053 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.7 Attachments: nutch-1053.patch, NUTCH-1053.trunk.patch, seed.txt See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662728#comment-13662728 ] Hudson commented on NUTCH-1513: --- Integrated in Nutch-nutchgora #613 (See [https://builds.apache.org/job/Nutch-nutchgora/613/]) NUTCH-1513 Support Robots.txt for Ftp urls (Revision 1484637) Result = SUCCESS tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1484637 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java * /nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java Support Robots.txt for Ftp urls --- Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement Affects Versions: 1.7, 2.2 Reporter: Tejas Patil Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 2.3, 1.8 Attachments: NUTCH-1513.2.x.v2.patch, NUTCH-1513.trunk.patch, NUTCH-1513.trunk.v2.patch As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls. In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_ {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { return EmptyRobotRules.RULES; }{noformat} Its not clear of this was part of design or if its a bug. [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Jenkins build is back to normal : Nutch-nutchgora #613
See https://builds.apache.org/job/Nutch-nutchgora/613/changes
[jira] [Commented] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement
[ https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663031#comment-13663031 ] Sebastian Nagel commented on NUTCH-1249: Great, work [~tejas.patil]! Resolve all issues flagged up by adding javac -Xlint arguement -- Key: NUTCH-1249 URL: https://issues.apache.org/jira/browse/NUTCH-1249 Project: Nutch Issue Type: Improvement Components: build Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.7 Attachments: NUTCH-1249.trunk.patch There are a heap of issues flagged up by NUTCH-1237, I think over time it would be great to get these addressed and resolved. What is interesting is that adding the same arguements to /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail. Some of this stuff is documented in the link below http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663413#comment-13663413 ] Chris Hairfield commented on NUTCH-1569: I applied your v2 patch an hour ago and, after re-running my crawls, everything looks great. For reference, here's my setup: Nutch 2.x (fully updated to rev. 1484961) + NUTCH-1486-2.x.v2.patch (Upgrade to Solr 4.2.1) HBase 0.90.4 Upgrade 2.x to Gora 0.3 --- Key: NUTCH-1569 URL: https://issues.apache.org/jira/browse/NUTCH-1569 Project: Nutch Issue Type: Improvement Components: build, storage Affects Versions: 2.2 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.2 Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch We just released the Maven artifacts and I would like to upgrade before we push the RC for 2.2 :) Patch coming up -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663431#comment-13663431 ] Lewis John McGibbney commented on NUTCH-1569: - Dynamite Chris, thanks for the heads up. Upgrade 2.x to Gora 0.3 --- Key: NUTCH-1569 URL: https://issues.apache.org/jira/browse/NUTCH-1569 Project: Nutch Issue Type: Improvement Components: build, storage Affects Versions: 2.2 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.2 Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch We just released the Maven artifacts and I would like to upgrade before we push the RC for 2.2 :) Patch coming up -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663561#comment-13663561 ] Lewis John McGibbney commented on NUTCH-1569: - I will commit this patch and push the 2.2RC tonight unless there are any objections. Upgrade 2.x to Gora 0.3 --- Key: NUTCH-1569 URL: https://issues.apache.org/jira/browse/NUTCH-1569 Project: Nutch Issue Type: Improvement Components: build, storage Affects Versions: 2.2 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.2 Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch We just released the Maven artifacts and I would like to upgrade before we push the RC for 2.2 :) Patch coming up -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1569) Upgrade 2.x to Gora 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1569. - Resolution: Fixed Committed @revision 1485044 in 2.x HEAD Thank you to everyone who commented on this issue. Upgrade 2.x to Gora 0.3 --- Key: NUTCH-1569 URL: https://issues.apache.org/jira/browse/NUTCH-1569 Project: Nutch Issue Type: Improvement Components: build, storage Affects Versions: 2.2 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.2 Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch We just released the Maven artifacts and I would like to upgrade before we push the RC for 2.2 :) Patch coming up -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1275) Fix [unchecked] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663741#comment-13663741 ] Lewis John McGibbney commented on NUTCH-1275: - Hi Tejas, just curious. Has this been addressed in 2.X? Great work on these issues BTW. Really great work. Fix [unchecked] javac warnings -- Key: NUTCH-1275 URL: https://issues.apache.org/jira/browse/NUTCH-1275 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.7 We can simply suppress these warnings using {code} SuppressWarnings [unchecked] {code} However if there is a another method for resolving these warnings then they should be implemented if deemed beneficial to code quality. Some resources http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1275) Fix [unchecked] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1275: -- Assignee: Tejas Patil Fix [unchecked] javac warnings -- Key: NUTCH-1275 URL: https://issues.apache.org/jira/browse/NUTCH-1275 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Minor Fix For: 1.7 We can simply suppress these warnings using {code} SuppressWarnings [unchecked] {code} However if there is a another method for resolving these warnings then they should be implemented if deemed beneficial to code quality. Some resources http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1275) Fix [unchecked] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663792#comment-13663792 ] Tejas Patil commented on NUTCH-1275: Hi [~lewismc], I am working on a patch for 2.x. Fix [unchecked] javac warnings -- Key: NUTCH-1275 URL: https://issues.apache.org/jira/browse/NUTCH-1275 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Minor Fix For: 1.7 We can simply suppress these warnings using {code} SuppressWarnings [unchecked] {code} However if there is a another method for resolving these warnings then they should be implemented if deemed beneficial to code quality. Some resources http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira