[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1053: --- Attachment: NUTCH-1053.trunk.patch A tiny change in ivy file for feeds plugin fixes the problem. Attached a patch for trunk. {noformat}$ wget http://feeds.bbci.co.uk/news/scotland/rss.xml $ bin/nutch plugin feed org.apache.nutch.parse.feed.FeedParser rss.xml key: http://www.bbc.co.uk/sport/0/football/22477429 data: Version: 5 Status: success(1,0) Title: The man who floored Alex Ferguson Outlinks: 0 Content Metadata: Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 feed=http://www.bbc.co.uk/news/scotland/ published=1368226806000 text: How Sir Alex's temper helped build his legend - and success {noformat} Parsing of RSS feeds fails --- Key: NUTCH-1053 URL: https://issues.apache.org/jira/browse/NUTCH-1053 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.7 Attachments: nutch-1053.patch, NUTCH-1053.trunk.patch, seed.txt See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1053: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Parsing of RSS feeds fails --- Key: NUTCH-1053 URL: https://issues.apache.org/jira/browse/NUTCH-1053 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.6 Attachments: nutch-1053.patch, seed.txt See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Kazekin updated NUTCH-1053: --- Attachment: nutch-1053.patch The problem is that the Feed's plugin.xml doesn't support multiple 'contentType' parameters (yet). I joined the values into one string (splitted with |) Parsing of RSS feeds fails --- Key: NUTCH-1053 URL: https://issues.apache.org/jira/browse/NUTCH-1053 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: nutch-1053.patch, seed.txt See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1053: - Fix Version/s: 1.5 I'd happily give an example of fix it myself if only I could find it :-) Moved to 1.5 and left open for now Parsing of RSS feeds fails --- Key: NUTCH-1053 URL: https://issues.apache.org/jira/browse/NUTCH-1053 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: seed.txt See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1053: - Fix Version/s: (was: 1.4) Parsing of RSS feeds fails --- Key: NUTCH-1053 URL: https://issues.apache.org/jira/browse/NUTCH-1053 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: seed.txt See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1053: Attachment: seed.txt I attach a seed file which I've used with the crawl command to parse and index several feed URLs. Using the crawl command the only warning in my logs was as follows {code} 2011-10-10 22:10:37,853 WARN parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.feed.FeedParser mapped to contentType application/rss+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/rss+xml {code} Additionally I've used the command line to attempt to parse the feeds but I'm getting the following. Any thoughts? Can you give a use case or an URL which will reproduce the problem you mention with the RSS parser? {code} lewis@lewis:~/ASF/trunk/runtime/local$ bin/nutch plugin feed org.apache.nutch.parse.feed.FeedParser http://feeds.bbci.co.uk/news/scotland/rss.xml Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421) Caused by: java.io.FileNotFoundException: http:/feeds.bbci.co.uk/news/scotland/rss.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:106) at org.apache.nutch.parse.feed.FeedParser.main(FeedParser.java:209) ... 5 more {code} Parsing of RSS feeds fails --- Key: NUTCH-1053 URL: https://issues.apache.org/jira/browse/NUTCH-1053 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.4 Attachments: seed.txt See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira