[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2013-05-12 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1053:
---

Attachment: NUTCH-1053.trunk.patch

A tiny change in ivy file for feeds plugin fixes the problem. Attached a patch 
for trunk.

{noformat}$ wget http://feeds.bbci.co.uk/news/scotland/rss.xml
$ bin/nutch plugin feed org.apache.nutch.parse.feed.FeedParser rss.xml 
key: http://www.bbc.co.uk/sport/0/football/22477429
data: Version: 5
Status: success(1,0)
Title: The man who floored Alex Ferguson
Outlinks: 0
Content Metadata: 
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 
feed=http://www.bbc.co.uk/news/scotland/ published=1368226806000 

text: How Sir Alex's temper helped build his legend - and success


{noformat}

 Parsing of RSS feeds fails 
 ---

 Key: NUTCH-1053
 URL: https://issues.apache.org/jira/browse/NUTCH-1053
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.7

 Attachments: nutch-1053.patch, NUTCH-1053.trunk.patch, seed.txt


 See discussion on 
 http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
 Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1053:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Parsing of RSS feeds fails 
 ---

 Key: NUTCH-1053
 URL: https://issues.apache.org/jira/browse/NUTCH-1053
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.6

 Attachments: nutch-1053.patch, seed.txt


 See discussion on 
 http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
 Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2012-02-20 Thread Michael Kazekin (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Kazekin updated NUTCH-1053:
---

Attachment: nutch-1053.patch

The problem is that the Feed's plugin.xml doesn't support multiple 
'contentType' parameters (yet). I joined the values into one string (splitted 
with |)

 Parsing of RSS feeds fails 
 ---

 Key: NUTCH-1053
 URL: https://issues.apache.org/jira/browse/NUTCH-1053
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: nutch-1053.patch, seed.txt


 See discussion on 
 http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
 Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2011-10-11 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1053:
-

Fix Version/s: 1.5

I'd happily give an example of fix it myself if only I could find it :-)
Moved to 1.5 and left open for now

 Parsing of RSS feeds fails 
 ---

 Key: NUTCH-1053
 URL: https://issues.apache.org/jira/browse/NUTCH-1053
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: seed.txt


 See discussion on 
 http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
 Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2011-10-11 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1053:
-

Fix Version/s: (was: 1.4)

 Parsing of RSS feeds fails 
 ---

 Key: NUTCH-1053
 URL: https://issues.apache.org/jira/browse/NUTCH-1053
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: seed.txt


 See discussion on 
 http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
 Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2011-10-10 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1053:


Attachment: seed.txt

I attach a seed file which I've used with the crawl command to parse and index 
several feed URLs. Using the crawl command the only warning in my logs was as 
follows
{code}
2011-10-10 22:10:37,853 WARN  parse.ParserFactory - ParserFactory:Plugin: 
org.apache.nutch.parse.feed.FeedParser mapped to contentType 
application/rss+xml via parse-plugins.xml, but its plugin.xml file does not 
claim to support contentType: application/rss+xml
{code} 

Additionally I've used the command line to attempt to parse the feeds but I'm 
getting the following. Any thoughts? Can you give a use case or an URL which 
will reproduce the problem you mention with the RSS parser?
{code}
lewis@lewis:~/ASF/trunk/runtime/local$ bin/nutch plugin feed 
org.apache.nutch.parse.feed.FeedParser 
http://feeds.bbci.co.uk/news/scotland/rss.xml
Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
Caused by: java.io.FileNotFoundException: 
http:/feeds.bbci.co.uk/news/scotland/rss.xml (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:106)
at org.apache.nutch.parse.feed.FeedParser.main(FeedParser.java:209)
... 5 more
{code}

 Parsing of RSS feeds fails 
 ---

 Key: NUTCH-1053
 URL: https://issues.apache.org/jira/browse/NUTCH-1053
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.4

 Attachments: seed.txt


 See discussion on 
 http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
 Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira