[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dogacan Güney updated NUTCH-444:
--------------------------------

    Attachment: parse-feed-v2.tar.bz2

Updated parse-feed plugin. Still not ready for any serious use, but I think I 
fixed the problems with indexing and dedup. Use it with NUTCH-443's v5 patch.

nutch.newbie: I change parse-plugins.xml as you do. For this plugin to work, 
you also have to change default signature to TextProfileSignature(because 
MD5Signature takes the hash of content, which is the same for every element in 
a parseMap). This is done by adding:
<property>
  <name>db.signature.class</name>
  <value>org.apache.nutch.crawl.TextProfileSignature</value>
</property>

to your nutch-site.xml.


> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to