[ https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034162#comment-16034162 ]
Lewis John McGibbney commented on NUTCH-2389: --------------------------------------------- [~kaidul], i think that the plugin should be a ParseFilter. Do you happend to have a pull request we can review? Unit tests would also be VERY welcome. Thank you > Precise data parsing using Jsoup CSS selectors > ---------------------------------------------- > > Key: NUTCH-2389 > URL: https://issues.apache.org/jira/browse/NUTCH-2389 > Project: Nutch > Issue Type: New Feature > Components: parser > Affects Versions: 2.3 > Reporter: Kaidul Islam > Assignee: Kaidul Islam > Fix For: 2.4 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > As far as I know, currently Nutch 1.x and 2.x has no features to > extract/parse exact contents for specific websites. I've developed a plugin > {{parse-jsoup}} using Jsoup for my current project to extract precise content > for site specific crawling using detailed XML configuration(field name, > CSS-selector, attribute, extraction rules, data-type, default-value etc). > Please let me know if this feature seems relevant and currently not present > in Nutch. I have also plan to export it into Nutch 1.x. -- This message was sent by Atlassian JIRA (v6.3.15#6346)