[ https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838234#comment-15838234 ]
Lewis John McGibbney commented on NUTCH-1870: --------------------------------------------- Would be really nice to get your patch as a Github PR [~wastl-nagel]. Are you able to do it or so you want me to? > Generic xsl parser plugin > ------------------------- > > Key: NUTCH-1870 > URL: https://issues.apache.org/jira/browse/NUTCH-1870 > Project: Nutch > Issue Type: New Feature > Components: indexer, parser > Affects Versions: 1.9 > Reporter: Albinscode > Attachments: NUTCH-1870-trunk-v3.patch, NUTCH-1870-trunk-v4.patch, > nutch-site.xml, xsl-parse-plugin2.patch, xsl-parse-plugin.patch > > > The aim of this plugin is to use XSLT to extract metadata from HTML DOM > structures. > | Your Data | --> | Parse-html plugin or TIKA plugin | --> | DOM structure | > --> |XSLT plugin | > > > The main advantage is that: > - You won't have to produce any java code, only XSLT and configuration > - It can process DOM structure from DocumentFragment (@see NekoHtml and @see > TagSoup) > - It is HtmlParseFilter plugin compatible and can be plugged as any other > plugin (parse-js, parse-swf, etc...) > This topic has been discussed on > http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)