[ https://issues.apache.org/jira/browse/NIFI-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004618#comment-15004618 ]
ASF GitHub Bot commented on NIFI-1156: -------------------------------------- GitHub user jdye64 opened a pull request: https://github.com/apache/nifi/pull/124 NIFI-1156: HTML Parsing Processors Bundle NIFI-1156: HTML Parsing Processors Bundle. GetHTMLElement, ModifyHTMLElement, and PutHTMLElement You can merge this pull request into a Git repository by running: $ git pull https://github.com/jdye64/nifi NIFI-1156 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nifi/pull/124.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #124 ---- commit c82fc18f8e306c5a31345856e529cfd9fe4c81ef Author: Jeremy Dyer <jdy...@gmail.com> Date: 2015-11-13T20:01:10Z HTML Parsing Processors Bundle NIFI-1156 HTML Parsing Processors Bundle commit f8d26d205901d8cd782c02a5bc57e9ec66ee2f33 Author: Jeremy Dyer <jdy...@gmail.com> Date: 2015-11-13T20:02:23Z Merge branch 'master' into NIFI-1156 ---- > HTML Parsing Processors Bundle > ------------------------------ > > Key: NIFI-1156 > URL: https://issues.apache.org/jira/browse/NIFI-1156 > Project: Apache NiFi > Issue Type: New Feature > Components: Core Framework > Reporter: Jeremy Dyer > Priority: Minor > > NiFi provides the ability to ingest HTML but lacks the convenience to easily > interact with that HTML once it has entered the flow. There should be a HTML > Processing Bundle that provides mechanisms for manipulating and interacting > with HTML data once it has entered the flow. Jsoup http://jsoup.org/ seems > like a logical tool to use since it is mature and has a MIT license which > would allow it to be incorporated into NiFi. > “GetHTMLElement” should use the CSS selector-syntax > (http://www.w3schools.com/cssref/css_selectors.asp) built into Jsoup to > extract 0-N HTML elements from the original HTML input. This processor should > support a delimited string of selectors allowing the user to build compound > HTML element output. Each HTML element (or compound element result) extracted > will create a new Flowfile where the element will be in either the Flowfile > content or an attribute depending on the user configuration. > “ModifyHTMLElement” should provide the ability to modify the original input > HTML and overwrite any existing element values. The HTML element that will be > modified can be selected by using the CSS selector-syntax > “PutHTMLElement” should provide the ability to put a new HTML element > anywhere in the original input HTML using CSS selector-syntax to indicate the > position that the new HTML element should be placed. > There seems to be a potential for adding more processors but this seems like > a good start. Since there is a dependency on Jsoup and a potential for more > processors to come I think it makes sense to add this logic as its own nar > bundle but I could be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)