[ 
https://issues.apache.org/jira/browse/NIFI-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004618#comment-15004618
 ] 

ASF GitHub Bot commented on NIFI-1156:
--------------------------------------

GitHub user jdye64 opened a pull request:

    https://github.com/apache/nifi/pull/124

    NIFI-1156: HTML Parsing Processors Bundle

    NIFI-1156: HTML Parsing Processors Bundle. GetHTMLElement, 
ModifyHTMLElement, and PutHTMLElement

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jdye64/nifi NIFI-1156

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #124
    
----
commit c82fc18f8e306c5a31345856e529cfd9fe4c81ef
Author: Jeremy Dyer <jdy...@gmail.com>
Date:   2015-11-13T20:01:10Z

    HTML Parsing Processors Bundle
    
    NIFI-1156 HTML Parsing Processors Bundle

commit f8d26d205901d8cd782c02a5bc57e9ec66ee2f33
Author: Jeremy Dyer <jdy...@gmail.com>
Date:   2015-11-13T20:02:23Z

    Merge branch 'master' into NIFI-1156

----


> HTML Parsing Processors Bundle
> ------------------------------
>
>                 Key: NIFI-1156
>                 URL: https://issues.apache.org/jira/browse/NIFI-1156
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Jeremy Dyer
>            Priority: Minor
>
> NiFi provides the ability to ingest HTML but lacks the convenience to easily 
> interact with that HTML once it has entered the flow. There should be a HTML 
> Processing Bundle that provides mechanisms for manipulating and interacting 
> with HTML data once it has entered the flow. Jsoup http://jsoup.org/ seems 
> like a logical tool to use since it is mature and has a MIT license which 
> would allow it to be incorporated into NiFi.
> “GetHTMLElement” should use the CSS selector-syntax 
> (http://www.w3schools.com/cssref/css_selectors.asp) built into Jsoup to 
> extract 0-N HTML elements from the original HTML input. This processor should 
> support a delimited string of selectors allowing the user to build compound 
> HTML element output. Each HTML element (or compound element result) extracted 
> will create a new Flowfile where the element will be in either the Flowfile 
> content or an attribute depending on the user configuration.
> “ModifyHTMLElement” should provide the ability to modify the original input 
> HTML and overwrite any existing element values. The HTML element that will be 
> modified can be selected by using the CSS selector-syntax
> “PutHTMLElement” should provide the ability to put a new HTML element 
> anywhere in the original input HTML using CSS selector-syntax to indicate the 
> position that the new HTML element should be placed.
> There seems to be a potential for adding more processors but this seems like 
> a good start. Since there is a dependency on Jsoup and a potential for more 
> processors to come I think it makes sense to add this logic as its own nar 
> bundle but I could be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to