[ 
https://issues.apache.org/jira/browse/MINIFICPP-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099135#comment-17099135
 ] 

Marc Parisi edited comment on MINIFICPP-1199 at 5/4/20, 5:03 PM:
-----------------------------------------------------------------

[~joewitt] gotcha I think we're on the same page and I was likely confusing 
myself via the Apache E-mail chain and being too terse probably confusing you. 
Thanks!


was (Author: phrocker):
[~joewitt] gotcha I think we're on the same page and I was likely confusing 
myself via the Apache E-mail chain. Thanks!

> Integrates MiNiFi C++ with H2O Driverless AI Python Scoring Pipeline To Do ML 
> Inference on Edge
> -----------------------------------------------------------------------------------------------
>
>                 Key: MINIFICPP-1199
>                 URL: https://issues.apache.org/jira/browse/MINIFICPP-1199
>             Project: Apache NiFi MiNiFi C++
>          Issue Type: New Feature
>    Affects Versions: master
>         Environment: Ubuntu 18.04 in AWS EC2
> MiNiFi C++ 0.7.0
>            Reporter: James Medel
>            Priority: Blocker
>             Fix For: 0.8.0
>
>
> *MiNiFi C++ and H2O Driverless AI Integration* via Custom Python Processors:
> Integrates MiNiFi C++ with H2O's Driverless AI by Using Driverless AI's 
> Python Scoring Pipeline and MiNiFi's Custom Python Processors. Uses the 
> Python Processors to execute the Python Scoring Pipeline scorer to do batch 
> scoring and real-time scoring for one or more predicted labels on test data 
> in the incoming flow file content. I would like to contribute my processors 
> to MiNiFi C++ as a new feature.
>  
> *3 custom python processors* created for MiNiFi:
> *H2oPspScoreRealTime* - Executes H2O Driverless AI's Python Scoring Pipeline 
> to do interactive scoring (real-time) scoring on an individual row or list of 
> test data within each incoming flow file. Uses H2O's open-source Datatable 
> library to load test data into a frame, then converts it to pandas dataframe. 
> Pandas is used to convert the pandas dataframe rows to a list of lists, but 
> since each flow file passing through this processor should have only 1 row, 
> we extract the 1st list. Then that list is passed into the Driverless AI's 
> Python scorer.score() function to predict one or more predicted labels. The 
> prediction is returned to a list. The number of predicted labels is specified 
> when the user built the Python Scoring Pipeline in Driverless AI. With that 
> knowledge, there is a property for the user to pass in one or more predicted 
> label names that will be used as the predicted header. I create a comma 
> separated string using the predicted header and predicted value. The 
> predicted header(s) is on one line followed by a newline and the predicted 
> value(s) is on the next line followed by a newline. The string is written to 
> the flow file content. Flow File attributes are added to the flow file for 
> the number of lists scored and the predicted label name and its associated 
> score. Finally, the flow file is transferred on a success relationship.
>  
> *H2oPspScoreBatches* - Executes H2O Driverless AI's Python Scoring Pipeline 
> to do batch scoring on a frame of data within each incoming flow file. Uses 
> H2O's open-source Datatable library to load test data into a frame. Each 
> frame from the flow file passing through this processor should have multiple 
> rows. That frame is passed into the Driverless AI's Python 
> scorer.score_batch() function to predict one or more predicted labels. The 
> prediction is returned to a pandas dataframe, then that dataframe is 
> converted to a string, so it can be written to the flow file content. Flow 
> File attributes are added to the flow file for the number of rows scored. 
> There are also flow file attributes added for the predicted label name and 
> its associated score for the first row in the frame. Finally, the flow file 
> is transferred on a success relationship.
>  
> *ConvertDsToCsv* - Converts data source of incoming flow file to csv. Uses 
> H2O's open-source Datatable library to load data into a frame, then converts 
> it to pandas dataframe. Pandas is used to convert the pandas dataframe to a 
> csv and store it into in-memory text stream StringIO without pandas dataframe 
> index. The csv string data is grabbed using file read() function on the 
> StringIO object, so it can be written to the flow file content. The flow file 
> is transferred on a success relationship.
>  
> *Hydraulic System Condition Monitoring* Data used in MiNiFi Flow:
> The sensor test data I used in this integration comes from [Kaggle: Condition 
> Monitoring of Hydraulic 
> Systems|[https://www.kaggle.com/jjacostupa/condition-monitoring-of-hydraulic-systems#description.txt]].
>  I was able to predict hydraulic system cooling efficiency through MiNiFi and 
> H2O integration described above. This use case here is hydraulic system 
> predictive maintenance.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to