[ https://issues.apache.org/jira/browse/MINIFICPP-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099135#comment-17099135 ]
Marc Parisi edited comment on MINIFICPP-1199 at 5/4/20, 5:03 PM: ----------------------------------------------------------------- [~joewitt] gotcha I think we're on the same page and I was likely confusing myself via the Apache E-mail chain and being too terse probably confusing you. Thanks! was (Author: phrocker): [~joewitt] gotcha I think we're on the same page and I was likely confusing myself via the Apache E-mail chain. Thanks! > Integrates MiNiFi C++ with H2O Driverless AI Python Scoring Pipeline To Do ML > Inference on Edge > ----------------------------------------------------------------------------------------------- > > Key: MINIFICPP-1199 > URL: https://issues.apache.org/jira/browse/MINIFICPP-1199 > Project: Apache NiFi MiNiFi C++ > Issue Type: New Feature > Affects Versions: master > Environment: Ubuntu 18.04 in AWS EC2 > MiNiFi C++ 0.7.0 > Reporter: James Medel > Priority: Blocker > Fix For: 0.8.0 > > > *MiNiFi C++ and H2O Driverless AI Integration* via Custom Python Processors: > Integrates MiNiFi C++ with H2O's Driverless AI by Using Driverless AI's > Python Scoring Pipeline and MiNiFi's Custom Python Processors. Uses the > Python Processors to execute the Python Scoring Pipeline scorer to do batch > scoring and real-time scoring for one or more predicted labels on test data > in the incoming flow file content. I would like to contribute my processors > to MiNiFi C++ as a new feature. > > *3 custom python processors* created for MiNiFi: > *H2oPspScoreRealTime* - Executes H2O Driverless AI's Python Scoring Pipeline > to do interactive scoring (real-time) scoring on an individual row or list of > test data within each incoming flow file. Uses H2O's open-source Datatable > library to load test data into a frame, then converts it to pandas dataframe. > Pandas is used to convert the pandas dataframe rows to a list of lists, but > since each flow file passing through this processor should have only 1 row, > we extract the 1st list. Then that list is passed into the Driverless AI's > Python scorer.score() function to predict one or more predicted labels. The > prediction is returned to a list. The number of predicted labels is specified > when the user built the Python Scoring Pipeline in Driverless AI. With that > knowledge, there is a property for the user to pass in one or more predicted > label names that will be used as the predicted header. I create a comma > separated string using the predicted header and predicted value. The > predicted header(s) is on one line followed by a newline and the predicted > value(s) is on the next line followed by a newline. The string is written to > the flow file content. Flow File attributes are added to the flow file for > the number of lists scored and the predicted label name and its associated > score. Finally, the flow file is transferred on a success relationship. > > *H2oPspScoreBatches* - Executes H2O Driverless AI's Python Scoring Pipeline > to do batch scoring on a frame of data within each incoming flow file. Uses > H2O's open-source Datatable library to load test data into a frame. Each > frame from the flow file passing through this processor should have multiple > rows. That frame is passed into the Driverless AI's Python > scorer.score_batch() function to predict one or more predicted labels. The > prediction is returned to a pandas dataframe, then that dataframe is > converted to a string, so it can be written to the flow file content. Flow > File attributes are added to the flow file for the number of rows scored. > There are also flow file attributes added for the predicted label name and > its associated score for the first row in the frame. Finally, the flow file > is transferred on a success relationship. > > *ConvertDsToCsv* - Converts data source of incoming flow file to csv. Uses > H2O's open-source Datatable library to load data into a frame, then converts > it to pandas dataframe. Pandas is used to convert the pandas dataframe to a > csv and store it into in-memory text stream StringIO without pandas dataframe > index. The csv string data is grabbed using file read() function on the > StringIO object, so it can be written to the flow file content. The flow file > is transferred on a success relationship. > > *Hydraulic System Condition Monitoring* Data used in MiNiFi Flow: > The sensor test data I used in this integration comes from [Kaggle: Condition > Monitoring of Hydraulic > Systems|[https://www.kaggle.com/jjacostupa/condition-monitoring-of-hydraulic-systems#description.txt]]. > I was able to predict hydraulic system cooling efficiency through MiNiFi and > H2O integration described above. This use case here is hydraulic system > predictive maintenance. > -- This message was sent by Atlassian Jira (v8.3.4#803005)