On Tue, Oct 6, 2015 at 7:17 AM, Bryan Bende <bbe...@gmail.com> wrote: > Hi Russell, > > I understand what you are getting at... I don't think the current > processors we have are designed to handle this bulk load scenario. >
That's what I'm gathering so far myself, having been poring over the history of related processors in Jira. I'm surprised it hadn't come up as an issue, but then there's always a first time for everything. > The series of processors you outlined in your original email would likely > work, but I *think* it would suffer the same problem of producing a lot of > FlowFiles because ConvertJSONToSQL will produce a FlowFile with one SQL > statement for each entry of the JSON array. You would have to try this out > to see if you are getting the performance you want with your data set. Looks like that's what I'll need to do in the short term. > PutSQL does have a configurable batch size that lets multiple inserts > happen through a single database transaction. > That's good to know. > I'm not that familiar with the "LOAD DATA INFILE" command, but I'm > wondering if another option is to somehow use ExecuteStreamCommand to > execute that statement based on the filename from an incoming FlowFile. > Just thinking out loud about what else could be done. > I'm not sure where ExecuteStreamCommand would help in this case, since that's DML in the statement above, in this case specific to mysql. So, I'll have a go at working with the approach you've suggested, and if in the meantime I discover - or write - a workaround, I'll post it in a follow-up. My overarching concern, I should mention here, has been to provide our business analysts with a template for doing Sqoop-like egress of data from processing into endpoint storage, something they can reason about given their experiences with Hadoop. Thanks Bryan, Russell > -Bryan > > > > On Mon, Oct 5, 2015 at 5:24 PM, Russell Whitaker <russell.whita...@gmail.com >> wrote: > >> Really, what I'd like to do is this type of msql bread 'n butter task: >> >> LOAD DATA INFILE <my_csv_file_ingested_as_a_flowfile> >> INTO TABLE <my_destination_table> >> FIELDS TERMINATED BY ',' >> ENCLOSED BY '"' >> LINES TERMINATED BY '\n' >> IGNORE 3 ROWS; >> >> Russell >> >> >> On Mon, Oct 5, 2015 at 2:09 PM, Russell Whitaker >> <russell.whita...@gmail.com> wrote: >> > Bryan, >> > >> > Some of the CSV files are as small as 6 columns and a thousand lines >> > or so of entries; >> > some are many more columns and thousands of lines. I'm hoping to avoid >> > the necessity >> > of spawning a flowfile per line; I'm hoping there's the Nifi >> > equivalent of the SQL DML >> > statement LOAD DATA INFILE. (Relatedly, being able to toggle off >> > foreign key & uniqueness >> > checks and transaction isolation guarantees during bulk load would be >> > very nice...) >> > >> > Russell >> > >> > On Mon, Oct 5, 2015 at 1:53 PM, Bryan Bende <bbe...@gmail.com> wrote: >> >> Russell, >> >> >> >> How big are these CSVs in terms of rows and columns? >> >> >> >> If they aren't too big, another option could be to use SplitText + >> >> ReplaceText to split the csv into a FlowFile per line, and then convert >> >> each line into SQL in ReplaceText. The downside is that this would >> create a >> >> lot of FlowFiles for very large CSVs. >> >> >> >> -Bryan >> >> >> >> On Mon, Oct 5, 2015 at 4:14 PM, Russell Whitaker < >> russell.whita...@gmail.com >> >>> wrote: >> >> >> >>> Use case I'm attempting: >> >>> >> >>> 1.) ingest a CSV file with header lines; >> >>> 2.) remove header lines (i.e. remove N lines at head); >> >>> 2.) SQL INSERT each remaining line as a row in an existing mysql table. >> >>> >> >>> My thinking so far: >> >>> >> >>> #1 is given (CSV fetched already); >> >>> #2 simple, should be handled in the context of ExecuteStreamProcessor; >> >>> >> >>> #3 is where I'm scratching my head: I keep re-reading the Description >> >>> field for >> >>> the PutSQL processor in http://nifi.apache.org/docs.html but can't >> seem to >> >>> parse this into what I need to do to prepare a flowfile comprising >> lines of >> >>> comma-separated lines of text into a series of INSERT statements: >> >>> >> >>> "Executes a SQL UPDATE or INSERT command. The content of an incoming >> >>> FlowFile is expected to be the SQL command to execute. The SQL command >> >>> may use the ? to escape parameters. In this case, the parameters to >> >>> use must exist as FlowFile attributes with the naming convention >> >>> sql.args.N.type and sql.args.N.value, where N is a positive integer. >> >>> The sql.args.N.type is expected to be a number indicating the JDBC >> >>> Type." >> >>> >> >>> Of related interest: there seems to be only one CSV-relevant processor >> >>> type in >> >>> v0.3.0, ConvertCSVToAvro; I fear the need to have to do something like >> >>> this: >> >>> >> >>> ConvertCSVToAvro->ConvertAvroToJSON->ConvertJSONToSQL->PutSQL >> >>> >> >>> Guidance, suggestions? Thanks! >> >>> >> >>> Russell >> >>> >> >>> -- >> >>> Russell Whitaker >> >>> http://twitter.com/OrthoNormalRuss >> >>> http://www.linkedin.com/pub/russell-whitaker/0/b86/329 >> >>> >> > >> > >> > >> > -- >> > Russell Whitaker >> > http://twitter.com/OrthoNormalRuss >> > http://www.linkedin.com/pub/russell-whitaker/0/b86/329 >> >> >> >> -- >> Russell Whitaker >> http://twitter.com/OrthoNormalRuss >> http://www.linkedin.com/pub/russell-whitaker/0/b86/329 >> -- Russell Whitaker http://twitter.com/OrthoNormalRuss http://www.linkedin.com/pub/russell-whitaker/0/b86/329