Re: A flow to PutSQL the lines of a CSV file?

2015-10-06 Thread Russell Whitaker
On Tue, Oct 6, 2015 at 7:17 AM, Bryan Bende  wrote:
> Hi Russell,
>
> I understand what you are getting at... I don't think the current
> processors we have are designed to handle this bulk load scenario.
>

That's what I'm gathering so far myself, having been poring over the history of
related processors in Jira. I'm surprised it hadn't come up as an
issue, but then
there's always a first time for everything.

> The series of processors you outlined in your original email would likely
> work, but I *think* it would suffer the same problem of producing a lot of
> FlowFiles because ConvertJSONToSQL will produce a FlowFile with one SQL
> statement for each entry of the JSON array. You would have to try this out
> to see if you are getting the performance you want with your data set.

Looks like that's what I'll need to do in the short term.

> PutSQL does have a configurable batch size that lets multiple inserts
> happen through a single database transaction.
>

That's good to know.

> I'm not that familiar with the "LOAD DATA INFILE" command, but I'm
> wondering if another option is to somehow use ExecuteStreamCommand to
> execute that statement based on the filename from an incoming FlowFile.
> Just thinking out loud about what else could be done.
>

I'm not sure where ExecuteStreamCommand would help in this case, since
that's DML in the statement above, in this case specific to mysql.

So, I'll have a go at working with the approach you've suggested, and if in the
meantime I discover - or write - a workaround, I'll post it in a follow-up. My
overarching concern, I should mention here, has been to provide our business
analysts with a template for doing Sqoop-like egress of data from processing
into endpoint storage, something they can reason about given their experiences
with Hadoop.

Thanks Bryan,
Russell

> -Bryan
>
>
>
> On Mon, Oct 5, 2015 at 5:24 PM, Russell Whitaker > wrote:
>
>> Really, what I'd like to do is this type of msql bread 'n butter task:
>>
>> LOAD DATA INFILE 
>> INTO TABLE 
>> FIELDS TERMINATED BY ','
>> ENCLOSED BY '"'
>> LINES TERMINATED BY '\n'
>> IGNORE 3 ROWS;
>>
>> Russell
>>
>>
>> On Mon, Oct 5, 2015 at 2:09 PM, Russell Whitaker
>>  wrote:
>> > Bryan,
>> >
>> > Some of the CSV files are as small as 6 columns and a thousand lines
>> > or so of entries;
>> > some are many more columns and thousands of lines. I'm hoping to avoid
>> > the necessity
>> > of spawning a flowfile per line; I'm hoping there's the Nifi
>> > equivalent of the SQL DML
>> > statement LOAD DATA INFILE. (Relatedly, being able to toggle off
>> > foreign key & uniqueness
>> > checks and transaction isolation guarantees during bulk load would be
>> > very nice...)
>> >
>> > Russell
>> >
>> > On Mon, Oct 5, 2015 at 1:53 PM, Bryan Bende  wrote:
>> >> Russell,
>> >>
>> >> How big are these CSVs in terms of rows and columns?
>> >>
>> >> If they aren't too big, another option could be to use SplitText +
>> >> ReplaceText to split the csv into a FlowFile per line, and then convert
>> >> each line into SQL in ReplaceText. The downside is that this would
>> create a
>> >> lot of FlowFiles for very large CSVs.
>> >>
>> >> -Bryan
>> >>
>> >> On Mon, Oct 5, 2015 at 4:14 PM, Russell Whitaker <
>> russell.whita...@gmail.com
>> >>> wrote:
>> >>
>> >>> Use case I'm attempting:
>> >>>
>> >>> 1.) ingest a CSV file with header lines;
>> >>> 2.) remove header lines (i.e. remove N lines at head);
>> >>> 2.) SQL INSERT each remaining line as a row in an existing mysql table.
>> >>>
>> >>> My thinking so far:
>> >>>
>> >>> #1 is given (CSV fetched already);
>> >>> #2 simple, should be handled in the context of ExecuteStreamProcessor;
>> >>>
>> >>> #3 is where I'm scratching my head: I keep re-reading the Description
>> >>> field for
>> >>> the PutSQL processor in http://nifi.apache.org/docs.html but can't
>> seem to
>> >>> parse this into what I need to do to prepare a flowfile comprising
>> lines of
>> >>> comma-separated lines of text into a series of INSERT statements:
>> >>>
>> >>> "Executes a SQL UPDATE or INSERT command. The content of an incoming
>> >>> FlowFile is expected to be the SQL command to execute. The SQL command
>> >>> may use the ? to escape parameters. In this case, the parameters to
>> >>> use must exist as FlowFile attributes with the naming convention
>> >>> sql.args.N.type and sql.args.N.value, where N is a positive integer.
>> >>> The sql.args.N.type is expected to be a number indicating the JDBC
>> >>> Type."
>> >>>
>> >>> Of related interest: there seems to be only one CSV-relevant processor
>> >>> type in
>> >>> v0.3.0, ConvertCSVToAvro; I fear the need to have to do something like
>> >>> this:
>> >>>
>> >>> ConvertCSVToAvro->ConvertAvroToJSON->ConvertJSONToSQL->PutSQL
>> >>>
>> >>> Guidance, suggestions? Thanks!
>> >>>
>> >>> Russell
>> >>>
>> >>> --
>> >>> Russell Whitaker
>> >>> http://twitter.com/OrthoNormalRuss

Re: A flow to PutSQL the lines of a CSV file?

2015-10-05 Thread Bryan Bende
Russell,

How big are these CSVs in terms of rows and columns?

If they aren't too big, another option could be to use SplitText +
ReplaceText to split the csv into a FlowFile per line, and then convert
each line into SQL in ReplaceText. The downside is that this would create a
lot of FlowFiles for very large CSVs.

-Bryan

On Mon, Oct 5, 2015 at 4:14 PM, Russell Whitaker  wrote:

> Use case I'm attempting:
>
> 1.) ingest a CSV file with header lines;
> 2.) remove header lines (i.e. remove N lines at head);
> 2.) SQL INSERT each remaining line as a row in an existing mysql table.
>
> My thinking so far:
>
> #1 is given (CSV fetched already);
> #2 simple, should be handled in the context of ExecuteStreamProcessor;
>
> #3 is where I'm scratching my head: I keep re-reading the Description
> field for
> the PutSQL processor in http://nifi.apache.org/docs.html but can't seem to
> parse this into what I need to do to prepare a flowfile comprising lines of
> comma-separated lines of text into a series of INSERT statements:
>
> "Executes a SQL UPDATE or INSERT command. The content of an incoming
> FlowFile is expected to be the SQL command to execute. The SQL command
> may use the ? to escape parameters. In this case, the parameters to
> use must exist as FlowFile attributes with the naming convention
> sql.args.N.type and sql.args.N.value, where N is a positive integer.
> The sql.args.N.type is expected to be a number indicating the JDBC
> Type."
>
> Of related interest: there seems to be only one CSV-relevant processor
> type in
> v0.3.0, ConvertCSVToAvro; I fear the need to have to do something like
> this:
>
> ConvertCSVToAvro->ConvertAvroToJSON->ConvertJSONToSQL->PutSQL
>
> Guidance, suggestions? Thanks!
>
> Russell
>
> --
> Russell Whitaker
> http://twitter.com/OrthoNormalRuss
> http://www.linkedin.com/pub/russell-whitaker/0/b86/329
>


Re: A flow to PutSQL the lines of a CSV file?

2015-10-05 Thread Russell Whitaker
Bryan,

Some of the CSV files are as small as 6 columns and a thousand lines
or so of entries;
some are many more columns and thousands of lines. I'm hoping to avoid
the necessity
of spawning a flowfile per line; I'm hoping there's the Nifi
equivalent of the SQL DML
statement LOAD DATA INFILE. (Relatedly, being able to toggle off
foreign key & uniqueness
checks and transaction isolation guarantees during bulk load would be
very nice...)

Russell

On Mon, Oct 5, 2015 at 1:53 PM, Bryan Bende  wrote:
> Russell,
>
> How big are these CSVs in terms of rows and columns?
>
> If they aren't too big, another option could be to use SplitText +
> ReplaceText to split the csv into a FlowFile per line, and then convert
> each line into SQL in ReplaceText. The downside is that this would create a
> lot of FlowFiles for very large CSVs.
>
> -Bryan
>
> On Mon, Oct 5, 2015 at 4:14 PM, Russell Whitaker > wrote:
>
>> Use case I'm attempting:
>>
>> 1.) ingest a CSV file with header lines;
>> 2.) remove header lines (i.e. remove N lines at head);
>> 2.) SQL INSERT each remaining line as a row in an existing mysql table.
>>
>> My thinking so far:
>>
>> #1 is given (CSV fetched already);
>> #2 simple, should be handled in the context of ExecuteStreamProcessor;
>>
>> #3 is where I'm scratching my head: I keep re-reading the Description
>> field for
>> the PutSQL processor in http://nifi.apache.org/docs.html but can't seem to
>> parse this into what I need to do to prepare a flowfile comprising lines of
>> comma-separated lines of text into a series of INSERT statements:
>>
>> "Executes a SQL UPDATE or INSERT command. The content of an incoming
>> FlowFile is expected to be the SQL command to execute. The SQL command
>> may use the ? to escape parameters. In this case, the parameters to
>> use must exist as FlowFile attributes with the naming convention
>> sql.args.N.type and sql.args.N.value, where N is a positive integer.
>> The sql.args.N.type is expected to be a number indicating the JDBC
>> Type."
>>
>> Of related interest: there seems to be only one CSV-relevant processor
>> type in
>> v0.3.0, ConvertCSVToAvro; I fear the need to have to do something like
>> this:
>>
>> ConvertCSVToAvro->ConvertAvroToJSON->ConvertJSONToSQL->PutSQL
>>
>> Guidance, suggestions? Thanks!
>>
>> Russell
>>
>> --
>> Russell Whitaker
>> http://twitter.com/OrthoNormalRuss
>> http://www.linkedin.com/pub/russell-whitaker/0/b86/329
>>



-- 
Russell Whitaker
http://twitter.com/OrthoNormalRuss
http://www.linkedin.com/pub/russell-whitaker/0/b86/329