Re: Architecting the MS SQL CDC Processor

2017-10-16 Thread Matt Burgess
Peter,

This is great to hear, I'm sure the community is looking forward to
such a solution!  I worked on the first offering of the
CaptureChangeMySQL processor, so here are some notes, comments, and
(hopefully!) answers to your questions:

* If you support a RecordSetWriter controller service as your output,
then you won't need JdbcCommon per se; instead you would create
Records and pass those to the user-selected RecordSetWriter. In that
sense you can support Avro, CSV, JSON, or anything else for which
there is a RecordSetWriter implementation.

* Depending on how often you'll be updating state, you may want to
implement something similar to the State Update Interval property in
CaptureChangeMySQL, which came about due to similar concerns about the
overhead of state updates vs the amount of processing beforehand.
This allows to user to tune the tradeoff, based on their own
requirements and performance and such.

* I have no concerns with having a different output format from
CaptureChangeMySQL; in fact the only reason it doesn't have a
RecordSetWriter output interface is that those capabilities were being
developed in parallel, so rather than have to wait for the
record-aware API stuff, I chose to output JSON. I have written
NIFI-4491 to improve/augment CDC processor(s) with RecordSetWriter
support. This would be very helpful by supporting various output
formats as well as generating the accompanying schema. If your
processor were the first to support this, it could be the exemplar for
past and future CDC processors :)

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-4491

On Mon, Oct 16, 2017 at 10:36 PM, Peter Wicks (pwicks)
 wrote:
> I've been working on a new processor that does Change Data Capture with 
> Microsoft SQL Server. I followed Microsoft's documentation on how CDC works, 
> and I've got some code that gets me the changes and is testing well. Right 
> now, I don't actually have a processor, but a number of scripts that generate 
> SQL and I put it into ExecuteSQL and QueryDatabaseTable processors; with QDB 
> using my as-yet incomplete 
> NIFI-1706.
>
> One of the reasons I don't have a processor yet is because I don't want to 
> use the same output format as the MySQL CDC Processor, but I didn't want to 
> put in the time if it was not going to get merged. The MySQL CDC processor 
> uses JSON messages as the output format, but in MS SQL the CDC messages are 
> rows in a table; and it's much more convenient to output them as records. 
> Currently, I'm using Avro.
>
> Questions:
>
>   *   My output format doesn't have to be Avro, but given the source is rows 
> in a table being returned by a ResultSet, using the JdbcCommon class makes a 
> lot of sense to me. Can I move JdbcCommon to somewhere useful like 
> nifi-avro-record-utils?
>   *   I'll be looping through a list of tables and plan on committing the 
> files immediately to the success relationship as that table's CDC records are 
> pulled. I want to make sure that the max value tracking gets updated 
> immediately too. Does calling setState on the State Manager cause an 
> immediate state save? Is this safe to call repeatedly, assuming single 
> threaded, during the execution of the processor?
>   *   Concerns with using a different output format than the MySQL CDC 
> Processor?
>
> Thanks,
>   Peter


Architecting the MS SQL CDC Processor

2017-10-16 Thread Peter Wicks (pwicks)
I've been working on a new processor that does Change Data Capture with 
Microsoft SQL Server. I followed Microsoft's documentation on how CDC works, 
and I've got some code that gets me the changes and is testing well. Right now, 
I don't actually have a processor, but a number of scripts that generate SQL 
and I put it into ExecuteSQL and QueryDatabaseTable processors; with QDB using 
my as-yet incomplete NIFI-1706.

One of the reasons I don't have a processor yet is because I don't want to use 
the same output format as the MySQL CDC Processor, but I didn't want to put in 
the time if it was not going to get merged. The MySQL CDC processor uses JSON 
messages as the output format, but in MS SQL the CDC messages are rows in a 
table; and it's much more convenient to output them as records. Currently, I'm 
using Avro.

Questions:

  *   My output format doesn't have to be Avro, but given the source is rows in 
a table being returned by a ResultSet, using the JdbcCommon class makes a lot 
of sense to me. Can I move JdbcCommon to somewhere useful like 
nifi-avro-record-utils?
  *   I'll be looping through a list of tables and plan on committing the files 
immediately to the success relationship as that table's CDC records are pulled. 
I want to make sure that the max value tracking gets updated immediately too. 
Does calling setState on the State Manager cause an immediate state save? Is 
this safe to call repeatedly, assuming single threaded, during the execution of 
the processor?
  *   Concerns with using a different output format than the MySQL CDC 
Processor?

Thanks,
  Peter


RE: [EXT] Re: JAVA_HOME trouble in nifi.sh

2017-10-16 Thread Peter Wicks (pwicks)
Aldrin,

My branch isn't  pure, in fact it has a few build only related changes... I'll 
play around with it.

I'm working on moving my builds to a Unix box with Jenkins to reduce the CPU 
strain on my development box. But my builds were failing due to the NPM/Node 
Maven plugin. So I upgraded the package version to get the build to complete.

Since the build did not complete before this update I have no idea how it may 
have affected the build.

Thanks,
  Peter

-Original Message-
From: Aldrin Piri [mailto:aldrinp...@gmail.com] 
Sent: Friday, October 13, 2017 10:25 PM
To: dev 
Subject: [EXT] Re: JAVA_HOME trouble in nifi.sh

Can't say I've seen this before and certainly have JAVA_HOME set on most of the 
places where I've performed builds.  Would you mind please opening up a ticket 
as well as capturing the salient environmental bits (OS, Maven, JDK versions, 
etc)?

That locateJava blurb is something we used from another ASF project and use 
heavily throughout our executables, so definitely need to track it down.

Did you happen to see this substitution in any other files?

On Thu, Oct 12, 2017 at 10:51 PM, Peter Wicks (pwicks) 
wrote:

> Only when building on Linux, during build my "${JAVA_HOME}" string in 
> nifi.sh is getting overwritten by the current value of my environment 
> variable for JAVA_HOME on my build box... not sure if this is 
> something others have run into.
>
> I built on one box, where JAVA_HOME is set to 
> "/var/spe/tools/jdk1.8.0_144". I then copied the tar.gz directly from 
> nifi-assembly to another box. I only extracted it after getting to the 
> other server where JAVA_HOME is not set.
>
> Here is a snippet from nifi.sh, I've bolded the sections where the raw 
> file has ${JAVA_HOME}. Mabye this is a system config issue? Obviously 
> this isn't happening for everyone else building on Linux...?
>
> locateJava() {
> # Setup the Java Virtual Machine
> if $cygwin ; then
> [ -n "${JAVA}" ] && JAVA=$(cygpath --unix "${JAVA}")
> [ -n "/var/spe/tools/jdk1.8.0_144" ] && JAVA_HOME=$(cygpath 
> --unix
> "/var/spe/tools/jdk1.8.0_144")
> fi
>
> if [ "x${JAVA}" = "x" ] && [ -r /etc/gentoo-release ] ; then
> JAVA_HOME=$(java-config --jre-home)
> fi
> if [ "x${JAVA}" = "x" ]; then
> if [ "x/var/spe/tools/jdk1.8.0_144" != "x" ]; then
> if [ ! -d "/var/spe/tools/jdk1.8.0_144" ]; then
> die "JAVA_HOME is not valid: /var/spe/tools/jdk1.8.0_144"
> fi
> JAVA="/var/spe/tools/jdk1.8.0_144/bin/java"
> else
> warn "JAVA_HOME not set; results may vary"
> JAVA=$(type java)
> JAVA=$(expr "${JAVA}" : '.* \(/.*\)$')
> if [ "x${JAVA}" = "x" ]; then
> die "java command not found"
> fi
> fi
> fi
> # if command is env, attempt to add more to the classpath
> if [ "$1" = "env" ]; then
> [ "x${TOOLS_JAR}" =  "x" ] && [ -n 
> "/var/spe/tools/jdk1.8.0_144" ] && TOOLS_JAR=$(find -H 
> "/var/spe/tools/jdk1.8.0_144" -name "tools.jar")
> [ "x${TOOLS_JAR}" =  "x" ] && [ -n 
> "/var/spe/tools/jdk1.8.0_144" ] && TOOLS_JAR=$(find -H 
> "/var/spe/tools/jdk1.8.0_144" -name "classes.jar")
> if [ "x${TOOLS_JAR}" =  "x" ]; then
>  warn "Could not locate tools.jar or classes.jar. Please 
> set manually to avail all command features."
> fi
> fi
>
> }
>


RE: [EXT] Re: Please refresh my memory on NAR dependencies

2017-10-16 Thread Peter Wicks (pwicks)
I gave this a shot and it worked well for me.
https://github.com/apache/nifi/pull/2194

-Original Message-
From: Koji Kawamura [mailto:ijokaruma...@gmail.com] 
Sent: Monday, October 16, 2017 12:03 PM
To: dev 
Subject: Re: [EXT] Re: Please refresh my memory on NAR dependencies

Peter, Matt,

If the goal is sharing org.apache.nifi.csv.CSVUtils among modules, an 
alternative approach is moving CSVUtils to nifi-standard-record-util and add 
ordinary JAR dependency from nifi-poi-processors. How do you think?

Thanks,
Koji

On Mon, Oct 16, 2017 at 12:17 PM, Peter Wicks (pwicks)  
wrote:
> Matt,
>
> I am trying to re-use most of CSVUtils, including most of the property 
> descriptors and CSVUtils.createCSVFormat.
>
> It seemed like a waste to duplicate the entire class. I can try making it the 
> parent, what are the implications if I do that?
>
> Thanks,
>   Peter
>
> -Original Message-
> From: Matt Burgess [mailto:mattyb...@apache.org]
> Sent: Monday, October 16, 2017 10:58 AM
> To: dev@nifi.apache.org
> Subject: [EXT] Re: Please refresh my memory on NAR dependencies
>
> Do you have a hard requirement on the implementations in 
> nifi-record-serialization-services? Otherwise, the existing examples have the 
> processor POM pointing at the following:
>
> 
> org.apache.nifi
> nifi-record-serialization-service-api
> 
>
> which is the API JAR I think. If you need the implementations behind 
> it, you will probably need to declare that as a parent (not a
> dependency) and perhaps still use the API JAR (though I'm guessing about the 
> latter).
>
> Regards,
> Matt
>
>
> On Sun, Oct 15, 2017 at 10:27 PM, Peter Wicks (pwicks)  
> wrote:
>> For NIFI-4465 I want the nifi-poi-bundle to include a Maven dependency on 
>> nifi-record-serialization-services. So I start by adding the dependency to 
>> the pom.xml.
>>
>> 
>>org.apache.nifi
>>nifi-record-serialization-services
>> 
>>
>> I've tried several variations on this, with version numbers, putting it at 
>> higher pom levels, including it in the nifi-nar-bundles pom and marking it 
>> as included, etc...
>>
>> Throughout all this compiling is no problem, and all my unit tests run 
>> correctly. But when I try to start NiFi I immediately get Class not found 
>> exceptions from the nifi-poi classes related to the 
>> nifi-record-serialization libraries.
>>
>> I feel like I've run into this in the past, and it was due to how NAR's 
>> work. Can't remember though.
>>
>> Help would be appreciated!
>>
>> Thanks,
>>   Peter


RE: [EXT] Re: Funnel Queue Slowness

2017-10-16 Thread Peter Wicks (pwicks)
Pierre,

I agree with you all around. It would be nice if it was a little smarter.

--Peter


-Original Message-
From: Pierre Villard [mailto:pierre.villard...@gmail.com] 
Sent: Monday, October 16, 2017 4:00 PM
To: dev 
Subject: Re: [EXT] Re: Funnel Queue Slowness

Peter,

This behaviour is by design and it's the case for processors as well.

Back pressure is only checked by the component each time it is scheduled to see 
whether the component can run or not. If yes, the component will run as 
configured and will process as many flow files as it is supposed to process. In 
case of funnels, a funnel will always perform actions on a batch of 100 flow 
files (
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core-api/src/main/java/org/apache/nifi/controller/StandardFunnel.java#L372
).

You would have the same with other components. Let's say you have a SplitText 
creating 10k flow files for each incoming flow file. Even though backpressure 
is configured with 1k flow file on the downstream connection, if back pressure 
thresholds are not reached, the processor will be triggered and produce the 
expected number of flow files (which is over back pressure threshold).

I agree this hard-coded number of 100 for funnels could be improved (something 
like min(100, backpressure threshold - number of queued flow
files)) but I'm not sure that's really an issue.

Pierre







2017-10-16 5:05 GMT+02:00 Peter Wicks (pwicks) :

> Joe,
>
> It really is about just forgetting that penalization is a thing. 
> Penalized files are fairly well marked when you do a List Queue.
>
> I think Funnel's need an overall re-examination. I noticed another 
> quirk the other day when moving queues around that already contained 
> FlowFiles; Funnel's ignore back pressure settings if there is any 
> space available in the down-stream queue.
>
> Prep the FlowFiles: https://photos.app.goo.gl/Fu3EBDtQZ5wurQNt2
> Configure the Queue to only allow Back Pressure of 10 files:
> https://photos.app.goo.gl/17OlJSu2NXkxQ8lZ2
> Funnel grabs 100 FlowFiles no matter what and shoves them through:
> https://photos.app.goo.gl/vEwoZYETH6iMImBJ3
>
> If you let the down-stream processor run until there is space for 1 
> FlowFile available then it loads in another 100 flow files:
> https://photos.app.goo.gl/R4P5mdXr3L5oJnSw2
>
> I created a ticket: NIFI-4486.
>
> Thanks,
>   Peter
>
> -Original Message-
> From: Joe Witt [mailto:joe.w...@gmail.com]
> Sent: Tuesday, October 10, 2017 10:01 AM
> To: dev@nifi.apache.org
> Subject: Re: [EXT] Re: Funnel Queue Slowness
>
> Peter,
>
> I see your point that it feels not natural or at least surprising.
> There are two challenges I see with what you propose.  One is user 
> oriented and the other is technical.
>
> The user oriented one is that penalized objects are penalized as a 
> function of the thing that last operated on them.  The further away we 
> let the data get the harder it would be to reason over why they were 
> penalized in the first place.
>
> The technical one is that once something is penalized and placed into 
> the queue there is prioritization and polling logic that kicks in as a factor.
> I'm not sure how we'd tweak it for that to be ok in some cases and in 
> others not.  Perhaps we could just make funnels truly a pass-through 
> and when calculating the queue we're storing on figure out the first 
> non-funnel queue provided there is no cloning/branching we'd have to 
> account for.  But even then it brings us back to the previous point 
> which is the user challenge of knowing what thing penalized objects in 
> queue in the first place.
>
> Alternatively, we should review whether it is obvious enough (or at
> all) that items within a queue at a given moment in time are penalized.
> I've worked with NiFi for a very long time and i'll be honest and 
> state I've forgotten that penalization was a thing more than a few times too.
>
> What do you think?
>
> Thanks
>
> On Mon, Oct 9, 2017 at 9:01 PM, Peter Wicks (pwicks) 
> 
> wrote:
> > Bryan,
> >
> > Yes, it was the penalty causing the issue. This feels like weird
> behavior for Funnel’s, and I’m not sure if it makes sense for 
> penalties to work this way.
> >
> > Would it make more sense if penalties were generally kept as is, but 
> > not
> applied at Funnel’s, then the penalty would kick back in at the first 
> non-funnel queue?
> >
> > Thanks,
> >   Peter
> >
> > From: Bryan Bende [mailto:bbe...@gmail.com]
> > Sent: Monday, October 09, 2017 7:33 PM
> > To: dev@nifi.apache.org
> > Subject: [EXT] Re: Funnel Queue Slowness
> >
> > Peter,
> >
> > The images didn’t come across for me, but since you mentioned that a
> failure queue is involved, is it possible all the flow files going to 
> failure are being penalized which would cause them to not be processed 
> immediately?
> >
> > -Bryan
> >
> >
> > On Oct 8, 2017, 

Re: [EXT] Re: Funnel Queue Slowness

2017-10-16 Thread Pierre Villard
Peter,

This behaviour is by design and it's the case for processors as well.

Back pressure is only checked by the component each time it is scheduled to
see whether the component can run or not. If yes, the component will run as
configured and will process as many flow files as it is supposed to
process. In case of funnels, a funnel will always perform actions on a
batch of 100 flow files (
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core-api/src/main/java/org/apache/nifi/controller/StandardFunnel.java#L372
).

You would have the same with other components. Let's say you have a
SplitText creating 10k flow files for each incoming flow file. Even though
backpressure is configured with 1k flow file on the downstream connection,
if back pressure thresholds are not reached, the processor will be
triggered and produce the expected number of flow files (which is over back
pressure threshold).

I agree this hard-coded number of 100 for funnels could be improved
(something like min(100, backpressure threshold - number of queued flow
files)) but I'm not sure that's really an issue.

Pierre







2017-10-16 5:05 GMT+02:00 Peter Wicks (pwicks) :

> Joe,
>
> It really is about just forgetting that penalization is a thing. Penalized
> files are fairly well marked when you do a List Queue.
>
> I think Funnel's need an overall re-examination. I noticed another quirk
> the other day when moving queues around that already contained FlowFiles;
> Funnel's ignore back pressure settings if there is any space available in
> the down-stream queue.
>
> Prep the FlowFiles: https://photos.app.goo.gl/Fu3EBDtQZ5wurQNt2
> Configure the Queue to only allow Back Pressure of 10 files:
> https://photos.app.goo.gl/17OlJSu2NXkxQ8lZ2
> Funnel grabs 100 FlowFiles no matter what and shoves them through:
> https://photos.app.goo.gl/vEwoZYETH6iMImBJ3
>
> If you let the down-stream processor run until there is space for 1
> FlowFile available then it loads in another 100 flow files:
> https://photos.app.goo.gl/R4P5mdXr3L5oJnSw2
>
> I created a ticket: NIFI-4486.
>
> Thanks,
>   Peter
>
> -Original Message-
> From: Joe Witt [mailto:joe.w...@gmail.com]
> Sent: Tuesday, October 10, 2017 10:01 AM
> To: dev@nifi.apache.org
> Subject: Re: [EXT] Re: Funnel Queue Slowness
>
> Peter,
>
> I see your point that it feels not natural or at least surprising.
> There are two challenges I see with what you propose.  One is user
> oriented and the other is technical.
>
> The user oriented one is that penalized objects are penalized as a
> function of the thing that last operated on them.  The further away we let
> the data get the harder it would be to reason over why they were penalized
> in the first place.
>
> The technical one is that once something is penalized and placed into the
> queue there is prioritization and polling logic that kicks in as a factor.
> I'm not sure how we'd tweak it for that to be ok in some cases and in
> others not.  Perhaps we could just make funnels truly a pass-through and
> when calculating the queue we're storing on figure out the first non-funnel
> queue provided there is no cloning/branching we'd have to account for.  But
> even then it brings us back to the previous point which is the user
> challenge of knowing what thing penalized objects in queue in the first
> place.
>
> Alternatively, we should review whether it is obvious enough (or at
> all) that items within a queue at a given moment in time are penalized.
> I've worked with NiFi for a very long time and i'll be honest and state
> I've forgotten that penalization was a thing more than a few times too.
>
> What do you think?
>
> Thanks
>
> On Mon, Oct 9, 2017 at 9:01 PM, Peter Wicks (pwicks) 
> wrote:
> > Bryan,
> >
> > Yes, it was the penalty causing the issue. This feels like weird
> behavior for Funnel’s, and I’m not sure if it makes sense for penalties to
> work this way.
> >
> > Would it make more sense if penalties were generally kept as is, but not
> applied at Funnel’s, then the penalty would kick back in at the first
> non-funnel queue?
> >
> > Thanks,
> >   Peter
> >
> > From: Bryan Bende [mailto:bbe...@gmail.com]
> > Sent: Monday, October 09, 2017 7:33 PM
> > To: dev@nifi.apache.org
> > Subject: [EXT] Re: Funnel Queue Slowness
> >
> > Peter,
> >
> > The images didn’t come across for me, but since you mentioned that a
> failure queue is involved, is it possible all the flow files going to
> failure are being penalized which would cause them to not be processed
> immediately?
> >
> > -Bryan
> >
> >
> > On Oct 8, 2017, at 10:49 PM, Peter Wicks (pwicks)  > wrote:
> >
> > I’ve been running into an issue on 1.4.0 where my Funnel sometimes runs
> slow. I haven’t been able to create a nice reproducible test case to pass
> on.
> > What I’m seeing is that my failure queue on the right will start to fill
>