Hi Peter,

I'm making a few assumptions on batch sizes in the flow, mainly a single
flowfile (representing a batch) goes into JsonToSql, then many flowfiles
are produced.

An inline ExecuteStream Command processor placed before the PutSQL
processor may work.  The ESC would contain a script that would delete the
records from the old table, then pass the 'original' flow file to PutSQL.
This would be almost within the same transaction.  An additional validation
flow could help;  I could see the potential for database downtime occurring
mid-batch.

For the second issue (to prevent mixing of batches), you could set the
flowfile threshold count to "1" on the input to PutSQL.  This way back
pressure is on except when empty.  Additionally, a 1 minute run schedule on
the penultimate processor would further ensure you pull 1 flowfile at a
time.

The delete-first-flowfile is an interesting idea.
A putfile/getfile combo where new flowfiles overwrite older flowfiles of
the same filename to disk would accomplish this. I don't think there is
currently an in-memory version of this but...

putting on my Rube Goldberg hat, I'm envisioning 2 updateAttribute
processors and 1 RouteOnAttribute processor in a loop.  Leveraging
flowfile-priority (newest first) in the connections and RunScheduling on
the updateAttribute processors, one *could* engineer conditions where
non-new flowfiles do not get  an attribute "refreshed" in-time and are
subsequently routed out (or aged-out) of the loop.  A few new challenges
arise in this thought experiment, namely applying proper backpressure to
the loop as a whole, and failures due to heavily loaded instances taking
more time than expected.  I'll think about this a bit more, run some tests
and get back to you.

Thanks,
Lee



On Aug 26, 2016, at 7:45 AM, Peter Wicks (pwicks) <pwi...@micron.com> wrote:

I have a source SQL table that I’m reading with a SQL select statement.  I
want to kill and fill a destination SQL table with this source data on an
interval.



My non kill-and-fill pattern is: ExecuteSQL -> Avro To JSON -> JSON To SQL
-> PutSQL.



I’m trying to come up with a good way to delete existing data first before
loading new data.

One option I’ve considered is to mark the original Avro file with a UUID
and add this attribute as a field in the destination table; then do a split
off, ReplaceText, and delete all rows where the UUID doesn’t match this
batch.  I think this could work, but I’m worried about timing the SQL
DELETE.  I kind of want the kill and the fill steps to happen in a single
transaction.



The other issue is what happens if PutSQL has to go down for a while due to
database downtime and I get several kill-and-fill batches piled up.  Is
there a way I can use backpressure to make sure only a single file gets
converted from JSON to SQL at a time in order to avoid mixing batches?

I also considered FlowFile expiration, but is there a way I can tell it
NiFI to only expire a FlowFile when a new FlowFile has entered the queue?
Ex: 1 flow file in queue, no expiration occurs. 2nd (newer) FlowFile enters
queue then first file will expire itself.



Thanks,

  Peter

Reply via email to