On Tue, Dec 5, 2017 at 11:55 PM, Peter Wicks (pwicks) <pwi...@micron.com>
wrote:

> Alberto,
>
Hello Peter,

Thanks for your answer.

>
>
> Since it sounds like you have control over the structure of the tables,
> this should be doable.
>
>
>
> If you have a changelog table for each table this will probably be easier,
> and in your changelog table you’ll need to make sure you have a good
> transaction timestamp column and a change type column (I/U/D). Then use
> QueryDatabaseTable to tail your change log table, one copy of
> QueryDatabaseTable for each change table.
>

Yes. This is the way that I'm trying to do. I have the TimeStamp and
Operation type columns as "metadata columns" and all the other "data
columns" of each table.

>
>
> Now your changes are in easy to ingest Avro files. For HIVE I’d probably
> use an external table with the Avro schema, this makes it easy to use
> PutHDFS to load the file and make it accessible from HIVE. I haven’t used
> Phoenix, sorry.
>

Hmm. Sounds interesting.

I was planning to use ORC because it's allow transactions (to make updates
/ deletes). Avro do not allow transactions, but changing data using HDFS
instead of HiveQL would be an option.

Would be possible to update fields of specific records using PutHDFS?

On my changelog table I do not have the entire row data when triggered by
an update. I just have values of changed fields (not changed fields have
<null> values on changelog tables).

_TimeStamp                              _Operation        Column_A Column_B
Column_C
2017-12-01 14:35:56:204 - 02:00          3 7501 <null>  <null>
2017-12-01 14:35:56:211 - 02:00          4 7501 1234  <null>
2017-12-01 15:25:35:945 - 02:00          3 7503 <null>  <null>
2017-12-01 15:25:35:945 - 02:00          4 7503 5678  <null>

In the example above, we had two update operations (_Operation = 4).
Column_B was changed, Column_C not. Column_C would have any prior value.


> If you have a single change table for all tables, then you can still use
> the above patter, but you’ll need a middle step where you extract and
> rebuild the changes. Maybe if you store the changes in JSON you could
> extract them using one of the Record parsers and then rebuild the data row.
> Much harder though.
>

I have one changelog table for each table.

Considering that I would use HiveQL to update tables on the Datalake, could
I use a RouteOnContent processor to create SQL Queries according to the
_Operation type?

>
>

>


> Thanks,
>
>   Peter
>
>
>

Thanks you!

Alberto


> *From:* Alberto Bengoa [mailto:albe...@propus.com.br]
> *Sent:* Wednesday, December 06, 2017 06:24
> *To:* users@nifi.apache.org
> *Subject:* [EXT] CDC like updates on Nifi
>
>
>
> Hey folks,
>
>
>
> I read about Nifi CDC processor for MySQL and other CDC "solutions" with
> Nifi found on Google, like these:
>
>
>
> https://community.hortonworks.com/idea/53420/apache-nifi-pro
> cessor-to-address-cdc-use-cases-for.html
>
> https://community.hortonworks.com/questions/88686/change-dat
> a-capture-using-nifi-1.html
>
> https://community.hortonworks.com/articles/113941/change-dat
> a-capture-cdc-with-apache-nifi-version-1-1.html
>
>
>
> I'm trying a different approach to acquire fresh information from tables,
> using triggers on source database's tables to write changes to a "changelog
> table".
>
>
>
> This is done, but my questions are:
>
>
>
> Would Nifi be capable to read this tables, transform these data to
> generate a SQL equivalent query (insert/update/delete) to send to Hive
> and/or Phoenix with current available processors?
>
>
>
> Which would be the best / suggested flow?
>
>
>
> The objective is to keep tables on the Data Lake as up-to-date as possible
> for real time analyses.
>
>
>
> Cheers,
>
> Alberto
>

Reply via email to