Alberto, You probably just need to try out the options and see what works best (Avro or ORC, etc…).
With the Avro option, you wouldn’t need to change the type of your main HIVE table, keep that as ORC. Only the staging table would use Avro. Then call Hive QL to merge the data from your staging table into your main table. Let your clusters CPU power crunch through the data to do the merge. If you split the data using SplitRecord into individual rows then you could probably route on the transaction type. But working with individual rows in NiFi adds a lot of overhead, and just imagine executing 10k Hive QL SQL statements instead of 1 big one… If you have ACID enabled I guess it would all get recombined, but the overhead of calling that many statements would be really high. --Peter From: Alberto Bengoa [mailto:albe...@propus.com.br] Sent: Thursday, December 07, 2017 02:27 To: users@nifi.apache.org Subject: Re: [EXT] CDC like updates on Nifi On Tue, Dec 5, 2017 at 11:55 PM, Peter Wicks (pwicks) <pwi...@micron.com<mailto:pwi...@micron.com>> wrote: Alberto, Hello Peter, Thanks for your answer. Since it sounds like you have control over the structure of the tables, this should be doable. If you have a changelog table for each table this will probably be easier, and in your changelog table you’ll need to make sure you have a good transaction timestamp column and a change type column (I/U/D). Then use QueryDatabaseTable to tail your change log table, one copy of QueryDatabaseTable for each change table. Yes. This is the way that I'm trying to do. I have the TimeStamp and Operation type columns as "metadata columns" and all the other "data columns" of each table. Now your changes are in easy to ingest Avro files. For HIVE I’d probably use an external table with the Avro schema, this makes it easy to use PutHDFS to load the file and make it accessible from HIVE. I haven’t used Phoenix, sorry. Hmm. Sounds interesting. I was planning to use ORC because it's allow transactions (to make updates / deletes). Avro do not allow transactions, but changing data using HDFS instead of HiveQL would be an option. Would be possible to update fields of specific records using PutHDFS? On my changelog table I do not have the entire row data when triggered by an update. I just have values of changed fields (not changed fields have <null> values on changelog tables). _TimeStamp _Operation Column_A Column_B Column_C 2017-12-01 14:35:56:204 - 02:00 3 7501 <null> <null> 2017-12-01 14:35:56:211 - 02:00 4 7501 1234 <null> 2017-12-01 15:25:35:945 - 02:00 3 7503 <null> <null> 2017-12-01 15:25:35:945 - 02:00 4 7503 5678 <null> In the example above, we had two update operations (_Operation = 4). Column_B was changed, Column_C not. Column_C would have any prior value. If you have a single change table for all tables, then you can still use the above patter, but you’ll need a middle step where you extract and rebuild the changes. Maybe if you store the changes in JSON you could extract them using one of the Record parsers and then rebuild the data row. Much harder though. I have one changelog table for each table. Considering that I would use HiveQL to update tables on the Datalake, could I use a RouteOnContent processor to create SQL Queries according to the _Operation type? Thanks, Peter Thanks you! Alberto From: Alberto Bengoa [mailto:albe...@propus.com.br<mailto:albe...@propus.com.br>] Sent: Wednesday, December 06, 2017 06:24 To: users@nifi.apache.org<mailto:users@nifi.apache.org> Subject: [EXT] CDC like updates on Nifi Hey folks, I read about Nifi CDC processor for MySQL and other CDC "solutions" with Nifi found on Google, like these: https://community.hortonworks.com/idea/53420/apache-nifi-processor-to-address-cdc-use-cases-for.html https://community.hortonworks.com/questions/88686/change-data-capture-using-nifi-1.html https://community.hortonworks.com/articles/113941/change-data-capture-cdc-with-apache-nifi-version-1-1.html I'm trying a different approach to acquire fresh information from tables, using triggers on source database's tables to write changes to a "changelog table". This is done, but my questions are: Would Nifi be capable to read this tables, transform these data to generate a SQL equivalent query (insert/update/delete) to send to Hive and/or Phoenix with current available processors? Which would be the best / suggested flow? The objective is to keep tables on the Data Lake as up-to-date as possible for real time analyses. Cheers, Alberto