Hi Team,

We at Microsoft opensource, are developing a custom for Azure Data Explorer 
sink connector for Apache Nifi. What we want to achieve is transactionality 
with data ingestion. The source of the processor can be TBs of telemetry data 
as well as CDC logs. What that means is in case of any failure while ingesting 
data of particular partition to ADX, will delete/cleanup any other ingested 
data of other partitions. Since Azure Data Explorer is an append only database, 
unfortunately we cant perform delete on the ingested data of same partition or 
other partition. So to achieve this kind of transactionality of large ingested 
data, we are thinking to implement something similar we have done for Apache 
Spark ADX connector. We will be creating temporary tables inside Azure Data 
Explorer before ingesting to the actual tables. The worker nodes in apache 
spark will create these temporary tables and report the ingestion status to the 
driver node. The driver node on receiving the success status of all the worker 
nodes performs ingestion on the actual table or else the ingestion is aborted 
along with cleanup of temporary table. So basically we aggregate the worker 
node task status in the driver node in spark to take further decision on 
whether to ingest data into ADX table or not.
Question 1: Is this possible in Apache Nifi follows a zero master cluster 
strategy as opposed to master-slave architecture of apache spark?
Question 2: In our custom processor in Nifi, is it possible to run custom code 
of a particular partition say the cluster coordinator node. Also is it possible 
to get the details of the partition inside the processor?
Question 3: Is is to possible to get the details of tasks executed at the 
various partitions and take decisions based on the task status. Can all these 
be done inside the same processor.

Thanks,
Tanmaya

Reply via email to