-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73081/
-----------------------------------------------------------

Review request for atlas, Ashutosh Mestry, Madhan Neethiraj, Nikhil Bonte, and 
Sarath Subramanian.


Bugs: ATLAS-4076
    https://issues.apache.org/jira/browse/ATLAS-4076


Repository: atlas


Description
-------

Observations:
=============
Have a hive table and attach classification to it on Atlas. Enable propagation 
on the attached classification.
When you drive a new table from this hive table, the new table will have the 
propagated classification, as expected.
However, the entity audits of the newly derived table has multiple "Propagated 
Classification Added" enteries. 

If table derivation is done using Hive Beeline, there are 5 such entries per 
propagated classification.
Using Spark-shell, 3 such entries were observed per propagated classification.

Expected behaviour is to have just 1 entry per propagated classification.

Analysis:
=========
After detecting relationship and creating relationship edge, the propagated 
enteties(classifications) are notified to entityChangeListner through 
entityChangeNotifier. However details of the propagated enteties are not passed 
directly to notifier, but through request context (buffered into 
addedPropagation list). 

After processing every edge, AtlasRelationshipStore manager sends notification 
to entityChangeListner, which simply gets all the items in request context 
buffer list. 

In this issue, Hive sends and event which has multiple relationships, and only 
one relationship has propagated entities, but due to multiple 
notifications(which is correct) same buffer list is processed multipletimes 
(which is wrong).

Following are the list of created relationships 
Created relationship edge from [hive_table] --> [hive_storagedesc] using edge 
label: [__hive_table.sd] 
Created relationship edge from [hive_table] --> [hive_column] using edge label: 
[__hive_table.columns] 
Created relationship edge from [hive_table] --> [hive_table_ddl] using edge 
label: [r:hive_table_ddl_queries] 
Created relationship edge from [hive_table] --> [hive_db] using edge label: 
[__hive_table.db] 
Created relationship edge from [hive_process] --> [hive_process_execution] 
using edge label: [r:hive_process_process_executions] 
Created relationship edge from [hive_process] --> [hive_table] using edge 
label: [__Process.outputs]
Created relationship edge from [hive_process] --> [hive_table] using edge 
label: [__Process.inputs]
===================================================================================================
Created relationship edge from [hive_column_lineage] --> [hive_column] using 
edge label: [__Process.outputs] 
Created relationship edge from [hive_column_lineage] --> [hive_column] using 
edge label: [__Process.inputs] 
Created relationship edge from [hive_column_lineage] --> [hive_process] using 
edge label: [__hive_column_lineage.query] 

In the above list the highlited one has propagated classificatin, but 
subscequent 3 relationships sends 3 more notifications, resulting 3 extra 
entries for same classification in entity audits.

At the end entityChangeNotifier, while processing mutated entities, explicetly 
notify for any pending propagated entities and once again buffer list in 
request context is processed. Resulting in 4th extra entry in audits.

Fix:
====

One option was to send the details of propagated entities directly to notifier 
and not rely on the request context. It required lot of code change.
Other option was to clear the buffer in the request context after processing it 
in entityChangeNotifier.

This review request is with the second aproach.


Diffs
-----

  
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityChangeNotifier.java
 32ad65e7a 
  server-api/src/main/java/org/apache/atlas/RequestContext.java 32ffddde1 


Diff: https://reviews.apache.org/r/73081/diff/1/


Testing
-------

Manual testing was done using both hive and spark.
precommit test were success
https://ci-builds.apache.org/job/Atlas/job/PreCommit-ATLAS-Build-Test/263/console


Thanks,

Deep Singh

Reply via email to