[jira] [Updated] (ATLAS-568) Parallelize Hive hook operations

Suma Shivaprasad (JIRA) Tue, 22 Mar 2016 10:58:48 -0700

     [ 
https://issues.apache.org/jira/browse/ATLAS-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Suma Shivaprasad updated ATLAS-568:
-----------------------------------
    Description: 
Maintaining the same order of operations that were executed in hive is crucial 
also on top of ATLAS . This is because if they are not ordered, it could easily 
lead to correctness issues in the ATLAS repository. For eg: Table columns being 
dropped and then table is renamed , dropping tables, databases etc all need to 
be executed in the same order as they were in hive metastore. There are 
multiple issues that needs to be addressed here

1. How do we ensure order of messages on the producer/hook side?
2. Once producer/hook publishes these messages onto KAFKA, how do we ensure the 
order of processing is the same as it was published.

One suggested approach is to assign a timestamp to all the messages on the 
producer side and have a window/batch these messages on the consumer/ATLAS 
server side. Order these messages according to the timestamp within the window 
which is a configured time period and then execute these operations in that 
order.



  was:Currently, there are operations like rename which fire multiple messages 
to repository to accomplish the required task(create table by old name if it 
doesnt exist. This could lead to issues if there are multiple consumers running 
as kafka consumers since the order of messages could be executed in anyway in 
this case resulting in inconsistent behaviour.


>  Parallelize Hive hook operations
> ---------------------------------
>
>                 Key: ATLAS-568
>                 URL: https://issues.apache.org/jira/browse/ATLAS-568
>             Project: Atlas
>          Issue Type: Sub-task
>    Affects Versions: 0.7-incubating
>            Reporter: Suma Shivaprasad
>             Fix For: 0.7-incubating
>
>
> Maintaining the same order of operations that were executed in hive is 
> crucial also on top of ATLAS . This is because if they are not ordered, it 
> could easily lead to correctness issues in the ATLAS repository. For eg: 
> Table columns being dropped and then table is renamed , dropping tables, 
> databases etc all need to be executed in the same order as they were in hive 
> metastore. There are multiple issues that needs to be addressed here
> 1. How do we ensure order of messages on the producer/hook side?
> 2. Once producer/hook publishes these messages onto KAFKA, how do we ensure 
> the order of processing is the same as it was published.
> One suggested approach is to assign a timestamp to all the messages on the 
> producer side and have a window/batch these messages on the consumer/ATLAS 
> server side. Order these messages according to the timestamp within the 
> window which is a configured time period and then execute these operations in 
> that order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ATLAS-568) Parallelize Hive hook operations

Reply via email to