[ 
https://issues.apache.org/jira/browse/ATLAS-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304252#comment-15304252
 ] 

Hemanth Yamijala commented on ATLAS-503:
----------------------------------------

An update on what I’ve investigated so far:

*tl;dr*

I am thinking of a retry based solution in the interim for balancing out the 
concurrency requirements driven by performance, and correctness requirements as 
uncovered by this bug. The longer term fix is likely coming either in 
ATLAS-496, or deeper understanding of the Titan graph model.

*Longer read, with excuses*

When using HBase as a storage backend, we have observed two specific scenarios 
when we got lock related exceptions:
* Creating traits concurrently.
* Ingesting data from hive with more than one topic partition and consumer 
thread.

The exceptions are triggered when a transaction is committed - which is 
explained because Titan tries to enforce consistency constraints only on commit 
as described 
[here|http://s3.thinkaurelius.com/docs/titan/0.5.4/eventual-consistency.html]. 
The commits happen from two specific places, both reflecting one of the use 
cases above, respectively:
* ManagementSystem.commit
* TitanGraph.commit

Also, with [~suma.shivaprasad]’s help, I understood that we have changed the 
HBase store manager configuration of Atlas (from the defaults of Titan) to 
indicate that we will take care of locking ourselves. This was done because 
otherwise, Titan’s own implementation of locking was found to heavily degrade 
performance. (I have confirmed this with tests from my end as well). 

To take care of this management of locking, we have implemented a pessimistic 
locking mechanism in {{HBaseKeyColumnValueStore.acquireLock}}. Further, if 
there is a lock conflict, we immediately throw a {{PermanentLockingException}} 
and the transaction fails. The granularity of the lock is at a store, key and 
column level. In the tests above, we are running into this scenario where 
multiple threads are trying to concurrently acquire a lock on objects of the 
same granularity. In specific, for the two scenarios above, I’ve observed 
respectively:
* lock on the edgestore database (that stores the adjacency graph of Titan)
* lock on the graph index database (that maps from property value to vertex)

What has been difficult has been to identify the specific key & column on which 
the lock is attempted to be acquired. The key and column values seem to be 
heavily encoded and except for some printable characters, it has not been easy 
to identify them.

The general fix for locks as described [here by Stephen 
Mallette|https://groups.google.com/d/msg/aureliusgraphs/LbOx0wKhULc/u6q63GQrkg0J]
 include:
* Retry transactions
* Keep committing transactions regularly - which I think we do to the most part.
* Change the schema to eliminate the need for locking
In my mind, these are in increasing order of complexity.

(Also note ATLAS-496, that [~dkantor] opened)

In the interim, I tried two experiments to fix this issue of concurrent updates:
* *Synchronize the commits* - within a JVM instance, this will clearly work, 
but will most likely impact performance. However, my experiments show that this 
is still faster than letting Titan manage the locking. A slightly more 
sophisticated fix here could be to synchronize on the specific values of store, 
key and column to minimize contention. This has the risk of causing deadlocks 
though, as I don’t know if we can assume anything about the order of locking to 
be uniform.
* *Add retries to {{HBaseKeyColumnValueStore.acquireLock}}*. This worked too, 
to an extent - the number of retries should be equal to the amount of 
concurrency expected for the worst case of all concurrent threads trying to 
lock the same store, key and column. This is configurable via the option 
{{atlas.graph.storage.lock.retries}}.

The *right* solution of changing the schema to eliminate locking requires us to 
understand *when* Titan tries to lock. I find it difficult to understand this 
currently. (For e.g. it doesn’t seem to be just for enforcing uniqueness 
constraints). I will try and get an answer to this, but could take a while to 
come - any inputs others have will help here, of course.

To move forward, I am thinking of implementing the safer second option of 
retries, while I try to understand if elimination of locking is possible from a 
model perspective. Any other ideas are welcome - just please keep the short 
term perspective in mind.

> Not all Hive tables are not imported into Atlas when interrupted with search 
> queries while importing.  
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ATLAS-503
>                 URL: https://issues.apache.org/jira/browse/ATLAS-503
>             Project: Atlas
>          Issue Type: Bug
>            Reporter: Sharmadha Sainath
>            Assignee: Hemanth Yamijala
>            Priority: Critical
>             Fix For: 0.7-incubating
>
>         Attachments: hiv2atlaslogs.rtf
>
>
> On running a file containing 100 table creation commands using beeline -f , 
> all hive tables are created. But only 81 of them are imported into Atlas 
> (HiveHook enabled) when queries like "hive_table" is searched frequently 
> while the import process for the table is going on.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to