[ https://issues.apache.org/jira/browse/ATLAS-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304252#comment-15304252 ]
Hemanth Yamijala commented on ATLAS-503: ---------------------------------------- An update on what I’ve investigated so far: *tl;dr* I am thinking of a retry based solution in the interim for balancing out the concurrency requirements driven by performance, and correctness requirements as uncovered by this bug. The longer term fix is likely coming either in ATLAS-496, or deeper understanding of the Titan graph model. *Longer read, with excuses* When using HBase as a storage backend, we have observed two specific scenarios when we got lock related exceptions: * Creating traits concurrently. * Ingesting data from hive with more than one topic partition and consumer thread. The exceptions are triggered when a transaction is committed - which is explained because Titan tries to enforce consistency constraints only on commit as described [here|http://s3.thinkaurelius.com/docs/titan/0.5.4/eventual-consistency.html]. The commits happen from two specific places, both reflecting one of the use cases above, respectively: * ManagementSystem.commit * TitanGraph.commit Also, with [~suma.shivaprasad]’s help, I understood that we have changed the HBase store manager configuration of Atlas (from the defaults of Titan) to indicate that we will take care of locking ourselves. This was done because otherwise, Titan’s own implementation of locking was found to heavily degrade performance. (I have confirmed this with tests from my end as well). To take care of this management of locking, we have implemented a pessimistic locking mechanism in {{HBaseKeyColumnValueStore.acquireLock}}. Further, if there is a lock conflict, we immediately throw a {{PermanentLockingException}} and the transaction fails. The granularity of the lock is at a store, key and column level. In the tests above, we are running into this scenario where multiple threads are trying to concurrently acquire a lock on objects of the same granularity. In specific, for the two scenarios above, I’ve observed respectively: * lock on the edgestore database (that stores the adjacency graph of Titan) * lock on the graph index database (that maps from property value to vertex) What has been difficult has been to identify the specific key & column on which the lock is attempted to be acquired. The key and column values seem to be heavily encoded and except for some printable characters, it has not been easy to identify them. The general fix for locks as described [here by Stephen Mallette|https://groups.google.com/d/msg/aureliusgraphs/LbOx0wKhULc/u6q63GQrkg0J] include: * Retry transactions * Keep committing transactions regularly - which I think we do to the most part. * Change the schema to eliminate the need for locking In my mind, these are in increasing order of complexity. (Also note ATLAS-496, that [~dkantor] opened) In the interim, I tried two experiments to fix this issue of concurrent updates: * *Synchronize the commits* - within a JVM instance, this will clearly work, but will most likely impact performance. However, my experiments show that this is still faster than letting Titan manage the locking. A slightly more sophisticated fix here could be to synchronize on the specific values of store, key and column to minimize contention. This has the risk of causing deadlocks though, as I don’t know if we can assume anything about the order of locking to be uniform. * *Add retries to {{HBaseKeyColumnValueStore.acquireLock}}*. This worked too, to an extent - the number of retries should be equal to the amount of concurrency expected for the worst case of all concurrent threads trying to lock the same store, key and column. This is configurable via the option {{atlas.graph.storage.lock.retries}}. The *right* solution of changing the schema to eliminate locking requires us to understand *when* Titan tries to lock. I find it difficult to understand this currently. (For e.g. it doesn’t seem to be just for enforcing uniqueness constraints). I will try and get an answer to this, but could take a while to come - any inputs others have will help here, of course. To move forward, I am thinking of implementing the safer second option of retries, while I try to understand if elimination of locking is possible from a model perspective. Any other ideas are welcome - just please keep the short term perspective in mind. > Not all Hive tables are not imported into Atlas when interrupted with search > queries while importing. > ------------------------------------------------------------------------------------------------------- > > Key: ATLAS-503 > URL: https://issues.apache.org/jira/browse/ATLAS-503 > Project: Atlas > Issue Type: Bug > Reporter: Sharmadha Sainath > Assignee: Hemanth Yamijala > Priority: Critical > Fix For: 0.7-incubating > > Attachments: hiv2atlaslogs.rtf > > > On running a file containing 100 table creation commands using beeline -f , > all hive tables are created. But only 81 of them are imported into Atlas > (HiveHook enabled) when queries like "hive_table" is searched frequently > while the import process for the table is going on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)