[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899918#action_12899918 ] Namit Jain commented on HIVE-1293: -- Agreed on the bug in getLockObjects() - will have a new patch. Filed a new patch for the followup: https://issues.apache.org/jira/browse/HIVE-1293 Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900128#action_12900128 ] Namit Jain commented on HIVE-1293: -- Fixed a lot of bugs, added a lot of comments, tested it with a zooKeeper cluster of 3 nodes. select * currently performs a dirty read, we can add a new parameter to change that behavior if need be. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive.1293.6.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899320#action_12899320 ] Joydeep Sen Sarma commented on HIVE-1293: - a little bummed that locks need to be held for entire query execution. that could mean a writer blocking readers for hours. hive's query plans seem to be of two distinct stages: 1. read a bunch of stuff, compute intermediate/final data 2. move final data into output locations ie. - a single query never reads what it writes (into a final output location). even if #1 and #2 are mingled today - they can easily be put in order. in that sense - we only need to get shared locks for all read entities involved in #1 to begin with. once phase #1 is done, we can drop all the read locks and get the exclusive locks for all the write entities in #2, perform #2 and quit. that way exclusive locks are held for a very short duration. i think this scheme is similarly deadlock free (now there are two independent lock acquire/release phases - and each of them can lock stuff in lex. order). Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899347#action_12899347 ] Joydeep Sen Sarma commented on HIVE-1293: - also - i am missing something here: + for (WriteEntity output : plan.getOutputs()) { +lockObjects.addAll(getLockObjects(output.getTable(), output.getPartition(), HiveLockMode.EXCLUSIVE)); + } getLockObjects(): +if (p != null) { ... + locks.add(new LockObject(new HiveLockObject(p.getTable()), mode)); +} doesn't this end up locking the table in exclusive mode if a partition is being written to? (whereas the design talks about locking the table in shared mode only?) Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899640#action_12899640 ] Namit Jain commented on HIVE-1293: -- The partition being written to is locked in exclusive mode - the table should be locked in shared mode. The write entity should only consist of the partition. There might be a bug there - https://issues.apache.org/jira/browse/HIVE-1548 should populate the inputs and outputs appropriately. I will start on this now. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899647#action_12899647 ] Joydeep Sen Sarma commented on HIVE-1293: - can u check the getLockObjects() routine. it seemed to me that even u called with partition in X mode - it would add the table to the list of objects to be locked as well (in the same X mode). i think we should, at least, as follow on make the optimization to not lock write entities for the duration of the query. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897917#action_12897917 ] John Sichi commented on HIVE-1293: -- I think ZK default client port would be 2181; see HBASE-2305. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897946#action_12897946 ] John Sichi commented on HIVE-1293: -- From testing: the parsed lock mode seems to be case-sensitive: hive lock table blah shared; Failed with exception No enum const class org.apache.hadoop.hive.ql.lockmgr.HiveLockMode.shared FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask If I use lock table blah SHARED it works. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897953#action_12897953 ] Basab Maulik commented on HIVE-1293: Re: One lib question: Zookeeper hbase-handler with hbase 0.20.x does not work with zk 3.3.1 but works fine with the version it ships with, zk 3.2.2. Have not investigated what breaks. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897977#action_12897977 ] John Sichi commented on HIVE-1293: -- Namit, I tried testing with a standalone zookeeper via CLI. Locking a table succeeded, but then show locks didn't show anything, and unlock said the lock didn't exist. I think the reason is that CLI is creating a new Driver for each statement executed, and when the old Driver is closed, the lock manager is closed along with it (closing the ZooKeeper client instance). As a result, locks are released immediately after LOCK TABLE is executed. When I tested with a thrift server plus two JDBC clients, all was well. I was able to take a lock from one client and prevent the other client from getting the same lock. So I guess the thrift server is keeping one Driver around per connection. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897988#action_12897988 ] John Sichi commented on HIVE-1293: -- Here's a scenario which is not working correctly. (Tested with thrift server plus JDBC clients.) Existing table foo. Client 1: LOCK TABLE foo EXCLUSIVE; Client 2: DROP TABLE foo; According to the doc, the DROP TABLE should fail, but it succeeds. Same is true for LOAD DATA. Probably the same reason in both cases: for these commands we don't register the output in the PREHOOK (only the POSTHOOK). INSERT is getting blocked correctly since it's in the PREHOOK. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898010#action_12898010 ] John Sichi commented on HIVE-1293: -- After seeing some other issues, had a chat with Namit about semantics; here's what we worked out. * Normally, locks should only be held for duration of statement execution. * However, LOCK TABLE should take a global lock (not tied to any particular session or statement). * UNLOCK TABLE should remove both kinds of lock (statement-level and global). Likewise, SHOW LOCKS shows all. * For fetching results, we'll need a parameter to control whether a dirty read is possible. Normally, this is not an issue since we're fetching from saved temp results, but when using select * from t to fetch directly from the original table, this behavior makes a difference. To prevent dirty reads, we'll need the statement-level lock to span the duration of the fetch. To avoid leaks, we need to make sure that once we create a ZooKeeper client, we always close it. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898012#action_12898012 ] John Sichi commented on HIVE-1293: -- Also, as a followup, need to add client info such as hostname, process ID to SHOW LOCKS. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive.1293.5.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897548#action_12897548 ] John Sichi commented on HIVE-1293: -- Two configuration questions: * You have hive.support.concurrency=true in hive-default.xml. Probably we want it false instead (only on during tests) since most people using Hive won't have a zookeeper quorum set up? * Isn't there a default value we can use for hive.zookeeper.client.port? One lib question: * Zookeeper is now available from maven. Maybe we should delete the one in hbase-handler/lib and get it via ivy instead of adding it in the top-level lib? The version we have checked in is 3.2.2, but the maven availability is 3.3.x, so we'd need to test to make sure everything (including hbase-handler) still works with the newer version. http://mvnrepository.com/artifact/org.apache.hadoop/zookeeper Two cleanups: * In QTestUtil.java, you left the following code commented out; can we get rid of it? + // for (int i = 0; i qfiles.length; i++) { + //qsetup[i].tearDown(); + // } * In DDLTask.java, you left some commented-out debugging code (two instances): +//console.printError(conflicting lock present + tbl + cannot be locked in mode + mode); Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897577#action_12897577 ] Namit Jain commented on HIVE-1293: -- Did the cleanups and changed default value of hive.support.concurrency to false Not sure how can we set a default value for hive.zookeeper.client.port? Let us do the lib cleanup in a follow-up - I will file a jira Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive.1293.3.patch, hive.1293.4.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896182#action_12896182 ] John Sichi commented on HIVE-1293: -- Added comments here: https://review.cloudera.org/r/563/ (doesn't seem to be adding comments here) Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885631#action_12885631 ] He Yongqiang commented on HIVE-1293: I am going to commit this patch in the next few days. Please post your comments if you have any, so we can fix them before this patch gets in. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884316#action_12884316 ] Namit Jain commented on HIVE-1293: -- 1. No, 2 clients cannot get conflicting locks -- zookeeper will guarantee that, will read again to double-check 2. If client A cannot get the lock, it will unlock only its own lock. However, there is no security - if client A does unlock table A explicitly, all locks on A are released Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884342#action_12884342 ] Namit Jain commented on HIVE-1293: -- Confirmed 1 Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884343#action_12884343 ] Prasad Chakka commented on HIVE-1293: - same as https://issues.apache.org/jira/browse/HIVE-829 ? Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884111#action_12884111 ] Namit Jain commented on HIVE-1293: -- The unit tests are running right now (they should succeed) - submitted a patch for review. Also, all the jar files (3 of them) from hbase-handler/lib should be moved to lib. That is not part of the patch since those files are binary Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884112#action_12884112 ] He Yongqiang commented on HIVE-1293: I will take a look. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884165#action_12884165 ] He Yongqiang commented on HIVE-1293: a few questions so far: 1) can the lock implementation guarantee the atomicity? I mean since the lock's logic happens in client side, it is possible that two concurrent client get conflicting locks. 2) about realizing locks. if a client did an unlock, will it also release locks made by other clients? I mean,if a client A did a lock, and then client B did another lock, and client B did an unlock, will client A still hold its lock? still reading the code. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864541#action_12864541 ] John Sichi commented on HIVE-1293: -- Right, you get this if for partition p.q.r in t, you add the following to the flat lock list: t (S) t.p (S) t.p.q (S) t.p.q.r (S or X depending what the operation is) This doesn't add a lot of extra locks in general since there are more children than parents, it makes the low-level recipe a little simpler, and maybe makes show locks output clearer. It might be exactly what you are already proposing, in which case we're in agreement. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864554#action_12864554 ] Namit Jain commented on HIVE-1293: -- Agreed, this is cleaner than what I had. I was checking the parents, you are suggesting locking the parents in 'S' mode, which achieves the desired affect, but removes the need for hierarchy from the lock manager. It is even better given that we may have different lock manager implementations. I will update the wiki Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864145#action_12864145 ] Namit Jain commented on HIVE-1293: -- The initial writeup is at http://wiki.apache.org/hadoop/Hive/Locking. Please comment Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12863510#action_12863510 ] Namit Jain commented on HIVE-1293: -- One option is to use ZooKeeper for locking - we dont need to worry about leases since ZooKeeper supports ephemeral nodes. The zookeeper quorum can be specified via some configuration parameters, and they need to be specified for concurrency to be enabled. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857110#action_12857110 ] Ashish Thusoo commented on HIVE-1293: - I would vote for versioning. Since we do not have to deal with the complexity of a buffer cache I think this would be much simpler to implement than what is possible in traditional databases. At the same time, for locks we will have to do a lease based mechanism anyway in order to protect against locks leaking because of client crashes. And when you account for that, it seems that locking would not be significantly simpler to implement than versioning. Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira