[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202719#comment-13202719 ] Kevin Wilfong commented on HIVE-2612: - I attached a patch to this JIRA which provides scripts to update the metastore for derby, MySQL and postgres. It also changes the default cluster name to '' (empty string) and fixes an inconsistency where the size of PRIMARY_CLUSTER_NAME in SDS had a different size than the CLUSTER_NAME column in CLUSTER_SDS. support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: Namit Jain Attachments: HIVE-2612.1.patch, HIVE-2612.2.patch.txt, HIVE-2612.D1569.1.patch, HIVE-2612.D1569.2.patch, HIVE-2612.D1569.3.patch, HIVE-2612.D1569.4.patch, HIVE-2612.D1569.5.patch 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202726#comment-13202726 ] Namit Jain commented on HIVE-2612: -- All the existing APIs will continue to work. support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: Namit Jain Attachments: HIVE-2612.1.patch, HIVE-2612.2.patch.txt, HIVE-2612.D1569.1.patch, HIVE-2612.D1569.2.patch, HIVE-2612.D1569.3.patch, HIVE-2612.D1569.4.patch, HIVE-2612.D1569.5.patch, HIVE-2612.D1569.6.patch 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202724#comment-13202724 ] Namit Jain commented on HIVE-2612: -- Can everyone concerned please take a look ? For anyone not using clusters, they need to run the scripts provided in this patch to upgrade the metastore. The time taken for the upgrade depends on the size of the metastore (number of tables/partitions), but it should be fairly small - it is less than 10 minutes for facebook cluster. support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: Namit Jain Attachments: HIVE-2612.1.patch, HIVE-2612.2.patch.txt, HIVE-2612.D1569.1.patch, HIVE-2612.D1569.2.patch, HIVE-2612.D1569.3.patch, HIVE-2612.D1569.4.patch, HIVE-2612.D1569.5.patch, HIVE-2612.D1569.6.patch 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202783#comment-13202783 ] Kevin Wilfong commented on HIVE-2612: - This change will require the user to update their metastore schema. The scripts in the patch should be sufficient provided the schema is already up to date. The only schema changes needed are a new table and a new column is added to SDS, it should not take long, no more than five minutes depending on the size of the SDS table, to update. support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: Namit Jain Attachments: HIVE-2612.1.patch, HIVE-2612.2.patch.txt, HIVE-2612.D1569.1.patch, HIVE-2612.D1569.2.patch, HIVE-2612.D1569.3.patch, HIVE-2612.D1569.4.patch, HIVE-2612.D1569.5.patch, HIVE-2612.D1569.6.patch 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202935#comment-13202935 ] Kevin Wilfong commented on HIVE-2612: - I attached a patch which fixes an error seen where JDO was looking for a column which doesn't exist in the schema in the update scripts provided. The collection of MClusterStorageDescriptors was changed from a List to a Set, and a primary key was indicated in package.jdo. This fixes the error by removing the need to order the MClusterStorageDescriptors and providing a way to uniquely identify them. The primary key is already present in the upgrade scripts provided. support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: Namit Jain Attachments: HIVE-2612.1.patch, HIVE-2612.2.patch.txt, HIVE-2612.3.patch.txt, HIVE-2612.D1569.1.patch, HIVE-2612.D1569.2.patch, HIVE-2612.D1569.3.patch, HIVE-2612.D1569.4.patch, HIVE-2612.D1569.5.patch, HIVE-2612.D1569.6.patch, HIVE-2612.D1569.7.patch 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200101#comment-13200101 ] Namit Jain commented on HIVE-2612: -- https://cwiki.apache.org/confluence/display/Hive/Hive+across+Multiple+Data+Centers+%28Physical+Clusters%29 is the correct link to the wiki support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: Namit Jain Attachments: HIVE-2612.1.patch, HIVE-2612.D1569.1.patch 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13198175#comment-13198175 ] Namit Jain commented on HIVE-2612: -- https://cwiki.apache.org/confluence/display/Hive/Hive+across+Multiple+Data+Centers+(Physical+Clusters) Added a new document which explains some of the thinking and the design. Please comment support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: Namit Jain Attachments: HIVE-2612.1.patch 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196422#comment-13196422 ] Namit Jain commented on HIVE-2612: -- bq. A table T1's primary cluster is C1 meaning :1) C1 contains all data that is available in all other clusters. Does this mean that if T1's primary cluster is C1, then all of the partitions in T1 must also have have their primary partition set to C1? If that's the case then primary cluster should probably be a table level property, and the list of replica clusters can be a table/partition level property. I agree support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: Namit Jain 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196447#comment-13196447 ] Namit Jain commented on HIVE-2612: -- .bq write is only allowed in this cluster for table C1. but need to allow exceptions here. What are the exceptions? Currently, there should be no exceptions. Eventually, if we provide something in hive to do a cross-cluster write, that should be like an exception. There may be a hive command like, Replicate T@P from cluster1 to c1uster2. .bq all data changes to T1 happened in the primary cluster should be replicated to other clusters if there are any secondary clusters. but there should be a conf to disable it as there are some exception situations. This question should not be relevant now. A much simpler to visualize this is: for every table, there is a primary cluster, and a list of secondary clusters. All the partitions belong to the primary cluster, and may belong to one or more secondary clusters. Every hive session has a current cluster, and the read happens from the current cluster. An error is thrown if the partition is missing from the current cluster, but is present in the primary cluster. I will write a new wiki, and attach it - it might be simpler to understand that way. Dynamic partitions should not require anything different. .bq overwrite database name for the purpose of cluster name. And allow a table co-exist in multiple databases. But that require to promote table to top level citizen, and degrade database. For example, show tables used to scan all tables in current db, but now need to scan all tables in all databases. I don't think this is an option since it breaks backwards compatibility and effectively changes the whole notion of what a db/schema is. A lot of people in the community already depend on this feature. Agreed. .bq add a cluster parameter to existing thrift interfaces. This sounds like the best option to me. I think Thrift supports API evolution via default values for missing parameters, but setting a default value in this case may be a little tricky. Agreed .bq Also, instead of modifying the Thrift interface, is it possible that you could instead leverage the work that's being done in HIVE-2720? Will look into it support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: Namit Jain 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189526#comment-13189526 ] Carl Steinbach commented on HIVE-2612: -- bq. A table T1's primary cluster is C1 meaning :1) C1 contains all data that is available in all other clusters. Does this mean that if T1's primary cluster is C1, then all of the partitions in T1 must also have have their primary partition set to C1? If that's the case then primary cluster should probably be a table level property, and the list of replica clusters can be a table/partition level property. bq. 2) write is only allowed in this cluster for table C1. but need to allow exceptions here What are the exceptions? bq. 4) all data changes to T1 happened in the primary cluster should be replicated to other clusters if there are any secondary clusters. but there should be a conf to disable it as there are some exception situations. What are the exceptions? How will dynamic partitions work? Where will new partitions get the list of replica clusters from? Will they inherit it from the table definition? Hive now supports insert-append into a partition (HIVE-306). Suppose that the metadata for a particular partition indicates that it is replicated to clusters C2 and C3. If I insert new data into the partition in the primary cluster C1, then the metadata is now invalid. How is this going to be handled? bq. 2) add new interfaces which do exactly the same set of functionalities as old ones but using a different name (use _on_cluster suffix maybe?) and have a cluster parameter This is going to introduce new codepaths that need to be tested separately, and also double the amount of work people need to do every time a new metastore API call is created. I don't think this is a good approach. bq. 3) overwrite database name for the purpose of cluster name. And allow a table co-exist in multiple databases. But that require to promote table to top level citizen, and degrade database. For example, show tables used to scan all tables in current db, but now need to scan all tables in all databases. I don't think this is an option since it breaks backwards compatibility and effectively changes the whole notion of what a db/schema is. A lot of people in the community already depend on this feature. bq. 1) add a cluster parameter to existing thrift interfaces This sounds like the best option to me. I think Thrift supports API evolution via default values for missing parameters, but setting a default value in this case may be a little tricky. Also, instead of modifying the Thrift interface, is it possible that you could instead leverage the work that's being done in HIVE-2720? support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: He Yongqiang 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188589#comment-13188589 ] He Yongqiang commented on HIVE-2612: Not just authorization. It requires a complete redesign of existing database feature. And the only benefit of doing that is the backward interface compatibility. @Carl, this should not be just a metastore change. We should also provide an utility to replicate data. but we can do it in a followup and focus on the metastore changes on this jira. support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: He Yongqiang 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188156#comment-13188156 ] Carl Steinbach commented on HIVE-2612: -- This is a big feature to implement. I really think it needs a design document before we start committing patches. Yongqiang, do you think you can write something up? support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188173#comment-13188173 ] He Yongqiang commented on HIVE-2612: yeah. sure. internally we actually had some discussions around this. I can write something about it. Right now, our main concern is that the cluster feature may be not so useful for most people. So we want to know what other people think about the potential incompatibilities that this could introduce. actually sent out an discussion to dev@, copy it here: We are planning to make hive run across multiple data centers (physical clusters). We prefer to use hive metastore to provide a unified namespace. Tables/partitions can exist in more than one cluster. And one cluster is defined as a primary cluster. A primary cluster is a table level property. A table T1's primary cluster is C1 meaning :1) C1 contains all data that is available in all other clusters. 2) write is only allowed in this cluster for table C1. but need to allow exceptions here 3) new partitions are only allowed to be created in C1. 4) all data changes to T1 happened in the primary cluster should be replicated to other clusters if there are any secondary clusters. but there should be a conf to disable it as there are some exception situations. The first thing that needs to be done is to make hive metastore have a concept of cluster. And that also means all thrift communication calls to metastore need to provide a cluster parameter. So we have there options here: 1) add a cluster parameter to existing thrift interfaces or 2) add new interfaces which do exactly the same set of functionalities as old ones but using a different name (use _on_cluster suffifx maybe?) and have a cluster parameter or 3) overwrite database name for the purpose of cluster name. And allow a table co-exist in multiple databases. But that require to promote table to top level citizen, and degrade database. For example, show tables used to scan all tables in current db, but now need to scan all tables in all databases. We would like to get more ideas about which one to choose, and we are definitely open to other alternatives that we missed here. We are also looking for other systems that have solved similar problems. If anyone knows such a system, we would like to know. Appreciate that! support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188183#comment-13188183 ] Steven Wong commented on HIVE-2612: --- Yongqiang, please clarify what you mean by option 3. Currently, databases contain tables (db1.foo is unrelated to db2.foo). Is option 3 saying make tables span databases (db1.foo is the same table as db2.foo) instead? That would be a radical change, so maybe I've misunderstood it. support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188220#comment-13188220 ] He Yongqiang commented on HIVE-2612: yes. Your understanding is correct. db1.foo is the same table as db2.foo. It is listed as one potential option. support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: He Yongqiang 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2612) support hive table/partitions coexistes in more than one clusters
[ https://issues.apache.org/jira/browse/HIVE-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188322#comment-13188322 ] Steven Wong commented on HIVE-2612: --- Won't that conflict with authorization? support hive table/partitions coexistes in more than one clusters - Key: HIVE-2612 URL: https://issues.apache.org/jira/browse/HIVE-2612 Project: Hive Issue Type: New Feature Components: Metastore Reporter: He Yongqiang Assignee: He Yongqiang 1) add cluster object into hive metastore 2) each partition/table has a creation cluster and a list of living clusters, and also data location in each cluster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira