[jira] [Created] (HIVE-21314) Hive Replication not retaining the owner in the replicated table
mahesh kumar behera created HIVE-21314: -- Summary: Hive Replication not retaining the owner in the replicated table Key: HIVE-21314 URL: https://issues.apache.org/jira/browse/HIVE-21314 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera Hive Replication not retaining the owner in the replicated table. The owner for the target table is set same as the user executing the load command. The user information should be read from the dump metadata and should be used while creating the table at target cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622991 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java ## @@ -112,6 +118,12 @@ public void run() { continue; } + if (replIsCompactionDisabledForTable(t)) { Review comment: not done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622969 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -661,6 +663,10 @@ public int execute(DriverContext driverContext) { if (work.getAlterMaterializedViewDesc() != null) { return alterMaterializedView(db, work.getAlterMaterializedViewDesc()); } + + if (work.getReplSetFirstIncLoadFlagDesc() != null) { Review comment: its done in a separate task to make it simpler This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622407 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -164,6 +165,12 @@ public IncrementalLoadTasksBuilder(String dbName, String tableName, String loadP lastEventid); } } + + ReplSetFirstIncLoadFlagDesc desc = new ReplSetFirstIncLoadFlagDesc(dbName, tableName); Review comment: new ddl task is created to make it simpler with table level and warehouse level support This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622413 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -158,6 +159,15 @@ private boolean isDbEmpty(String dbName) throws HiveException { // Add the checkpoint key to the Database binding it to current dump directory. // So, if retry using same dump, we shall skip Database object update. parameters.put(ReplUtils.REPL_CHECKPOINT_KEY, dumpDirectory); + +if (needSetIncFlag) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622357 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -135,7 +135,8 @@ private boolean isDbEmpty(String dbName) throws HiveException { } private Task alterDbTask(Database dbObj) { -return alterDbTask(dbObj.getName(), updateDbProps(dbObj, context.dumpDirectory), context.hiveConf); +return alterDbTask(dbObj.getName(), updateDbProps(dbObj, context.dumpDirectory, false), Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622274 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622236 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622204 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java ## @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() { Assert.assertEquals("The hash codes must be equal", compactionInfo.hashCode(), compactionInfo1.hashCode()); } + @Test + public void testDisableCompactionDuringReplLoad() throws Exception { +String tblName = "discomp"; +String database = "discomp_db"; +executeStatementOnDriver("drop database if exists " + database + " cascade", driver); +executeStatementOnDriver("create database " + database, driver); +executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a INT, b STRING) " + +" PARTITIONED BY(ds string)" + +" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires table to be bucketed +" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver); +executeStatementOnDriver("insert into " + database + "." + tblName + " partition (ds) values (1, 'fred', " + +"'today'), (2, 'wilma', 'yesterday')", driver); + +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", driver); +List compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'true')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + Review comment: table level is taken care of now This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622210 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java ## @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() { Assert.assertEquals("The hash codes must be equal", compactionInfo.hashCode(), compactionInfo1.hashCode()); } + @Test + public void testDisableCompactionDuringReplLoad() throws Exception { +String tblName = "discomp"; +String database = "discomp_db"; +executeStatementOnDriver("drop database if exists " + database + " cascade", driver); +executeStatementOnDriver("create database " + database, driver); +executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a INT, b STRING) " + +" PARTITIONED BY(ds string)" + +" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires table to be bucketed +" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver); +executeStatementOnDriver("insert into " + database + "." + tblName + " partition (ds) values (1, 'fred', " + +"'today'), (2, 'wilma', 'yesterday')", driver); + +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", driver); +List compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'true')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'false')", driver); +compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'false')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + Review comment: table level is taken care of now This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622190 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -271,12 +299,13 @@ public String getName() { LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath); if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){ ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false); - if (replicationSpec.isReplace() && conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) { + if (replicationSpec.isReplace() && (conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) { rcwork.setDeleteDestIfExist(true); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r25960 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/MetaStoreCompactorThread.java ## @@ -71,6 +74,16 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) throws Exception { } } + @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws TException { +try { + Database database = rs.getDatabase(getDefaultCatalog(conf), dbName); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622183 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -370,6 +370,9 @@ private int executeIncrementalLoad(DriverContext driverContext) { // If incremental events are already applied, then check and perform if need to bootstrap any tables. if (!builder.hasMoreWork() && !work.getPathsToCopyIterator().hasNext()) { +// No need to set incremental load pending flag for external tables as the files will be copied to the same path Review comment: todo not required as table level load is taken care now This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259621882 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, DumpType dumpType, String dbNa return updateReplIdTask; } - private Task dbUpdateReplStateTask(String dbName, String replState, + private Task dbUpdateReplStateTask(String dbName, String replState, String incLoadPendFlag, Review comment: the code is removed This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259621844 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, DumpType dumpType, String dbNa return updateReplIdTask; } - private Task dbUpdateReplStateTask(String dbName, String replState, + private Task dbUpdateReplStateTask(String dbName, String replState, String incLoadPendFlag, Task preCursor) { HashMap mapProp = new HashMap<>(); -mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); -AlterDatabaseDesc alterDbDesc = new AlterDatabaseDesc(dbName, mapProp, new ReplicationSpec(replState, replState)); +// if the update is for incLoadPendFlag, then send replicationSpec as null to avoid replacement check. +ReplicationSpec replicationSpec = null; +if (incLoadPendFlag == null) { + mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); + replicationSpec = new ReplicationSpec(replState, replState); +} else { + assert replState == null; + mapProp.put(ReplUtils.REPL_FIRST_INC_PENDING_FLAG, incLoadPendFlag); Review comment: done. Dump will fail if the inc pending flag is set to true This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259618838 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259618842 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259617900 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -61,6 +62,21 @@ public ReplCopyTask(){ super(); } + // If file is already present in base directory, then remove it from the list. + // Check HIVE-21197 for more detail + private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, List srcFiles, + long writeId, int stmtId) throws IOException { +ListIterator iter = srcFiles.listIterator(); +Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, writeId, writeId, stmtId)); +while (iter.hasNext()) { + Path filePath = new Path(basePath, iter.next().getSourcePath().getName()); + if (dstFs.exists(filePath)) { Review comment: the i/o exception retry case is handled specifically at 2 places only. there are many other i/o failure scenarios which are not handled. I think its not required here. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services