[GitHub] ZhangXinJason opened a new pull request #548: HIVE-21313: Use faster method to prevent copying bytes twice
ZhangXinJason opened a new pull request #548: HIVE-21313: Use faster method to prevent copying bytes twice URL: https://github.com/apache/hive/pull/548 In file ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorAssignRow.java We may find code like this: ``` Text text = (Text) convertTargetWritable; if (text == null) { text = new Text(); } text.set(string); ((BytesColumnVector) columnVector).setVal( batchIndex, text.getBytes(), 0, text.getLength()); ``` Using `setVal` method can copy the bytes array generated by `text.getBytes()`. This is totally unnecessary at all. Since the bytes array is immutable, we can just use `setRef` method to point to the specific byte array, which will also lower the memory usage. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (HIVE-21313) Use faster function to revent
ZhangXin created HIVE-21313: --- Summary: Use faster function to revent Key: HIVE-21313 URL: https://issues.apache.org/jira/browse/HIVE-21313 Project: Hive Issue Type: Improvement Reporter: ZhangXin -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606409 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); +return compFlag == null || compFlag.isEmpty() || "false".equalsIgnoreCase(compFlag); Review comment: i think better to have this check as for this case null and empty has special meaning This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606337 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationEx { + @Rule + public final TestName testName = new TestName(); + + protected static final Logger LOG = LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class); + private static WarehouseInstance primary, replica; + private String primaryDbName, replicatedDbName; + + @BeforeClass + public static void classLevelSetup() throws Exception { +HashMap overrideProperties = new HashMap<>(); +internalBeforeClassSetup(overrideProperties); + } + + static void internalBeforeClassSetup(Map overrideConfigs) throws Exception { Review comment: no This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606331 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606013 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -271,12 +299,13 @@ public String getName() { LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath); if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){ ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false); - if (replicationSpec.isReplace() && conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) { + if (replicationSpec.isReplace() && (conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) { rcwork.setDeleteDestIfExist(true); Review comment: As per discussion, we need to skip duplicate check for replace case, as we may end up with skipping copy of files with same name but different content. Also, we may need to move those matching files to new base path which is complicated. So, we just skip duplicate check for replace flow. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (HIVE-21312) FSStatsAggregator::connect is slow
Rajesh Balamohan created HIVE-21312: --- Summary: FSStatsAggregator::connect is slow Key: HIVE-21312 URL: https://issues.apache.org/jira/browse/HIVE-21312 Project: Hive Issue Type: Improvement Components: Statistics Reporter: Rajesh Balamohan -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587931 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java ## @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() { Assert.assertEquals("The hash codes must be equal", compactionInfo.hashCode(), compactionInfo1.hashCode()); } + @Test + public void testDisableCompactionDuringReplLoad() throws Exception { +String tblName = "discomp"; +String database = "discomp_db"; +executeStatementOnDriver("drop database if exists " + database + " cascade", driver); +executeStatementOnDriver("create database " + database, driver); +executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a INT, b STRING) " + +" PARTITIONED BY(ds string)" + +" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires table to be bucketed +" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver); +executeStatementOnDriver("insert into " + database + "." + tblName + " partition (ds) values (1, 'fred', " + +"'today'), (2, 'wilma', 'yesterday')", driver); + +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", driver); +List compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'true')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'false')", driver); +compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'false')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + Review comment: Alter table doesn't make sense now as we removed it from use. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587889 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationEx { + @Rule + public final TestName testName = new TestName(); + + protected static final Logger LOG = LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class); + private static WarehouseInstance primary, replica; + private String primaryDbName, replicatedDbName; + + @BeforeClass + public static void classLevelSetup() throws Exception { +HashMap overrideProperties = new HashMap<>(); +internalBeforeClassSetup(overrideProperties); + } + + static void internalBeforeClassSetup(Map overrideConfigs) throws Exception { Review comment: Can we have a common base class just like BaseReplicationAcrossInstances instead of duplicating in both classes? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259588731 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); +return compFlag == null || compFlag.isEmpty() || "false".equalsIgnoreCase(compFlag); Review comment: String.equalsIgnoreCase handles null and empty cases of input argument. However, it returns false in both the cases. Can remove these checks but need to be careful. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259588983 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, DumpType dumpType, String dbNa return updateReplIdTask; } - private Task dbUpdateReplStateTask(String dbName, String replState, + private Task dbUpdateReplStateTask(String dbName, String replState, String incLoadPendFlag, Review comment: Shall change incLoadPendFlag to firstIncLoadPendingFlag. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259588369 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -271,12 +299,13 @@ public String getName() { LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath); if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){ ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false); - if (replicationSpec.isReplace() && conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) { + if (replicationSpec.isReplace() && (conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) { rcwork.setDeleteDestIfExist(true); Review comment: If it is insert overwrite case(writeId=2) and let's say few of the listed files are there in base_1. Now, this deleteDestIfExists marks to create base_2 instead of delta_2_2 as it is IOW. So, only partial data would be copied to base_2. After commit of writeId=2, all readers would see partial data from base_2. So, need to handle this case. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587946 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java ## @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() { Assert.assertEquals("The hash codes must be equal", compactionInfo.hashCode(), compactionInfo1.hashCode()); } + @Test + public void testDisableCompactionDuringReplLoad() throws Exception { +String tblName = "discomp"; +String database = "discomp_db"; +executeStatementOnDriver("drop database if exists " + database + " cascade", driver); +executeStatementOnDriver("create database " + database, driver); +executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a INT, b STRING) " + +" PARTITIONED BY(ds string)" + +" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires table to be bucketed +" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver); +executeStatementOnDriver("insert into " + database + "." + tblName + " partition (ds) values (1, 'fred', " + +"'today'), (2, 'wilma', 'yesterday')", driver); + +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", driver); +List compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'true')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + Review comment: Alter table doesn't make sense now as we removed it from use. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587771 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/MetaStoreCompactorThread.java ## @@ -71,6 +74,16 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) throws Exception { } } + @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws TException { +try { + Database database = rs.getDatabase(getDefaultCatalog(conf), dbName); Review comment: Shall add a comment on why compaction is disabled if first incremental is not done. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587562 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259589411 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { Review comment: All the callers negate this result while using. Why don't we change it to isFirstIncPending? It will make it uniform with stored DB parameter name as well. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587524 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259589143 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, DumpType dumpType, String dbNa return updateReplIdTask; } - private Task dbUpdateReplStateTask(String dbName, String replState, + private Task dbUpdateReplStateTask(String dbName, String replState, String incLoadPendFlag, Task preCursor) { HashMap mapProp = new HashMap<>(); -mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); -AlterDatabaseDesc alterDbDesc = new AlterDatabaseDesc(dbName, mapProp, new ReplicationSpec(replState, replState)); +// if the update is for incLoadPendFlag, then send replicationSpec as null to avoid replacement check. +ReplicationSpec replicationSpec = null; +if (incLoadPendFlag == null) { + mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); + replicationSpec = new ReplicationSpec(replState, replState); +} else { + assert replState == null; + mapProp.put(ReplUtils.REPL_FIRST_INC_PENDING_FLAG, incLoadPendFlag); Review comment: There is one corner case for A->B->C where EVENT_ALTER_DATABASE should skip this additional parameter while dumping. Let's say, bootstrap and first incremental done in B. So, this flag is false. Now, we trigger bootstrap dump from B->C and concurrently, there is a alterDb operation which logs an event with this flag as false. Now, when we bootstrap load in C, we set it there as true. During first incremental load in C, when we process AlterDb event, we set it to false, so that we open up compaction after this event and later events may not guarantee duplicate check. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259586952 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -135,7 +135,8 @@ private boolean isDbEmpty(String dbName) throws HiveException { } private Task alterDbTask(Database dbObj) { -return alterDbTask(dbObj.getName(), updateDbProps(dbObj, context.dumpDirectory), context.hiveConf); +return alterDbTask(dbObj.getName(), updateDbProps(dbObj, context.dumpDirectory, false), Review comment: Alter case also we need to set it. If it is bootstrapping on empty Db. May be we can remove the flag and just set it to true always. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259588001 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -61,6 +62,21 @@ public ReplCopyTask(){ super(); } + // If file is already present in base directory, then remove it from the list. + // Check HIVE-21197 for more detail + private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, List srcFiles, + long writeId, int stmtId) throws IOException { +ListIterator iter = srcFiles.listIterator(); +Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, writeId, writeId, stmtId)); +while (iter.hasNext()) { + Path filePath = new Path(basePath, iter.next().getSourcePath().getName()); + if (dstFs.exists(filePath)) { Review comment: Do we need to have a retry mechanism if it throws IOException due to network glitch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259588476 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -370,6 +370,9 @@ private int executeIncrementalLoad(DriverContext driverContext) { // If incremental events are already applied, then check and perform if need to bootstrap any tables. if (!builder.hasMoreWork() && !work.getPathsToCopyIterator().hasNext()) { +// No need to set incremental load pending flag for external tables as the files will be copied to the same path Review comment: Add a TODO here saying, this flag needs to be set for non-txn to txn migrated tables if they are bootstrapped along with incremental load (table level replication). With TODO, the current comment make sense like why it is not needed for external tables. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259586959 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -158,6 +159,15 @@ private boolean isDbEmpty(String dbName) throws HiveException { // Add the checkpoint key to the Database binding it to current dump directory. // So, if retry using same dump, we shall skip Database object update. parameters.put(ReplUtils.REPL_CHECKPOINT_KEY, dumpDirectory); + +if (needSetIncFlag) { Review comment: Need to set it in alter case too. Empty db bootstrap case. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259589263 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); Review comment: compFlag doesn't look relevant. Can change it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (HIVE-21311) upgrade netty 4.1.33
t oo created HIVE-21311: --- Summary: upgrade netty 4.1.33 Key: HIVE-21311 URL: https://issues.apache.org/jira/browse/HIVE-21311 Project: Hive Issue Type: Improvement Affects Versions: 3.1.1 Reporter: t oo upgrade netty to 4.1.33 -- This message was sent by Atlassian JIRA (v7.6.3#76005)