Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8312 )
Change subject: KUDU-2191 (6/n): Hive Metastore catalog manager integration ...................................................................... Patch Set 9: (55 comments) http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/hms/hms_client.cc File src/kudu/hms/hms_client.cc: http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/hms/hms_client.cc@193 PS8, Line 193: // Also check that the HMS is configured to allow changing the type of : // columns, which is required to support dropping columns. Could you elaborate on this a bit? It's not intuitive why it needs to be legal to change a column's type in order to drop it. http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/hms/hms_client.cc@202 PS8, Line 202: if (boost::iequals(disallow_incompatible_column_type_changes, "true")) { Hmm, is it really possible for the config's value to have a mixed case? Asking because L185 doesn't do a case-insensitive comparison when verifying that the required listeners are in hive.metastore.transactional.event.listeners. http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/hms/hms_client.cc@218 PS8, Line 218: bool HmsClient::IsConnected() { Maybe add a bit of test coverage for this new method? http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/hms/mini_hms.cc File src/kudu/hms/mini_hms.cc: http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/hms/mini_hms.cc@186 PS8, Line 186: // Configures the HMS to allow altering and dropping columns. Nit: indentation is off by one character. http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/hms/mini_hms.cc@214 PS8, Line 214: <value>jdbc:derby:$1/metadb;create=true</value> Why the change from 'memory' back to 'directory'? http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/CMakeLists.txt File src/kudu/integration-tests/CMakeLists.txt: http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/CMakeLists.txt@87 PS8, Line 87: ADD_KUDU_TEST(master_hms-itest) Could you run this new test with KUDU_MEASURE_TEST_CPU_CONSUMPTION=1 (see run-test.sh) to figure out whether to set RUN_SERIAL or to enumerate the number of PROCESSORS? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/integration-tests/master-stress-test.cc File src/kudu/integration-tests/master-stress-test.cc: http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/integration-tests/master-stress-test.cc@343 PS9, Line 343: return Substitute("default.table_$0", oid_generator_.Next()); Not necessary yet? http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc File src/kudu/integration-tests/master_hms-itest.cc: http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@122 PS8, Line 122: Could RETURN_NOT_OK instead. http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@162 PS8, Line 162: string table_name = Substitute("$0.$1", hms_database_name, hms_table_name); : : // Attempt to create the table before the database is created. : Status s = CreateTable(hms_database_name, hms_table Could be its own test (i.e. TestCreateTableWithoutDatabase) http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@190 PS8, Line 190: : // Shutdown the HMS and try to create a table. : ASSERT_OK(StopHms()); : : s = CreateTable(hms_database_name, "foo"); : ASSERT_TRUE(s.IsNetworkError()) << s.ToString(); : : // Start the HMS and try again. : ASSERT_OK(StartHms()); : ASSERT_OK(CreateTable(hms_database_name, "foo")); Could break out into a separate test since it only shares HMS database state with the previous test. http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@202 PS8, Line 202: Maybe add a test that if the HMS is down renaming fails? http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@207 PS8, Line 207: : // Create the database. : hive::Database db; : db.name = hms_database_name; : ASSERT_OK(hms_client_->CreateDatabase(db)); Every test does this; consider doing it in the test fixture. http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@243 PS8, Line 243: // TODO(HIVE-18852): match on the error message. : : // Attempt to rename the Kudu table to an invalid table name. : table_alterer.reset(client_->NewTableAlterer(table_name)); : s = table_alterer->RenameTo(Substitute("$0.☃", hms_database_name))->Alter(); : ASSERT_TRUE(s.IsIllegalState()) << s.ToString(); Would be a good negative test for creation too, right? http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@255 PS8, Line 255: hive::EnvironmentContext env_ctx; Why is it necessary to provide the table's ID here, on L270, and on L371, but not when dropping on L331? http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@268 PS8, Line 268: : // Drop the HMS table entry, then c I was confused when I first read this, but now I think I understand: it's legal to post facto create an HMS external table for an existing Kudu table, and doing so causes the two to become connected such that future operations will affect both. Is that right? May want to elaborate the comment a bit to make that behavior clear. http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@285 PS8, Line 285: hms_external_table_name, Why does this one exist? It was an external table created on L274, but didn't it get renamed to hms_rename_table_name on L276? http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@309 PS8, Line 309: hms_table.sd.cols.clear(); After this step could you use CheckCatalogs (or a variant thereof) to assert that the column entries do NOT match across HMS and Kudu? http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@339 PS8, Line 339: s = table_alterer->DropColumn("int32_val")->Alter(); Why does this fail? The non-Kudu HMS table entry has the same database and name as the Kudu table, so shouldn't they be "connected" to one another? http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@354 PS8, Line 354: Maybe ensure it's gone from Kudu too, just to make sure that the HMS integration doesn't do something silly like drop the table from HMS but not Kudu? http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@374 PS8, Line 374: ASSERT_TRUE(s.IsNotFound()) << s.ToString(); Likewise, make sure it was really dropped from Kudu. http://gerrit.cloudera.org:8080/#/c/8312/8/src/kudu/integration-tests/master_hms-itest.cc@384 PS8, Line 384: NO_FATALS(CheckCatalogs(hms_database_name, hms_table_name)); Check that it's gone from both Kudu and HMS? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/CMakeLists.txt File src/kudu/master/CMakeLists.txt: http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/CMakeLists.txt@76 PS9, Line 76: ADD_KUDU_TEST(hms_catalog-test) Likewise, please assess the perf impact of this test and annotate it accordingly. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/catalog_manager.cc@697 PS9, Line 697: CHECK_OK(HostPort::ParseStrings(FLAGS_hive_metastore_addresses, Since you're inside Init, may as well convert these CHECK_OKs to RETURN_NOT_OK. The failure will be fatal either way; it'll just be more graceful. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/catalog_manager.cc@1548 PS9, Line 1548: // Create the table in the HMS. Can you expand this comment a bit to explain why this should happen before step e and not after? This is a meaty enough operation that I'd also give it its own step letter and reletter the others. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/catalog_manager.cc@1743 PS9, Line 1743: // Drop the HMS table entry. As in Create, could you expand this to explain why this should happen here, and not after step 3? Also, please give it its own number and renumber the others. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/catalog_manager.cc@1755 PS9, Line 1755: auto abort_hms = MakeScopedCleanup([&] { What does it mean for a ScopedCleanup to return something? It's run on scope exit, so where is it returning that value to exactly? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/catalog_manager.cc@2254 PS9, Line 2254: if (hms_catalog_ != nullptr && has_metadata_changes) { Add a comment explaining why we do this here and not after the catalog write, give it a number, renumber the rest, etc. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/catalog_manager.cc@2272 PS9, Line 2272: RETURN_NOT_OK_LOG(SchemaFromPB(l.data().pb.schema(), &schema), WARNING, See above comment re: returning a value from a ScopedCleanup. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog-test.cc File src/kudu/master/hms_catalog-test.cc: http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog-test.cc@38 PS9, Line 38: Even though master_hms-test provides integration testing between CatalogManager and HmsCatalog, I think it'd still be worthwhile to have standalone HmsCatalog tests here to verify basic functionality. It'd certainly make it easier to test future changes to HmsCatalog. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.h File src/kudu/master/hms_catalog.h: http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.h@61 PS9, Line 61: // Create a new table entry in the HMS. Doc that this method fails if HMS is unreachable? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.h@66 PS9, Line 66: // Drops a table entry from the HMS, if it exists. Nit: it would be good to use the same tense when describing the functionality of each method. Here you're using present tense "[DropTable()] drops a table entry from the HMS..." but elsewhere you use future tense "[Start()] will start the HmsCatalog instance." I happen to prefer present tense (I think that's the Javadoc standard); could you apply it consistently in your method documentation? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.h@91 PS9, Line 91: Status Reconnect(); Doc this? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.h@103 PS9, Line 103: static Status ParseTableName(const std::string& table, And this? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.h@110 PS9, Line 110: gscoped_ptr<ThreadPool> threadpool_; unique_ptr http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc File src/kudu/master/hms_catalog.cc: http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@52 PS9, Line 52: const int HmsCatalog::kNumRetries = 1; Do you think it would it be useful to convert this into an advanced gflag? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@76 PS9, Line 76: .set_min_threads(0) This is the default; it can be omitted. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@77 PS9, Line 77: .set_max_threads(1) Doc why capping at one thread is important. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@95 PS9, Line 95: return Execute([=] (hms::HmsClient* client) { Execute() waits for the provided functor to run, so why not capture variables by reference in all of these calls? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@162 PS9, Line 162: if (!s.ok()) { I'm going to claim laziness and ask whether all of these edge cases are covered by hms_catalog-test rather than checking myself. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@180 PS9, Line 180: // Handle all other errors. Nit: not really "handling" per se; it's just that all other errors are fatal. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@201 PS9, Line 201: If : // the original table object and the new table object match exactly then : // we don't need to alter the table in the HMS. Does this short circuit make us vulnerable to a TOCTOU race, where we return OK to the client because we think orig_table and table are identical even though someone else may have altered the Hive table in the meantime? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@222 PS9, Line 222: // TODO(Todd): wrapping this in a TRACE_EVENT scope and a LOG_IF_SLOW and such Nit: $ git grep TODO\([a-z] | wc -l 278 $ git grep TODO\([A-Z] | grep -v TODO\(KUDU- | wc -l 5 http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@238 PS9, Line 238: // Since every task submitted to the (single thread) pool runs this, it's : // essentially a single iteration of a run loop which handles HMS client : // reconnection and task processing. Do you think the pool should have a min_thread count of 1 so that we don't pay the thread startup cost for HMS operations that are spaced out by the pool's idle timeout? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@253 PS9, Line 253: // network or IO fault (not an applicaiton level failure). The HMS client application http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@258 PS9, Line 258: for (int i = 0; i <= kNumRetries; i++) { Seems like kNumRetries should depend at least somewhat on the number of HMSes. If we have three HMS instances and the first two are down, the very first submitted task will fail while the rest will succeed. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@291 PS9, Line 291: // A fatal error occurred. Tear down the connection, and try again. We Is it possible for reconnect to succeed, for the task to exhibit a fatal error, and for this to repeat over and over? If so, we'll perform the retries in a tight loop; shouldn't we wait in between? I can see the tight loop being a "feature" rather than a bug if we're retrying the task on a different HMS each time, but not if it's on the same HMS over and over. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@300 PS9, Line 300: return callback(s); This passes in the last failure; should we instead pass in the first? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@316 PS9, Line 316: hms_client_ = hms::HmsClient(address, hms::HmsClientOptions()); Nit: might consider making HmsClientOptions() the default value of the constructor's second argument, for better ergonomics. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@324 PS9, Line 324: LOG(WARNING) << "Failed to connect to Hive Metastore (" : << address.ToString() << "): " << s.ToString(); More terse: WARN_NOT_OK(s, Substitute("Failed to connect to Hive Metastore ($0)", address.ToString()); http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@330 PS9, Line 330: return s; If we can't connect to any of the HMSs, should we return the first failure? Currently we're returning the last one. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@368 PS9, Line 368: return string(); Hmm, so what would happen at runtime if we returned an empty string here? Any reason not to promote the DFATAL above into a FATAL? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@387 PS9, Line 387: table->tableType = "MANAGED_TABLE"; Should this constant be given a little more visibility? What's its significance? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/master/hms_catalog.cc@420 PS9, Line 420: "When the Hive Metastore integration is enabled, Kudu table names must be a " Nit: as a convention we should start a status message with a lower case letter, since they're almost always prepended with other colon-delimited messages on the return path. http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/mini-cluster/external_mini_cluster-test.cc File src/kudu/mini-cluster/external_mini_cluster-test.cc: http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/mini-cluster/external_mini_cluster-test.cc@19 PS9, Line 19: #include <ostream> Revert? http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/util/net/net_util.cc File src/kudu/util/net/net_util.cc: http://gerrit.cloudera.org:8080/#/c/8312/9/src/kudu/util/net/net_util.cc@18 PS9, Line 18: #include <cstdint> Nit: this belongs in the next group of includes. -- To view, visit http://gerrit.cloudera.org:8080/8312 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1b8f360408b05d6ac4f7359a8ac0dd889b196baf Gerrit-Change-Number: 8312 Gerrit-PatchSet: 9 Gerrit-Owner: Dan Burkert <danburk...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Dan Burkert <danburk...@apache.org> Gerrit-Reviewer: Hao Hao <hao....@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Fri, 09 Mar 2018 00:36:36 +0000 Gerrit-HasComments: Yes