Adar Dembo has posted comments on this change. Change subject: Add basic Hive MetaStore client ......................................................................
Patch Set 5: (52 comments) http://gerrit.cloudera.org:8080/#/c/7053/5/build-support/dist_test.py File build-support/dist_test.py: Line 82: # Hive MetaStore tests require hadoop and hive libraries. These directories will be recursively copied? Or are these just symlinks? Can you add to the comment? http://gerrit.cloudera.org:8080/#/c/7053/5/build-support/run_dist_test.py File build-support/run_dist_test.py: Line 127: env['HIVE_HOME'] = os.path.abspath(glob.glob("../../thirdparty/src/apache-hive-*-bin")[0]) Can you use ROOT to derive the paths instead of relying on the cwd? Line 129: env['JAVA_HOME'] = glob.glob("/usr/lib/jvm/java-1.8.0-*")[0] Can we run detect-java-home.sh from bigtop instead? This seems rather fragile. http://gerrit.cloudera.org:8080/#/c/7053/5/cmake_modules/FindJavaHome.cmake File cmake_modules/FindJavaHome.cmake: Line 19: # https://github.com/apache/bigtop/blob/38e1571b2f73bbfa6ab0c01a689fae967b8399d9/bigtop-packages/src/common/bigtop-utils/bigtop-detect-javahome If this is a specific version for a reason, please explain why. IIRC, it's because this was the last version to support JDK7? http://gerrit.cloudera.org:8080/#/c/7053/5/cmake_modules/FindThrift.cmake File cmake_modules/FindThrift.cmake: PS5, Line 106: #list(APPEND ${SRCS} "${THRIFT_CC_OUT}") : #list(APPEND ${HDRS} "${THRIFT_H_OUT}") Remove this? http://gerrit.cloudera.org:8080/#/c/7053/5/src/kudu/hms/CMakeLists.txt File src/kudu/hms/CMakeLists.txt: PS5, Line 21: set(HMS_THRIFT_LIBS : thrift) This is only used in one place, just inline it there? Line 23: ADD_EXPORTABLE_LIBRARY(hms_thrift Is hms_thrift going to be linked into the Kudu client library? If not, it doesn't need to be an exportable library; add_library() and target_link_library() should suffice. Line 35: ADD_EXPORTABLE_LIBRARY(hms Likewise. Line 44: ADD_CUSTOM_TARGET(hive_link_target ALL Nit: use lowercase for cmake built-ins (i.e. add_custom_target(...)). Line 49: ADD_CUSTOM_TARGET(hadoop_link_target ALL So this means the symlinks will be recreated on every invocation of "make" that doesn't include a specific target, or an invocation of "make all". Isn't that overkill though? Wouldn't it be sufficient to create the symlink on the first run of cmake? You could do that via execute_process(). http://gerrit.cloudera.org:8080/#/c/7053/5/src/kudu/hms/hive_metastore-test.cc File src/kudu/hms/hive_metastore-test.cc: Line 23: // See https://stackoverflow.com/questions/1156003/c-namespace-collision-with-gtest-and-boost But we're not explicitly including boost here. Is this from ThriftHiveMetastore.h? If so, could we add the workaround directly to the generated header? Line 42: TEST_F(HiveMetastoreTest, TestCreateDatabase) { Also test that Stop() works? Line 52: ASSERT_EQ(vector<string>({ "default" }), databases) Surprised the compiler needs the vector<string> part and isn't able to figure out the type of the initializer list itself. http://gerrit.cloudera.org:8080/#/c/7053/5/src/kudu/hms/hive_metastore.cc File src/kudu/hms/hive_metastore.cc: Line 49: #define HMS_RET_NOT_OK(call, msg) \ Are these all of the exception classes that may be thrown? Is there some way to specify a catch-all in case we're wrong? Line 63: : client_(nullptr) { So client_ is actually a pointer? Is the pointer hidden behind a typedef or something? Line 71: CHECK_OK(Stop()); Maybe WARN_NOT_OK() instead? I think that's the convention we use. http://gerrit.cloudera.org:8080/#/c/7053/5/src/kudu/hms/hive_metastore.h File src/kudu/hms/hive_metastore.h: Line 33: // A client for the Hive MetaStore. Maybe HiveMetastoreClient then? Or HMSClient? HiveMetastore suggests the actual server-side component. Line 39: What happens if Start or Stop are called multiple times, or if Stop is called before Start? Is that something we need to protect against? Line 51: Status CreateDatabase(const Apache::Hadoop::Hive::Database& db); Forward declare and maybe avoid some includes? Line 57: Apache::Hadoop::Hive::ThriftHiveMetastoreClient client_; Guess we can't forward declare this, so maybe forward declaring the above isn't useful. http://gerrit.cloudera.org:8080/#/c/7053/5/src/kudu/hms/mini_hms-test.cc File src/kudu/hms/mini_hms-test.cc: Line 24: class MiniKdcTest : public KuduTest {}; MiniHmsTest Line 26: TEST_F(MiniKdcTest, TestBasicOperation) { Shouldn't we also test some sort of RPC? Oh I see, that's what hive_metastore-test.cc is for. But, I still think we should do something here to verify that the HMS is indeed running properly. Maybe we can look for the presence of a file? Or read its contents? http://gerrit.cloudera.org:8080/#/c/7053/5/src/kudu/hms/mini_hms.cc File src/kudu/hms/mini_hms.cc: Line 20: #include <csignal> Don't need? Line 27: #include "kudu/gutil/strings/split.h" Don't need? Line 28: #include "kudu/gutil/strings/strip.h" Don't need? Line 33: #include "kudu/util/stopwatch.h" Don't need? Line 47: proc->KillAndWait(SIGKILL); WARN_NOT_OK() here? Line 60: if (env == nullptr) { Use a ternary? Line 63: home_dir = env; Extra space here. Line 65: CHECK(Env::Default()->FileExists(home_dir)) << env_var << " not found at " << home_dir; Shouldn't we gracefully return this sort of error via Status? Also, if this is a symlink, could we make sure it's not broken? For example, if I upgrade my JVM so my java-home symlink is now broken, would be good to catch that before trying to run Hive. Line 72: //SCOPED_LOG_SLOW_EXECUTION(WARNING, 10000, "Starting HMS"); Remove? Or uncomment? Line 77: Env* env = Env::Default(); How about asking users to supply this in the class constructor? Line 106: int rc = 0; Nit: do you actually need to initialize the value? Would prefer to adhere to the calling convention of "value is guaranteed to be set if function does not fail". Line 109: return Status::RuntimeError("schematool failed"); Maybe include the exit status? Should use GetExitStatus() anyway. Line 112: // Start the Kerberos HMS. Kerberos HMS? http://gerrit.cloudera.org:8080/#/c/7053/5/src/kudu/security/test/mini_kdc.cc File src/kudu/security/test/mini_kdc.cc: Line 151: RETURN_NOT_OK(kdc_process_->WaitForUdpBind(&options_.port, MonoDelta::FromSeconds(1))); This seems like a rather short timeout. The old code was even more aggressive; any reason not to wait longer (and perhaps avoid some future flakiness in doing so)? http://gerrit.cloudera.org:8080/#/c/7053/5/src/kudu/util/subprocess.cc File src/kudu/util/subprocess.cc: Line 749: return this->WaitForBind(port, "4TCP", timeout); Nit: remove unnecessary this prefixes, here and below. Line 756: Status Subprocess::WaitForBind(uint16_t* port, const char* kind, MonoDelta timeout) { I don't really think this or GetExecutablePath belong in util/subprocess. They don't strike me as functionality we'd want in production, so test_util feels like a better home. AFAICT the only reason to put WaitForBind() in util/subprocess is to get access to the pid, but pid() is already exposed in the subprocess API. http://gerrit.cloudera.org:8080/#/c/7053/5/src/kudu/util/subprocess.h File src/kudu/util/subprocess.h: Line 205: // Attempts to find the path the specified executable, searching the provided "to the specified executable" http://gerrit.cloudera.org:8080/#/c/7053/5/src/kudu/util/test_util.cc File src/kudu/util/test_util.cc: Line 27: New includes not needed? http://gerrit.cloudera.org:8080/#/c/7053/5/thirdparty/LICENSE.txt File thirdparty/LICENSE.txt: Line 620: Source: http://www.boost.org/ Copy/pasta Line 658: Source: http://www.python.org/ Copy/pasta? http://gerrit.cloudera.org:8080/#/c/7053/5/thirdparty/build-definitions.sh File thirdparty/build-definitions.sh: Line 501: CXXFLAGS="$EXTRA_CPPFLAGS $OPENSSL_CFLAGS" \ Isn't this wrong though? Curl is a C program so I'd expect CXXFLAGS has no effect. Line 670: # Build the date_time boost lib. Update/remove this. Line 671: ./bootstrap.sh --prefix=$PREFIX threading=multi --with-libraries=date_time,thread Is this actually needed? I don't see a corresponding change to FindKuduBoost.cmake. Oh, is it only for Thrift? If so, could you add a comment? Could you double check that it's indeed needed? PS5, Line 698: -I$PREFIX/include -I$THRIFT_BDIR/lib/cpp/src Why do we need these two additions? I can take my answer in the form of a code comment. Line 732: cp $THRIFT_SOURCE/contrib/fb303/if/fb303.thrift $PREFIX/share/fb303/if Should be cp -f, right? Otherwise will it overwrite? http://gerrit.cloudera.org:8080/#/c/7053/5/thirdparty/download-thirdparty.sh File thirdparty/download-thirdparty.sh: Line 323: if [ ! -d "$THRIFT_SOURCE" ]; then No autoreconf? Line 331: if [ ! -d "$BISON_SOURCE" ]; then No autoreconf? Line 342: # Remove the share/doc dir which contains 2GiB of HTML project documentation. Perhaps we should follow our LLVM tarball's lead and remove it a priori, to avoid downloading this unnecessary stuff in the first place? http://gerrit.cloudera.org:8080/#/c/7053/5/thirdparty/vars.sh File thirdparty/vars.sh: Line 180: BISON_VERSION=2.7.1 Why use this version of bison? The latest is 3.0.4 and it's quite old already (~2 years). Line 185: HIVE_NAME=apache-hive-$HIVE_VERSION-bin Why not just hive-$HIVE_VERSION? -- To view, visit http://gerrit.cloudera.org:8080/7053 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I155223da912bc18a759df2f1f6bc25d1132a99ee Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert <danburk...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Dan Burkert <danburk...@apache.org> Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes