Hi all,

For the database import tool I'm writing (Sqoop; HADOOP-5815), in addition
to uploading data into HDFS and using MapReduce to load/transform the data,
I'd like to integrate more closely with Hive. Specifically, to run the
CREATE TABLE statements needed to automatically inject table defintions into
Hive's metastore for the data files that sqoop loads into HDFS. Doing this
requires linking against Hive in some way (either directly by using one of
their API libraries, or "loosely" by piping commands into a Hive instance).

In either case, there's a dependency there. I was hoping someone on this
list with more Ivy experience than I knows what's the best way to make this
happen. Hive isn't in the maven2 repository that Hadoop pulls most of its
dependencies from. It might be necessary for sqoop to have access to a full
build of Hive. It doesn't seem like a good idea to check that binary
distribution into Hadoop svn, but I'm not sure what's the most expedient
alternative. Is it acceptable to just require that developers who wish to
compile/test/run sqoop have a separate standalone Hive deployment and a
proper HIVE_HOME variable? This would keep our source repo "clean." The
downside here is that it makes it difficult to test Hive-specific
integration functionality with Hudson and requires extra leg-work of
developers.

Thanks,
- Aaron Kimball

Reply via email to