I have been working with Hive for the past week. The ability to wrap
an SQL like tool over HDFS as very powerful. Now that I am comfortable
with the
concept, I am looking at an implementation of it.

Currently I have a three node cluster for testing hadoop1, hadoop2,
and hadoop3. I have hive installed on hadoop1 and derby is working as
the metastore
on the local filesystem. I am not able to run more then one instance
of Hive. That makes sense because hive probably wants exclusive access
to the meta_db.

This is a big downside as I can only run 1 job at once. These are the
solutions I am looking at:

Different hive instances different warehouse directories on hdfs.
instance1 /users/hive/warehouse1
instance2 /users/hive/warehouse2
I could for example install one copy of hive on each server.
Upside: I can now execute three jobs at once.
Downside: I have three separate warehouses. Even though they live on
hdfs together they are unaware of each other.

Option 2: Always use external tables, create my own schema 'replication' system
In this case the layout and install is the same.
instance1 /users/hive/warehouse1
instance2 /users/hive/warehouse2

Also if any instance creates a table, it should create the table
outside the warehouse
/users/hive/shared/table1
Now I need some external process that runs 'create external table
/users/hive/shared/table1' on all the other nodes. This way
all nodes can query the table. I am really not woried about table
mutations once that data goes in the tables will almost never be
mutated.

IIRC meta_db might be able to be stored on hdfs in a future version.
Am I over thinking something, have I missed a way to execute multiple
hive queries at once?

Reply via email to