RE: Hive questions about the meta db

Ashish Thusoo Wed, 01 Oct 2008 18:11:16 -0700

Hi Edward,

You can have multiple instances of hive by pointing the hive cli to different 
configs (This is very similar to the hadoop model). Take a look at 
hive-default.xml in you hive instance. You can create different copies of this 
file and change the following properties:


hive.metastore.warehouse.dir - defines the path in the file system where your 
warehouse files are stored.
javax.jdo.option.ConnectionURL - defines the connection url for the embedded 
derby instance. You can change that to point to different instances of non 
local metastore servers or change that to point to different directories 
(Prasad is an expert in this and can probably answer this better).

you can override these with hive-site.xml

so once you create these different site.xmls say (hive-site1.xml,... 
hive-site3.xml) in different directories...

you can run different CLIs by

bin/hive --config <directory of hive-site1.xml)..

And some point we may think of backing up the metastore in hdfs for 
availability reasons but metastore being a online system which needs low 
latencies, at this point we may not actually store it on hdfs.

Additionally, you should be able to run multiple jobs to the same hive instance 
concurrently without any problems (just spawn different instance of the CLI).

Ashish

-----Original Message-----
From: Edward Capriolo [mailto:[EMAIL PROTECTED]
Sent: Wed 10/1/2008 5:04 PM
To: core-user@hadoop.apache.org
Subject: Hive questions about the meta db
 
I have been working with Hive for the past week. The ability to wrap
an SQL like tool over HDFS as very powerful. Now that I am comfortable
with the
concept, I am looking at an implementation of it.

Currently I have a three node cluster for testing hadoop1, hadoop2,
and hadoop3. I have hive installed on hadoop1 and derby is working as
the metastore
on the local filesystem. I am not able to run more then one instance
of Hive. That makes sense because hive probably wants exclusive access
to the meta_db.

This is a big downside as I can only run 1 job at once. These are the
solutions I am looking at:

Different hive instances different warehouse directories on hdfs.
instance1 /users/hive/warehouse1
instance2 /users/hive/warehouse2
I could for example install one copy of hive on each server.
Upside: I can now execute three jobs at once.
Downside: I have three separate warehouses. Even though they live on
hdfs together they are unaware of each other.

Option 2: Always use external tables, create my own schema 'replication' system
In this case the layout and install is the same.
instance1 /users/hive/warehouse1
instance2 /users/hive/warehouse2

Also if any instance creates a table, it should create the table
outside the warehouse
/users/hive/shared/table1
Now I need some external process that runs 'create external table
/users/hive/shared/table1' on all the other nodes. This way
all nodes can query the table. I am really not woried about table
mutations once that data goes in the tables will almost never be
mutated.

IIRC meta_db might be able to be stored on hdfs in a future version.
Am I over thinking something, have I missed a way to execute multiple
hive queries at once?

RE: Hive questions about the meta db

Reply via email to