Hi Edward, You can have multiple instances of hive by pointing the hive cli to different configs (This is very similar to the hadoop model). Take a look at hive-default.xml in you hive instance. You can create different copies of this file and change the following properties:
hive.metastore.warehouse.dir - defines the path in the file system where your warehouse files are stored. javax.jdo.option.ConnectionURL - defines the connection url for the embedded derby instance. You can change that to point to different instances of non local metastore servers or change that to point to different directories (Prasad is an expert in this and can probably answer this better). you can override these with hive-site.xml so once you create these different site.xmls say (hive-site1.xml,... hive-site3.xml) in different directories... you can run different CLIs by bin/hive --config <directory of hive-site1.xml).. And some point we may think of backing up the metastore in hdfs for availability reasons but metastore being a online system which needs low latencies, at this point we may not actually store it on hdfs. Additionally, you should be able to run multiple jobs to the same hive instance concurrently without any problems (just spawn different instance of the CLI). Ashish -----Original Message----- From: Edward Capriolo [mailto:[EMAIL PROTECTED] Sent: Wed 10/1/2008 5:04 PM To: core-user@hadoop.apache.org Subject: Hive questions about the meta db I have been working with Hive for the past week. The ability to wrap an SQL like tool over HDFS as very powerful. Now that I am comfortable with the concept, I am looking at an implementation of it. Currently I have a three node cluster for testing hadoop1, hadoop2, and hadoop3. I have hive installed on hadoop1 and derby is working as the metastore on the local filesystem. I am not able to run more then one instance of Hive. That makes sense because hive probably wants exclusive access to the meta_db. This is a big downside as I can only run 1 job at once. These are the solutions I am looking at: Different hive instances different warehouse directories on hdfs. instance1 /users/hive/warehouse1 instance2 /users/hive/warehouse2 I could for example install one copy of hive on each server. Upside: I can now execute three jobs at once. Downside: I have three separate warehouses. Even though they live on hdfs together they are unaware of each other. Option 2: Always use external tables, create my own schema 'replication' system In this case the layout and install is the same. instance1 /users/hive/warehouse1 instance2 /users/hive/warehouse2 Also if any instance creates a table, it should create the table outside the warehouse /users/hive/shared/table1 Now I need some external process that runs 'create external table /users/hive/shared/table1' on all the other nodes. This way all nodes can query the table. I am really not woried about table mutations once that data goes in the tables will almost never be mutated. IIRC meta_db might be able to be stored on hdfs in a future version. Am I over thinking something, have I missed a way to execute multiple hive queries at once?