I have been working with Hive for the past week. The ability to wrap an SQL like tool over HDFS as very powerful. Now that I am comfortable with the concept, I am looking at an implementation of it.
Currently I have a three node cluster for testing hadoop1, hadoop2, and hadoop3. I have hive installed on hadoop1 and derby is working as the metastore on the local filesystem. I am not able to run more then one instance of Hive. That makes sense because hive probably wants exclusive access to the meta_db. This is a big downside as I can only run 1 job at once. These are the solutions I am looking at: Different hive instances different warehouse directories on hdfs. instance1 /users/hive/warehouse1 instance2 /users/hive/warehouse2 I could for example install one copy of hive on each server. Upside: I can now execute three jobs at once. Downside: I have three separate warehouses. Even though they live on hdfs together they are unaware of each other. Option 2: Always use external tables, create my own schema 'replication' system In this case the layout and install is the same. instance1 /users/hive/warehouse1 instance2 /users/hive/warehouse2 Also if any instance creates a table, it should create the table outside the warehouse /users/hive/shared/table1 Now I need some external process that runs 'create external table /users/hive/shared/table1' on all the other nodes. This way all nodes can query the table. I am really not woried about table mutations once that data goes in the tables will almost never be mutated. IIRC meta_db might be able to be stored on hdfs in a future version. Am I over thinking something, have I missed a way to execute multiple hive queries at once?