You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE 
facilities. Check the wiki on how to use them. Or do you want something 
different?




________________________________
From: Min Zhou <coderp...@gmail.com>
Reply-To: <hive-user@hadoop.apache.org>
Date: Sun, 22 Feb 2009 19:42:50 -0800
To: <hive-user@hadoop.apache.org>
Subject: How to simplify our development flow under the means of using Hive?

Hi list,

    I'm goint to take Hive into production to analyze our web logs, which are 
hundreds of  giga-bytes per day. Previously, we did this job by using Apache 
hadoop, running our raw mapreduce code. It did work, but it also decreased our 
productivity directly. We were suffering from writting code with similar logic 
again and again. It could be worse, when the format of our logs being changed. 
For example, when we want to insert one more field in each line of the log, the 
previous work would be useless, then we have to redo it. Hence we are thinking 
about using Hive as a persistent layer, to store and retrieve the schemes of 
the data easily. But we found that sometimes Hive could not do some sort of 
complex analysis, because of the limitation of the ideographic ability of SQL.  
 We have to write our own UDFs, even though, some difficulties Hive still 
cannot go through.  Thus we also need to write raw mapreduces code,  which let 
us come up against another problem.  Since one is a set of SQL scripts, the 
other is pieces of java or hybrid code, How to coordinate  Hive and raw 
mapreduce code and how to shedule them? How does Facebook use Hive? And what is 
your solution when you come across the similar problems?

    In the end, we are considering about using Hive as our data warehouse. Any 
suggestions?

Thanks in advance!
Min

--
My research interests are distributed systems, parallel computing and bytecode 
based virtual machine.

http://coderplay.javaeye.com

Reply via email to