Hey Martin, MapReduce was designed to perform serial scans over large quantities of data, so it's not clear that the programming paradigm will prove useful for queries over "huge" amounts of analytical data. You'll most likely need to materialize frequent and complex queries and get creative with table statistics and indexing to achieve your purposes.
That being said, the Hive query language was designed to naturally incorporate Map and Reduce code written in any language via Hadoop Streaming. See the "Custom Map Reduce Scripts" section of http://wiki.apache.org/hadoop/Hive/HiveQL. Note that the syntax is evolving and may change, as indicated in https://issues.apache.org/jira/browse/HIVE-37. Regards, Jeff On Sun, Dec 14, 2008 at 1:32 PM, Martin Matula <[email protected]> wrote: > So is Hive not suitable for this at all and I should rather look for > something else or maybe try to build something on top of Hadoop and HBase? > Or would it make any sense for me to check out the pluggable serialization > in Hive? I am assuming I would also need to add my own MapReduce routines so > the value of using Hive would be questionable, right? > Thanks, > Martin > > > On Sun, Dec 14, 2008 at 10:06 PM, Joydeep Sen Sarma > <[email protected]>wrote: > >> We have done some preliminary work with indexing – but that's not the >> focus right now and no code is available in the open source trunk for this >> purpose. I think it's fair to say that hive is not optimized for online >> processing right now. (and we are quite some ways off from columnar >> storage). >> >> >> ------------------------------ >> >> *From:* Martin Matula [mailto:[email protected]] >> *Sent:* Sunday, December 14, 2008 6:54 AM >> *To:* [email protected] >> *Subject:* OLAP with Hive >> >> >> >> Hi, >> Is Hive capable of indexing the data and storing them in a way optimized >> for querying (like a columnar database - bitmap indexes, compression, etc.)? >> I need to be able to get decent response times for queries (up to a few >> seconds) over huge amounts of analytical data. Is that achievable (with >> appropriate number of machines in a cluster)? I saw the >> serialization/deserialization of tables is pluggable. Is that the way to >> make the storage more efficient? Any existing implementation (either ready >> or in progress) that would be targeted at this? Or any hints on what I may >> want to take a look at among the things that are currently available in >> Hive/Hadoop? >> Thanks, >> Martin >> > >
