Er, lost the end of my first sentence: "analytical data [that are expected to return in seconds]"
On Sun, Dec 14, 2008 at 3:14 PM, Jeff Hammerbacher <[email protected]>wrote: > Hey Martin, > > MapReduce was designed to perform serial scans over large quantities of > data, so it's not clear that the programming paradigm will prove useful for > queries over "huge" amounts of analytical data. You'll most likely need to > materialize frequent and complex queries and get creative with table > statistics and indexing to achieve your purposes. > > That being said, the Hive query language was designed to naturally > incorporate Map and Reduce code written in any language via Hadoop > Streaming. See the "Custom Map Reduce Scripts" section of > http://wiki.apache.org/hadoop/Hive/HiveQL. Note that the syntax is > evolving and may change, as indicated in > https://issues.apache.org/jira/browse/HIVE-37. > > Regards, > Jeff > > > On Sun, Dec 14, 2008 at 1:32 PM, Martin Matula <[email protected]> wrote: > >> So is Hive not suitable for this at all and I should rather look for >> something else or maybe try to build something on top of Hadoop and HBase? >> Or would it make any sense for me to check out the pluggable serialization >> in Hive? I am assuming I would also need to add my own MapReduce routines so >> the value of using Hive would be questionable, right? >> Thanks, >> Martin >> >> >> On Sun, Dec 14, 2008 at 10:06 PM, Joydeep Sen Sarma <[email protected] >> > wrote: >> >>> We have done some preliminary work with indexing – but that's not the >>> focus right now and no code is available in the open source trunk for this >>> purpose. I think it's fair to say that hive is not optimized for online >>> processing right now. (and we are quite some ways off from columnar >>> storage). >>> >>> >>> ------------------------------ >>> >>> *From:* Martin Matula [mailto:[email protected]] >>> *Sent:* Sunday, December 14, 2008 6:54 AM >>> *To:* [email protected] >>> *Subject:* OLAP with Hive >>> >>> >>> >>> Hi, >>> Is Hive capable of indexing the data and storing them in a way optimized >>> for querying (like a columnar database - bitmap indexes, compression, etc.)? >>> I need to be able to get decent response times for queries (up to a few >>> seconds) over huge amounts of analytical data. Is that achievable (with >>> appropriate number of machines in a cluster)? I saw the >>> serialization/deserialization of tables is pluggable. Is that the way to >>> make the storage more efficient? Any existing implementation (either ready >>> or in progress) that would be targeted at this? Or any hints on what I may >>> want to take a look at among the things that are currently available in >>> Hive/Hadoop? >>> Thanks, >>> Martin >>> >> >> >
