For example, we can store one field (or a set of fields) of a table in one
directory.
So table (string a, string b) can be stored like this:
a/part-00000
a/part-00001
a/part-00002
...
b/part-00000
b/part-00001
b/part-00002
...

The columnar organization can give us a huge performance gain, when the
table is so wide, and most of the queries only access a very small number of
columns.

The drawback is when we do need to access a lot of fields, we need to
concatenate the fields backs into a row by reading multiple files at the
same time. This will increase the number of disk seeks that we do, and some
of the data need to go through the network. There was a discussion about
whether we want to co-locate the file system blocks of the different fields
of the same rows, but it's not very clear what's the best way to do that.

Zheng

On Sun, Dec 14, 2008 at 3:51 PM, Josh Ferguson <[email protected]> wrote:

> What would columnar organization look like and what are the benefits and
> drawbacks to this?
> Josh
>
> On Dec 14, 2008, at 3:13 PM, Joydeep Sen Sarma wrote:
>
> That's a hard one. We can wish whatever we want to – but I guess it's all a
> question of who has the resources to contribute to it and what they want
> from Hive.
>
> I can speak a little bit about Facebook. The reason we invested in indexing
> was not that it was the primary usage (or even a bottleneck for, say,
> performance optimization) – but because once you have so much data in one
> place – chances are that someone will come along and want to have quick
> lookups over some part of it (and u don't want to kill ur cluster by doing
> scans all the time). So that definitely makes indexing useful. We are also
> seeing that with dimensional analysis – where there is a need to drill down
> into detailed data – multidimensional indexes can be very useful. So in the
> long term – I think this is one of the desired features.
>
> That doesn't make it akin to hbase though (in the sense that we still
> wouldn't have row level updates or real-time index updates). Katta may be
> complimentary and we were actually interested in investigating it for
> indexing (instead of rolling things from scratch).
>
> Columnar organization is also very interesting. With all the hooks in
> hadoop (inputformatters) and hive(serdes) – I think it's fairly tractable
> to do this ..
>
> ------------------------------
> *From:* Josh Ferguson [mailto:[email protected] <[email protected]>]
> *Sent:* Sunday, December 14, 2008 1:20 PM
> *To:* [email protected]
> *Subject:* Re: OLAP with Hive
>
> I'd honestly like to see hive remain a partitioned flat file store. I
> don't think indexing what's inside the files is too incredibly useful in
> most situations where you'd use hive. I also think this kind of store is
> just the right fit for the hadoop and large scale analytics situation. I
> don't want to see hive go toward hbase or katta. What is the long term
> vision for hive?
>
> Josh
>
> On Dec 14, 2008, at 1:06 PM, Joydeep Sen Sarma wrote:
>
>
> We have done some preliminary work with indexing – but that's not the focus
> right now and no code is available in the open source trunk for this
> purpose. I think it's fair to say that hive is not optimized for online
> processing right now. (and we are quite some ways off from columnar
> storage).
>
> ------------------------------
> *From:* Martin Matula [mailto:[email protected] <[email protected]>]
> *Sent:* Sunday, December 14, 2008 6:54 AM
> *To:* [email protected]
> *Subject:* OLAP with Hive
>
> Hi,
> Is Hive capable of indexing the data and storing them in a way optimized
> for querying (like a columnar database - bitmap indexes, compression, etc.)?
> I need to be able to get decent response times for queries (up to a few
> seconds) over huge amounts of analytical data. Is that achievable (with
> appropriate number of machines in a cluster)? I saw the
> serialization/deserialization of tables is pluggable. Is that the way to
> make the storage more efficient? Any existing implementation (either ready
> or in progress) that would be targeted at this? Or any hints on what I may
> want to take a look at among the things that are currently available in
> Hive/Hadoop?
> Thanks,
> Martin
>
>
>
>


-- 
Yours,
Zheng

Reply via email to