RE: OLAP with Hive

Ashish Thusoo Mon, 15 Dec 2008 11:16:26 -0800

There is a lot of work done on a technology called materialized views which can 
provide a lot of these benefits in a very useable manner, though at the expense 
of using more disk storage. Essentially, the idea is to materialize views of 
data and keep them updated when data changes. Sometimes this is easy to do, at 
other times the entire view has to be reconstructed from scratch. These work 
really well for read intensive data that does not change very much with time. I 
would think that when the time comes, we should look at that more closely than 
simply columnar storage.

Ashish

________________________________
From: Josh Ferguson [mailto:[email protected]]
Sent: Sunday, December 14, 2008 8:12 PM
To: [email protected]
Subject: Re: OLAP with Hive

If you assume all the rows match across all of the column files (using null 
values where values don't exist) it is indeed very fast and quite simple in 
concept to do.. I think ease of use could be (and probably will be) debated for 
a long time. I like the idea of being able to choose which method is best for a 
particular table. I guess we can discuss it when the time comes.

Josh

On Dec 14, 2008, at 6:14 PM, Joydeep Sen Sarma wrote:

Regular joins are very expensive (hive doesn't implement either map-side or 
joins optimized for pre-sorted data yet). Joining corresponding rows of 
multiple files together is very cheap (it would be linear if disks are not a 
bottleneck). It should be better than merge join of multiple files (using unix 
join for example) since we would just read the head of each file and 
concatenate ..
Ease of use is the other fact. if performance/compaction can be transparently 
improved - then it's better for everyone .. (but views can get us there too).
Like Zheng says - disk seeks are a little bit of a wildcard (but our cluster, 
for example, is not disk iops bound and with good buffering/prefetching - one 
should be able to hide these costs to a large extent). And clearly things start 
falling apart with a high enough number of columns.
________________________________
From: Josh Ferguson [mailto:[email protected]]
Sent: Sunday, December 14, 2008 4:26 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: OLAP with Hive
So before I was using hive I was doing this on a normal file system using the 
standard unix "join" command line application, which requires the two files to 
be sorted. It was indeed great when you didn't have to deal with more than one 
or two columns of data at a time.. After that the performance started degrading 
pretty rapidly since the running time of doing a join of M rows N times grows 
quickly.
I think what you'd want is the idea of a column family or grouping that were 
always stored together so that you could keep columns you use regularly 
together. The ones not used regularly could be joined later if a query called 
for them. Also if you created identifiers (by way of file name perhaps) on the 
column family and not the individual columns you only have to join by family 
not per-column. So this is sort of the middle-ground between having it all in 
one file and all in separate files.
Of course you can sort of already do this by just having separate tables and 
using the built in join functionality, so you could implement this with 
implicit table names like tablename_familyname or even by just adding an 
implicit partition to the file scheme. Then you could either make the user 
responsible for knowing which partitions to select from to get the proper 
columns or do some magic in the background.
Anyways, I'm sure you knew all of this, I was just stating it for the sake of 
the group..:)
Josh
On Dec 14, 2008, at 4:07 PM, Zheng Shao wrote:

For example, we can store one field (or a set of fields) of a table in one 
directory.
So table (string a, string b) can be stored like this:
a/part-00000
a/part-00001
a/part-00002
...
b/part-00000
b/part-00001
b/part-00002
...

The columnar organization can give us a huge performance gain, when the table 
is so wide, and most of the queries only access a very small number of columns.

The drawback is when we do need to access a lot of fields, we need to 
concatenate the fields backs into a row by reading multiple files at the same 
time. This will increase the number of disk seeks that we do, and some of the 
data need to go through the network. There was a discussion about whether we 
want to co-locate the file system blocks of the different fields of the same 
rows, but it's not very clear what's the best way to do that.

Zheng
On Sun, Dec 14, 2008 at 3:51 PM, Josh Ferguson 
<[email protected]<mailto:[email protected]>> wrote:
What would columnar organization look like and what are the benefits and 
drawbacks to this?
Josh
On Dec 14, 2008, at 3:13 PM, Joydeep Sen Sarma wrote:

That's a hard one. We can wish whatever we want to - but I guess it's all a 
question of who has the resources to contribute to it and what they want from 
Hive.
I can speak a little bit about Facebook. The reason we invested in indexing was 
not that it was the primary usage (or even a bottleneck for, say, performance 
optimization) - but because once you have so much data in one place - chances 
are that someone will come along and want to have quick lookups over some part 
of it (and u don't want to kill ur cluster by doing scans all the time). So 
that definitely makes indexing useful. We are also seeing that with dimensional 
analysis - where there is a need to drill down into detailed data - 
multidimensional indexes can be very useful. So in the long term - I think this 
is one of the desired features.
That doesn't make it akin to hbase though (in the sense that we still wouldn't 
have row level updates or real-time index updates). Katta may be complimentary 
and we were actually interested in investigating it for indexing (instead of 
rolling things from scratch).
Columnar organization is also very interesting. With all the hooks in hadoop 
(inputformatters) and hive(serdes) - I think it's fairly tractable to do this ..
________________________________
From: Josh Ferguson [mailto:[email protected]]
Sent: Sunday, December 14, 2008 1:20 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: OLAP with Hive
I'd honestly like to see hive remain a partitioned flat file store. I don't 
think indexing what's inside the files is too incredibly useful in most 
situations where you'd use hive. I also think this kind of store is just the 
right fit for the hadoop and large scale analytics situation. I don't want to 
see hive go toward hbase or katta. What is the long term vision for hive?
Josh
On Dec 14, 2008, at 1:06 PM, Joydeep Sen Sarma wrote:
We have done some preliminary work with indexing - but that's not the focus 
right now and no code is available in the open source trunk for this purpose. I 
think it's fair to say that hive is not optimized for online processing right 
now. (and we are quite some ways off from columnar storage).
________________________________
From: Martin Matula [mailto:[email protected]]
Sent: Sunday, December 14, 2008 6:54 AM
To: [email protected]<mailto:[email protected]>
Subject: OLAP with Hive
Hi,
Is Hive capable of indexing the data and storing them in a way optimized for 
querying (like a columnar database - bitmap indexes, compression, etc.)?
I need to be able to get decent response times for queries (up to a few 
seconds) over huge amounts of analytical data. Is that achievable (with 
appropriate number of machines in a cluster)? I saw the 
serialization/deserialization of tables is pluggable. Is that the way to make 
the storage more efficient? Any existing implementation (either ready or in 
progress) that would be targeted at this? Or any hints on what I may want to 
take a look at among the things that are currently available in Hive/Hadoop?
Thanks,
Martin

--
Yours,
Zheng

RE: OLAP with Hive

Reply via email to