RE: Best practices for storing data on Hive

Steven Wong Mon, 12 Sep 2011 12:12:43 -0700

I haven't used sorted columns before. I expect sorted columns to give you some 
query optimization by avoiding some sorting in some cases, but not by pruning 
input. In this sense, sorted columns give you a very different kind of 
optimization than partitioned, bucketed, or indexed columns do.


-----Original Message-----
From: Mark Grover [mailto:mgro...@oanda.com] 
Sent: Monday, September 12, 2011 10:09 AM
To: user@hive.apache.org
Cc: Steven Wong; Travis Powell; Baiju Devani; Bob Tiernay
Subject: Re: Best practices for storing data on Hive

Thanks, Steven.
So, am I correct in understanding that even the sorting columns 
(user_id, time in my example) would not be used to optimize the query 
shown below?

On 11-09-09 07:00 PM, Steven Wong wrote:
> Bucketing only speeds up sampling queries. Hive doesn't know/remember the 
> hash function(s) you use to bucket the data, so it cannot use that to speed 
> up lookups by a bucketed column.
>
>
> -----Original Message-----
> From: Mark Grover [mailto:mgro...@oanda.com]
> Sent: Friday, September 09, 2011 4:18 AM
> To: user@hive.apache.org
> Cc: Travis Powell; Baiju Devani; Bob Tiernay
> Subject: Re: Best practices for storing data on Hive
>
> Edward, Steven or anyone else on the mailing list:
>
> Is it possible to optimize queries like the one below with bucketing?
> select * from<table>  where user_id='blah' and dt>= '2011-05-26' and dt<= 
> '2011-05-28';
>
> where<table>  is partitioned by dt (which represents day), bucketed by 
> user_id and within each bucket data is sorted by user_id, time.
>
> The Hive wiki says, that bucketing can improve performance on certain kinds 
> of queries. What kinds of queries are these?
> Only Sampling queries? Group by on bucketed column? Where clause on bucketed 
> column? All of the above?
>
> Thanks in advance!
> Mark
>
> ----- Original Message -----
> From: "Edward Capriolo"<edlinuxg...@gmail.com>
> To: user@hive.apache.org
> Cc: "Travis Powell"<tpow...@tealeaf.com>, "Baiju Devani"<bdev...@oanda.com>, 
> "Bob Tiernay"<btier...@oanda.com>
> Sent: Thursday, September 8, 2011 9:26:10 PM
> Subject: Re: Best practices for storing data on Hive
>
>
>
>
> On Thu, Sep 8, 2011 at 8:30 PM, Steven Wong<  sw...@netflix.com>  wrote:
>
>
> I think this statement is not true: "By distributing by (and preferably 
> ordering by) user_id, we can minimize seek time in the table because Hive 
> knows where all entries pertaining to a specific user are stored." I think it 
> is not true whether the table is bucketed on user_id or not (assuming that 
> user_id is not a partition column or indexed column).
>
>
> -----Original Message-----
> From: Mark Grover [mailto: mgro...@oanda.com ]
> Sent: Tuesday, September 06, 2011 2:36 PM
> To: user@hive.apache.org
> Cc: Travis Powell; Baiju Devani; Bob Tiernay
> Subject: Re: Best practices for storing data on Hive
>
> Thanks for your reply, Travis.
>
> I was under the impression that for Hive to make use of sorted structure
> of data (i.e. for the table named "data" in your example), the metadata
> of the table (specified during table creation) has to advertise such
> property. However, I don't see any special metadata specifying such
> property when "data" table was created.
>
> Is that true? If so, is such metadata specified by using CLUSTERED BY
> and SORTED BY clauses during table creation?
>
> On 11-09-06 03:50 PM, Travis Powell wrote:
>> Hi Mark,
>>
>> When we load data into Hive, we use a staging table to dynamically partition 
>> our data. This might help you too.
>>
>> We create our initial table and our staging table:
>>
>> DROP TABLE IF EXISTS staging_data;
>> CREATE TABLE staging_data ( ... )
>> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
>> CREATE TABLE data ( ... )
>> PARTITIONED BY (dt STRING, hour, INT)
>> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE;
>>
>> INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, 
>> to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM 
>> staging_session q ORDER BY user_id DISTRIBUTE BY user_id;
>>
>>
>> So.....
>> By distributing by (and preferably ordering by) user_id, we can minimize 
>> seek time in the table because Hive knows where all entries pertaining to a 
>> specific user are stored. Partitions by time have the best performance, 
>> because chances are almost every query will have some time-related component 
>> in it (and it spreads out the data among partitions fairly well.)
>>
>> Let me know if this works for you. We start every job with those first few 
>> lines of Hive script. It works well for us.
>>
>> Thanks,
>>
>> Travis Powell
>>
>> -----Original Message-----
>> From: Mark Grover [mailto: mgro...@oanda.com ]
>> Sent: Tuesday, September 06, 2011 12:39 PM
>> To: user@hive.apache.org
>> Cc: wd; Bob Tiernay; Baiju Devani
>> Subject: Re: Best practices for storing data on Hive
>>
>> Thanks for the response, wd.
>>
>> I would REALLY APPRECIATE if other people can share their views as well.
>>
>> Here are the possible solutions that I have thought about to the problem
>> (see original email for description of problem):
>>
>> 1) Multiple partitions: We would partition the table by day and userId.
>> However, given the amount of users that visit our website (hundreds of
>> thousands of unique users every day), this would lead to a large number
>> of partitions (and rather small file sizes, ranging from a couple of
>> bytes to a couple of KB). From the documentation I've read online, it
>> seems that Hive/Hadoop weren't designed to deal with such small file
>> sizes and such a situation should be avoided if possible.
>> We had a scenario previously where we were partitioning by day and hour
>> and because of the sheer number of partitions queries like "select *
>> from<table>  LIMIT 1;" were taking very long and even failed because of
>> "Java out of Heap space" errors. My guess is that the master node was
>> munching through all these partitions and couldn't deal with the large
>> number of partitions.
>>
>> 2) Use of data locality: We could keep the data partitioned by day and
>> bucketed by userId. Within each bucket sort the data by the (userId,
>> time). This way we could keep the data related to each userId together
>> within a daily partition and if Hive could be made aware of this sorting
>> order and could make use of this order to improve search/query times,
>> that would alleviate the problem quite a bit. The big question here is:
>> Does Hive leverage sorting order of data within a partition bucket when
>> running (most/all?) queries, where possible?
>>
>>
>> 3) Using an index: As wd mentioned, Hive 0.7 introduces the notion on an
>> index. If I do index on userId, given that we can hundreds of thousands
>> of unique users per day, would indexing prove to be a good move? Are
>> there people who are using it for similar purposes or on a similar scale?
>>
>>
>> 4) Using 2 "orthogonal tables": As mentioned in my original email (see
>> below), we could have 2 independent tables, one which stores data
>> partitioned by day and other partitioned by userId. For maintaining
>> partitions in userId partitioned table, I am planning to do the following:
>> In the nightly job, if userId=X visited the website previous day, we
>> create a partition for userId=X if it doesn't already exist. Once the
>> partition is created, all clicks for that user Id on the day for in
>> question are put in a single file and dropped in the userId=X folder on
>> HDFS. This method could be used to simulate an "append" to the Hive
>> table. The file would only be a few bytes to a few KB and the format of
>> the table would be sequence file.
>>
>> What are your thoughts about the above 4 methods? Any particular likes
>> or dislikes? Any comments, suggestions would be helpful.
>>
>> Thank you again in advance!
>>
>> Mark
>>
>> On 11-09-04 04:01 AM, wd wrote:
>>> Hive support more than one partitions, have your tried? Maybe you can
>>> create to partitions named as date and user.
>>>
>>> Hive 0.7 also support index, maybe you can have a try.
>>>
>>> On Sat, Sep 3, 2011 at 1:18 AM, Mark Grover<  mgro...@oanda.com>  wrote:
>>>> Hello folks,
>>>> I am fairly new to Hive and am wondering if you could share some of the 
>>>> best practices for storing/querying data with Hive.
>>>>
>>>> Here is an example of the problem I am trying to solve.
>>>>
>>>> The traffic to our website is logged in files that contain information 
>>>> about clicks from various users.
>>>> Simplified, the log file looks like:
>>>> t_1, ip_1, userid_1
>>>> t_2, ip_2, userid_2
>>>> t_3, ip_3, userid_3
>>>> ...
>>>>
>>>> where t_i represents time of the click, ip_i represents ip address where 
>>>> the click originated from, and userid_i represents the user ID of the user.
>>>>
>>>> Since the clicks are logged on an ongoing basis, partitioning our Hive 
>>>> table by day seemed like the obvious choice. Every night we upload the 
>>>> data from the previous day into a new partition.
>>>>
>>>> However, we would also want the capability to find all log lines 
>>>> corresponding to a particular user. With our present partitioning scheme, 
>>>> all day partitions are searched for that user ID but this takes a long 
>>>> time. I am looking for ideas/suggestions/thoughts/comments on how to 
>>>> reduce this time.
>>>>
>>>> As a solution, I am thinking that perhaps we could have 2 independent 
>>>> tables, one which stores data partitioned by day and the other partitioned 
>>>> by userId. With the second table partitioned by userId, I will have to 
>>>> find some way of maintaining the partitions since Hive doesn't support 
>>>> appending of files. Also, this seems suboptimal, since we are doubling 
>>>> that the amount of data that we store. What do you folks think of this 
>>>> idea?
>>>>
>>>> Do you have any other suggestions on how we can approach this problem?
>>>>
>>>> What have other people in similar situations done? Please share.
>>>>
>>>> Thank you in advance!
>>>> Mark
>>>>

-- 
Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.

RE: Best practices for storing data on Hive

Reply via email to