I haven't used sorted columns before. I expect sorted columns to give you some query optimization by avoiding some sorting in some cases, but not by pruning input. In this sense, sorted columns give you a very different kind of optimization than partitioned, bucketed, or indexed columns do.
-----Original Message----- From: Mark Grover [mailto:mgro...@oanda.com] Sent: Monday, September 12, 2011 10:09 AM To: user@hive.apache.org Cc: Steven Wong; Travis Powell; Baiju Devani; Bob Tiernay Subject: Re: Best practices for storing data on Hive Thanks, Steven. So, am I correct in understanding that even the sorting columns (user_id, time in my example) would not be used to optimize the query shown below? On 11-09-09 07:00 PM, Steven Wong wrote: > Bucketing only speeds up sampling queries. Hive doesn't know/remember the > hash function(s) you use to bucket the data, so it cannot use that to speed > up lookups by a bucketed column. > > > -----Original Message----- > From: Mark Grover [mailto:mgro...@oanda.com] > Sent: Friday, September 09, 2011 4:18 AM > To: user@hive.apache.org > Cc: Travis Powell; Baiju Devani; Bob Tiernay > Subject: Re: Best practices for storing data on Hive > > Edward, Steven or anyone else on the mailing list: > > Is it possible to optimize queries like the one below with bucketing? > select * from<table> where user_id='blah' and dt>= '2011-05-26' and dt<= > '2011-05-28'; > > where<table> is partitioned by dt (which represents day), bucketed by > user_id and within each bucket data is sorted by user_id, time. > > The Hive wiki says, that bucketing can improve performance on certain kinds > of queries. What kinds of queries are these? > Only Sampling queries? Group by on bucketed column? Where clause on bucketed > column? All of the above? > > Thanks in advance! > Mark > > ----- Original Message ----- > From: "Edward Capriolo"<edlinuxg...@gmail.com> > To: user@hive.apache.org > Cc: "Travis Powell"<tpow...@tealeaf.com>, "Baiju Devani"<bdev...@oanda.com>, > "Bob Tiernay"<btier...@oanda.com> > Sent: Thursday, September 8, 2011 9:26:10 PM > Subject: Re: Best practices for storing data on Hive > > > > > On Thu, Sep 8, 2011 at 8:30 PM, Steven Wong< sw...@netflix.com> wrote: > > > I think this statement is not true: "By distributing by (and preferably > ordering by) user_id, we can minimize seek time in the table because Hive > knows where all entries pertaining to a specific user are stored." I think it > is not true whether the table is bucketed on user_id or not (assuming that > user_id is not a partition column or indexed column). > > > -----Original Message----- > From: Mark Grover [mailto: mgro...@oanda.com ] > Sent: Tuesday, September 06, 2011 2:36 PM > To: user@hive.apache.org > Cc: Travis Powell; Baiju Devani; Bob Tiernay > Subject: Re: Best practices for storing data on Hive > > Thanks for your reply, Travis. > > I was under the impression that for Hive to make use of sorted structure > of data (i.e. for the table named "data" in your example), the metadata > of the table (specified during table creation) has to advertise such > property. However, I don't see any special metadata specifying such > property when "data" table was created. > > Is that true? If so, is such metadata specified by using CLUSTERED BY > and SORTED BY clauses during table creation? > > On 11-09-06 03:50 PM, Travis Powell wrote: >> Hi Mark, >> >> When we load data into Hive, we use a staging table to dynamically partition >> our data. This might help you too. >> >> We create our initial table and our staging table: >> >> DROP TABLE IF EXISTS staging_data; >> CREATE TABLE staging_data ( ... ) >> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; >> CREATE TABLE data ( ... ) >> PARTITIONED BY (dt STRING, hour, INT) >> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE; >> >> INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, >> to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM >> staging_session q ORDER BY user_id DISTRIBUTE BY user_id; >> >> >> So..... >> By distributing by (and preferably ordering by) user_id, we can minimize >> seek time in the table because Hive knows where all entries pertaining to a >> specific user are stored. Partitions by time have the best performance, >> because chances are almost every query will have some time-related component >> in it (and it spreads out the data among partitions fairly well.) >> >> Let me know if this works for you. We start every job with those first few >> lines of Hive script. It works well for us. >> >> Thanks, >> >> Travis Powell >> >> -----Original Message----- >> From: Mark Grover [mailto: mgro...@oanda.com ] >> Sent: Tuesday, September 06, 2011 12:39 PM >> To: user@hive.apache.org >> Cc: wd; Bob Tiernay; Baiju Devani >> Subject: Re: Best practices for storing data on Hive >> >> Thanks for the response, wd. >> >> I would REALLY APPRECIATE if other people can share their views as well. >> >> Here are the possible solutions that I have thought about to the problem >> (see original email for description of problem): >> >> 1) Multiple partitions: We would partition the table by day and userId. >> However, given the amount of users that visit our website (hundreds of >> thousands of unique users every day), this would lead to a large number >> of partitions (and rather small file sizes, ranging from a couple of >> bytes to a couple of KB). From the documentation I've read online, it >> seems that Hive/Hadoop weren't designed to deal with such small file >> sizes and such a situation should be avoided if possible. >> We had a scenario previously where we were partitioning by day and hour >> and because of the sheer number of partitions queries like "select * >> from<table> LIMIT 1;" were taking very long and even failed because of >> "Java out of Heap space" errors. My guess is that the master node was >> munching through all these partitions and couldn't deal with the large >> number of partitions. >> >> 2) Use of data locality: We could keep the data partitioned by day and >> bucketed by userId. Within each bucket sort the data by the (userId, >> time). This way we could keep the data related to each userId together >> within a daily partition and if Hive could be made aware of this sorting >> order and could make use of this order to improve search/query times, >> that would alleviate the problem quite a bit. The big question here is: >> Does Hive leverage sorting order of data within a partition bucket when >> running (most/all?) queries, where possible? >> >> >> 3) Using an index: As wd mentioned, Hive 0.7 introduces the notion on an >> index. If I do index on userId, given that we can hundreds of thousands >> of unique users per day, would indexing prove to be a good move? Are >> there people who are using it for similar purposes or on a similar scale? >> >> >> 4) Using 2 "orthogonal tables": As mentioned in my original email (see >> below), we could have 2 independent tables, one which stores data >> partitioned by day and other partitioned by userId. For maintaining >> partitions in userId partitioned table, I am planning to do the following: >> In the nightly job, if userId=X visited the website previous day, we >> create a partition for userId=X if it doesn't already exist. Once the >> partition is created, all clicks for that user Id on the day for in >> question are put in a single file and dropped in the userId=X folder on >> HDFS. This method could be used to simulate an "append" to the Hive >> table. The file would only be a few bytes to a few KB and the format of >> the table would be sequence file. >> >> What are your thoughts about the above 4 methods? Any particular likes >> or dislikes? Any comments, suggestions would be helpful. >> >> Thank you again in advance! >> >> Mark >> >> On 11-09-04 04:01 AM, wd wrote: >>> Hive support more than one partitions, have your tried? Maybe you can >>> create to partitions named as date and user. >>> >>> Hive 0.7 also support index, maybe you can have a try. >>> >>> On Sat, Sep 3, 2011 at 1:18 AM, Mark Grover< mgro...@oanda.com> wrote: >>>> Hello folks, >>>> I am fairly new to Hive and am wondering if you could share some of the >>>> best practices for storing/querying data with Hive. >>>> >>>> Here is an example of the problem I am trying to solve. >>>> >>>> The traffic to our website is logged in files that contain information >>>> about clicks from various users. >>>> Simplified, the log file looks like: >>>> t_1, ip_1, userid_1 >>>> t_2, ip_2, userid_2 >>>> t_3, ip_3, userid_3 >>>> ... >>>> >>>> where t_i represents time of the click, ip_i represents ip address where >>>> the click originated from, and userid_i represents the user ID of the user. >>>> >>>> Since the clicks are logged on an ongoing basis, partitioning our Hive >>>> table by day seemed like the obvious choice. Every night we upload the >>>> data from the previous day into a new partition. >>>> >>>> However, we would also want the capability to find all log lines >>>> corresponding to a particular user. With our present partitioning scheme, >>>> all day partitions are searched for that user ID but this takes a long >>>> time. I am looking for ideas/suggestions/thoughts/comments on how to >>>> reduce this time. >>>> >>>> As a solution, I am thinking that perhaps we could have 2 independent >>>> tables, one which stores data partitioned by day and the other partitioned >>>> by userId. With the second table partitioned by userId, I will have to >>>> find some way of maintaining the partitions since Hive doesn't support >>>> appending of files. Also, this seems suboptimal, since we are doubling >>>> that the amount of data that we store. What do you folks think of this >>>> idea? >>>> >>>> Do you have any other suggestions on how we can approach this problem? >>>> >>>> What have other people in similar situations done? Please share. >>>> >>>> Thank you in advance! >>>> Mark >>>> -- Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009.