Re: Selecting data based on the clustered columns

Deepak A Fri, 17 Jul 2009 00:02:32 -0700

That would be great.!Thanks a lot.

-Deepak


On Fri, Jul 17, 2009 at 11:46 AM, Prasad Chakka <pcha...@facebook.com>wrote:

>  Yeah, we know of this optimization and and will add it as we start
> optimizing filter queries using indexes.
>
> ------------------------------
> *From: *Namit Jain <nj...@facebook.com>
> *Reply-To: *<hive-user@hadoop.apache.org>
> *Date: *Thu, 16 Jul 2009 22:42:25 -0700
> *To: *<hive-user@hadoop.apache.org>
> *Subject: *Re: Selecting data based on the clustered columns
>
>
> I am not sure if they are handling this. Let me talk to Prasad offline and
> get back to you.
>
>
>
> On 7/16/09 9:49 PM, "Deepak A" <deepa...@gmail.com> wrote:
>
> Hi Namit,
>
> I checked JIRA for any existing tickets on this and figured out that there
> are plans to support indexing on queries. This is being discussed at
> https://issues.apache.org/jira/browse/HIVE-417
>
> Can you please check if what we are discussing makes sense in this content
> or if it is orthogonal to this.
>
> -Deepak
>
> On Thu, Jul 16, 2009 at 10:26 PM, Deepak A <deepa...@gmail.com> wrote:
>
> Hi Namit,
>
> Thanks a lot on the update.
> Will do that for sure.
>
> -Deepak
>
>
> On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Right now, bucketing information is not used in a lot of places – it is
> only used in sampling.
> For eg:
>
> If your query was:
>
> Select .. From Posts(tablesample 1 out of 256) a;
>
> Then only the first bucket will be scanned.
>
> Your query can be optimized, but currently it is not. Can you file a jira
> on that ?
> It will help us prioritize this.
>
>
>
> -namit
>
>
>
> On 7/16/09 3:25 AM, "Deepak A" <deepa...@gmail.com> wrote:
>
> Hi,
>
> I have the following table in Hive
> Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate)
> INTO 256 BUCKETS;
>
> Since the data is hash partitioned based on the 'UserId' column, buckets
> were created based on the hash value of 'UserId'.
>
> Now, when I issue a Select query to fetch all the posts by a particular
> 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan
> only the bucket to which 'UserId' is hashed to?. But, when I run this query,
> I could see all the buckets being searched for the UserId.
>
> Moreover, I see that's there is a way to sample the table based on the
> buckets. Why can't hive automatically figure out the bucket to which UserId
> is hashed to and search only in that bucket?
>
> Can someone clarify me on this?
>
> Thanks,
> Deepak
>
>
>
>
>
>

Re: Selecting data based on the clustered columns

Reply via email to