cc: user@hive.apache.org
-- Lefty
On Mon, Dec 28, 2015 at 11:00 PM, Varadharajan Mukundan <
srinath...@gmail.com> wrote:
> Hi All,
>
> Say i have a table with below schema:
>
> CREATE TABLE foo (id INT) CLUSTERED BY (id) INTO 8 BUCKETS STORED AS ORC;
>
> and when we issue the following query, its doing a "Full table scan"
>
> SELECT * FROM foo WHERE id=
>
> After doing some searching on the net, i found that "table sample" seems to
> be one of the ways to resolve this and the query would be written in this
> manner:
>
> SELECT * FROM foo TABLE SAMPLE(BUCKET of
> 8) where id=
>
> I was expecting the above query to read only the 2nd bucket but to my
> surprise it did full table scan again. I understand that partitioning in
> hive is the way to go for such queries by i have two points on why we need
> such filtering techniques for buckets as well.
>
> 1. Partitioning may not be suitable / preferred when there are lots of
> partitions (high cardinality columns)
> 2. Buckets just works out of the box, without specifying things like
> partitioning keys etc.. in the queries like "insert clauses".
>
> I was wondering if there are any technical constraints on why we were not
> able to restrict the scan only to that bucket for such pointed queries?
>
> --
> Thanks,
> M. Varadharajan
>
>
>
> "Experience is what you get when you didn't get what you wanted"
>-By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>