cc: user@hive.apache.org -- Lefty
On Mon, Dec 28, 2015 at 11:00 PM, Varadharajan Mukundan < srinath...@gmail.com> wrote: > Hi All, > > Say i have a table with below schema: > > CREATE TABLE foo (id INT) CLUSTERED BY (id) INTO 8 BUCKETS STORED AS ORC; > > and when we issue the following query, its doing a "Full table scan" > > SELECT * FROM foo WHERE id=<some integer> > > After doing some searching on the net, i found that "table sample" seems to > be one of the ways to resolve this and the query would be written in this > manner: > > SELECT * FROM foo TABLE SAMPLE(BUCKET <bucket id of some_integer: say 2> of > 8) where id=<some_integer> > > I was expecting the above query to read only the 2nd bucket but to my > surprise it did full table scan again. I understand that partitioning in > hive is the way to go for such queries by i have two points on why we need > such filtering techniques for buckets as well. > > 1. Partitioning may not be suitable / preferred when there are lots of > partitions (high cardinality columns) > 2. Buckets just works out of the box, without specifying things like > partitioning keys etc.. in the queries like "insert clauses". > > I was wondering if there are any technical constraints on why we were not > able to restrict the scan only to that bucket for such pointed queries? > > -- > Thanks, > M. Varadharajan > > ------------------------------------------------ > > "Experience is what you get when you didn't get what you wanted" > -By Prof. Randy Pausch in "The Last Lecture" > > My Journal :- http://varadharajan.in >