You may use percent based (block sampling) sampling for non-bucketed tables, though there are some restrictions.
https://cwiki.apache.org/Hive/languagemanual-sampling.html Regards, Ramki. On Wed, Mar 20, 2013 at 12:27 PM, Mark Grover <grover.markgro...@gmail.com>wrote: > Hey Dean, > I am not a power user of the sampling feature but my understanding was > that sampling in Hive only works on bucketed tables. I am happy to be > corrected though. > > Mark > > > On Wed, Mar 20, 2013 at 12:20 PM, Dean Wampler < > dean.wamp...@thinkbiganalytics.com> wrote: > >> Mark, >> >> Aside from what might be wrong here, isn't it true that sampling with the >> bucket clause still works on non-bucketed tables; it's just inefficient >> because it still scans the whole table? Or am I an idiot? ;) >> >> dean >> >> On Wed, Mar 20, 2013 at 2:17 PM, Mark Grover <grover.markgro...@gmail.com >> > wrote: >> >>> Hi Robert, >>> Sampling in Hive is based on buckets. Therefore, you table needs to be >>> appropriately bucketed. >>> >>> I would recommend storing the results of your inner query in a bucketed >>> table. See how to populate a bucketed table at >>> https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html >>> >>> Then you will be able to be sample through it. >>> >>> If the predicate is on partition column, you may be able to get around >>> the intermediate table requirement but in general, as far as I know, >>> intermediate bucketed table might be the only choice. >>> >>> Mark >>> >>> On Wed, Mar 20, 2013 at 9:56 AM, Robert Li <robert...@kontagent.com>wrote: >>> >>>> Hi Everyone >>>> >>>> I'm trying to use the TABLESAMPLE function to sample data, however it's >>>> a little more complicated and I am having trouble getting it to run. >>>> >>>> I know that this works fine and it will give me about 25% of the whole >>>> dataset >>>> >>>> select distinct s >>>> from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) >>>> where month <= 201211 >>>> >>>> However, in my situation I need to do a TABLESAMPLE on an outer query, >>>> a simple example is >>>> >>>> *select mytest.s * >>>> *from * >>>> * (select distinct s from testtable where month <= 201211)mytest* >>>> >>>> or something like >>>> >>>> *select table1.s* >>>> *from * >>>> * (select distinct s from testtable)table1* >>>> * join* >>>> * (select distinct s from test2table)table2* >>>> * on table1.s=table2.s* >>>> >>>> >>>> How do I use TABLESAMPLE in this case to sample the results of the >>>> outer query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various >>>> places of my query but it always returns some sort of syntax error and thus >>>> not allowing the query to run. >>>> >>>> Any help is appreciated. >>>> >>>> Robert >>>> ** >>>> >>> >>> >> >> >> -- >> *Dean Wampler, Ph.D.* >> thinkbiganalytics.com >> +1-312-339-1330 >> >> > >