Mark, Aside from what might be wrong here, isn't it true that sampling with the bucket clause still works on non-bucketed tables; it's just inefficient because it still scans the whole table? Or am I an idiot? ;)
dean On Wed, Mar 20, 2013 at 2:17 PM, Mark Grover <grover.markgro...@gmail.com>wrote: > Hi Robert, > Sampling in Hive is based on buckets. Therefore, you table needs to be > appropriately bucketed. > > I would recommend storing the results of your inner query in a bucketed > table. See how to populate a bucketed table at > https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html > > Then you will be able to be sample through it. > > If the predicate is on partition column, you may be able to get around the > intermediate table requirement but in general, as far as I know, > intermediate bucketed table might be the only choice. > > Mark > > On Wed, Mar 20, 2013 at 9:56 AM, Robert Li <robert...@kontagent.com>wrote: > >> Hi Everyone >> >> I'm trying to use the TABLESAMPLE function to sample data, however it's a >> little more complicated and I am having trouble getting it to run. >> >> I know that this works fine and it will give me about 25% of the whole >> dataset >> >> select distinct s >> from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) >> where month <= 201211 >> >> However, in my situation I need to do a TABLESAMPLE on an outer query, a >> simple example is >> >> *select mytest.s * >> *from * >> * (select distinct s from testtable where month <= 201211)mytest* >> >> or something like >> >> *select table1.s* >> *from * >> * (select distinct s from testtable)table1* >> * join* >> * (select distinct s from test2table)table2* >> * on table1.s=table2.s* >> >> >> How do I use TABLESAMPLE in this case to sample the results of the outer >> query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various >> places of my query but it always returns some sort of syntax error and thus >> not allowing the query to run. >> >> Any help is appreciated. >> >> Robert >> ** >> > > -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330