Re: Block Sampling

2012-06-15 Thread Carl Steinbach
Done!

On Fri, Jun 15, 2012 at 12:26 PM, Ladda, Anand wrote:

>  Thanks Carl. Could you give me edit rights to the wiki (
> ala...@microstrategy.com) to update the sampling page with this info
>
> ** **
>
> *From:* Carl Steinbach [mailto:c...@cloudera.com]
> *Sent:* Friday, June 15, 2012 3:20 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Block Sampling
>
> ** **
>
> Hi Anand,
>
> ** **
>
> This feature was implemented in HIVE-2121 and appeared in Hive 0.8.0.
>
> ** **
>
> Ref: https://issues.apache.org/jira/browse/HIVE-2121
>
> ** **
>
> Thanks.
>
> ** **
>
> Carl
>
> On Fri, Jun 15, 2012 at 11:59 AM, Ladda, Anand 
> wrote:
>
> Has the block sampling feature been added to one of the latest (Hive 0.8
> or Hive 0.9) releases. The wiki has the blurb below on block sampling
>
> *Block Sampling*
>
> It is a feature that is still on trunk and is not yet in any release
> version.
>
> block_sample: TABLESAMPLE (n PERCENT)
>
> This will allow Hive to pick up at least n% data size (notice it doesn't
> necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is
> supported and some special compression formats are not handled. If we fail
> to sample it, the input of MapReduce job will be the whole table/partition.
> We do it in HDFS block level so that the sampling granularity is block
> size. For example, if block size is 256MB, even if n% of input size is only
> 100MB, you get 256MB of data.
>
> In the following example the input size 0.1% or more will be used for the
> query.
>
> SELECT * 
>
> FROM source TABLESAMPLE(0.1 PERCENT) s; 
>
> Sometimes you want to sample the same data with different blocks, you can
> change this seed number:
>
> set hive.sample.seednumber=;
>
>  
>
> ** **
>


RE: Block Sampling

2012-06-15 Thread Ladda, Anand
Thanks Carl. Could you give me edit rights to the wiki 
(ala...@microstrategy.com<mailto:ala...@microstrategy.com>) to update the 
sampling page with this info

From: Carl Steinbach [mailto:c...@cloudera.com]
Sent: Friday, June 15, 2012 3:20 PM
To: user@hive.apache.org
Subject: Re: Block Sampling

Hi Anand,

This feature was implemented in HIVE-2121 and appeared in Hive 0.8.0.

Ref: https://issues.apache.org/jira/browse/HIVE-2121

Thanks.

Carl
On Fri, Jun 15, 2012 at 11:59 AM, Ladda, Anand 
mailto:lan...@microstrategy.com>> wrote:
Has the block sampling feature been added to one of the latest (Hive 0.8 or 
Hive 0.9) releases. The wiki has the blurb below on block sampling
Block Sampling
It is a feature that is still on trunk and is not yet in any release version.
block_sample: TABLESAMPLE (n PERCENT)
This will allow Hive to pick up at least n% data size (notice it doesn't 
necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is 
supported and some special compression formats are not handled. If we fail to 
sample it, the input of MapReduce job will be the whole table/partition. We do 
it in HDFS block level so that the sampling granularity is block size. For 
example, if block size is 256MB, even if n% of input size is only 100MB, you 
get 256MB of data.
In the following example the input size 0.1% or more will be used for the query.
SELECT *
FROM source TABLESAMPLE(0.1 PERCENT) s;
Sometimes you want to sample the same data with different blocks, you can 
change this seed number:
set hive.sample.seednumber=;




Re: Block Sampling

2012-06-15 Thread Carl Steinbach
Hi Anand,

This feature was implemented in HIVE-2121 and appeared in Hive 0.8.0.

Ref: https://issues.apache.org/jira/browse/HIVE-2121

Thanks.

Carl

On Fri, Jun 15, 2012 at 11:59 AM, Ladda, Anand wrote:

>  Has the block sampling feature been added to one of the latest (Hive 0.8
> or Hive 0.9) releases. The wiki has the blurb below on block sampling
>
> *Block Sampling*
>
> It is a feature that is still on trunk and is not yet in any release
> version.
>
> block_sample: TABLESAMPLE (n PERCENT)
>
> This will allow Hive to pick up at least n% data size (notice it doesn't
> necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is
> supported and some special compression formats are not handled. If we fail
> to sample it, the input of MapReduce job will be the whole table/partition.
> We do it in HDFS block level so that the sampling granularity is block
> size. For example, if block size is 256MB, even if n% of input size is only
> 100MB, you get 256MB of data.
>
> In the following example the input size 0.1% or more will be used for the
> query.
>
> SELECT * ** **
>
> FROM source TABLESAMPLE(0.1 PERCENT) s; 
>
> Sometimes you want to sample the same data with different blocks, you can
> change this seed number:
>
> set hive.sample.seednumber=;
>
> ** **
>