Hi Bejoy,

Thanks for your help. You are right. Partitions would help. The thing is
that the P_PARTKEY has 200000000 values in my table. And I expect the
table to get bigger. I tried to issue the dynamic partition insert query
with many configurations but it fails every time.

I followed the instructions on the wiki which suggested that if the number
of files is very big DISTRIBUTE BY should be used to do the partitioning
on the reducers. I also played with the number of reducers, number of
dynamic partitions per node, and all these parameters but I was not able
to create the files. I was getting errors regarding the max number of
files allowed and when I fixed that I was getting errors that have to do
with the jvm heap size or the garbage collector (for all the heap sizes I
tried).

It seems that creating one file per patition doesn't work for that many
values. Merging the files into larger files would help but to my
understanding this can be done after the 200000000 files have been
created, and not while creating the partitions.

Has anyone encountered this problem? Is there sth I can do to create
partitions for that many values?

Thanks,
Avrilia



> Hi Avrilia,
>            In your SELECT query you need to append the following, ***
> TABLESAMPLE( BUCKET x OUT OF y);  It specifies from which all buckets to
> scan and to get the data from. Only in the case of partitions if you issue
> a select query (on a partition) without any hint only that partition would
> be scanned. In case of SAMPLING/BUCKETS you need to specify hint for the
> query on which bucket to do the scan. You can adjust the value of 'x' and
> 'y' to scan more than one bucket in a query. 
>        Just a suggestion from my end,from my understanding of the
> requirement you put forth, i feel PARTITIONS would be better than SAMPLING
> for you as the query is centered around on some particular value for a
> column. SAMPLING is good when you like to do some operations on a small
> sample of whole data, say you need an approximate average of the whole
> data then you can use sampling to avoid scanning the whole table. In
> Partitions use Dynamic Partitions to load data from the source table into
> the target table on partitions on the fly.
>
>
> Hope it helps!..
>
> Regards
> Bejoy.K.S
>
>
>
> ________________________________
> From: Avrilia Floratou <flora...@cs.wisc.edu>
> To: user@hive.apache.org
> Sent: Monday, October 10, 2011 7:13 AM
> Subject: Problem with query on bucketed table
>
> Hello,
>
> I have a question regarding the execution of some queries on bucketed
> tables.
>
> I've created a compressed bucketed table using the following statement:
>
> create external table partRC (P_PARTKEY BIGINT,P_NAME STRING, P_MFGR
> STRING, P_BRAND STRING, P_TYPE STRING, P_SIZE INT, P_CONTAINER STRING,
> P_RETAILPRICE DOUBLE, P_COMMENT STRING)
> CLUSTERED BY (P_PARTKEY) SORTED BY (P_PARTKEY) INTO 512 BUCKETS
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerde'
> stored as RCFILE
> LOCATION '/tpch/partRC';
>
> I inserted all the data by reading from another table (PART) stored as
> textfile using the following script:
>
> set hive.enforce.bucketing = true;
> set hive.enforce.sorting = true;
> set mapred.output.compress=true;
> set
> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
> set hive.exec.compress.output=true;
>
>
> INSERT OVERWRITE TABLE partRC
> select * from part
> cluster by (P_PARTKEY);
>
> I verified that 512 buckets where cretaed (512 reducers where used) and
> then I ran the following selection:
>
> select * from partRC where P_PARTKEY = 3;
>
> I noticed that the whole table is scanned. Is this expected? I was
> expecting that only the bucket that contains the value 3 would be scanned
> (and actually not the whole bucket -- since it is sorted). Am I doing sth
> wrong here?
>
> Thanks,
> Avrilia


Reply via email to