Hi Avrilia
         If you are having 20M partitions then your data volume should be 
really huge as well. When doing partitions ensure that you have sufficiently 
enough volume of data in each partition. Based on your data and expected 
frequent queries choose a column for partitioning that ensures sufficiently 
large volume data in each partition. It is very important.

        Honestly I haven't tried with that many number of partitions.But for 
reasonably large numbers it did work by just including the DISTRIBUTE BY. What 
were the steps that you took to overcome the error due to the large number of 
files created?

Regards
Bejoy.K.S



________________________________
From: Avrilia Floratou <flora...@cs.wisc.edu>
To: Bejoy Ks <bejoy...@yahoo.com>
Cc: "user@hive.apache.org" <user@hive.apache.org>
Sent: Monday, October 10, 2011 10:28 AM
Subject: Re: Problem with query on bucketed table

Hi Bejoy,

Thanks for your help. You are right. Partitions would help. The thing is
that the P_PARTKEY has 200000000 values in my table. And I expect the
table to get bigger. I tried to issue the dynamic partition insert query
with many configurations but it fails every time.

I followed the instructions on the wiki which suggested that if the number
of files is very big DISTRIBUTE BY should be used to do the partitioning
on the reducers. I also played with the number of reducers, number of
dynamic partitions per node, and all these parameters but I was not able
to create the files. I was getting errors regarding the max number of
files allowed and when I fixed that I was getting errors that have to do
with the jvm heap size or the garbage collector (for all the heap sizes I
tried).

It seems that creating one file per patition doesn't work for that many
values. Merging the files into larger files would help but to my
understanding this can be done after the 200000000 files have been
created, and not while creating the partitions.

Has anyone encountered this problem? Is there sth I can do to create
partitions for that many values?

Thanks,
Avrilia



> Hi Avrilia,
>            In your SELECT query you need to append the following, ***
> TABLESAMPLE( BUCKET x OUT OF y);  It specifies from which all buckets to
> scan and to get the data from. Only in the case of partitions if you issue
> a select query (on a partition) without any hint only that partition would
> be scanned. In case of SAMPLING/BUCKETS you need to specify hint for the
> query on which bucket to do the scan. You can adjust the value of 'x' and
> 'y' to scan more than one bucket in a query. 
>        Just a suggestion from my end,from my understanding of the
> requirement you put forth, i feel PARTITIONS would be better than SAMPLING
> for you as the query is centered around on some particular value for a
> column. SAMPLING is good when you like to do some operations on a small
> sample of whole data, say you need an approximate average of the whole
> data then you can use sampling to avoid scanning the whole table. In
> Partitions use Dynamic Partitions to load data from the source table into
> the target table on partitions on the fly.
>
>
> Hope it helps!..
>
> Regards
> Bejoy.K.S
>
>
>
> ________________________________
> From: Avrilia Floratou <flora...@cs.wisc.edu>
> To: user@hive.apache.org
> Sent: Monday, October 10, 2011 7:13 AM
> Subject: Problem with query on bucketed table
>
> Hello,
>
> I have a question regarding the execution of some queries on bucketed
> tables.
>
> I've created a compressed bucketed table using the following statement:
>
> create external table partRC (P_PARTKEY BIGINT,P_NAME STRING, P_MFGR
> STRING, P_BRAND STRING, P_TYPE STRING, P_SIZE INT, P_CONTAINER STRING,
> P_RETAILPRICE DOUBLE, P_COMMENT STRING)
> CLUSTERED BY (P_PARTKEY) SORTED BY (P_PARTKEY) INTO 512 BUCKETS
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerde'
> stored as RCFILE
> LOCATION '/tpch/partRC';
>
> I inserted all the data by reading from another table (PART) stored as
> textfile using the following script:
>
> set hive.enforce.bucketing = true;
> set hive.enforce.sorting = true;
> set mapred.output.compress=true;
> set
> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
> set hive.exec.compress.output=true;
>
>
> INSERT OVERWRITE TABLE partRC
> select * from part
> cluster by (P_PARTKEY);
>
> I verified that 512 buckets where cretaed (512 reducers where used) and
> then I ran the following selection:
>
> select * from partRC where P_PARTKEY = 3;
>
> I noticed that the whole table is scanned. Is this expected? I was
> expecting that only the bucket that contains the value 3 would be scanned
> (and actually not the whole bucket -- since it is sorted). Am I doing sth
> wrong here?
>
> Thanks,
> Avrilia

Reply via email to