[
https://issues.apache.org/jira/browse/HIVE-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198542#comment-13198542
]
xiaoyu wang commented on HIVE-2775:
-----------------------------------
{code}
index d0ff67e..bcddc5b 100644
@@ -349,7 +349,25 @@ public class Partition implements Serializable {
* we are just storing it as a property of the table as a short term measure.
*/
public int getBucketCount() {
- return table.getNumBuckets();
+ int logicalBucketNumber = table.getNumBuckets();
+ String pathPattern = this.getPartitionPath().toString() + "/*";
+ try {
+ FileSystem fs =
FileSystem.get(this.table.getDataLocation(),Hive.get().getConf());
+ FileStatus srcs[] = fs.globStatus(new Path(pathPattern));
+ int physicalBucketNumber = srcs.length;
+ if ((physicalBucketNumber/logicalBucketNumber) * logicalBucketNumber
== physicalBucketNumber){
+ return physicalBucketNumber;
+ } else {
+ throw new RuntimeException("Cannot get bucket count for table "
+ this.table.getTableName() +
+ " logical bucket is " + logicalBucketNumber + " physical
bucket number is " + physicalBucketNumber);
+ }
+ }catch (Exception e)
+ {
+ throw new RuntimeException("Cannot get bucket count for table " +
this.table.getTableName(), e) ;
+ }
+
+
+// return table.getNumBuckets();
/*
* TODO: Keeping this code around for later use when we will support
* sampling on tables which are not created with CLUSTERED INTO clause
{code}
> allow the number of files to be a multiple of bucketed table
> ------------------------------------------------------------
>
> Key: HIVE-2775
> URL: https://issues.apache.org/jira/browse/HIVE-2775
> Project: Hive
> Issue Type: New Feature
> Components: Metastore
> Reporter: xiaoyu wang
>
> Currently, hive bucketed table requires the number of files to match the
> bucket number in order to for correct sampling. This is very restrictive.
> e.g. we can only populate the table using a fix number of reducer, which can
> be a bottleneck.
> The idea is to introduce this "physical bucket" and "logical bucket" concept.
> "physical bucket" is the number of files and "logical bucket" is the number
> of bucket stored in meda-data for bucketed table. By allowing "physical
> bucket" to be a multiple of "logical bucket", we can do correct sampling as
> well as scaling up.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira