Unsubscribe

2018-10-16 Thread Richard A. Bross
Unsubscribe

- Original Message -
From: "Kevin Risden" 
To: user@hive.apache.org
Sent: Tuesday, October 16, 2018 9:44:17 AM
Subject: Re: HiveServer2 performance references?


Thanks for the responses. I'll wait to test with HIVE-20621 to see how much 
that helps. For HIVE-17194, doesn't look like that is documented outside of 
JIRA currently. It looks like I may have picked up over the wire compression by 
default. I had assumed it wasn't compressed. 



Kevin Risden 




On Tue, Oct 16, 2018 at 12:44 AM Gopal Vijayaraghavan < gop...@apache.org > 
wrote: 






Hi, 


> I was looking at HiveServer2 performance going through Knox in KNOX-1524 and 
> found that HTTP mode is significantly slower. 



The HTTP mode does re-auth for every row before HIVE-20621 was fixed – Knox 
should be doing cookie-auth to prevent ActiveDirectory/LDAP from throttling 
this. 

I fixed it for the open-source JDBC drivers, but I think the proprietary ODBC 
drivers are still broken (the C++ code doesn’t save cookies). 

> I also didn't see any information about reducing amount of data over the wire 
> with compression. 

It’s possible Knox doesn’t generate/forward the Accept-Encoding headers. 



https://issues.apache.org/jira/browse/HIVE-17194 


Cheers, 

Gopal 




From: Prasanth Jayachandran < pjayachand...@hortonworks.com > 
Reply-To: " user@hive.apache.org " < user@hive.apache.org > 
Date: Monday, October 15, 2018 at 12:34 PM 
To: " user@hive.apache.org " < user@hive.apache.org >, " user@hive.apache.org " 
< user@hive.apache.org > 
Subject: Re: HiveServer2 performance references? 







One performance fix that is fixed recently and is related is HIVE-20621 






Thanks 


Prasanth 






On Mon, Oct 15, 2018 at 12:11 PM -0700, " 



Kevin Risden " < kris...@apache.org > wrote: 









Are there any resources on HiveServer2 performance - specifically binary vs 
HTTP mode? I would also be interested in any information about enabling 
compression between HiveServer2 and a client like beeline. 





I was looking at HiveServer2 performance going through Knox in KNOX-1524 and 
found that HTTP mode is significantly slower. I searched and didn't find any 
references to performance of HiveServer2 out of the box. I also didn't see any 
information about reducing amount of data over the wire with compression. 





Any pointers would be very helpful. Thanks! 







Kevin Risden


Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3

2018-04-12 Thread Richard A. Bross
To be honest, my advice was just based on your post - we only use Hive in AWS 
EMR, so I couldn't tell you.

Glad that at least you're back up though.

- Original Message -
From: "Anup Tiwari" 
To: user@hive.apache.org
Sent: Thursday, April 12, 2018 10:50:23 AM
Subject: Re: Unable to read hive external table data which is linked to s3 
after upgradation from 2.1.1 to 2.3.3








Hi Richard , 

After looking at hive 2.3.3 logs i found that when we are loading all 
configuration parameters then below message is getting printed :- 

Values omitted for security reason if present: [fs.s3n.awsAccessKeyId, 
fs.s3a.access.key, fs.s3.awsAccessKeyId, hive.server2.keystore.password, f 
s.s3a.proxy.password, javax.jdo.option.ConnectionPassword, 
fs.s3.awsSecretAccessKey, fs.s3n.awsSecretAccessKey, fs.s3a.secret.key] 


while in hive 2.1.1 logs i found below message :- 

Values omitted for security reason if present: [hive.server2.keystore.password, 
javax.jdo.option.ConnectionPassword] 

Can this be the reason why hive 2.3.3 is not able to read s3 related params 
from hive-site.xml? 

I found one of JIRA : https://issues.apache.org/jira/browse/HIVE-14588 



Also i have set below property in hive-site.xml by excluding above s3 
variables(fs.s3a.access.key,fs.s3a.secret.key) from list and it worked. 

 
hive.conf.hidden.list 
javax.jdo.option.ConnectionPassword,hive.server2.keystore.password,fs.s3.awsAccessKeyId,fs.s3.awsSecretAccessKey,fs.s3n.awsAccessKeyId,fs.s3n.awsSecretAccessKey,fs.s3a.proxy.password
 
Comma separated list of configuration options which should not be 
read by normal user like passwords.Anup has excluded 2 variable which were 
fs.s3a.access.key,fs.s3a.secret.key 
 

Let me know if there is any other solution because i think if these variables 
are by default part of hidden.list then there will be some other proper 
workaround for this. 









Regards, 
Anup Tiwari 


On Thu, Apr 12, 2018 at 7:44 PM, Richard A. Bross < r...@oaktreepeak.com > 
wrote: 


I hear you, but given the exception log, it does seem that it can't 
authenticate you. You can try using the AWS environment variables. If that 
resolves the issue then you'll have some more to go on. According to 
Hortonworks here: 

https://hortonworks.github.io/hdp-aws/s3-security/ 

"AWS CLI supports authentication through environment variables. These same 
environment variables will be used by Hadoop if no configuration properties are 
set." 


- Original Message - 
From: "Anup Tiwari" < anupsdtiw...@gmail.com > 
To: user@hive.apache.org 
Sent: Thursday, April 12, 2018 10:06:33 AM 
Subject: Re: Unable to read hive external table data which is linked to s3 
after upgradation from 2.1.1 to 2.3.3 


We are not using EMR. Also we have set below params for accessing s3 bucket in 
hive-site.xml which are same as what we have set in hive 2.1.1. 


* fs.s3a.access.key 
* fs.s3a.secret.key 
* fs.s3a.connection.maximum 
* fs.s3a.impl 








Regards, 
Anup Tiwari 


On Thu, Apr 12, 2018 at 7:19 PM, Richard A. Bross < r...@oaktreepeak.com > 
wrote: 


Based on the exception, it looks more like an AWS credentials issue than a Hive 
issue. Are you running in AWS EMR, on-prem? 

In AWS the resource accessing the S3 bucket would have to have an IAM that gave 
permission. If you are running somewhere else whatever AWS login you use would 
have to have the correct permissions in the IAM. 



- Original Message - 
From: "Anup Tiwari" < anupsdtiw...@gmail.com > 
To: user@hive.apache.org 
Sent: Thursday, April 12, 2018 9:11:37 AM 
Subject: Unable to read hive external table data which is linked to s3 after 
upgradation from 2.1.1 to 2.3.3 



Hi All, 

When i am trying to read s3 linked external table in 2.3.3 ; i am getting 
errors.It was working properly in 2.1.1. please find below details and let me 
know if i am missing something :- 


Hadoop Version :- 2.8.0 

Query :- 

select log_date,count(1) as cnt from test.tt1 group by log_date; 

Error :- 

Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, 
diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, 
vertex=vertex_1523502631429_0029_3_00 [Map 1], 
org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on 
g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials 
provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider 
SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: 
The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS 
Credentials provided by BasicAWSCredentialsProvider 
EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider 
: com.amazonaws.AmazonClientException: The requested metadata is

Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3

2018-04-12 Thread Richard A. Bross
I hear you, but given the exception log, it does seem that it can't 
authenticate you.  You can try using the AWS environment variables.  If that 
resolves the issue then you'll have some more to go on.  According to 
Hortonworks here:

https://hortonworks.github.io/hdp-aws/s3-security/

"AWS CLI supports authentication through environment variables. These same 
environment variables will be used by Hadoop if no configuration properties are 
set."


- Original Message -
From: "Anup Tiwari" 
To: user@hive.apache.org
Sent: Thursday, April 12, 2018 10:06:33 AM
Subject: Re: Unable to read hive external table data which is linked to s3 
after upgradation from 2.1.1 to 2.3.3


We are not using EMR. Also we have set below params for accessing s3 bucket in 
hive-site.xml which are same as what we have set in hive 2.1.1. 


* fs.s3a.access.key 
* fs.s3a.secret.key 
* fs.s3a.connection.maximum 
* fs.s3a.impl 






Regards, 
Anup Tiwari 


On Thu, Apr 12, 2018 at 7:19 PM, Richard A. Bross < r...@oaktreepeak.com > 
wrote: 


Based on the exception, it looks more like an AWS credentials issue than a Hive 
issue. Are you running in AWS EMR, on-prem? 

In AWS the resource accessing the S3 bucket would have to have an IAM that gave 
permission. If you are running somewhere else whatever AWS login you use would 
have to have the correct permissions in the IAM. 



- Original Message - 
From: "Anup Tiwari" < anupsdtiw...@gmail.com > 
To: user@hive.apache.org 
Sent: Thursday, April 12, 2018 9:11:37 AM 
Subject: Unable to read hive external table data which is linked to s3 after 
upgradation from 2.1.1 to 2.3.3 



Hi All, 

When i am trying to read s3 linked external table in 2.3.3 ; i am getting 
errors.It was working properly in 2.1.1. please find below details and let me 
know if i am missing something :- 


Hadoop Version :- 2.8.0 

Query :- 

select log_date,count(1) as cnt from test.tt1 group by log_date; 

Error :- 

Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, 
diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, 
vertex=vertex_1523502631429_0029_3_00 [Map 1], 
org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on 
g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials 
provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider 
SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: 
The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS 
Credentials provided by BasicAWSCredentialsProvider 
EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider 
: com.amazonaws.AmazonClientException: The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ 
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:128) 
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:288)
 
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236) 
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2811) 
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) 
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2848) 
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2830) 
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) 
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) 
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:265)
 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) 
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) 
at org.apache.hadoop.hive.ql.io 
.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442) 
at org.apache.hadoop.hive.ql.io 
.HiveInputFormat.getSplits(HiveInputFormat.java:561) 
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:422) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
a

Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3

2018-04-12 Thread Richard A. Bross
Based on the exception, it looks more like an AWS credentials issue than a Hive 
issue.  Are you running in AWS EMR, on-prem?

In AWS the resource accessing the S3 bucket would have to have an IAM that gave 
permission.  If you are running somewhere else whatever AWS login you use would 
have to have the correct permissions in the IAM.

- Original Message -
From: "Anup Tiwari" 
To: user@hive.apache.org
Sent: Thursday, April 12, 2018 9:11:37 AM
Subject: Unable to read hive external table data which is linked to s3 after 
upgradation from 2.1.1 to 2.3.3



Hi All, 

When i am trying to read s3 linked external table in 2.3.3 ; i am getting 
errors.It was working properly in 2.1.1. please find below details and let me 
know if i am missing something :- 


Hadoop Version :- 2.8.0 

Query :- 

select log_date,count(1) as cnt from test.tt1 group by log_date; 

Error :- 

Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, 
diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, 
vertex=vertex_1523502631429_0029_3_00 [Map 1], 
org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on 
g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials 
provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider 
SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: 
The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS 
Credentials provided by BasicAWSCredentialsProvider 
EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider 
: com.amazonaws.AmazonClientException: The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ 
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:128) 
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:288)
 
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236) 
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2811) 
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) 
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2848) 
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2830) 
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) 
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) 
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:265)
 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) 
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) 
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442)
 
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561)
 
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:422) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
Caused by: com.amazonaws.AmazonClientException: No AWS Credentials provided by 
BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider 
SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: 
The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ 
at 
org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:151)
 
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779) 
at 
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107) 
at 
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
 
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:276)
 
... 24 more 
Caused by: com.amazonaws.AmazonClientException: The requested metadata is not 
found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ 
a

Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Richard A. Bross
I can't duplicate the issue now.  Works like it always has . . 

- Original Message -
From: "Furcy Pin" 
To: user@hive.apache.org
Sent: Thursday, April 5, 2018 9:57:39 AM
Subject: Re: ALTER TABLE DROP PARTITION not working on S3


Indeed. 


If I remember correctly, s3 does not really have the concept of "folder" like 
HDFS has, and Hive sort of makes up for it by creating a descriptor file 
where the partition "folder" is supposed to be. Maybe this is what is missing 
here. 


Perhaps you could try doing a "MSCK REPAIR TABLE tablename" to make sure that 
the partitions are correctly loaded and then try again dropping that particular 
partition? 


Or look at your s3 folder if you see any such "partition folder file" and check 
if it is missing for this particular partition? 




On 5 April 2018 at 15:40, Richard A. Bross < r...@oaktreepeak.com > wrote: 


Leaving the column list out, here you go: 

# Detailed Table Information 
Database: default 
Owner: hadoop 
CreateTime: Thu Apr 05 13:24:33 UTC 2018 
LastAccessTime: UNKNOWN 
Retention: 0 
Location: s3://zoomi-proto-warehouse-measurements/ 
Table Type: MANAGED_TABLE 
Table Parameters: 
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} 
numFiles 10 
numPartitions 7 
numRows 153 
orc.compress ZLIB 
orc.create.index true 
rawDataSize 113563 
totalSize 37801 
transient_lastDdlTime 1522934673 

# Storage Information 
SerDe Library: org.apache.hadoop.hive.ql.io .orc.OrcSerde 
InputFormat: org.apache.hadoop.hive.ql.io .orc.OrcInputFormat 
OutputFormat: org.apache.hadoop.hive.ql.io .orc.OrcOutputFormat 
Compressed: No 
Num Buckets: 61 
Bucket Columns: [crs_id] 
Sort Columns: [] 
Storage Desc Params: 
serialization.format 1 
Time taken: 0.467 seconds, Fetched: 98 row(s) 


- Original Message - 
From: "Furcy Pin" < pin.fu...@gmail.com > 
To: user@hive.apache.org 
Sent: Thursday, April 5, 2018 9:21:06 AM 


Subject: Re: ALTER TABLE DROP PARTITION not working on S3 


Hi Richard, 


could you please check if your table is EXTERNAL? 
You can see it with a "DESCRIBE FORMATTED table_name ;" 


That's what external tables are for, they don't delete underlying data when you 
drop them. 




On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: 


I think that someone put a file in there manually. Would that prevent Hive from 
dropping the partition. I also did a "drop table" and the s3 object keys 
persisted. 



- Original Message - 
From: "Richard A. Bross" < r...@oaktreepeak.com > 
To: user@hive.apache.org 
Sent: Thursday, April 5, 2018 9:14:52 AM 
Subject: ALTER TABLE DROP PARTITION not working on S3 

Hi, 

I have a Hive managed table on S3, "api_measurements". I've tried dropping a 
partition like so: 

hive> alter table api_measurements drop if exists 
partition(daydate='2018-04-04', epoch=1522876500); 
Dropped the partition daydate=2018-04-04/epoch=1522876500 
OK 
Time taken: 2.109 seconds 

Yet the data is still on S3. Because object keys on S3 are always strings, I 
also tried this: 

hive> alter table api_measurements drop partition(daydate='2018-04-04', 
epoch='1522876500'); 
OK 
Time taken: 0.135 seconds 

Yet the object keys and data are still there. I assume that Im missing 
something really simple. Can anyone shed some light on this? 

Thanks 





Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Richard A. Bross
Leaving the column list out, here you go:

# Detailed Table Information 
Database:   default  
Owner:  hadoop   
CreateTime: Thu Apr 05 13:24:33 UTC 2018 
LastAccessTime: UNKNOWN  
Retention:  0
Location:   s3://zoomi-proto-warehouse-measurements/ 
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
numFiles10  
numPartitions   7   
numRows 153 
orc.compressZLIB
orc.create.indextrue
rawDataSize 113563  
totalSize   37801   
transient_lastDdlTime   1522934673  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No   
Num Buckets:61   
Bucket Columns: [crs_id] 
Sort Columns:   []   
Storage Desc Params: 
serialization.format1   
Time taken: 0.467 seconds, Fetched: 98 row(s)


- Original Message -
From: "Furcy Pin" 
To: user@hive.apache.org
Sent: Thursday, April 5, 2018 9:21:06 AM
Subject: Re: ALTER TABLE DROP PARTITION not working on S3


Hi Richard, 


could you please check if your table is EXTERNAL? 
You can see it with a "DESCRIBE FORMATTED table_name ;" 


That's what external tables are for, they don't delete underlying data when you 
drop them. 




On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: 


I think that someone put a file in there manually. Would that prevent Hive from 
dropping the partition. I also did a "drop table" and the s3 object keys 
persisted. 



- Original Message - 
From: "Richard A. Bross" < r...@oaktreepeak.com > 
To: user@hive.apache.org 
Sent: Thursday, April 5, 2018 9:14:52 AM 
Subject: ALTER TABLE DROP PARTITION not working on S3 

Hi, 

I have a Hive managed table on S3, "api_measurements". I've tried dropping a 
partition like so: 

hive> alter table api_measurements drop if exists 
partition(daydate='2018-04-04', epoch=1522876500); 
Dropped the partition daydate=2018-04-04/epoch=1522876500 
OK 
Time taken: 2.109 seconds 

Yet the data is still on S3. Because object keys on S3 are always strings, I 
also tried this: 

hive> alter table api_measurements drop partition(daydate='2018-04-04', 
epoch='1522876500'); 
OK 
Time taken: 0.135 seconds 

Yet the object keys and data are still there. I assume that Im missing 
something really simple. Can anyone shed some light on this? 

Thanks 




Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Richard A. Bross
No definitely internal.  It's loaded from an external table.

- Original Message -
From: "Furcy Pin" 
To: user@hive.apache.org
Sent: Thursday, April 5, 2018 9:21:06 AM
Subject: Re: ALTER TABLE DROP PARTITION not working on S3


Hi Richard, 


could you please check if your table is EXTERNAL? 
You can see it with a "DESCRIBE FORMATTED table_name ;" 


That's what external tables are for, they don't delete underlying data when you 
drop them. 




On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: 


I think that someone put a file in there manually. Would that prevent Hive from 
dropping the partition. I also did a "drop table" and the s3 object keys 
persisted. 



----- Original Message - 
From: "Richard A. Bross" < r...@oaktreepeak.com > 
To: user@hive.apache.org 
Sent: Thursday, April 5, 2018 9:14:52 AM 
Subject: ALTER TABLE DROP PARTITION not working on S3 

Hi, 

I have a Hive managed table on S3, "api_measurements". I've tried dropping a 
partition like so: 

hive> alter table api_measurements drop if exists 
partition(daydate='2018-04-04', epoch=1522876500); 
Dropped the partition daydate=2018-04-04/epoch=1522876500 
OK 
Time taken: 2.109 seconds 

Yet the data is still on S3. Because object keys on S3 are always strings, I 
also tried this: 

hive> alter table api_measurements drop partition(daydate='2018-04-04', 
epoch='1522876500'); 
OK 
Time taken: 0.135 seconds 

Yet the object keys and data are still there. I assume that Im missing 
something really simple. Can anyone shed some light on this? 

Thanks 




Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Richard A. Bross
I think that someone put a file in there manually.  Would that prevent Hive 
from dropping the partition.  I also did a "drop table" and the s3 object keys 
persisted.

- Original Message -----
From: "Richard A. Bross" 
To: user@hive.apache.org
Sent: Thursday, April 5, 2018 9:14:52 AM
Subject: ALTER TABLE DROP PARTITION not working on S3

Hi,

I have a Hive managed table on S3, "api_measurements".  I've tried dropping a 
partition like so:

hive> alter table api_measurements drop if exists 
partition(daydate='2018-04-04', epoch=1522876500);
Dropped the partition daydate=2018-04-04/epoch=1522876500
OK
Time taken: 2.109 seconds

Yet the data is still on S3.  Because object keys on S3 are always strings, I 
also tried this:

hive> alter table api_measurements drop partition(daydate='2018-04-04', 
epoch='1522876500');
OK
Time taken: 0.135 seconds

Yet the object keys and data are still there.  I assume that Im missing 
something really simple.  Can anyone shed some light on this?

Thanks



ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Richard A. Bross
Hi,

I have a Hive managed table on S3, "api_measurements".  I've tried dropping a 
partition like so:

hive> alter table api_measurements drop if exists 
partition(daydate='2018-04-04', epoch=1522876500);
Dropped the partition daydate=2018-04-04/epoch=1522876500
OK
Time taken: 2.109 seconds

Yet the data is still on S3.  Because object keys on S3 are always strings, I 
also tried this:

hive> alter table api_measurements drop partition(daydate='2018-04-04', 
epoch='1522876500');
OK
Time taken: 0.135 seconds

Yet the object keys and data are still there.  I assume that Im missing 
something really simple.  Can anyone shed some light on this?

Thanks



Re: Hive, Tez, clustering, buckets, and Presto

2018-04-04 Thread Richard A. Bross
This is great information, Gopal, thank you.

I wish I had the time to create a comparison for our use case between Hive 
buckets and ORC files on S3 and ORC files without bucket.  Unfortunately it's a 
chicken and egg issue, since I won't have enough data volume until we are in 
production, which should be shortly.

Thanks again.

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Wednesday, April 4, 2018 7:31:31 PM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

> so there asking "where is the Hive bucketing spec".  Is it just to read the 
> code for that function? 

This worked the other way around in time, than writing a spec first - ACIDv1 
implemented Streaming ingest via Storm, it used an explicit naming "bucket_" 
for the filename.

Since until the compaction runs the actual base files don't exist, the ACID 
bucketing implementation has to handle missing buckets as 0 rows in base file + 
possibly more rows in uncompacted deltas.

ACID's implementation has forced the two bucketing implementations to work 
similarly, for the ability to do bucket map-joins between ACID & non-ACID 
bucketed tables. Particularly about the modulus for -ve numbers, which was 
broken in Hive-1.0.

https://issues.apache.org/jira/browse/HIVE-12025

that's the place where this all got refactored so that joins & filters for 
bucketed tables work the same way for ACID & non-ACID tables.

Because of that spec lives in the comments now as a Regex.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1283

> They were looking for something more explicit, I think.

I think a simple unit test will probably help them a bit more.

create external table bucketed (x int) clustered by (x) into 4 buckets stored 
as orc;
insert into bucketed values(1),(2),(3),(4);
insert into bucketed values(1),(2),(3),(4);

0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed;

| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/00_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/00_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/01_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/01_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/02_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/02_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/03_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/03_0_copy_1 |

Even when all buckets are covered Presto should be expecting >1 files per 
bucket.

I saw a JIRA comment which said "sort in file order and assign buckets", you 
can see that is only applicable for the 1st insert to table (& the regex will 
remove the copy numbering).

And oddly enough this week, I saw an academic paper with a negative analysis of 
Hive bucketing.

https://www.researchgate.net/publication/323997831_Partitioning_and_Bucketing_in_Hive-Based_Big_Data_Warehouses

Cheers,
Gopal

On 4/3/18, 1:42 PM, "Richard A. Bross"  wrote:

Gopal,

The Presto devs say they are willing to make the changes to adhere to the 
Hive bucket spec.  I quoted 

"Presto could fix their fail-safe for bucketing implementation to actually 
trust the Hive bucketing spec & get you out of this mess - the bucketing 
contract for Hive is actual file name -> hash % buckets 
(Utilities::getBucketIdFromFile)."

so there asking "where is the Hive bucketing spec".  Is it just to read the 
code for that function?  They were looking for something more explicit, I think.

Thanks

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>* I'm interested in your statement that CLUSTERED BY does not CLUSTER 
BY.  My understanding was that this was related to the number of buckets, but 
you are relating it to ORC stripes.  It is odd that no examples that I've seen 
include the SORTED BY statement other than in relation to ORC indexes (that I 
understand).  So the question is; regardless of whether efficient ORC stripes 
are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this 
to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), which when combined with a c

Re: Building Datwarehouse Application in Spark

2018-04-04 Thread Richard A. Bross
Mahender,

To really address your question I think that you'd have to supply a bit more 
information, such as the kind of data that you want to save; RBDMS type look 
ups, key/value/index type look ups, insert velocity, etc.  These wide choices 
of technologies are suited to different use cases, although they overlap in 
some areas.

In a previous position that I held we used Spark on Cassandra to solve a 
similar problem.  The Datastax distribution puts Spark worker nodes directly on 
Cassandra nodes.  Because Cassandra partitions the data across nodes based on a 
row key, it's a nice match.  If the key is chosen properly, the Spark nodes are 
typically accessing local data on the Cassandra nodes, meaning that there are 
typically very few shuffles for direct queries and also that inserts go 
directly to the proper Cassandra nodes.  We had time series data based on 
unique row keys.  So our row keys were unique and our column keys were the time 
stamps. In that case our queries were done directly with the Cassandra clients 
for the most part, with SparkQL primarily used for ad-hoc queries.  

At my current position, we directly load raw data into Hive (using HiveQL) and 
then use Presto for queries.  That's our OLAP data store.  You can use any 
number of other tools to query Hive created data stores as well.

Then we have another pipeline that takes the same raw data, uses Spark for the 
ETL, and then inserts the results into Aurora (MySQL).  The schema is designed 
for specific queries, so the Spark ETL is designed to transform the data to 
optimize for the schema so as to allow efficient updates to those tables.  
That's our OLTP data store and we use standard SQL for queries.

Rick


- Original Message -
From: "Furcy Pin" 
To: user@hive.apache.org
Sent: Wednesday, April 4, 2018 6:58:58 AM
Subject: Re: Building Datwarehouse Application in Spark


Hi Mahender, 


Did you look at this? https://www.snappydata.io/blog/the-spark-database 


But I believe that most people handle this use case by either using: 
- Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...) if 
the data is not too big 
- Their favorite New-SQL storage (Cassandra, HBase) if the data is too big and 
needs to be distributed 


Spark generally makes it easy enough to query these other databases to allow 
you to perform analytics. 


Hive and Spark have been designed as OLAP tools, not OLTP. 
I'm not sure what features you are seeking for your SCD but they probably won't 
be part of Spark's core design. 


Hope this helps, 


Furcy 






On 4 April 2018 at 11:29, Mahender Sarangam < mahender.bigd...@outlook.com > 
wrote: 




Hi, 
Does anyone has good architecture document/design principle for building 
warehouse application using Spark. 


Is it better way of having Hive Context created with HQL and perform 
transformation or Directly loading files in dataframe and perform data 
transformation. 


We need to implement SCD 2 Type in Spark, Is there any better 
document/reference for building Type 2 warehouse object 


Thanks in advace 


/Mahender 


Re: Hive, Tez, clustering, buckets, and Presto

2018-04-03 Thread Richard A. Bross
Gopal,

The Presto devs say they are willing to make the changes to adhere to the Hive 
bucket spec.  I quoted 

"Presto could fix their fail-safe for bucketing implementation to actually 
trust the Hive bucketing spec & get you out of this mess - the bucketing 
contract for Hive is actual file name -> hash % buckets 
(Utilities::getBucketIdFromFile)."

so there asking "where is the Hive bucketing spec".  Is it just to read the 
code for that function?  They were looking for something more explicit, I think.

Thanks

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  
> My understanding was that this was related to the number of buckets, but you 
> are relating it to ORC stripes.  It is odd that no examples that I've seen 
> include the SORTED BY statement other than in relation to ORC indexes (that I 
> understand).  So the question is; regardless of whether efficient ORC stripes 
> are created (wouldn't I have to also specify 'orc.create.index’=’true’ for 
> this to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), which when combined with a columnar format & 
ordering within files produces better storage and runtimes together (which I 
guess is usually a trade-off).

Without a SORTED BY, the organizing function for the data-shuffle does not 
order in any specific way - the partition key for the shuffle is the modulus, 
while the order key is 0 bytes long, so it sorts by (modulus,) which for a 
quick-sort also loses the input order into the shuffle & each bucket file is 
produced in random order within itself.

An explicit sort with bucketing is what I recommend to most of the HDP 
customers who have performance problems with ORC.

This turns the shuffle key into (modulus, key1, key2) producing more 
predictable order during shuffle.

Then the key1 can be RLE encoded so that ORC vector impl will pass it on as 
key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for 
integers.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5

was written as a warning to customers who use bucketing to try & solve 
performance problems, but have ended up bucketing as their main problem.

Most of what I have written above was discussed a few years back and in 
general, bucketing on a high cardinality column + sorting on a low cardinality 
together has given good results to my customers.

>I hadn't thought of the even number issue, not having looked at the 
> function; I had assumed that it was a hash, not a modulus; shame on me.  
> Reading the docs I see that hash is only used on string columns

Actually a hash is used in theory, but I entirely blame Java for it - the Java 
hash is an identity function for Integers.

scala> 42.hashCode
res1: Int = 42

scala> 42L.hashCode
res2: Int = 42

> Finally, I'm not sure that I got a specific answer to my original question, 
> which is can I force Tez to create all bucket files so Presto queries can 
> succeed?  Anyway, I will be testing today and the solution will either be to 
> forgo buckets completely or to simply rely on ORC indexes.

There's no config to do that today & Presto is already incompatible with Hive 
3.0 tables (Update/Delete support).

Presto could fix their fail-safe for bucketing implementation to actually trust 
the Hive bucketing spec & get you out of this mess - the bucketing contract for 
Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile).

The file-count is a very flaky way to check if the table is bucketed correctly 
- either you trust the user to have properly bucketed the table or you don't 
use it. Failing to work on valid tables does look pretty bad, instead of soft 
fallbacks.

I wrote a few UDFs which was used to validate suspect tables and fix them for 
customers who had bad historical data, which was loaded with 
"enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.

https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27

LLAP has a bucket pruning implementation if Presto wants to copy from it 
(LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).

Optimizer: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
Runtime: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281

That actually does things according to the Hive bucketing contract where 
uncovered buckets are assumed to have 0 rows without a file present & not error 
out instead.

If you do have the abilit

Re: Hive, Tez, clustering, buckets, and Presto

2018-04-03 Thread Richard A. Bross
Gopal,

Thanks for this.  Great information and something to look at more closely to 
better understand the internals.

Rick

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  
> My understanding was that this was related to the number of buckets, but you 
> are relating it to ORC stripes.  It is odd that no examples that I've seen 
> include the SORTED BY statement other than in relation to ORC indexes (that I 
> understand).  So the question is; regardless of whether efficient ORC stripes 
> are created (wouldn't I have to also specify 'orc.create.index’=’true’ for 
> this to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), which when combined with a columnar format & 
ordering within files produces better storage and runtimes together (which I 
guess is usually a trade-off).

Without a SORTED BY, the organizing function for the data-shuffle does not 
order in any specific way - the partition key for the shuffle is the modulus, 
while the order key is 0 bytes long, so it sorts by (modulus,) which for a 
quick-sort also loses the input order into the shuffle & each bucket file is 
produced in random order within itself.

An explicit sort with bucketing is what I recommend to most of the HDP 
customers who have performance problems with ORC.

This turns the shuffle key into (modulus, key1, key2) producing more 
predictable order during shuffle.

Then the key1 can be RLE encoded so that ORC vector impl will pass it on as 
key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for 
integers.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5

was written as a warning to customers who use bucketing to try & solve 
performance problems, but have ended up bucketing as their main problem.

Most of what I have written above was discussed a few years back and in 
general, bucketing on a high cardinality column + sorting on a low cardinality 
together has given good results to my customers.

>I hadn't thought of the even number issue, not having looked at the 
> function; I had assumed that it was a hash, not a modulus; shame on me.  
> Reading the docs I see that hash is only used on string columns

Actually a hash is used in theory, but I entirely blame Java for it - the Java 
hash is an identity function for Integers.

scala> 42.hashCode
res1: Int = 42

scala> 42L.hashCode
res2: Int = 42

> Finally, I'm not sure that I got a specific answer to my original question, 
> which is can I force Tez to create all bucket files so Presto queries can 
> succeed?  Anyway, I will be testing today and the solution will either be to 
> forgo buckets completely or to simply rely on ORC indexes.

There's no config to do that today & Presto is already incompatible with Hive 
3.0 tables (Update/Delete support).

Presto could fix their fail-safe for bucketing implementation to actually trust 
the Hive bucketing spec & get you out of this mess - the bucketing contract for 
Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile).

The file-count is a very flaky way to check if the table is bucketed correctly 
- either you trust the user to have properly bucketed the table or you don't 
use it. Failing to work on valid tables does look pretty bad, instead of soft 
fallbacks.

I wrote a few UDFs which was used to validate suspect tables and fix them for 
customers who had bad historical data, which was loaded with 
"enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.

https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27

LLAP has a bucket pruning implementation if Presto wants to copy from it 
(LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).

Optimizer: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
Runtime: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281

That actually does things according to the Hive bucketing contract where 
uncovered buckets are assumed to have 0 rows without a file present & not error 
out instead.

If you do have the ability to redeploy Hive, the change you are looking for is 
a 1-liner to enable.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248

Cheers,
Gopal




Re: Hive, Tez, clustering, buckets, and Presto

2018-04-02 Thread Richard A. Bross
I'm really confused and could use help understanding.  The Hive documentation 
here 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

says:

"Bucketed tables are fantastic in that they allow much more efficient sampling 
than do non-bucketed tables, and they may later allow for time saving 
operations such as mapside joins. However, the bucketing specified at table 
creation is not enforced when the table is written to, and so it is possible 
for the table's metadata to advertise properties which are not upheld by the 
table's actual layout. This should obviously be avoided. Here's how to do it 
right."

"This should obviously be avoided".  Yet with the Tez engine it seems to be 
unavoidable when the data does not allow for all the buckets to be populated.

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Monday, April 2, 2018 2:16:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto


There's more here than Bucketing or Tez.

>  PARTITIONED BY(daydate STRING, epoch BIGINT)
> CLUSTERED BY(r_crs_id) INTO 64 BUCKETS

I hope the epoch partition column is actually a day rollup and not 1 partition 
for every timestamp.

CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). 
CLUSTERED BY needs a SORTED BY to produce good ORC stripes here.

>  If I perform the inserts with the mr engine, which is now deprecated, when 
> the partition is created there are always 64 bucket files, even if some are 0 
> length.  In this case I can query with Presto just fine.

The files should be at least 3 bytes long containing the 3 letters 'O','R','C'.

If you have a bucketing impl which is producing empty files when hash(20k) 
values % 64 is producing 0 rows for some buckets, the skew is unnatural.

>  However, when the Tez engine is used, only buckets that contain data are 
> created, and Presto fails since the table properties in the metastore (64 
> buckets) are a mismatch with the actual bucket count.

That was done to prevent creating these 3 byte files which actually cost real 
money to store (because storage is usually at the 4k block level, this uses 
1000x more storage).

You can end up with >64 or <64 or exactly 64 files for a validly bucketed table 
(in the >64 case, at least some files have a _Copy suffix, but the bucket 
prefix is consistent).

Presto just bails out when it finds something strange, because they do not 
trust the Hive bucketing impl - there is a really good reason why there are 2 
bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & 
hive.tez.bucket.pruning.compat).

> we'd really like to use buckets, since the the r_crs_id in production can 
> contain 20,000 values.

With 20k values and 64 buckets, if you have zero sized buckets - you might want 
to run your hash values through this.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6

> "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is 
> always set to "true", but I can't seem to find a description of its intent.  
> Anyway, if it's supposed to force the creation of buckets it's not working 
> with the Tez engine.

Previously you could disable "enforce.bucketing=false" and the inserts of data 
won't use the fixed number of buckets & can play other tricks with the data 
layout. That's not the config you're looking for.

There is something which is a very sharp no-handle knife in the Hive toolkit 
called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth 
explaining right now (but just that it is possible to use it, but not very 
easily).

> configuration so that r_crs_id was at least row optimized and sorted within 
> the ORC files. 

SORTED BY in the Table DDL should do the trick - I like to use a multi-dim 
sort, in some of these scenarios.

CLUSTERED BY(r_crs_id) 
SORTED BY(r_crs_id, id)
INTO 67 BUCKETS

If you want to know why I like primes, when % 64 is done on even numbers.

len(set([(x*2) % 64 for x in xrange(1000)]))

fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets 
have 0x data.

len(set([(x*2) % 61 for x in xrange(1000)]))

fills all 61 buckets - producing better bucketing & no 0 sized files.

FYI any prime number other than 31 works nicer than a 2^n - 
https://issues.apache.org/jira/browse/HIVE-7074

Cheers,
Gopal



Re: Hive, Tez, clustering, buckets, and Presto

2018-04-02 Thread Richard A. Bross
Gopal,

Thanks for taking the time to try and help.  A few things in relation to your 
response:

* Yes, the 'epoch' column is an hourly timestamp.  Clustering by a column with 
high cardinality would make little sense.
* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  My 
understanding was that this was related to the number of buckets, but you are 
relating it to ORC stripes.  It is odd that no examples that I've seen include 
the SORTED BY statement other than in relation to ORC indexes (that I 
understand).  So the question is; regardless of whether efficient ORC stripes 
are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this 
to have much of an effect)
* Thanks for the hint about using primes for specifying the buckets.  Given 
that r_crs_id is a wide range I think that it would have worked Ok, but any 
optimizations are appreciated and I hadn't thought of the even number issue, 
not having looked at the function; I had assumed that it was a hash, not a 
modulus; shame on me.  Reading the docs I see that hash is only used on string 
columns

Finally, I'm not sure that I got a specific answer to my original question, 
which is can I force Tez to create all bucket files so Presto queries can 
succeed?  Anyway, I will be testing today and the solution will either be to 
forgo buckets completely or to simply rely on ORC indexes.

Thanks.

Rick

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Monday, April 2, 2018 2:16:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto


There's more here than Bucketing or Tez.

>  PARTITIONED BY(daydate STRING, epoch BIGINT)
> CLUSTERED BY(r_crs_id) INTO 64 BUCKETS

I hope the epoch partition column is actually a day rollup and not 1 partition 
for every timestamp.

CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). 
CLUSTERED BY needs a SORTED BY to produce good ORC stripes here.

>  If I perform the inserts with the mr engine, which is now deprecated, when 
> the partition is created there are always 64 bucket files, even if some are 0 
> length.  In this case I can query with Presto just fine.

The files should be at least 3 bytes long containing the 3 letters 'O','R','C'.

If you have a bucketing impl which is producing empty files when hash(20k) 
values % 64 is producing 0 rows for some buckets, the skew is unnatural.

>  However, when the Tez engine is used, only buckets that contain data are 
> created, and Presto fails since the table properties in the metastore (64 
> buckets) are a mismatch with the actual bucket count.

That was done to prevent creating these 3 byte files which actually cost real 
money to store (because storage is usually at the 4k block level, this uses 
1000x more storage).

You can end up with >64 or <64 or exactly 64 files for a validly bucketed table 
(in the >64 case, at least some files have a _Copy suffix, but the bucket 
prefix is consistent).

Presto just bails out when it finds something strange, because they do not 
trust the Hive bucketing impl - there is a really good reason why there are 2 
bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & 
hive.tez.bucket.pruning.compat).

> we'd really like to use buckets, since the the r_crs_id in production can 
> contain 20,000 values.

With 20k values and 64 buckets, if you have zero sized buckets - you might want 
to run your hash values through this.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6

> "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is 
> always set to "true", but I can't seem to find a description of its intent.  
> Anyway, if it's supposed to force the creation of buckets it's not working 
> with the Tez engine.

Previously you could disable "enforce.bucketing=false" and the inserts of data 
won't use the fixed number of buckets & can play other tricks with the data 
layout. That's not the config you're looking for.

There is something which is a very sharp no-handle knife in the Hive toolkit 
called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth 
explaining right now (but just that it is possible to use it, but not very 
easily).

> configuration so that r_crs_id was at least row optimized and sorted within 
> the ORC files. 

SORTED BY in the Table DDL should do the trick - I like to use a multi-dim 
sort, in some of these scenarios.

CLUSTERED BY(r_crs_id) 
SORTED BY(r_crs_id, id)
INTO 67 BUCKETS

If you want to know why I like primes, when % 64 is done on even numbers.

len(set([(x*2) % 64 for x in xrange(1000)]))

fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets 
have 0x data.

len(set([(x*2) % 61 for x in xrange(1000)]))

fills all 61 buckets - producing better bucketing & no 0 sized files.

FYI any prime number other than 31 works nicer than a 2^n - 
https://issues.apache.org/jira/browse/HIVE-7074

Cheers,
Gopal   

Hive, Tez, clustering, buckets, and Presto

2018-03-31 Thread Richard A. Bross
Hello there, 

I've done a ton of reading and testing between Hive, Presto, Tez, etc. In our 
test environment I have 8 tables being loaded with Hive that works fine under 
certain circumstances. 

Most of our data is taken from JSON files in from S3, creates a temporary 
external table, and then performs an INSERT OVERWRITE to transform the data to 
ORC format in Hive managed tables. 

A simple example of the CREATE statement for the internal table looks like this 
(I've left out a number of irrelevant columns): 

CREATE TABLE IF NOT EXISTS measurements (
id STRING,
session STRING,
tt_ts INT,
r_crs_id INT,
user_agent STRING
)
PARTITIONED BY(daydate STRING, epoch BIGINT)
CLUSTERED BY(r_crs_id) INTO 64 BUCKETS
STORED AS ORC
LOCATION's3://warehouse/'
tblproperties ("orc.compress"="ZLIB");


If I perform the inserts with the mr engine, which is now deprecated, when the 
partition is created there are always 64 bucket files, even if some are 0 
length.  In this case I can query with Presto just fine.

However, when the Tez engine is used, only buckets that contain data are 
created, and Presto fails since the table properties in the metastore (64 
buckets) are a mismatch with the actual bucket count.

I've seen some other postings in StackExchange and via Google where others have 
run into this.  We really need to use Presto to query Hive, and we'd really 
like to use buckets, since the the r_crs_id in production can contain 20,000 
values.

So a few questions (thanks for any help):

The easiest fix for us would be a way to force Tez to create all buckets like 
mr so the behavior didn't change. Is that possible? "hive.enforce.bucketing" 
appears to have been deprecated as of 2.0 and is always set to "true", but I 
can't seem to find a description of its intent.  Anyway, if it's supposed to 
force the creation of buckets it's not working with the Tez engine.

ALternately, I was thinking that maybe someone could help me with the ORC file 
configuration so that r_crs_id was at least row optimized and sorted within the 
ORC files.  I've been reading up on ORC and the Hive statements related to ORC 
files but I seem to have hit my limit of understanding.  It appears that I 
should at the least have the rows sorted by r_crs_id when written, but I'm 
having problems figuring out how to apply that with the INSERT statement.

It appears that the ORC table should be created with 'orc.create.index’=’true’' 
for a start.

Currently our INSERT looks like this (we never UPDATE or DELETE):

FROM raw_measurements_external INSERT OVERWRITE TABLE raw_measurements 
PARTITION(daydate='{2}', epoch={3}) SELECT *;

Something like this?

FROM raw_measurements_external SORT BY r_crs_id INSERT OVERWRITE TABLE 
raw_measurements PARTITION(daydate='{2}', epoch={3}) SELECT *;

Thanks for any help from more experienced Hive developers (most of you).