Unsubscribe

2018-10-16 Thread Richard A. Bross
Unsubscribe

- Original Message -
From: "Kevin Risden" 
To: user@hive.apache.org
Sent: Tuesday, October 16, 2018 9:44:17 AM
Subject: Re: HiveServer2 performance references?


Thanks for the responses. I'll wait to test with HIVE-20621 to see how much 
that helps. For HIVE-17194, doesn't look like that is documented outside of 
JIRA currently. It looks like I may have picked up over the wire compression by 
default. I had assumed it wasn't compressed. 



Kevin Risden 




On Tue, Oct 16, 2018 at 12:44 AM Gopal Vijayaraghavan < gop...@apache.org > 
wrote: 






Hi, 


> I was looking at HiveServer2 performance going through Knox in KNOX-1524 and 
> found that HTTP mode is significantly slower. 



The HTTP mode does re-auth for every row before HIVE-20621 was fixed – Knox 
should be doing cookie-auth to prevent ActiveDirectory/LDAP from throttling 
this. 

I fixed it for the open-source JDBC drivers, but I think the proprietary ODBC 
drivers are still broken (the C++ code doesn’t save cookies). 

> I also didn't see any information about reducing amount of data over the wire 
> with compression. 

It’s possible Knox doesn’t generate/forward the Accept-Encoding headers. 



https://issues.apache.org/jira/browse/HIVE-17194 


Cheers, 

Gopal 




From: Prasanth Jayachandran < pjayachand...@hortonworks.com > 
Reply-To: " user@hive.apache.org " < user@hive.apache.org > 
Date: Monday, October 15, 2018 at 12:34 PM 
To: " user@hive.apache.org " < user@hive.apache.org >, " user@hive.apache.org " 
< user@hive.apache.org > 
Subject: Re: HiveServer2 performance references? 







One performance fix that is fixed recently and is related is HIVE-20621 






Thanks 


Prasanth 






On Mon, Oct 15, 2018 at 12:11 PM -0700, " 



Kevin Risden " < kris...@apache.org > wrote: 









Are there any resources on HiveServer2 performance - specifically binary vs 
HTTP mode? I would also be interested in any information about enabling 
compression between HiveServer2 and a client like beeline. 





I was looking at HiveServer2 performance going through Knox in KNOX-1524 and 
found that HTTP mode is significantly slower. I searched and didn't find any 
references to performance of HiveServer2 out of the box. I also didn't see any 
information about reducing amount of data over the wire with compression. 





Any pointers would be very helpful. Thanks! 







Kevin Risden


Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3

2018-04-12 Thread Richard A. Bross
To be honest, my advice was just based on your post - we only use Hive in AWS 
EMR, so I couldn't tell you.

Glad that at least you're back up though.

- Original Message -
From: "Anup Tiwari" <anupsdtiw...@gmail.com>
To: user@hive.apache.org
Sent: Thursday, April 12, 2018 10:50:23 AM
Subject: Re: Unable to read hive external table data which is linked to s3 
after upgradation from 2.1.1 to 2.3.3








Hi Richard , 

After looking at hive 2.3.3 logs i found that when we are loading all 
configuration parameters then below message is getting printed :- 

Values omitted for security reason if present: [fs.s3n.awsAccessKeyId, 
fs.s3a.access.key, fs.s3.awsAccessKeyId, hive.server2.keystore.password, f 
s.s3a.proxy.password, javax.jdo.option.ConnectionPassword, 
fs.s3.awsSecretAccessKey, fs.s3n.awsSecretAccessKey, fs.s3a.secret.key] 


while in hive 2.1.1 logs i found below message :- 

Values omitted for security reason if present: [hive.server2.keystore.password, 
javax.jdo.option.ConnectionPassword] 

Can this be the reason why hive 2.3.3 is not able to read s3 related params 
from hive-site.xml? 

I found one of JIRA : https://issues.apache.org/jira/browse/HIVE-14588 



Also i have set below property in hive-site.xml by excluding above s3 
variables(fs.s3a.access.key,fs.s3a.secret.key) from list and it worked. 

 
hive.conf.hidden.list 
javax.jdo.option.ConnectionPassword,hive.server2.keystore.password,fs.s3.awsAccessKeyId,fs.s3.awsSecretAccessKey,fs.s3n.awsAccessKeyId,fs.s3n.awsSecretAccessKey,fs.s3a.proxy.password
 
Comma separated list of configuration options which should not be 
read by normal user like passwords.Anup has excluded 2 variable which were 
fs.s3a.access.key,fs.s3a.secret.key 
 

Let me know if there is any other solution because i think if these variables 
are by default part of hidden.list then there will be some other proper 
workaround for this. 









Regards, 
Anup Tiwari 


On Thu, Apr 12, 2018 at 7:44 PM, Richard A. Bross < r...@oaktreepeak.com > 
wrote: 


I hear you, but given the exception log, it does seem that it can't 
authenticate you. You can try using the AWS environment variables. If that 
resolves the issue then you'll have some more to go on. According to 
Hortonworks here: 

https://hortonworks.github.io/hdp-aws/s3-security/ 

"AWS CLI supports authentication through environment variables. These same 
environment variables will be used by Hadoop if no configuration properties are 
set." 


- Original Message - 
From: "Anup Tiwari" < anupsdtiw...@gmail.com > 
To: user@hive.apache.org 
Sent: Thursday, April 12, 2018 10:06:33 AM 
Subject: Re: Unable to read hive external table data which is linked to s3 
after upgradation from 2.1.1 to 2.3.3 


We are not using EMR. Also we have set below params for accessing s3 bucket in 
hive-site.xml which are same as what we have set in hive 2.1.1. 


* fs.s3a.access.key 
* fs.s3a.secret.key 
* fs.s3a.connection.maximum 
* fs.s3a.impl 








Regards, 
Anup Tiwari 


On Thu, Apr 12, 2018 at 7:19 PM, Richard A. Bross < r...@oaktreepeak.com > 
wrote: 


Based on the exception, it looks more like an AWS credentials issue than a Hive 
issue. Are you running in AWS EMR, on-prem? 

In AWS the resource accessing the S3 bucket would have to have an IAM that gave 
permission. If you are running somewhere else whatever AWS login you use would 
have to have the correct permissions in the IAM. 



- Original Message - 
From: "Anup Tiwari" < anupsdtiw...@gmail.com > 
To: user@hive.apache.org 
Sent: Thursday, April 12, 2018 9:11:37 AM 
Subject: Unable to read hive external table data which is linked to s3 after 
upgradation from 2.1.1 to 2.3.3 



Hi All, 

When i am trying to read s3 linked external table in 2.3.3 ; i am getting 
errors.It was working properly in 2.1.1. please find below details and let me 
know if i am missing something :- 


Hadoop Version :- 2.8.0 

Query :- 

select log_date,count(1) as cnt from test.tt1 group by log_date; 

Error :- 

Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, 
diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, 
vertex=vertex_1523502631429_0029_3_00 [Map 1], 
org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on 
g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials 
provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider 
SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: 
The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS 
Credentials provided by BasicAWSCredentialsProvider 
EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider 
: com.amazonaws.AmazonClientException: The requested metadata is not foun

Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3

2018-04-12 Thread Richard A. Bross
I hear you, but given the exception log, it does seem that it can't 
authenticate you.  You can try using the AWS environment variables.  If that 
resolves the issue then you'll have some more to go on.  According to 
Hortonworks here:

https://hortonworks.github.io/hdp-aws/s3-security/

"AWS CLI supports authentication through environment variables. These same 
environment variables will be used by Hadoop if no configuration properties are 
set."


- Original Message -
From: "Anup Tiwari" <anupsdtiw...@gmail.com>
To: user@hive.apache.org
Sent: Thursday, April 12, 2018 10:06:33 AM
Subject: Re: Unable to read hive external table data which is linked to s3 
after upgradation from 2.1.1 to 2.3.3


We are not using EMR. Also we have set below params for accessing s3 bucket in 
hive-site.xml which are same as what we have set in hive 2.1.1. 


* fs.s3a.access.key 
* fs.s3a.secret.key 
* fs.s3a.connection.maximum 
* fs.s3a.impl 






Regards, 
Anup Tiwari 


On Thu, Apr 12, 2018 at 7:19 PM, Richard A. Bross < r...@oaktreepeak.com > 
wrote: 


Based on the exception, it looks more like an AWS credentials issue than a Hive 
issue. Are you running in AWS EMR, on-prem? 

In AWS the resource accessing the S3 bucket would have to have an IAM that gave 
permission. If you are running somewhere else whatever AWS login you use would 
have to have the correct permissions in the IAM. 



- Original Message - 
From: "Anup Tiwari" < anupsdtiw...@gmail.com > 
To: user@hive.apache.org 
Sent: Thursday, April 12, 2018 9:11:37 AM 
Subject: Unable to read hive external table data which is linked to s3 after 
upgradation from 2.1.1 to 2.3.3 



Hi All, 

When i am trying to read s3 linked external table in 2.3.3 ; i am getting 
errors.It was working properly in 2.1.1. please find below details and let me 
know if i am missing something :- 


Hadoop Version :- 2.8.0 

Query :- 

select log_date,count(1) as cnt from test.tt1 group by log_date; 

Error :- 

Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, 
diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, 
vertex=vertex_1523502631429_0029_3_00 [Map 1], 
org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on 
g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials 
provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider 
SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: 
The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS 
Credentials provided by BasicAWSCredentialsProvider 
EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider 
: com.amazonaws.AmazonClientException: The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ 
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:128) 
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:288)
 
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236) 
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2811) 
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) 
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2848) 
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2830) 
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) 
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) 
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:265)
 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) 
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) 
at org.apache.hadoop.hive.ql.io 
.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442) 
at org.apache.hadoop.hive.ql.io 
.HiveInputFormat.getSplits(HiveInputFormat.java:561) 
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:422) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
 
at java.ut

Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3

2018-04-12 Thread Richard A. Bross
Based on the exception, it looks more like an AWS credentials issue than a Hive 
issue.  Are you running in AWS EMR, on-prem?

In AWS the resource accessing the S3 bucket would have to have an IAM that gave 
permission.  If you are running somewhere else whatever AWS login you use would 
have to have the correct permissions in the IAM.

- Original Message -
From: "Anup Tiwari" 
To: user@hive.apache.org
Sent: Thursday, April 12, 2018 9:11:37 AM
Subject: Unable to read hive external table data which is linked to s3 after 
upgradation from 2.1.1 to 2.3.3



Hi All, 

When i am trying to read s3 linked external table in 2.3.3 ; i am getting 
errors.It was working properly in 2.1.1. please find below details and let me 
know if i am missing something :- 


Hadoop Version :- 2.8.0 

Query :- 

select log_date,count(1) as cnt from test.tt1 group by log_date; 

Error :- 

Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, 
diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, 
vertex=vertex_1523502631429_0029_3_00 [Map 1], 
org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on 
g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials 
provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider 
SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: 
The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS 
Credentials provided by BasicAWSCredentialsProvider 
EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider 
: com.amazonaws.AmazonClientException: The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ 
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:128) 
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:288)
 
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236) 
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2811) 
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) 
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2848) 
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2830) 
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) 
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) 
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:265)
 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) 
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) 
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442)
 
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561)
 
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:422) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
 
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
Caused by: com.amazonaws.AmazonClientException: No AWS Credentials provided by 
BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider 
SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: 
The requested metadata is not found at 
http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ 
at 
org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:151)
 
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779) 
at 
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107) 
at 
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
 
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:276)
 
... 24 more 
Caused by: com.amazonaws.AmazonClientException: The requested metadata is not 
found at 

Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Richard A. Bross
I can't duplicate the issue now.  Works like it always has . . 

- Original Message -
From: "Furcy Pin" <pin.fu...@gmail.com>
To: user@hive.apache.org
Sent: Thursday, April 5, 2018 9:57:39 AM
Subject: Re: ALTER TABLE DROP PARTITION not working on S3


Indeed. 


If I remember correctly, s3 does not really have the concept of "folder" like 
HDFS has, and Hive sort of makes up for it by creating a descriptor file 
where the partition "folder" is supposed to be. Maybe this is what is missing 
here. 


Perhaps you could try doing a "MSCK REPAIR TABLE tablename" to make sure that 
the partitions are correctly loaded and then try again dropping that particular 
partition? 


Or look at your s3 folder if you see any such "partition folder file" and check 
if it is missing for this particular partition? 




On 5 April 2018 at 15:40, Richard A. Bross < r...@oaktreepeak.com > wrote: 


Leaving the column list out, here you go: 

# Detailed Table Information 
Database: default 
Owner: hadoop 
CreateTime: Thu Apr 05 13:24:33 UTC 2018 
LastAccessTime: UNKNOWN 
Retention: 0 
Location: s3://zoomi-proto-warehouse-measurements/ 
Table Type: MANAGED_TABLE 
Table Parameters: 
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} 
numFiles 10 
numPartitions 7 
numRows 153 
orc.compress ZLIB 
orc.create.index true 
rawDataSize 113563 
totalSize 37801 
transient_lastDdlTime 1522934673 

# Storage Information 
SerDe Library: org.apache.hadoop.hive.ql.io .orc.OrcSerde 
InputFormat: org.apache.hadoop.hive.ql.io .orc.OrcInputFormat 
OutputFormat: org.apache.hadoop.hive.ql.io .orc.OrcOutputFormat 
Compressed: No 
Num Buckets: 61 
Bucket Columns: [crs_id] 
Sort Columns: [] 
Storage Desc Params: 
serialization.format 1 
Time taken: 0.467 seconds, Fetched: 98 row(s) 


- Original Message - 
From: "Furcy Pin" < pin.fu...@gmail.com > 
To: user@hive.apache.org 
Sent: Thursday, April 5, 2018 9:21:06 AM 


Subject: Re: ALTER TABLE DROP PARTITION not working on S3 


Hi Richard, 


could you please check if your table is EXTERNAL? 
You can see it with a "DESCRIBE FORMATTED table_name ;" 


That's what external tables are for, they don't delete underlying data when you 
drop them. 




On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: 


I think that someone put a file in there manually. Would that prevent Hive from 
dropping the partition. I also did a "drop table" and the s3 object keys 
persisted. 



- Original Message - 
From: "Richard A. Bross" < r...@oaktreepeak.com > 
To: user@hive.apache.org 
Sent: Thursday, April 5, 2018 9:14:52 AM 
Subject: ALTER TABLE DROP PARTITION not working on S3 

Hi, 

I have a Hive managed table on S3, "api_measurements". I've tried dropping a 
partition like so: 

hive> alter table api_measurements drop if exists 
partition(daydate='2018-04-04', epoch=1522876500); 
Dropped the partition daydate=2018-04-04/epoch=1522876500 
OK 
Time taken: 2.109 seconds 

Yet the data is still on S3. Because object keys on S3 are always strings, I 
also tried this: 

hive> alter table api_measurements drop partition(daydate='2018-04-04', 
epoch='1522876500'); 
OK 
Time taken: 0.135 seconds 

Yet the object keys and data are still there. I assume that Im missing 
something really simple. Can anyone shed some light on this? 

Thanks 





Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Richard A. Bross
Leaving the column list out, here you go:

# Detailed Table Information 
Database:   default  
Owner:  hadoop   
CreateTime: Thu Apr 05 13:24:33 UTC 2018 
LastAccessTime: UNKNOWN  
Retention:  0
Location:   s3://zoomi-proto-warehouse-measurements/ 
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
numFiles10  
numPartitions   7   
numRows 153 
orc.compressZLIB
orc.create.indextrue
rawDataSize 113563  
totalSize   37801   
transient_lastDdlTime   1522934673  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No   
Num Buckets:61   
Bucket Columns: [crs_id] 
Sort Columns:   []   
Storage Desc Params: 
serialization.format1   
Time taken: 0.467 seconds, Fetched: 98 row(s)


- Original Message -
From: "Furcy Pin" <pin.fu...@gmail.com>
To: user@hive.apache.org
Sent: Thursday, April 5, 2018 9:21:06 AM
Subject: Re: ALTER TABLE DROP PARTITION not working on S3


Hi Richard, 


could you please check if your table is EXTERNAL? 
You can see it with a "DESCRIBE FORMATTED table_name ;" 


That's what external tables are for, they don't delete underlying data when you 
drop them. 




On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: 


I think that someone put a file in there manually. Would that prevent Hive from 
dropping the partition. I also did a "drop table" and the s3 object keys 
persisted. 



- Original Message - 
From: "Richard A. Bross" < r...@oaktreepeak.com > 
To: user@hive.apache.org 
Sent: Thursday, April 5, 2018 9:14:52 AM 
Subject: ALTER TABLE DROP PARTITION not working on S3 

Hi, 

I have a Hive managed table on S3, "api_measurements". I've tried dropping a 
partition like so: 

hive> alter table api_measurements drop if exists 
partition(daydate='2018-04-04', epoch=1522876500); 
Dropped the partition daydate=2018-04-04/epoch=1522876500 
OK 
Time taken: 2.109 seconds 

Yet the data is still on S3. Because object keys on S3 are always strings, I 
also tried this: 

hive> alter table api_measurements drop partition(daydate='2018-04-04', 
epoch='1522876500'); 
OK 
Time taken: 0.135 seconds 

Yet the object keys and data are still there. I assume that Im missing 
something really simple. Can anyone shed some light on this? 

Thanks 




Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Richard A. Bross
No definitely internal.  It's loaded from an external table.

- Original Message -
From: "Furcy Pin" <pin.fu...@gmail.com>
To: user@hive.apache.org
Sent: Thursday, April 5, 2018 9:21:06 AM
Subject: Re: ALTER TABLE DROP PARTITION not working on S3


Hi Richard, 


could you please check if your table is EXTERNAL? 
You can see it with a "DESCRIBE FORMATTED table_name ;" 


That's what external tables are for, they don't delete underlying data when you 
drop them. 




On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: 


I think that someone put a file in there manually. Would that prevent Hive from 
dropping the partition. I also did a "drop table" and the s3 object keys 
persisted. 



----- Original Message - 
From: "Richard A. Bross" < r...@oaktreepeak.com > 
To: user@hive.apache.org 
Sent: Thursday, April 5, 2018 9:14:52 AM 
Subject: ALTER TABLE DROP PARTITION not working on S3 

Hi, 

I have a Hive managed table on S3, "api_measurements". I've tried dropping a 
partition like so: 

hive> alter table api_measurements drop if exists 
partition(daydate='2018-04-04', epoch=1522876500); 
Dropped the partition daydate=2018-04-04/epoch=1522876500 
OK 
Time taken: 2.109 seconds 

Yet the data is still on S3. Because object keys on S3 are always strings, I 
also tried this: 

hive> alter table api_measurements drop partition(daydate='2018-04-04', 
epoch='1522876500'); 
OK 
Time taken: 0.135 seconds 

Yet the object keys and data are still there. I assume that Im missing 
something really simple. Can anyone shed some light on this? 

Thanks 




Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Richard A. Bross
I think that someone put a file in there manually.  Would that prevent Hive 
from dropping the partition.  I also did a "drop table" and the s3 object keys 
persisted.

- Original Message -----
From: "Richard A. Bross" <r...@oaktreepeak.com>
To: user@hive.apache.org
Sent: Thursday, April 5, 2018 9:14:52 AM
Subject: ALTER TABLE DROP PARTITION not working on S3

Hi,

I have a Hive managed table on S3, "api_measurements".  I've tried dropping a 
partition like so:

hive> alter table api_measurements drop if exists 
partition(daydate='2018-04-04', epoch=1522876500);
Dropped the partition daydate=2018-04-04/epoch=1522876500
OK
Time taken: 2.109 seconds

Yet the data is still on S3.  Because object keys on S3 are always strings, I 
also tried this:

hive> alter table api_measurements drop partition(daydate='2018-04-04', 
epoch='1522876500');
OK
Time taken: 0.135 seconds

Yet the object keys and data are still there.  I assume that Im missing 
something really simple.  Can anyone shed some light on this?

Thanks



ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Richard A. Bross
Hi,

I have a Hive managed table on S3, "api_measurements".  I've tried dropping a 
partition like so:

hive> alter table api_measurements drop if exists 
partition(daydate='2018-04-04', epoch=1522876500);
Dropped the partition daydate=2018-04-04/epoch=1522876500
OK
Time taken: 2.109 seconds

Yet the data is still on S3.  Because object keys on S3 are always strings, I 
also tried this:

hive> alter table api_measurements drop partition(daydate='2018-04-04', 
epoch='1522876500');
OK
Time taken: 0.135 seconds

Yet the object keys and data are still there.  I assume that Im missing 
something really simple.  Can anyone shed some light on this?

Thanks



Re: Hive, Tez, clustering, buckets, and Presto

2018-04-04 Thread Richard A. Bross
This is great information, Gopal, thank you.

I wish I had the time to create a comparison for our use case between Hive 
buckets and ORC files on S3 and ORC files without bucket.  Unfortunately it's a 
chicken and egg issue, since I won't have enough data volume until we are in 
production, which should be shortly.

Thanks again.

- Original Message -
From: "Gopal Vijayaraghavan" <gop...@apache.org>
To: user@hive.apache.org
Sent: Wednesday, April 4, 2018 7:31:31 PM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

> so there asking "where is the Hive bucketing spec".  Is it just to read the 
> code for that function? 

This worked the other way around in time, than writing a spec first - ACIDv1 
implemented Streaming ingest via Storm, it used an explicit naming "bucket_" 
for the filename.

Since until the compaction runs the actual base files don't exist, the ACID 
bucketing implementation has to handle missing buckets as 0 rows in base file + 
possibly more rows in uncompacted deltas.

ACID's implementation has forced the two bucketing implementations to work 
similarly, for the ability to do bucket map-joins between ACID & non-ACID 
bucketed tables. Particularly about the modulus for -ve numbers, which was 
broken in Hive-1.0.

https://issues.apache.org/jira/browse/HIVE-12025

that's the place where this all got refactored so that joins & filters for 
bucketed tables work the same way for ACID & non-ACID tables.

Because of that spec lives in the comments now as a Regex.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1283

> They were looking for something more explicit, I think.

I think a simple unit test will probably help them a bit more.

create external table bucketed (x int) clustered by (x) into 4 buckets stored 
as orc;
insert into bucketed values(1),(2),(3),(4);
insert into bucketed values(1),(2),(3),(4);

0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed;

| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/00_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/00_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/01_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/01_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/02_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/02_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/03_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/03_0_copy_1 |

Even when all buckets are covered Presto should be expecting >1 files per 
bucket.

I saw a JIRA comment which said "sort in file order and assign buckets", you 
can see that is only applicable for the 1st insert to table (& the regex will 
remove the copy numbering).

And oddly enough this week, I saw an academic paper with a negative analysis of 
Hive bucketing.

https://www.researchgate.net/publication/323997831_Partitioning_and_Bucketing_in_Hive-Based_Big_Data_Warehouses

Cheers,
Gopal

On 4/3/18, 1:42 PM, "Richard A. Bross" <r...@oaktreepeak.com> wrote:

Gopal,

The Presto devs say they are willing to make the changes to adhere to the 
Hive bucket spec.  I quoted 

"Presto could fix their fail-safe for bucketing implementation to actually 
trust the Hive bucketing spec & get you out of this mess - the bucketing 
contract for Hive is actual file name -> hash % buckets 
(Utilities::getBucketIdFromFile)."

so there asking "where is the Hive bucketing spec".  Is it just to read the 
code for that function?  They were looking for something more explicit, I think.

Thanks

- Original Message -
From: "Gopal Vijayaraghavan" <gop...@apache.org>
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>* I'm interested in your statement that CLUSTERED BY does not CLUSTER 
BY.  My understanding was that this was related to the number of buckets, but 
you are relating it to ORC stripes.  It is odd that no examples that I've seen 
include the SORTED BY statement other than in relation to ORC indexes (that I 
understand).  So the question is; regardless of whether efficient ORC stripes 
are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this 
to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), w

Re: Building Datwarehouse Application in Spark

2018-04-04 Thread Richard A. Bross
Mahender,

To really address your question I think that you'd have to supply a bit more 
information, such as the kind of data that you want to save; RBDMS type look 
ups, key/value/index type look ups, insert velocity, etc.  These wide choices 
of technologies are suited to different use cases, although they overlap in 
some areas.

In a previous position that I held we used Spark on Cassandra to solve a 
similar problem.  The Datastax distribution puts Spark worker nodes directly on 
Cassandra nodes.  Because Cassandra partitions the data across nodes based on a 
row key, it's a nice match.  If the key is chosen properly, the Spark nodes are 
typically accessing local data on the Cassandra nodes, meaning that there are 
typically very few shuffles for direct queries and also that inserts go 
directly to the proper Cassandra nodes.  We had time series data based on 
unique row keys.  So our row keys were unique and our column keys were the time 
stamps. In that case our queries were done directly with the Cassandra clients 
for the most part, with SparkQL primarily used for ad-hoc queries.  

At my current position, we directly load raw data into Hive (using HiveQL) and 
then use Presto for queries.  That's our OLAP data store.  You can use any 
number of other tools to query Hive created data stores as well.

Then we have another pipeline that takes the same raw data, uses Spark for the 
ETL, and then inserts the results into Aurora (MySQL).  The schema is designed 
for specific queries, so the Spark ETL is designed to transform the data to 
optimize for the schema so as to allow efficient updates to those tables.  
That's our OLTP data store and we use standard SQL for queries.

Rick


- Original Message -
From: "Furcy Pin" 
To: user@hive.apache.org
Sent: Wednesday, April 4, 2018 6:58:58 AM
Subject: Re: Building Datwarehouse Application in Spark


Hi Mahender, 


Did you look at this? https://www.snappydata.io/blog/the-spark-database 


But I believe that most people handle this use case by either using: 
- Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...) if 
the data is not too big 
- Their favorite New-SQL storage (Cassandra, HBase) if the data is too big and 
needs to be distributed 


Spark generally makes it easy enough to query these other databases to allow 
you to perform analytics. 


Hive and Spark have been designed as OLAP tools, not OLTP. 
I'm not sure what features you are seeking for your SCD but they probably won't 
be part of Spark's core design. 


Hope this helps, 


Furcy 






On 4 April 2018 at 11:29, Mahender Sarangam < mahender.bigd...@outlook.com > 
wrote: 




Hi, 
Does anyone has good architecture document/design principle for building 
warehouse application using Spark. 


Is it better way of having Hive Context created with HQL and perform 
transformation or Directly loading files in dataframe and perform data 
transformation. 


We need to implement SCD 2 Type in Spark, Is there any better 
document/reference for building Type 2 warehouse object 


Thanks in advace 


/Mahender 


Re: Hive, Tez, clustering, buckets, and Presto

2018-04-03 Thread Richard A. Bross
Gopal,

The Presto devs say they are willing to make the changes to adhere to the Hive 
bucket spec.  I quoted 

"Presto could fix their fail-safe for bucketing implementation to actually 
trust the Hive bucketing spec & get you out of this mess - the bucketing 
contract for Hive is actual file name -> hash % buckets 
(Utilities::getBucketIdFromFile)."

so there asking "where is the Hive bucketing spec".  Is it just to read the 
code for that function?  They were looking for something more explicit, I think.

Thanks

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  
> My understanding was that this was related to the number of buckets, but you 
> are relating it to ORC stripes.  It is odd that no examples that I've seen 
> include the SORTED BY statement other than in relation to ORC indexes (that I 
> understand).  So the question is; regardless of whether efficient ORC stripes 
> are created (wouldn't I have to also specify 'orc.create.index’=’true’ for 
> this to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), which when combined with a columnar format & 
ordering within files produces better storage and runtimes together (which I 
guess is usually a trade-off).

Without a SORTED BY, the organizing function for the data-shuffle does not 
order in any specific way - the partition key for the shuffle is the modulus, 
while the order key is 0 bytes long, so it sorts by (modulus,) which for a 
quick-sort also loses the input order into the shuffle & each bucket file is 
produced in random order within itself.

An explicit sort with bucketing is what I recommend to most of the HDP 
customers who have performance problems with ORC.

This turns the shuffle key into (modulus, key1, key2) producing more 
predictable order during shuffle.

Then the key1 can be RLE encoded so that ORC vector impl will pass it on as 
key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for 
integers.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5

was written as a warning to customers who use bucketing to try & solve 
performance problems, but have ended up bucketing as their main problem.

Most of what I have written above was discussed a few years back and in 
general, bucketing on a high cardinality column + sorting on a low cardinality 
together has given good results to my customers.

>I hadn't thought of the even number issue, not having looked at the 
> function; I had assumed that it was a hash, not a modulus; shame on me.  
> Reading the docs I see that hash is only used on string columns

Actually a hash is used in theory, but I entirely blame Java for it - the Java 
hash is an identity function for Integers.

scala> 42.hashCode
res1: Int = 42

scala> 42L.hashCode
res2: Int = 42

> Finally, I'm not sure that I got a specific answer to my original question, 
> which is can I force Tez to create all bucket files so Presto queries can 
> succeed?  Anyway, I will be testing today and the solution will either be to 
> forgo buckets completely or to simply rely on ORC indexes.

There's no config to do that today & Presto is already incompatible with Hive 
3.0 tables (Update/Delete support).

Presto could fix their fail-safe for bucketing implementation to actually trust 
the Hive bucketing spec & get you out of this mess - the bucketing contract for 
Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile).

The file-count is a very flaky way to check if the table is bucketed correctly 
- either you trust the user to have properly bucketed the table or you don't 
use it. Failing to work on valid tables does look pretty bad, instead of soft 
fallbacks.

I wrote a few UDFs which was used to validate suspect tables and fix them for 
customers who had bad historical data, which was loaded with 
"enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.

https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27

LLAP has a bucket pruning implementation if Presto wants to copy from it 
(LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).

Optimizer: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
Runtime: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281

That actually does things according to the Hive bucketing contract where 
uncovered buckets are assumed to have 0 rows without a file present & not error 
out instead.

If 

Re: Hive, Tez, clustering, buckets, and Presto

2018-04-03 Thread Richard A. Bross
Gopal,

Thanks for this.  Great information and something to look at more closely to 
better understand the internals.

Rick

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  
> My understanding was that this was related to the number of buckets, but you 
> are relating it to ORC stripes.  It is odd that no examples that I've seen 
> include the SORTED BY statement other than in relation to ORC indexes (that I 
> understand).  So the question is; regardless of whether efficient ORC stripes 
> are created (wouldn't I have to also specify 'orc.create.index’=’true’ for 
> this to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), which when combined with a columnar format & 
ordering within files produces better storage and runtimes together (which I 
guess is usually a trade-off).

Without a SORTED BY, the organizing function for the data-shuffle does not 
order in any specific way - the partition key for the shuffle is the modulus, 
while the order key is 0 bytes long, so it sorts by (modulus,) which for a 
quick-sort also loses the input order into the shuffle & each bucket file is 
produced in random order within itself.

An explicit sort with bucketing is what I recommend to most of the HDP 
customers who have performance problems with ORC.

This turns the shuffle key into (modulus, key1, key2) producing more 
predictable order during shuffle.

Then the key1 can be RLE encoded so that ORC vector impl will pass it on as 
key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for 
integers.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5

was written as a warning to customers who use bucketing to try & solve 
performance problems, but have ended up bucketing as their main problem.

Most of what I have written above was discussed a few years back and in 
general, bucketing on a high cardinality column + sorting on a low cardinality 
together has given good results to my customers.

>I hadn't thought of the even number issue, not having looked at the 
> function; I had assumed that it was a hash, not a modulus; shame on me.  
> Reading the docs I see that hash is only used on string columns

Actually a hash is used in theory, but I entirely blame Java for it - the Java 
hash is an identity function for Integers.

scala> 42.hashCode
res1: Int = 42

scala> 42L.hashCode
res2: Int = 42

> Finally, I'm not sure that I got a specific answer to my original question, 
> which is can I force Tez to create all bucket files so Presto queries can 
> succeed?  Anyway, I will be testing today and the solution will either be to 
> forgo buckets completely or to simply rely on ORC indexes.

There's no config to do that today & Presto is already incompatible with Hive 
3.0 tables (Update/Delete support).

Presto could fix their fail-safe for bucketing implementation to actually trust 
the Hive bucketing spec & get you out of this mess - the bucketing contract for 
Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile).

The file-count is a very flaky way to check if the table is bucketed correctly 
- either you trust the user to have properly bucketed the table or you don't 
use it. Failing to work on valid tables does look pretty bad, instead of soft 
fallbacks.

I wrote a few UDFs which was used to validate suspect tables and fix them for 
customers who had bad historical data, which was loaded with 
"enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.

https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27

LLAP has a bucket pruning implementation if Presto wants to copy from it 
(LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).

Optimizer: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
Runtime: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281

That actually does things according to the Hive bucketing contract where 
uncovered buckets are assumed to have 0 rows without a file present & not error 
out instead.

If you do have the ability to redeploy Hive, the change you are looking for is 
a 1-liner to enable.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248

Cheers,
Gopal




Re: Hive, Tez, clustering, buckets, and Presto

2018-04-02 Thread Richard A. Bross
I'm really confused and could use help understanding.  The Hive documentation 
here 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

says:

"Bucketed tables are fantastic in that they allow much more efficient sampling 
than do non-bucketed tables, and they may later allow for time saving 
operations such as mapside joins. However, the bucketing specified at table 
creation is not enforced when the table is written to, and so it is possible 
for the table's metadata to advertise properties which are not upheld by the 
table's actual layout. This should obviously be avoided. Here's how to do it 
right."

"This should obviously be avoided".  Yet with the Tez engine it seems to be 
unavoidable when the data does not allow for all the buckets to be populated.

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Monday, April 2, 2018 2:16:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto


There's more here than Bucketing or Tez.

>  PARTITIONED BY(daydate STRING, epoch BIGINT)
> CLUSTERED BY(r_crs_id) INTO 64 BUCKETS

I hope the epoch partition column is actually a day rollup and not 1 partition 
for every timestamp.

CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). 
CLUSTERED BY needs a SORTED BY to produce good ORC stripes here.

>  If I perform the inserts with the mr engine, which is now deprecated, when 
> the partition is created there are always 64 bucket files, even if some are 0 
> length.  In this case I can query with Presto just fine.

The files should be at least 3 bytes long containing the 3 letters 'O','R','C'.

If you have a bucketing impl which is producing empty files when hash(20k) 
values % 64 is producing 0 rows for some buckets, the skew is unnatural.

>  However, when the Tez engine is used, only buckets that contain data are 
> created, and Presto fails since the table properties in the metastore (64 
> buckets) are a mismatch with the actual bucket count.

That was done to prevent creating these 3 byte files which actually cost real 
money to store (because storage is usually at the 4k block level, this uses 
1000x more storage).

You can end up with >64 or <64 or exactly 64 files for a validly bucketed table 
(in the >64 case, at least some files have a _Copy suffix, but the bucket 
prefix is consistent).

Presto just bails out when it finds something strange, because they do not 
trust the Hive bucketing impl - there is a really good reason why there are 2 
bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & 
hive.tez.bucket.pruning.compat).

> we'd really like to use buckets, since the the r_crs_id in production can 
> contain 20,000 values.

With 20k values and 64 buckets, if you have zero sized buckets - you might want 
to run your hash values through this.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6

> "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is 
> always set to "true", but I can't seem to find a description of its intent.  
> Anyway, if it's supposed to force the creation of buckets it's not working 
> with the Tez engine.

Previously you could disable "enforce.bucketing=false" and the inserts of data 
won't use the fixed number of buckets & can play other tricks with the data 
layout. That's not the config you're looking for.

There is something which is a very sharp no-handle knife in the Hive toolkit 
called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth 
explaining right now (but just that it is possible to use it, but not very 
easily).

> configuration so that r_crs_id was at least row optimized and sorted within 
> the ORC files. 

SORTED BY in the Table DDL should do the trick - I like to use a multi-dim 
sort, in some of these scenarios.

CLUSTERED BY(r_crs_id) 
SORTED BY(r_crs_id, id)
INTO 67 BUCKETS

If you want to know why I like primes, when % 64 is done on even numbers.

len(set([(x*2) % 64 for x in xrange(1000)]))

fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets 
have 0x data.

len(set([(x*2) % 61 for x in xrange(1000)]))

fills all 61 buckets - producing better bucketing & no 0 sized files.

FYI any prime number other than 31 works nicer than a 2^n - 
https://issues.apache.org/jira/browse/HIVE-7074

Cheers,
Gopal



Re: Hive, Tez, clustering, buckets, and Presto

2018-04-02 Thread Richard A. Bross
Gopal,

Thanks for taking the time to try and help.  A few things in relation to your 
response:

* Yes, the 'epoch' column is an hourly timestamp.  Clustering by a column with 
high cardinality would make little sense.
* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  My 
understanding was that this was related to the number of buckets, but you are 
relating it to ORC stripes.  It is odd that no examples that I've seen include 
the SORTED BY statement other than in relation to ORC indexes (that I 
understand).  So the question is; regardless of whether efficient ORC stripes 
are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this 
to have much of an effect)
* Thanks for the hint about using primes for specifying the buckets.  Given 
that r_crs_id is a wide range I think that it would have worked Ok, but any 
optimizations are appreciated and I hadn't thought of the even number issue, 
not having looked at the function; I had assumed that it was a hash, not a 
modulus; shame on me.  Reading the docs I see that hash is only used on string 
columns

Finally, I'm not sure that I got a specific answer to my original question, 
which is can I force Tez to create all bucket files so Presto queries can 
succeed?  Anyway, I will be testing today and the solution will either be to 
forgo buckets completely or to simply rely on ORC indexes.

Thanks.

Rick

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Monday, April 2, 2018 2:16:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto


There's more here than Bucketing or Tez.

>  PARTITIONED BY(daydate STRING, epoch BIGINT)
> CLUSTERED BY(r_crs_id) INTO 64 BUCKETS

I hope the epoch partition column is actually a day rollup and not 1 partition 
for every timestamp.

CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). 
CLUSTERED BY needs a SORTED BY to produce good ORC stripes here.

>  If I perform the inserts with the mr engine, which is now deprecated, when 
> the partition is created there are always 64 bucket files, even if some are 0 
> length.  In this case I can query with Presto just fine.

The files should be at least 3 bytes long containing the 3 letters 'O','R','C'.

If you have a bucketing impl which is producing empty files when hash(20k) 
values % 64 is producing 0 rows for some buckets, the skew is unnatural.

>  However, when the Tez engine is used, only buckets that contain data are 
> created, and Presto fails since the table properties in the metastore (64 
> buckets) are a mismatch with the actual bucket count.

That was done to prevent creating these 3 byte files which actually cost real 
money to store (because storage is usually at the 4k block level, this uses 
1000x more storage).

You can end up with >64 or <64 or exactly 64 files for a validly bucketed table 
(in the >64 case, at least some files have a _Copy suffix, but the bucket 
prefix is consistent).

Presto just bails out when it finds something strange, because they do not 
trust the Hive bucketing impl - there is a really good reason why there are 2 
bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & 
hive.tez.bucket.pruning.compat).

> we'd really like to use buckets, since the the r_crs_id in production can 
> contain 20,000 values.

With 20k values and 64 buckets, if you have zero sized buckets - you might want 
to run your hash values through this.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6

> "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is 
> always set to "true", but I can't seem to find a description of its intent.  
> Anyway, if it's supposed to force the creation of buckets it's not working 
> with the Tez engine.

Previously you could disable "enforce.bucketing=false" and the inserts of data 
won't use the fixed number of buckets & can play other tricks with the data 
layout. That's not the config you're looking for.

There is something which is a very sharp no-handle knife in the Hive toolkit 
called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth 
explaining right now (but just that it is possible to use it, but not very 
easily).

> configuration so that r_crs_id was at least row optimized and sorted within 
> the ORC files. 

SORTED BY in the Table DDL should do the trick - I like to use a multi-dim 
sort, in some of these scenarios.

CLUSTERED BY(r_crs_id) 
SORTED BY(r_crs_id, id)
INTO 67 BUCKETS

If you want to know why I like primes, when % 64 is done on even numbers.

len(set([(x*2) % 64 for x in xrange(1000)]))

fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets 
have 0x data.

len(set([(x*2) % 61 for x in xrange(1000)]))

fills all 61 buckets - producing better bucketing & no 0 sized files.

FYI any prime number other than 31 works nicer than a 2^n - 

Hive, Tez, clustering, buckets, and Presto

2018-03-31 Thread Richard A. Bross
Hello there, 

I've done a ton of reading and testing between Hive, Presto, Tez, etc. In our 
test environment I have 8 tables being loaded with Hive that works fine under 
certain circumstances. 

Most of our data is taken from JSON files in from S3, creates a temporary 
external table, and then performs an INSERT OVERWRITE to transform the data to 
ORC format in Hive managed tables. 

A simple example of the CREATE statement for the internal table looks like this 
(I've left out a number of irrelevant columns): 

CREATE TABLE IF NOT EXISTS measurements (
id STRING,
session STRING,
tt_ts INT,
r_crs_id INT,
user_agent STRING
)
PARTITIONED BY(daydate STRING, epoch BIGINT)
CLUSTERED BY(r_crs_id) INTO 64 BUCKETS
STORED AS ORC
LOCATION's3://warehouse/'
tblproperties ("orc.compress"="ZLIB");


If I perform the inserts with the mr engine, which is now deprecated, when the 
partition is created there are always 64 bucket files, even if some are 0 
length.  In this case I can query with Presto just fine.

However, when the Tez engine is used, only buckets that contain data are 
created, and Presto fails since the table properties in the metastore (64 
buckets) are a mismatch with the actual bucket count.

I've seen some other postings in StackExchange and via Google where others have 
run into this.  We really need to use Presto to query Hive, and we'd really 
like to use buckets, since the the r_crs_id in production can contain 20,000 
values.

So a few questions (thanks for any help):

The easiest fix for us would be a way to force Tez to create all buckets like 
mr so the behavior didn't change. Is that possible? "hive.enforce.bucketing" 
appears to have been deprecated as of 2.0 and is always set to "true", but I 
can't seem to find a description of its intent.  Anyway, if it's supposed to 
force the creation of buckets it's not working with the Tez engine.

ALternately, I was thinking that maybe someone could help me with the ORC file 
configuration so that r_crs_id was at least row optimized and sorted within the 
ORC files.  I've been reading up on ORC and the Hive statements related to ORC 
files but I seem to have hit my limit of understanding.  It appears that I 
should at the least have the rows sorted by r_crs_id when written, but I'm 
having problems figuring out how to apply that with the INSERT statement.

It appears that the ORC table should be created with 'orc.create.index’=’true’' 
for a start.

Currently our INSERT looks like this (we never UPDATE or DELETE):

FROM raw_measurements_external INSERT OVERWRITE TABLE raw_measurements 
PARTITION(daydate='{2}', epoch={3}) SELECT *;

Something like this?

FROM raw_measurements_external SORT BY r_crs_id INSERT OVERWRITE TABLE 
raw_measurements PARTITION(daydate='{2}', epoch={3}) SELECT *;

Thanks for any help from more experienced Hive developers (most of you).




Re: RE: Re: For Apache Hive HS2 , what is the largest heap size setting that works well?

2017-11-29 Thread Richard Xin
 Use JDK8 and try G1 first, we saw lower GC pause under G1 especially in our 
large memory environment 
On Wednesday, November 29, 2017, 8:14:03 AM PST, eric wong 
 wrote:  
 
 



-- 已转发邮件 --
From: eric wong 
To: user@hive.apache.org
Cc: 
Bcc: 
Date: Thu, 30 Nov 2017 00:07:47 +0800
Subject: Re: For Apache Hive HS2 , what is the largest heap size setting that 
works well?

1)Long GC pause should has detailed reson, like InitialMark/Remark、Full GC due 
to Concurrent Mode failure/Promotion failure。Please check it
2)CMS is nice and steady for our production. Please troubleshooting case by 
case. G1 is just difficult to perceive.
3)You shouldupdate to JDK8 ,for one case,ConcurrentInitialMark is disabled as 
default or just only can be sequenced in JDK7.


2017-11-29 15:14 GMT+08:00 Jörn Franke :

I also recommend it you will have also performance improvements with JDK8 in 
general (use the latest version). Keep also in mind that more and more big data 
libraries etc will drop JDK7 support soon (Aside that JDK7 is anyway not 
maintained anymore).
On 29. Nov 2017, at 01:31, Johannes Alberti  wrote:


Yes, I would recommend to go to Java 8 and give it a shot with G1 and report 
back :)
Sent from my iPhone
On Nov 28, 2017, at 3:30 PM, Sharanya Santhanam  wrote:


HI Johannes ,
We are running on Java version jdk1.7.0_67 . We are using 
ConcurrentMarkAndSweep.  Would you recommend using G1GC ? 


These are our current settings  -XX:NewRatio=8 -XX:+UseParNewGC 
-XX:-UseGCOverheadLimit -XX:PermSize=256m -Xloggc:<> -XX:HeapDumpPath=oom 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:ErrorFile=/oom/hs2jvm 
error%p.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 
-XX:GCLogFileSize=128M -XX:+CMSClassUnloadingEnabled 
-XX:+CMSPermGenSweepingEnabled -XX:+HeapDumpOnOutOfMemoryErro r 
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:MaxPermSize=1024m 
-Xmx69427m -Xms128m -XX:MaxHeapFreeRatio=30 -XX:MinHeapFreeRatio=10 
-XX:+UseParNewGC -XX:-UseGCOverheadLimit -XX:PermSize=256m 


Thanks ,Sharanya 
On Tue, Nov 28, 2017 at 2:19 PM, Johannes Alberti  
wrote:

Hi Sharanya,
Can you share your current GC settings and Java version. Are you using Java 8/9 
w/ G1 already?
Regards,
Johannes

Sent from my iPhone
On Nov 28, 2017, at 12:57 PM, Sharanya Santhanam  wrote:


Hello , 
I am currently trying to upgrade hive version on our prod clusters form V1.2 to 
v2.1 We also want to adopt HS2 on the new upgraded cluster. Earlier all queries 
were submitted via Hive cli. 
Would like to understand how large a single HS2 Heap size can be ? And is there 
any formula to figure out the how many concurrent sessions I can support with 
this particular heap setting? 

We currently have a upper limit of supporting 300 concurrent sessions ( 
hive.server2.thrift.max.work er.threads=300). Based on this we set the max  
heap size to 70 GB , but seeing many long GC pauses. 

Would like to understand what is the industry standard for max HS2 Heap size. 
Are there any recommendations on what JMV GC setting work best for supporting  
such high number of concurrent sessions? 
Thanks,Sharanya 








-- 
王海华




-- 
王海华  

Fw: read this

2015-09-28 Thread Rosenfeld Richard E
Hello!

 

New message, please read <http://xn--80acveirclejkb.xn--p1ai/opinion.php?ib>

 

Rosenfeld Richard E



Fw: important message

2015-09-17 Thread Rosenfeld Richard E
Hey friend!

 

Check this out http://rhaquellasupplierkosmetik.com/different.php?oq

 

Rosenfeld Richard E



Re: metastore (mysql) connection problems - hive v0.12

2013-12-17 Thread Richard Nadeau
Hi Stephen,

Try increasing the max connections limit in MySQL. I ran into a similar
problem and this resolved things after a Hive 0.12 upgrade. You can
increase the connection limit without a restart, but pay attention to other
settings as well so that you don't over allocate resources on your db
server.

Regards,
Rick
On Dec 17, 2013 6:12 PM, Stephen Sprague sprag...@gmail.com wrote:

 hi guys,
 I'm pretty much at my wits end on this one. i'm using hive v0.12 against a
 mysql metastore.   the error manifests itself (mostly randomly which is the
 problem) as this error in the client:

 2013-12-17 16:54:03,158 ERROR metastore.RetryingRawStore
 (RetryingRawStore.java:invoke(146)) - JDO datastore error. Retrying
 metastore command after 1000 ms (attempt 1 of 1)
 2013-12-17 16:54:04,345 ERROR metastore.RetryingHMSHandler
 (RetryingHMSHandler.java:invoke(141)) -
 MetaException(message:java.lang.RuntimeException: commitTransaction was
 called but openTransactionCalls = 0. This probably indicates that there are
 unbalanced calls to openTransaction/commitTransaction)


 but i believe this to be just a symptom of something happening further
 upstream.


 examining the hive.log file for this useer (under /tmp/user/hive.log) i
 find this hideous traceback with bonecp being the focus of attention.
 There's nothing wrong with the mysql server metastore especially given it
 works sometimes.  to me i think it has something to do with the connection
 pooling logic but that's only speculation.

 before i surrender and rollback to a previous version i ask:  Has anyone
 encountered this before?  Or are there any nuggets in the traceback worthy
 of insight?

 thanks,
 Stephen.


 2013-12-17 16:54:03,142 ERROR bonecp.ConnectionHandle
 (ConnectionHandle.java:markPossiblyBroken(297)) - Database access problem.
 Killing off all remaining connections in the connection pool. SQL State =
 08S01
 2013-12-17 16:54:03,144 ERROR bonecp.ConnectionHandle
 (ConnectionHandle.java:markPossiblyBroken(297)) - Database access problem.
 Killing off all remaining connections in the connection pool. SQL State =
 08007
 2013-12-17 16:54:03,146 ERROR bonecp.ConnectionHandle
 (ConnectionHandle.java:markPossiblyBroken(297)) - Database access problem.
 Killing off all remaining connections in the connection pool. SQL State =
 08007
 2013-12-17 16:54:03,147 ERROR bonecp.ConnectionHandle
 (ConnectionHandle.java:markPossiblyBroken(297)) - Database access problem.
 Killing off all remaining connections in the connection pool. SQL State =
 08007
 2013-12-17 16:54:03,148 ERROR bonecp.BoneCP
 (BoneCP.java:terminateAllConnections(199)) - Error in attempting to close
 connection
 com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException:
 Communications link failure during rollback(). Transaction resolution
 unknown.
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
 at com.mysql.jdbc.Util.getInstance(Util.java:386)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
 at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4730)
 at
 com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4325)
 at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1557)
 at
 com.jolbox.bonecp.ConnectionHandle.internalClose(ConnectionHandle.java:396)
 at
 com.jolbox.bonecp.BoneCP.terminateAllConnections(BoneCP.java:197)
 at
 com.jolbox.bonecp.ConnectionHandle.markPossiblyBroken(ConnectionHandle.java:298)
 at
 com.jolbox.bonecp.ConnectionHandle.internalClose(ConnectionHandle.java:404)
 at
 com.jolbox.bonecp.BoneCP.terminateAllConnections(BoneCP.java:197)
 at
 com.jolbox.bonecp.ConnectionHandle.markPossiblyBroken(ConnectionHandle.java:298)
 at
 com.jolbox.bonecp.ConnectionHandle.internalClose(ConnectionHandle.java:404)
 at
 com.jolbox.bonecp.BoneCP.terminateAllConnections(BoneCP.java:197)
 at
 com.jolbox.bonecp.ConnectionHandle.markPossiblyBroken(ConnectionHandle.java:298)
 at
 com.jolbox.bonecp.ConnectionHandle.getTransactionIsolation(ConnectionHandle.java:635)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:422)
 at
 org.datanucleus.store.rdbms.SQLController.getStatementForQuery(SQLController.java:316)
 at
 

Re: How to prevent user drop table in Hive metadata?

2013-11-22 Thread Richard Nadeau
You can use: ALTER TABLE {table_name} ENABLE NO_DROP;

And it will keep a user from dropping the table - but it can be over ridden.

Rick
On Nov 22, 2013 12:36 PM, Echo Li echo...@gmail.com wrote:

 Good Friday!

 I was trying to apply certain level of security in our hive data
 warehouse, by modifying access mode of directories and files on hdfs to 755
 I think it's good enough for a new user to remove data, however the user
 still can drop the table definition in hive cli, seems the revoke doesn't
 help much, is there any way to prevent this?


 Thanks,
 Echo



histogram_numeric find the most frequent value

2013-10-16 Thread Richard
I want to find the most frequent value of a column, I noticed histogram_numerc,
but I cannot specify the bin boundary. The result is not what I want. 


take an example as follows,  I want something like


select gid, most_frequent(category) from mytable group by gid.


where category is a column with descritized values.


thanks.
Richard

Re:Re: histogram_numeric find the most frequent value

2013-10-16 Thread Richard
good idea, I will try. thanks

At 2013-10-16 19:12:30,Ed Soniat eson...@liveperson.com wrote:

You could use a modular math to transform the data in to single value 
representations of each range you intend to represent with your boundary using 
a sub select.





On Wed, Oct 16, 2013 at 7:09 AM, Richard codemon...@163.com wrote:

I want to find the most frequent value of a column, I noticed histogram_numerc,
but I cannot specify the bin boundary. The result is not what I want. 


take an example as follows,  I want something like


select gid, most_frequent(category) from mytable group by gid.


where category is a column with descritized values.


thanks.
Richard







This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this message 
or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Hive external table to HBase multiple versions

2013-04-29 Thread Richard Payne
Hi

Is it possible to create a Hive external table pointing to an HBase column
family and  for all versions of the data for the keys to be returned rather
than just I presume the latest version of key/column to be returned.

Regards


Re: Upgrade from Hive 0.9 to Hive 0.10 Heap Error on show tables;

2013-04-03 Thread Richard Nadeau
Hi John,

Do you have a copy of the MySQL JDBC driver in your Hive library path?

Rick
On Apr 3, 2013 3:57 PM, John Omernik j...@omernik.com wrote:

 Not sure what the issues is, conf is good, validated I can log in to mysql
 with username in the hive-site, and I ran the metastore update scripts.

 show tables;
 java.lang.OutOfMemoryError: Java heap space
  at
 org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:353)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:215)
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
 at
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:412)
  at
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:399)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:736)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
  at
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:74)
 at $Proxy6.getDatabase(Unknown Source)
  at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1110)
 at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1099)
  at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:2206)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:334)
  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
 at
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
  at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1336)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1122)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:935)
 at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:755)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
 FAILED: Execution Error, return code -101 from
 org.apache.hadoop.hive.ql.exec.DDLTask


 Thoughts?



Re: The dreaded Heap Space Issue on a Transform

2013-01-30 Thread Richard Nadeau
What do you have set in core-site.XML for io.sort.mb, io.sort.factor, and
io.file.buffer.size? You should be able to adjust these and get past the
heap issue. Be careful about how much ram you ave though, and don't st them
too high.

Rick
On Jan 30, 2013 8:55 AM, John Omernik j...@omernik.com wrote:

 So it's filling up on the emitting stage, so I need to look at the task
 logs and or my script that's printing to stdout as the likely culprits I am
 guessing.



 On Wed, Jan 30, 2013 at 9:11 AM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 That particular OutOfMemoryError is happening on one of your hadoop
 nodes. It's the heap within the process forked by the hadoop tasktracker, I
 think.

 Phil.


 On 30 January 2013 14:28, John Omernik j...@omernik.com wrote:

 So just a follow-up. I am less looking for specific troubleshooting on
 how to fix my problem, and more looking for a general understanding of heap
 space usage with Hive.  When I get an error like this, is it heap space on
 a node, or heap space on my hive server?  Is it the heap space of the
 tasktracker? Heap of the job kicked off on the node?  Which heap is being
 affected? If it's not clear in my output, where can I better understand
 this? I am sorely out of my league here when it comes to understanding the
 JVM interactions of Hive and Hadoop, i.e. where hive is run, vs where task
 trackers are run etc.

 Thanks is advance!



 On Tue, Jan 29, 2013 at 7:43 AM, John Omernik j...@omernik.com wrote:

 I am running a transform script that parses through a bunch of binary
 data. In 99% of the cases it runs, it runs fine, but on certain files I get
 a failure (as seen below).  Funny thing is, I can run a job with only the
 problem source file, and it will work fine, but when as a group of files, I
 get these warnings.  I guess what I am asking here is this: Where is the
 heap error? Is this occurring on the nodes themselves or, since this is
 where the script is emitting records (and potentially large ones at that)
 and in this case my hive server running the job may be memory light, could
 the issue actually be due to heap on the hive server itself?   My setup is
 1 Hive node (that is woefully underpowered, under memoried, and under disk
 I/Oed) and 4 beefy hadoop nodes.  I guess, my question is the heap issue on
 the sender or the receiver :)




 13-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got
 brand-new compressor
 2013-01-29 08:20:24,107 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows
 2013-01-29 08:20:24,410 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 10 rows
 2013-01-29 08:20:24,410 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 10 rows
 2013-01-29 08:20:24,412 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 10 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 100 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 100 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 100 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1000 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1000 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1000 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1000 rows
 2013-01-29 08:21:16,248 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 1000 rows
 2013-01-29 08:21:16,248 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 1000 rows
 2013-01-29 08:21:16,248 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 1000 rows
 2013-01-29 08:21:16,248 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1000 rows
 2013-01-29 08:25:47,532 INFO
 

how may map-reduce needed in a hive query

2013-01-22 Thread Richard
I am wondering how to determine the number of map-reduce for a hive query.


for example, the following query


select 
sum(c1),
sum(c2),
k1
from
{
select transform(*) using 'mymapper'  as c1, c2, k1
from t1
} a group by k1; 


when i run this query, it takes two map-reduce, but I expect it to take only 1.
in the map stage, using 'mymapper' as the mapper, then shuffle the mapper 
output by k1 and perform sum reduce in the reducer.


so why hive takes 2 map-reduce?

Re:how may map-reduce needed in a hive query

2013-01-22 Thread Richard
thanks. I used explain command and get the plan, but I am still confused.
The below is the description of two map-reduce stages:


it seems that in stage-1 the aggregation has already been done, why stage-2 has 
aggregation again?




==
STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Alias - Map Operator Tree:
a:t1 
  TableScan
alias: t1
Select Operator
  expressions:
expr: f
type: string
  outputColumnNames: _col0
  Transform Operator
command: mymapper
output info:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Select Operator
  expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
  outputColumnNames: _col0, _col1, _col2
  Group By Operator
aggregations:
  expr: sum(_col0)
  expr: sum(_col1)
bucketGroup: false
keys:
  expr: _col2
  type: string
mode: hash
outputColumnNames: _col0, _col1, _col2
Reduce Output Operator
  key expressions:
expr: _col0
type: string
  sort order: +
  Map-reduce partition columns:
expr: rand()
type: double
  tag: -1
  value expressions:
expr: _col1
type: double
expr: _col2
type: double
  Reduce Operator Tree:
Group By Operator
  aggregations:
expr: sum(VALUE._col0)
expr: sum(VALUE._col1)
  bucketGroup: false
  keys:
expr: KEY._col0
type: string
  mode: partials
  outputColumnNames: _col0, _col1, _col2
  File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
Map Reduce
  Alias - Map Operator Tree:

hdfs://hdpnn:9000/mydata/hive/hive_2013-01-23_13-46-09_628_5487089660360786955/10002
 
Reduce Output Operator
  key expressions:
expr: _col0
type: string
  sort order: +
  Map-reduce partition columns:
expr: _col0
type: string
  tag: -1
  value expressions:
expr: _col1
type: double
expr: _col2
type: double
  Reduce Operator Tree:
Group By Operator
  aggregations:
expr: sum(VALUE._col0)
expr: sum(VALUE._col1)
  bucketGroup: false
  keys:
expr: KEY._col0
type: string
  mode: final
  outputColumnNames: _col0, _col1, _col2
  Select Operator
expressions:
  expr: _col1
  type: double
  expr: _col2
  type: double
  expr: _col0
  type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
  compressed: false
  GlobalTableId: 0
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat










At 2013-01-23 11:45:13,Richard codemon...@163.com wrote:

I am wondering how to determine the number of map-reduce for a hive query.


for example, the following query


select 
sum(c1),
sum(c2),
k1
from
{
select transform(*) using 'mymapper'  as c1, c2, k1
from t1
} a group by k1; 


when i run this query, it takes two map-reduce, but I expect it to take only 1.
in the map stage, using 'mymapper' as the mapper, then shuffle the mapper 
output by k1 and perform sum reduce in the reducer.


so why hive takes 2 map-reduce?




Re:Re: create a hive table: always a tab space before each line

2013-01-14 Thread Richard
thanks.
it seems that as long as I use sequencefile as the storage format, there
will be \t before the first column. If this output is continously used by
hive, it is fine. The problem is that I may use a self-define map-reduce
job to read these files.  Does that mean I have to take care of 
this \t by myself?


is there any option that I can disable this \t in hive?




At 2013-01-09 22:38:11,Dean Wampler dean.wamp...@thinkbiganalytics.com 
wrote:
To add to what Nitin said, there is no key output by Hive in front of the tab.


On Wed, Jan 9, 2013 at 3:07 AM, Nitin Pawar nitinpawar...@gmail.com wrote:

you may want to look at the sequencefile format 
http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-based-data-structures/id3555432



that tab is to separate key from values in the record (I may be wrong but this 
is how I interpreted it) 



On Wed, Jan 9, 2013 at 12:49 AM, Richard codemon...@163.com wrote:

more information:


if I set the format as textfile, there is no tab space. 
if I set the format as sequencefile and view the content via hadoop fs -text, I 
saw a tab space in the head of each line.



At 2013-01-09 15:44:00,Richard codemon...@163.com wrote:

hi there


I have a problem with creating a hive table.
no matter what field delimiter I used, I always got a tab space in the head of 
each line (a line is a record).
something like this:
\t f1 \001 f2 \001 f3 ...
where f1 , f2 , f3 denotes the field value and \001 is the field separator.


here is the clause I used 
35 create external table if not exists ${HIVETBL_my_table}
 36 (
 37 nid string, 
 38 userid string, 
 39 spv bigint, 
 40 sipv bigint, 
 41 pay bigint, 
 42 spay bigint, 
 43 ipv bigint, 
 44 sellerid string, 
 45 cate string
 46 )
 47 partitioned by(ds string)
 48 row format delimited fields terminated by '\001' lines terminated by '\n'
 49 stored as sequencefile
 50 location '${HADOOP_PATH_4_MY_HIVE}/${HIVETBL_my_table}';


thanks for help.


Richard











--
Nitin Pawar






--
Dean Wampler, Ph.D.
thinkbiganalytics.com
+1-312-339-1330



Re:create a hive table: always a tab space before each line

2013-01-09 Thread Richard
more information:


if I set the format as textfile, there is no tab space. 
if I set the format as sequencefile and view the content via hadoop fs -text, I 
saw a tab space in the head of each line.


At 2013-01-09 15:44:00,Richard codemon...@163.com wrote:

hi there


I have a problem with creating a hive table.
no matter what field delimiter I used, I always got a tab space in the head of 
each line (a line is a record).
something like this:
\t f1 \001 f2 \001 f3 ...
where f1 , f2 , f3 denotes the field value and \001 is the field separator.


here is the clause I used 
35 create external table if not exists ${HIVETBL_my_table}
 36 (
 37 nid string, 
 38 userid string, 
 39 spv bigint, 
 40 sipv bigint, 
 41 pay bigint, 
 42 spay bigint, 
 43 ipv bigint, 
 44 sellerid string, 
 45 cate string
 46 )
 47 partitioned by(ds string)
 48 row format delimited fields terminated by '\001' lines terminated by '\n'
 49 stored as sequencefile
 50 location '${HADOOP_PATH_4_MY_HIVE}/${HIVETBL_my_table}';


thanks for help.


Richard




Re:Re: create a hive table: always a tab space before each line

2013-01-09 Thread Richard
I am trying to create a table and insert overwrite it, so the data is supposed 
to be generated.






At 2013-01-09 17:17:06,Anurag Tangri tangri.anu...@gmail.com wrote:

Hi Richard,
You should set the format in create external table command based on the format 
of your data on HDFS.


Is your data text file or seq file on HDFS ?


Thanks,
Anurag Tangri

Sent from my iPhone

On Jan 9, 2013, at 12:49 AM, Richard  codemon...@163.com wrote:


more information:


if I set the format as textfile, there is no tab space. 
if I set the format as sequencefile and view the content via hadoop fs -text, I 
saw a tab space in the head of each line.


At 2013-01-09 15:44:00,Richard codemon...@163.com wrote:

hi there


I have a problem with creating a hive table.
no matter what field delimiter I used, I always got a tab space in the head of 
each line (a line is a record).
something like this:
\t f1 \001 f2 \001 f3 ...
where f1 , f2 , f3 denotes the field value and \001 is the field separator.


here is the clause I used 
35 create external table if not exists ${HIVETBL_my_table}
 36 (
 37 nid string, 
 38 userid string, 
 39 spv bigint, 
 40 sipv bigint, 
 41 pay bigint, 
 42 spay bigint, 
 43 ipv bigint, 
 44 sellerid string, 
 45 cate string
 46 )
 47 partitioned by(ds string)
 48 row format delimited fields terminated by '\001' lines terminated by '\n'
 49 stored as sequencefile
 50 location '${HADOOP_PATH_4_MY_HIVE}/${HIVETBL_my_table}';


thanks for help.


Richard







create a hive table: always a tab space before each line

2013-01-08 Thread Richard
hi there

I have a problem with creating a hive table.
no matter what field delimiter I used, I always got a tab space in the head of 
each line (a line is a record).
something like this:
\t f1 \001 f2 \001 f3 ...
where f1 , f2 , f3 denotes the field value and \001 is the field separator.


here is the clause I used 
35 create external table if not exists ${HIVETBL_my_table}
 36 (
 37 nid string, 
 38 userid string, 
 39 spv bigint, 
 40 sipv bigint, 
 41 pay bigint, 
 42 spay bigint, 
 43 ipv bigint, 
 44 sellerid string, 
 45 cate string
 46 )
 47 partitioned by(ds string)
 48 row format delimited fields terminated by '\001' lines terminated by '\n'
 49 stored as sequencefile
 50 location '${HADOOP_PATH_4_MY_HIVE}/${HIVETBL_my_table}';


thanks for help.


Richard

user define data format

2012-05-21 Thread Richard
 Hi, I want to use Hive on some data in the following format:
doc\0x01
field1=val1\0x01
field2=val2\0x01
...
/doc\0x01

the lines between doc and /doc are a record. How should I define the table?

thanks.
Richard


Re:Re: update a hive table

2012-03-18 Thread Richard
I see, thanks.




At 2012-03-16 20:30:24,Bejoy Ks bejoy...@yahoo.com wrote:

Hi Robert,
Sorry I missed answering your question
If mytable has many columns but I only need to update one of them, how can I 
write the
statement short?
It is not possible in hive. At a bare minimum hive would overwrite a partition 
(not even a record/file) if you frame an optimized query.



Hope it helps!...


Regards
Bejoy.K.S


From: Bejoy Ks bejoy...@yahoo.com
To: user@hive.apache.org user@hive.apache.org
Sent: Friday, March 16, 2012 5:56 PM
Subject: Re: update a hive table



Hey Richard
 First, don't treat hive as a database. Hive is just an interface above 
map reduce that help you write SQL like queries that inturn is converted to MR 
jobs. Saying that hive doesn't support update. 


If you are just experimenting, the query that you wrote would overwrite a whole 
record (in a broader context a whole partion/table)


Regards
Bejoy.K.S


From: Richard codemon...@163.com
To:user@hive.apache.org
Sent: Friday, March 16, 2012 4:26 PM
Subject: update a hive table



if I wang to update a table, e.g,
 
insert overwrite table mytable
select lower(col1), col2, col3 from mytable;
 
if mytable has many columns but I only need to update one of them, how can I 
write the
statement short?
 










update a hive table

2012-03-16 Thread Richard
if I wang to update a table, e.g,
 
insert overwrite table mytable
select lower(col1), col2, col3 from mytable;
 
if mytable has many columns but I only need to update one of them, how can I 
write the
statement short?
 

Re: Multi character delimiter for Hive Columns and Rows

2011-04-28 Thread Richard Nadeau
A custom SerDe would be your best bet. We're using one to do exactly that.

Regards,
Rick

On Apr 28, 2011 11:29 AM, Shantian Purkad shantian_pur...@yahoo.com
wrote:
 Any suggestions?



 
 From: Shantian Purkad shantian_pur...@yahoo.com
 To: user@hive.apache.org
 Sent: Tue, April 26, 2011 11:05:46 PM
 Subject: Multi character delimiter for Hive Columns and Rows


 Hello,

 We have a situation where the data coming from source systems to hive may
 contain the common characters and delimiters such as |, tabs, new line
 characters etc.

 We may have to use multi character delimiters such as |# for columns and

 ||# for rows.

 How can we achieve this? In this case our single rows may look like below
(|#is
 column delimiter and ||#is row delimiter

 row 1 col1 |# row 1 col2 |# row 1 col 3 has
 two
 new line characters |# and this is
 the last column of row 1 ||# row 2 col1 |# row 2 col2 |# row 2 col 3 has
 one tab and one new line character |# and this is
 the last column of row 2 ||#

 Would custom SerDe help us handle this situation?

 Thanks and Regards,
 Shantian