Unsubscribe
Unsubscribe - Original Message - From: "Kevin Risden" To: user@hive.apache.org Sent: Tuesday, October 16, 2018 9:44:17 AM Subject: Re: HiveServer2 performance references? Thanks for the responses. I'll wait to test with HIVE-20621 to see how much that helps. For HIVE-17194, doesn't look like that is documented outside of JIRA currently. It looks like I may have picked up over the wire compression by default. I had assumed it wasn't compressed. Kevin Risden On Tue, Oct 16, 2018 at 12:44 AM Gopal Vijayaraghavan < gop...@apache.org > wrote: Hi, > I was looking at HiveServer2 performance going through Knox in KNOX-1524 and > found that HTTP mode is significantly slower. The HTTP mode does re-auth for every row before HIVE-20621 was fixed – Knox should be doing cookie-auth to prevent ActiveDirectory/LDAP from throttling this. I fixed it for the open-source JDBC drivers, but I think the proprietary ODBC drivers are still broken (the C++ code doesn’t save cookies). > I also didn't see any information about reducing amount of data over the wire > with compression. It’s possible Knox doesn’t generate/forward the Accept-Encoding headers. https://issues.apache.org/jira/browse/HIVE-17194 Cheers, Gopal From: Prasanth Jayachandran < pjayachand...@hortonworks.com > Reply-To: " user@hive.apache.org " < user@hive.apache.org > Date: Monday, October 15, 2018 at 12:34 PM To: " user@hive.apache.org " < user@hive.apache.org >, " user@hive.apache.org " < user@hive.apache.org > Subject: Re: HiveServer2 performance references? One performance fix that is fixed recently and is related is HIVE-20621 Thanks Prasanth On Mon, Oct 15, 2018 at 12:11 PM -0700, " Kevin Risden " < kris...@apache.org > wrote: Are there any resources on HiveServer2 performance - specifically binary vs HTTP mode? I would also be interested in any information about enabling compression between HiveServer2 and a client like beeline. I was looking at HiveServer2 performance going through Knox in KNOX-1524 and found that HTTP mode is significantly slower. I searched and didn't find any references to performance of HiveServer2 out of the box. I also didn't see any information about reducing amount of data over the wire with compression. Any pointers would be very helpful. Thanks! Kevin Risden
Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3
To be honest, my advice was just based on your post - we only use Hive in AWS EMR, so I couldn't tell you. Glad that at least you're back up though. - Original Message - From: "Anup Tiwari" To: user@hive.apache.org Sent: Thursday, April 12, 2018 10:50:23 AM Subject: Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 Hi Richard , After looking at hive 2.3.3 logs i found that when we are loading all configuration parameters then below message is getting printed :- Values omitted for security reason if present: [fs.s3n.awsAccessKeyId, fs.s3a.access.key, fs.s3.awsAccessKeyId, hive.server2.keystore.password, f s.s3a.proxy.password, javax.jdo.option.ConnectionPassword, fs.s3.awsSecretAccessKey, fs.s3n.awsSecretAccessKey, fs.s3a.secret.key] while in hive 2.1.1 logs i found below message :- Values omitted for security reason if present: [hive.server2.keystore.password, javax.jdo.option.ConnectionPassword] Can this be the reason why hive 2.3.3 is not able to read s3 related params from hive-site.xml? I found one of JIRA : https://issues.apache.org/jira/browse/HIVE-14588 Also i have set below property in hive-site.xml by excluding above s3 variables(fs.s3a.access.key,fs.s3a.secret.key) from list and it worked. hive.conf.hidden.list javax.jdo.option.ConnectionPassword,hive.server2.keystore.password,fs.s3.awsAccessKeyId,fs.s3.awsSecretAccessKey,fs.s3n.awsAccessKeyId,fs.s3n.awsSecretAccessKey,fs.s3a.proxy.password Comma separated list of configuration options which should not be read by normal user like passwords.Anup has excluded 2 variable which were fs.s3a.access.key,fs.s3a.secret.key Let me know if there is any other solution because i think if these variables are by default part of hidden.list then there will be some other proper workaround for this. Regards, Anup Tiwari On Thu, Apr 12, 2018 at 7:44 PM, Richard A. Bross < r...@oaktreepeak.com > wrote: I hear you, but given the exception log, it does seem that it can't authenticate you. You can try using the AWS environment variables. If that resolves the issue then you'll have some more to go on. According to Hortonworks here: https://hortonworks.github.io/hdp-aws/s3-security/ "AWS CLI supports authentication through environment variables. These same environment variables will be used by Hadoop if no configuration properties are set." - Original Message - From: "Anup Tiwari" < anupsdtiw...@gmail.com > To: user@hive.apache.org Sent: Thursday, April 12, 2018 10:06:33 AM Subject: Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 We are not using EMR. Also we have set below params for accessing s3 bucket in hive-site.xml which are same as what we have set in hive 2.1.1. * fs.s3a.access.key * fs.s3a.secret.key * fs.s3a.connection.maximum * fs.s3a.impl Regards, Anup Tiwari On Thu, Apr 12, 2018 at 7:19 PM, Richard A. Bross < r...@oaktreepeak.com > wrote: Based on the exception, it looks more like an AWS credentials issue than a Hive issue. Are you running in AWS EMR, on-prem? In AWS the resource accessing the S3 bucket would have to have an IAM that gave permission. If you are running somewhere else whatever AWS login you use would have to have the correct permissions in the IAM. - Original Message - From: "Anup Tiwari" < anupsdtiw...@gmail.com > To: user@hive.apache.org Sent: Thursday, April 12, 2018 9:11:37 AM Subject: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 Hi All, When i am trying to read s3 linked external table in 2.3.3 ; i am getting errors.It was working properly in 2.1.1. please find below details and let me know if i am missing something :- Hadoop Version :- 2.8.0 Query :- select log_date,count(1) as cnt from test.tt1 group by log_date; Error :- Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, vertex=vertex_1523502631429_0029_3_00 [Map 1], org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is
Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3
I hear you, but given the exception log, it does seem that it can't authenticate you. You can try using the AWS environment variables. If that resolves the issue then you'll have some more to go on. According to Hortonworks here: https://hortonworks.github.io/hdp-aws/s3-security/ "AWS CLI supports authentication through environment variables. These same environment variables will be used by Hadoop if no configuration properties are set." - Original Message - From: "Anup Tiwari" To: user@hive.apache.org Sent: Thursday, April 12, 2018 10:06:33 AM Subject: Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 We are not using EMR. Also we have set below params for accessing s3 bucket in hive-site.xml which are same as what we have set in hive 2.1.1. * fs.s3a.access.key * fs.s3a.secret.key * fs.s3a.connection.maximum * fs.s3a.impl Regards, Anup Tiwari On Thu, Apr 12, 2018 at 7:19 PM, Richard A. Bross < r...@oaktreepeak.com > wrote: Based on the exception, it looks more like an AWS credentials issue than a Hive issue. Are you running in AWS EMR, on-prem? In AWS the resource accessing the S3 bucket would have to have an IAM that gave permission. If you are running somewhere else whatever AWS login you use would have to have the correct permissions in the IAM. - Original Message - From: "Anup Tiwari" < anupsdtiw...@gmail.com > To: user@hive.apache.org Sent: Thursday, April 12, 2018 9:11:37 AM Subject: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 Hi All, When i am trying to read s3 linked external table in 2.3.3 ; i am getting errors.It was working properly in 2.1.1. please find below details and let me know if i am missing something :- Hadoop Version :- 2.8.0 Query :- select log_date,count(1) as cnt from test.tt1 group by log_date; Error :- Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, vertex=vertex_1523502631429_0029_3_00 [Map 1], org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:128) at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:288) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2811) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2848) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2830) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:265) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) at org.apache.hadoop.hive.ql.io .HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442) at org.apache.hadoop.hive.ql.io .HiveInputFormat.getSplits(HiveInputFormat.java:561) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at java.util.concurrent.FutureTask.run(FutureTask.java:266) a
Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3
Based on the exception, it looks more like an AWS credentials issue than a Hive issue. Are you running in AWS EMR, on-prem? In AWS the resource accessing the S3 bucket would have to have an IAM that gave permission. If you are running somewhere else whatever AWS login you use would have to have the correct permissions in the IAM. - Original Message - From: "Anup Tiwari" To: user@hive.apache.org Sent: Thursday, April 12, 2018 9:11:37 AM Subject: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 Hi All, When i am trying to read s3 linked external table in 2.3.3 ; i am getting errors.It was working properly in 2.1.1. please find below details and let me know if i am missing something :- Hadoop Version :- 2.8.0 Query :- select log_date,count(1) as cnt from test.tt1 group by log_date; Error :- Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, vertex=vertex_1523502631429_0029_3_00 [Map 1], org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:128) at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:288) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2811) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2848) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2830) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:265) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:151) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070) at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:276) ... 24 more Caused by: com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ a
Re: ALTER TABLE DROP PARTITION not working on S3
I can't duplicate the issue now. Works like it always has . . - Original Message - From: "Furcy Pin" To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:57:39 AM Subject: Re: ALTER TABLE DROP PARTITION not working on S3 Indeed. If I remember correctly, s3 does not really have the concept of "folder" like HDFS has, and Hive sort of makes up for it by creating a descriptor file where the partition "folder" is supposed to be. Maybe this is what is missing here. Perhaps you could try doing a "MSCK REPAIR TABLE tablename" to make sure that the partitions are correctly loaded and then try again dropping that particular partition? Or look at your s3 folder if you see any such "partition folder file" and check if it is missing for this particular partition? On 5 April 2018 at 15:40, Richard A. Bross < r...@oaktreepeak.com > wrote: Leaving the column list out, here you go: # Detailed Table Information Database: default Owner: hadoop CreateTime: Thu Apr 05 13:24:33 UTC 2018 LastAccessTime: UNKNOWN Retention: 0 Location: s3://zoomi-proto-warehouse-measurements/ Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles 10 numPartitions 7 numRows 153 orc.compress ZLIB orc.create.index true rawDataSize 113563 totalSize 37801 transient_lastDdlTime 1522934673 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io .orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io .orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io .orc.OrcOutputFormat Compressed: No Num Buckets: 61 Bucket Columns: [crs_id] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.467 seconds, Fetched: 98 row(s) - Original Message - From: "Furcy Pin" < pin.fu...@gmail.com > To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:21:06 AM Subject: Re: ALTER TABLE DROP PARTITION not working on S3 Hi Richard, could you please check if your table is EXTERNAL? You can see it with a "DESCRIBE FORMATTED table_name ;" That's what external tables are for, they don't delete underlying data when you drop them. On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: I think that someone put a file in there manually. Would that prevent Hive from dropping the partition. I also did a "drop table" and the s3 object keys persisted. - Original Message - From: "Richard A. Bross" < r...@oaktreepeak.com > To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:14:52 AM Subject: ALTER TABLE DROP PARTITION not working on S3 Hi, I have a Hive managed table on S3, "api_measurements". I've tried dropping a partition like so: hive> alter table api_measurements drop if exists partition(daydate='2018-04-04', epoch=1522876500); Dropped the partition daydate=2018-04-04/epoch=1522876500 OK Time taken: 2.109 seconds Yet the data is still on S3. Because object keys on S3 are always strings, I also tried this: hive> alter table api_measurements drop partition(daydate='2018-04-04', epoch='1522876500'); OK Time taken: 0.135 seconds Yet the object keys and data are still there. I assume that Im missing something really simple. Can anyone shed some light on this? Thanks
Re: ALTER TABLE DROP PARTITION not working on S3
Leaving the column list out, here you go: # Detailed Table Information Database: default Owner: hadoop CreateTime: Thu Apr 05 13:24:33 UTC 2018 LastAccessTime: UNKNOWN Retention: 0 Location: s3://zoomi-proto-warehouse-measurements/ Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles10 numPartitions 7 numRows 153 orc.compressZLIB orc.create.indextrue rawDataSize 113563 totalSize 37801 transient_lastDdlTime 1522934673 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets:61 Bucket Columns: [crs_id] Sort Columns: [] Storage Desc Params: serialization.format1 Time taken: 0.467 seconds, Fetched: 98 row(s) - Original Message - From: "Furcy Pin" To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:21:06 AM Subject: Re: ALTER TABLE DROP PARTITION not working on S3 Hi Richard, could you please check if your table is EXTERNAL? You can see it with a "DESCRIBE FORMATTED table_name ;" That's what external tables are for, they don't delete underlying data when you drop them. On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: I think that someone put a file in there manually. Would that prevent Hive from dropping the partition. I also did a "drop table" and the s3 object keys persisted. - Original Message - From: "Richard A. Bross" < r...@oaktreepeak.com > To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:14:52 AM Subject: ALTER TABLE DROP PARTITION not working on S3 Hi, I have a Hive managed table on S3, "api_measurements". I've tried dropping a partition like so: hive> alter table api_measurements drop if exists partition(daydate='2018-04-04', epoch=1522876500); Dropped the partition daydate=2018-04-04/epoch=1522876500 OK Time taken: 2.109 seconds Yet the data is still on S3. Because object keys on S3 are always strings, I also tried this: hive> alter table api_measurements drop partition(daydate='2018-04-04', epoch='1522876500'); OK Time taken: 0.135 seconds Yet the object keys and data are still there. I assume that Im missing something really simple. Can anyone shed some light on this? Thanks
Re: ALTER TABLE DROP PARTITION not working on S3
No definitely internal. It's loaded from an external table. - Original Message - From: "Furcy Pin" To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:21:06 AM Subject: Re: ALTER TABLE DROP PARTITION not working on S3 Hi Richard, could you please check if your table is EXTERNAL? You can see it with a "DESCRIBE FORMATTED table_name ;" That's what external tables are for, they don't delete underlying data when you drop them. On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: I think that someone put a file in there manually. Would that prevent Hive from dropping the partition. I also did a "drop table" and the s3 object keys persisted. ----- Original Message - From: "Richard A. Bross" < r...@oaktreepeak.com > To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:14:52 AM Subject: ALTER TABLE DROP PARTITION not working on S3 Hi, I have a Hive managed table on S3, "api_measurements". I've tried dropping a partition like so: hive> alter table api_measurements drop if exists partition(daydate='2018-04-04', epoch=1522876500); Dropped the partition daydate=2018-04-04/epoch=1522876500 OK Time taken: 2.109 seconds Yet the data is still on S3. Because object keys on S3 are always strings, I also tried this: hive> alter table api_measurements drop partition(daydate='2018-04-04', epoch='1522876500'); OK Time taken: 0.135 seconds Yet the object keys and data are still there. I assume that Im missing something really simple. Can anyone shed some light on this? Thanks
Re: ALTER TABLE DROP PARTITION not working on S3
I think that someone put a file in there manually. Would that prevent Hive from dropping the partition. I also did a "drop table" and the s3 object keys persisted. - Original Message ----- From: "Richard A. Bross" To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:14:52 AM Subject: ALTER TABLE DROP PARTITION not working on S3 Hi, I have a Hive managed table on S3, "api_measurements". I've tried dropping a partition like so: hive> alter table api_measurements drop if exists partition(daydate='2018-04-04', epoch=1522876500); Dropped the partition daydate=2018-04-04/epoch=1522876500 OK Time taken: 2.109 seconds Yet the data is still on S3. Because object keys on S3 are always strings, I also tried this: hive> alter table api_measurements drop partition(daydate='2018-04-04', epoch='1522876500'); OK Time taken: 0.135 seconds Yet the object keys and data are still there. I assume that Im missing something really simple. Can anyone shed some light on this? Thanks
ALTER TABLE DROP PARTITION not working on S3
Hi, I have a Hive managed table on S3, "api_measurements". I've tried dropping a partition like so: hive> alter table api_measurements drop if exists partition(daydate='2018-04-04', epoch=1522876500); Dropped the partition daydate=2018-04-04/epoch=1522876500 OK Time taken: 2.109 seconds Yet the data is still on S3. Because object keys on S3 are always strings, I also tried this: hive> alter table api_measurements drop partition(daydate='2018-04-04', epoch='1522876500'); OK Time taken: 0.135 seconds Yet the object keys and data are still there. I assume that Im missing something really simple. Can anyone shed some light on this? Thanks
Re: Hive, Tez, clustering, buckets, and Presto
This is great information, Gopal, thank you. I wish I had the time to create a comparison for our use case between Hive buckets and ORC files on S3 and ORC files without bucket. Unfortunately it's a chicken and egg issue, since I won't have enough data volume until we are in production, which should be shortly. Thanks again. - Original Message - From: "Gopal Vijayaraghavan" To: user@hive.apache.org Sent: Wednesday, April 4, 2018 7:31:31 PM Subject: Re: Hive, Tez, clustering, buckets, and Presto > so there asking "where is the Hive bucketing spec". Is it just to read the > code for that function? This worked the other way around in time, than writing a spec first - ACIDv1 implemented Streaming ingest via Storm, it used an explicit naming "bucket_" for the filename. Since until the compaction runs the actual base files don't exist, the ACID bucketing implementation has to handle missing buckets as 0 rows in base file + possibly more rows in uncompacted deltas. ACID's implementation has forced the two bucketing implementations to work similarly, for the ability to do bucket map-joins between ACID & non-ACID bucketed tables. Particularly about the modulus for -ve numbers, which was broken in Hive-1.0. https://issues.apache.org/jira/browse/HIVE-12025 that's the place where this all got refactored so that joins & filters for bucketed tables work the same way for ACID & non-ACID tables. Because of that spec lives in the comments now as a Regex. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1283 > They were looking for something more explicit, I think. I think a simple unit test will probably help them a bit more. create external table bucketed (x int) clustered by (x) into 4 buckets stored as orc; insert into bucketed values(1),(2),(3),(4); insert into bucketed values(1),(2),(3),(4); 0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed; | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/00_0 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/00_0_copy_1 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/01_0 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/01_0_copy_1 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/02_0 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/02_0_copy_1 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/03_0 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/03_0_copy_1 | Even when all buckets are covered Presto should be expecting >1 files per bucket. I saw a JIRA comment which said "sort in file order and assign buckets", you can see that is only applicable for the 1st insert to table (& the regex will remove the copy numbering). And oddly enough this week, I saw an academic paper with a negative analysis of Hive bucketing. https://www.researchgate.net/publication/323997831_Partitioning_and_Bucketing_in_Hive-Based_Big_Data_Warehouses Cheers, Gopal On 4/3/18, 1:42 PM, "Richard A. Bross" wrote: Gopal, The Presto devs say they are willing to make the changes to adhere to the Hive bucket spec. I quoted "Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile)." so there asking "where is the Hive bucketing spec". Is it just to read the code for that function? They were looking for something more explicit, I think. Thanks - Original Message - From: "Gopal Vijayaraghavan" To: user@hive.apache.org Sent: Tuesday, April 3, 2018 3:15:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto >* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. My understanding was that this was related to the number of buckets, but you are relating it to ORC stripes. It is odd that no examples that I've seen include the SORTED BY statement other than in relation to ORC indexes (that I understand). So the question is; regardless of whether efficient ORC stripes are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this to have much of an effect) ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), which when combined with a c
Re: Building Datwarehouse Application in Spark
Mahender, To really address your question I think that you'd have to supply a bit more information, such as the kind of data that you want to save; RBDMS type look ups, key/value/index type look ups, insert velocity, etc. These wide choices of technologies are suited to different use cases, although they overlap in some areas. In a previous position that I held we used Spark on Cassandra to solve a similar problem. The Datastax distribution puts Spark worker nodes directly on Cassandra nodes. Because Cassandra partitions the data across nodes based on a row key, it's a nice match. If the key is chosen properly, the Spark nodes are typically accessing local data on the Cassandra nodes, meaning that there are typically very few shuffles for direct queries and also that inserts go directly to the proper Cassandra nodes. We had time series data based on unique row keys. So our row keys were unique and our column keys were the time stamps. In that case our queries were done directly with the Cassandra clients for the most part, with SparkQL primarily used for ad-hoc queries. At my current position, we directly load raw data into Hive (using HiveQL) and then use Presto for queries. That's our OLAP data store. You can use any number of other tools to query Hive created data stores as well. Then we have another pipeline that takes the same raw data, uses Spark for the ETL, and then inserts the results into Aurora (MySQL). The schema is designed for specific queries, so the Spark ETL is designed to transform the data to optimize for the schema so as to allow efficient updates to those tables. That's our OLTP data store and we use standard SQL for queries. Rick - Original Message - From: "Furcy Pin" To: user@hive.apache.org Sent: Wednesday, April 4, 2018 6:58:58 AM Subject: Re: Building Datwarehouse Application in Spark Hi Mahender, Did you look at this? https://www.snappydata.io/blog/the-spark-database But I believe that most people handle this use case by either using: - Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...) if the data is not too big - Their favorite New-SQL storage (Cassandra, HBase) if the data is too big and needs to be distributed Spark generally makes it easy enough to query these other databases to allow you to perform analytics. Hive and Spark have been designed as OLAP tools, not OLTP. I'm not sure what features you are seeking for your SCD but they probably won't be part of Spark's core design. Hope this helps, Furcy On 4 April 2018 at 11:29, Mahender Sarangam < mahender.bigd...@outlook.com > wrote: Hi, Does anyone has good architecture document/design principle for building warehouse application using Spark. Is it better way of having Hive Context created with HQL and perform transformation or Directly loading files in dataframe and perform data transformation. We need to implement SCD 2 Type in Spark, Is there any better document/reference for building Type 2 warehouse object Thanks in advace /Mahender
Re: Hive, Tez, clustering, buckets, and Presto
Gopal, The Presto devs say they are willing to make the changes to adhere to the Hive bucket spec. I quoted "Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile)." so there asking "where is the Hive bucketing spec". Is it just to read the code for that function? They were looking for something more explicit, I think. Thanks - Original Message - From: "Gopal Vijayaraghavan" To: user@hive.apache.org Sent: Tuesday, April 3, 2018 3:15:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto >* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. > My understanding was that this was related to the number of buckets, but you > are relating it to ORC stripes. It is odd that no examples that I've seen > include the SORTED BY statement other than in relation to ORC indexes (that I > understand). So the question is; regardless of whether efficient ORC stripes > are created (wouldn't I have to also specify 'orc.create.index’=’true’ for > this to have much of an effect) ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), which when combined with a columnar format & ordering within files produces better storage and runtimes together (which I guess is usually a trade-off). Without a SORTED BY, the organizing function for the data-shuffle does not order in any specific way - the partition key for the shuffle is the modulus, while the order key is 0 bytes long, so it sorts by (modulus,) which for a quick-sort also loses the input order into the shuffle & each bucket file is produced in random order within itself. An explicit sort with bucketing is what I recommend to most of the HDP customers who have performance problems with ORC. This turns the shuffle key into (modulus, key1, key2) producing more predictable order during shuffle. Then the key1 can be RLE encoded so that ORC vector impl will pass it on as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for integers. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5 was written as a warning to customers who use bucketing to try & solve performance problems, but have ended up bucketing as their main problem. Most of what I have written above was discussed a few years back and in general, bucketing on a high cardinality column + sorting on a low cardinality together has given good results to my customers. >I hadn't thought of the even number issue, not having looked at the > function; I had assumed that it was a hash, not a modulus; shame on me. > Reading the docs I see that hash is only used on string columns Actually a hash is used in theory, but I entirely blame Java for it - the Java hash is an identity function for Integers. scala> 42.hashCode res1: Int = 42 scala> 42L.hashCode res2: Int = 42 > Finally, I'm not sure that I got a specific answer to my original question, > which is can I force Tez to create all bucket files so Presto queries can > succeed? Anyway, I will be testing today and the solution will either be to > forgo buckets completely or to simply rely on ORC indexes. There's no config to do that today & Presto is already incompatible with Hive 3.0 tables (Update/Delete support). Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile). The file-count is a very flaky way to check if the table is bucketed correctly - either you trust the user to have properly bucketed the table or you don't use it. Failing to work on valid tables does look pretty bad, instead of soft fallbacks. I wrote a few UDFs which was used to validate suspect tables and fix them for customers who had bad historical data, which was loaded with "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945. https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27 LLAP has a bucket pruning implementation if Presto wants to copy from it (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD). Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236 Runtime: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281 That actually does things according to the Hive bucketing contract where uncovered buckets are assumed to have 0 rows without a file present & not error out instead. If you do have the abilit
Re: Hive, Tez, clustering, buckets, and Presto
Gopal, Thanks for this. Great information and something to look at more closely to better understand the internals. Rick - Original Message - From: "Gopal Vijayaraghavan" To: user@hive.apache.org Sent: Tuesday, April 3, 2018 3:15:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto >* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. > My understanding was that this was related to the number of buckets, but you > are relating it to ORC stripes. It is odd that no examples that I've seen > include the SORTED BY statement other than in relation to ORC indexes (that I > understand). So the question is; regardless of whether efficient ORC stripes > are created (wouldn't I have to also specify 'orc.create.index’=’true’ for > this to have much of an effect) ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), which when combined with a columnar format & ordering within files produces better storage and runtimes together (which I guess is usually a trade-off). Without a SORTED BY, the organizing function for the data-shuffle does not order in any specific way - the partition key for the shuffle is the modulus, while the order key is 0 bytes long, so it sorts by (modulus,) which for a quick-sort also loses the input order into the shuffle & each bucket file is produced in random order within itself. An explicit sort with bucketing is what I recommend to most of the HDP customers who have performance problems with ORC. This turns the shuffle key into (modulus, key1, key2) producing more predictable order during shuffle. Then the key1 can be RLE encoded so that ORC vector impl will pass it on as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for integers. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5 was written as a warning to customers who use bucketing to try & solve performance problems, but have ended up bucketing as their main problem. Most of what I have written above was discussed a few years back and in general, bucketing on a high cardinality column + sorting on a low cardinality together has given good results to my customers. >I hadn't thought of the even number issue, not having looked at the > function; I had assumed that it was a hash, not a modulus; shame on me. > Reading the docs I see that hash is only used on string columns Actually a hash is used in theory, but I entirely blame Java for it - the Java hash is an identity function for Integers. scala> 42.hashCode res1: Int = 42 scala> 42L.hashCode res2: Int = 42 > Finally, I'm not sure that I got a specific answer to my original question, > which is can I force Tez to create all bucket files so Presto queries can > succeed? Anyway, I will be testing today and the solution will either be to > forgo buckets completely or to simply rely on ORC indexes. There's no config to do that today & Presto is already incompatible with Hive 3.0 tables (Update/Delete support). Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile). The file-count is a very flaky way to check if the table is bucketed correctly - either you trust the user to have properly bucketed the table or you don't use it. Failing to work on valid tables does look pretty bad, instead of soft fallbacks. I wrote a few UDFs which was used to validate suspect tables and fix them for customers who had bad historical data, which was loaded with "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945. https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27 LLAP has a bucket pruning implementation if Presto wants to copy from it (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD). Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236 Runtime: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281 That actually does things according to the Hive bucketing contract where uncovered buckets are assumed to have 0 rows without a file present & not error out instead. If you do have the ability to redeploy Hive, the change you are looking for is a 1-liner to enable. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248 Cheers, Gopal
Re: Hive, Tez, clustering, buckets, and Presto
I'm really confused and could use help understanding. The Hive documentation here https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables says: "Bucketed tables are fantastic in that they allow much more efficient sampling than do non-bucketed tables, and they may later allow for time saving operations such as mapside joins. However, the bucketing specified at table creation is not enforced when the table is written to, and so it is possible for the table's metadata to advertise properties which are not upheld by the table's actual layout. This should obviously be avoided. Here's how to do it right." "This should obviously be avoided". Yet with the Tez engine it seems to be unavoidable when the data does not allow for all the buckets to be populated. - Original Message - From: "Gopal Vijayaraghavan" To: user@hive.apache.org Sent: Monday, April 2, 2018 2:16:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto There's more here than Bucketing or Tez. > PARTITIONED BY(daydate STRING, epoch BIGINT) > CLUSTERED BY(r_crs_id) INTO 64 BUCKETS I hope the epoch partition column is actually a day rollup and not 1 partition for every timestamp. CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). CLUSTERED BY needs a SORTED BY to produce good ORC stripes here. > If I perform the inserts with the mr engine, which is now deprecated, when > the partition is created there are always 64 bucket files, even if some are 0 > length. In this case I can query with Presto just fine. The files should be at least 3 bytes long containing the 3 letters 'O','R','C'. If you have a bucketing impl which is producing empty files when hash(20k) values % 64 is producing 0 rows for some buckets, the skew is unnatural. > However, when the Tez engine is used, only buckets that contain data are > created, and Presto fails since the table properties in the metastore (64 > buckets) are a mismatch with the actual bucket count. That was done to prevent creating these 3 byte files which actually cost real money to store (because storage is usually at the 4k block level, this uses 1000x more storage). You can end up with >64 or <64 or exactly 64 files for a validly bucketed table (in the >64 case, at least some files have a _Copy suffix, but the bucket prefix is consistent). Presto just bails out when it finds something strange, because they do not trust the Hive bucketing impl - there is a really good reason why there are 2 bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & hive.tez.bucket.pruning.compat). > we'd really like to use buckets, since the the r_crs_id in production can > contain 20,000 values. With 20k values and 64 buckets, if you have zero sized buckets - you might want to run your hash values through this. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6 > "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is > always set to "true", but I can't seem to find a description of its intent. > Anyway, if it's supposed to force the creation of buckets it's not working > with the Tez engine. Previously you could disable "enforce.bucketing=false" and the inserts of data won't use the fixed number of buckets & can play other tricks with the data layout. That's not the config you're looking for. There is something which is a very sharp no-handle knife in the Hive toolkit called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth explaining right now (but just that it is possible to use it, but not very easily). > configuration so that r_crs_id was at least row optimized and sorted within > the ORC files. SORTED BY in the Table DDL should do the trick - I like to use a multi-dim sort, in some of these scenarios. CLUSTERED BY(r_crs_id) SORTED BY(r_crs_id, id) INTO 67 BUCKETS If you want to know why I like primes, when % 64 is done on even numbers. len(set([(x*2) % 64 for x in xrange(1000)])) fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets have 0x data. len(set([(x*2) % 61 for x in xrange(1000)])) fills all 61 buckets - producing better bucketing & no 0 sized files. FYI any prime number other than 31 works nicer than a 2^n - https://issues.apache.org/jira/browse/HIVE-7074 Cheers, Gopal
Re: Hive, Tez, clustering, buckets, and Presto
Gopal, Thanks for taking the time to try and help. A few things in relation to your response: * Yes, the 'epoch' column is an hourly timestamp. Clustering by a column with high cardinality would make little sense. * I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. My understanding was that this was related to the number of buckets, but you are relating it to ORC stripes. It is odd that no examples that I've seen include the SORTED BY statement other than in relation to ORC indexes (that I understand). So the question is; regardless of whether efficient ORC stripes are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this to have much of an effect) * Thanks for the hint about using primes for specifying the buckets. Given that r_crs_id is a wide range I think that it would have worked Ok, but any optimizations are appreciated and I hadn't thought of the even number issue, not having looked at the function; I had assumed that it was a hash, not a modulus; shame on me. Reading the docs I see that hash is only used on string columns Finally, I'm not sure that I got a specific answer to my original question, which is can I force Tez to create all bucket files so Presto queries can succeed? Anyway, I will be testing today and the solution will either be to forgo buckets completely or to simply rely on ORC indexes. Thanks. Rick - Original Message - From: "Gopal Vijayaraghavan" To: user@hive.apache.org Sent: Monday, April 2, 2018 2:16:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto There's more here than Bucketing or Tez. > PARTITIONED BY(daydate STRING, epoch BIGINT) > CLUSTERED BY(r_crs_id) INTO 64 BUCKETS I hope the epoch partition column is actually a day rollup and not 1 partition for every timestamp. CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). CLUSTERED BY needs a SORTED BY to produce good ORC stripes here. > If I perform the inserts with the mr engine, which is now deprecated, when > the partition is created there are always 64 bucket files, even if some are 0 > length. In this case I can query with Presto just fine. The files should be at least 3 bytes long containing the 3 letters 'O','R','C'. If you have a bucketing impl which is producing empty files when hash(20k) values % 64 is producing 0 rows for some buckets, the skew is unnatural. > However, when the Tez engine is used, only buckets that contain data are > created, and Presto fails since the table properties in the metastore (64 > buckets) are a mismatch with the actual bucket count. That was done to prevent creating these 3 byte files which actually cost real money to store (because storage is usually at the 4k block level, this uses 1000x more storage). You can end up with >64 or <64 or exactly 64 files for a validly bucketed table (in the >64 case, at least some files have a _Copy suffix, but the bucket prefix is consistent). Presto just bails out when it finds something strange, because they do not trust the Hive bucketing impl - there is a really good reason why there are 2 bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & hive.tez.bucket.pruning.compat). > we'd really like to use buckets, since the the r_crs_id in production can > contain 20,000 values. With 20k values and 64 buckets, if you have zero sized buckets - you might want to run your hash values through this. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6 > "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is > always set to "true", but I can't seem to find a description of its intent. > Anyway, if it's supposed to force the creation of buckets it's not working > with the Tez engine. Previously you could disable "enforce.bucketing=false" and the inserts of data won't use the fixed number of buckets & can play other tricks with the data layout. That's not the config you're looking for. There is something which is a very sharp no-handle knife in the Hive toolkit called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth explaining right now (but just that it is possible to use it, but not very easily). > configuration so that r_crs_id was at least row optimized and sorted within > the ORC files. SORTED BY in the Table DDL should do the trick - I like to use a multi-dim sort, in some of these scenarios. CLUSTERED BY(r_crs_id) SORTED BY(r_crs_id, id) INTO 67 BUCKETS If you want to know why I like primes, when % 64 is done on even numbers. len(set([(x*2) % 64 for x in xrange(1000)])) fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets have 0x data. len(set([(x*2) % 61 for x in xrange(1000)])) fills all 61 buckets - producing better bucketing & no 0 sized files. FYI any prime number other than 31 works nicer than a 2^n - https://issues.apache.org/jira/browse/HIVE-7074 Cheers, Gopal
Hive, Tez, clustering, buckets, and Presto
Hello there, I've done a ton of reading and testing between Hive, Presto, Tez, etc. In our test environment I have 8 tables being loaded with Hive that works fine under certain circumstances. Most of our data is taken from JSON files in from S3, creates a temporary external table, and then performs an INSERT OVERWRITE to transform the data to ORC format in Hive managed tables. A simple example of the CREATE statement for the internal table looks like this (I've left out a number of irrelevant columns): CREATE TABLE IF NOT EXISTS measurements ( id STRING, session STRING, tt_ts INT, r_crs_id INT, user_agent STRING ) PARTITIONED BY(daydate STRING, epoch BIGINT) CLUSTERED BY(r_crs_id) INTO 64 BUCKETS STORED AS ORC LOCATION's3://warehouse/' tblproperties ("orc.compress"="ZLIB"); If I perform the inserts with the mr engine, which is now deprecated, when the partition is created there are always 64 bucket files, even if some are 0 length. In this case I can query with Presto just fine. However, when the Tez engine is used, only buckets that contain data are created, and Presto fails since the table properties in the metastore (64 buckets) are a mismatch with the actual bucket count. I've seen some other postings in StackExchange and via Google where others have run into this. We really need to use Presto to query Hive, and we'd really like to use buckets, since the the r_crs_id in production can contain 20,000 values. So a few questions (thanks for any help): The easiest fix for us would be a way to force Tez to create all buckets like mr so the behavior didn't change. Is that possible? "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is always set to "true", but I can't seem to find a description of its intent. Anyway, if it's supposed to force the creation of buckets it's not working with the Tez engine. ALternately, I was thinking that maybe someone could help me with the ORC file configuration so that r_crs_id was at least row optimized and sorted within the ORC files. I've been reading up on ORC and the Hive statements related to ORC files but I seem to have hit my limit of understanding. It appears that I should at the least have the rows sorted by r_crs_id when written, but I'm having problems figuring out how to apply that with the INSERT statement. It appears that the ORC table should be created with 'orc.create.index’=’true’' for a start. Currently our INSERT looks like this (we never UPDATE or DELETE): FROM raw_measurements_external INSERT OVERWRITE TABLE raw_measurements PARTITION(daydate='{2}', epoch={3}) SELECT *; Something like this? FROM raw_measurements_external SORT BY r_crs_id INSERT OVERWRITE TABLE raw_measurements PARTITION(daydate='{2}', epoch={3}) SELECT *; Thanks for any help from more experienced Hive developers (most of you).