Unsubscribe
Unsubscribe - Original Message - From: "Kevin Risden" To: user@hive.apache.org Sent: Tuesday, October 16, 2018 9:44:17 AM Subject: Re: HiveServer2 performance references? Thanks for the responses. I'll wait to test with HIVE-20621 to see how much that helps. For HIVE-17194, doesn't look like that is documented outside of JIRA currently. It looks like I may have picked up over the wire compression by default. I had assumed it wasn't compressed. Kevin Risden On Tue, Oct 16, 2018 at 12:44 AM Gopal Vijayaraghavan < gop...@apache.org > wrote: Hi, > I was looking at HiveServer2 performance going through Knox in KNOX-1524 and > found that HTTP mode is significantly slower. The HTTP mode does re-auth for every row before HIVE-20621 was fixed – Knox should be doing cookie-auth to prevent ActiveDirectory/LDAP from throttling this. I fixed it for the open-source JDBC drivers, but I think the proprietary ODBC drivers are still broken (the C++ code doesn’t save cookies). > I also didn't see any information about reducing amount of data over the wire > with compression. It’s possible Knox doesn’t generate/forward the Accept-Encoding headers. https://issues.apache.org/jira/browse/HIVE-17194 Cheers, Gopal From: Prasanth Jayachandran < pjayachand...@hortonworks.com > Reply-To: " user@hive.apache.org " < user@hive.apache.org > Date: Monday, October 15, 2018 at 12:34 PM To: " user@hive.apache.org " < user@hive.apache.org >, " user@hive.apache.org " < user@hive.apache.org > Subject: Re: HiveServer2 performance references? One performance fix that is fixed recently and is related is HIVE-20621 Thanks Prasanth On Mon, Oct 15, 2018 at 12:11 PM -0700, " Kevin Risden " < kris...@apache.org > wrote: Are there any resources on HiveServer2 performance - specifically binary vs HTTP mode? I would also be interested in any information about enabling compression between HiveServer2 and a client like beeline. I was looking at HiveServer2 performance going through Knox in KNOX-1524 and found that HTTP mode is significantly slower. I searched and didn't find any references to performance of HiveServer2 out of the box. I also didn't see any information about reducing amount of data over the wire with compression. Any pointers would be very helpful. Thanks! Kevin Risden
Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3
To be honest, my advice was just based on your post - we only use Hive in AWS EMR, so I couldn't tell you. Glad that at least you're back up though. - Original Message - From: "Anup Tiwari" <anupsdtiw...@gmail.com> To: user@hive.apache.org Sent: Thursday, April 12, 2018 10:50:23 AM Subject: Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 Hi Richard , After looking at hive 2.3.3 logs i found that when we are loading all configuration parameters then below message is getting printed :- Values omitted for security reason if present: [fs.s3n.awsAccessKeyId, fs.s3a.access.key, fs.s3.awsAccessKeyId, hive.server2.keystore.password, f s.s3a.proxy.password, javax.jdo.option.ConnectionPassword, fs.s3.awsSecretAccessKey, fs.s3n.awsSecretAccessKey, fs.s3a.secret.key] while in hive 2.1.1 logs i found below message :- Values omitted for security reason if present: [hive.server2.keystore.password, javax.jdo.option.ConnectionPassword] Can this be the reason why hive 2.3.3 is not able to read s3 related params from hive-site.xml? I found one of JIRA : https://issues.apache.org/jira/browse/HIVE-14588 Also i have set below property in hive-site.xml by excluding above s3 variables(fs.s3a.access.key,fs.s3a.secret.key) from list and it worked. hive.conf.hidden.list javax.jdo.option.ConnectionPassword,hive.server2.keystore.password,fs.s3.awsAccessKeyId,fs.s3.awsSecretAccessKey,fs.s3n.awsAccessKeyId,fs.s3n.awsSecretAccessKey,fs.s3a.proxy.password Comma separated list of configuration options which should not be read by normal user like passwords.Anup has excluded 2 variable which were fs.s3a.access.key,fs.s3a.secret.key Let me know if there is any other solution because i think if these variables are by default part of hidden.list then there will be some other proper workaround for this. Regards, Anup Tiwari On Thu, Apr 12, 2018 at 7:44 PM, Richard A. Bross < r...@oaktreepeak.com > wrote: I hear you, but given the exception log, it does seem that it can't authenticate you. You can try using the AWS environment variables. If that resolves the issue then you'll have some more to go on. According to Hortonworks here: https://hortonworks.github.io/hdp-aws/s3-security/ "AWS CLI supports authentication through environment variables. These same environment variables will be used by Hadoop if no configuration properties are set." - Original Message - From: "Anup Tiwari" < anupsdtiw...@gmail.com > To: user@hive.apache.org Sent: Thursday, April 12, 2018 10:06:33 AM Subject: Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 We are not using EMR. Also we have set below params for accessing s3 bucket in hive-site.xml which are same as what we have set in hive 2.1.1. * fs.s3a.access.key * fs.s3a.secret.key * fs.s3a.connection.maximum * fs.s3a.impl Regards, Anup Tiwari On Thu, Apr 12, 2018 at 7:19 PM, Richard A. Bross < r...@oaktreepeak.com > wrote: Based on the exception, it looks more like an AWS credentials issue than a Hive issue. Are you running in AWS EMR, on-prem? In AWS the resource accessing the S3 bucket would have to have an IAM that gave permission. If you are running somewhere else whatever AWS login you use would have to have the correct permissions in the IAM. - Original Message - From: "Anup Tiwari" < anupsdtiw...@gmail.com > To: user@hive.apache.org Sent: Thursday, April 12, 2018 9:11:37 AM Subject: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 Hi All, When i am trying to read s3 linked external table in 2.3.3 ; i am getting errors.It was working properly in 2.1.1. please find below details and let me know if i am missing something :- Hadoop Version :- 2.8.0 Query :- select log_date,count(1) as cnt from test.tt1 group by log_date; Error :- Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, vertex=vertex_1523502631429_0029_3_00 [Map 1], org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not foun
Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3
I hear you, but given the exception log, it does seem that it can't authenticate you. You can try using the AWS environment variables. If that resolves the issue then you'll have some more to go on. According to Hortonworks here: https://hortonworks.github.io/hdp-aws/s3-security/ "AWS CLI supports authentication through environment variables. These same environment variables will be used by Hadoop if no configuration properties are set." - Original Message - From: "Anup Tiwari" <anupsdtiw...@gmail.com> To: user@hive.apache.org Sent: Thursday, April 12, 2018 10:06:33 AM Subject: Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 We are not using EMR. Also we have set below params for accessing s3 bucket in hive-site.xml which are same as what we have set in hive 2.1.1. * fs.s3a.access.key * fs.s3a.secret.key * fs.s3a.connection.maximum * fs.s3a.impl Regards, Anup Tiwari On Thu, Apr 12, 2018 at 7:19 PM, Richard A. Bross < r...@oaktreepeak.com > wrote: Based on the exception, it looks more like an AWS credentials issue than a Hive issue. Are you running in AWS EMR, on-prem? In AWS the resource accessing the S3 bucket would have to have an IAM that gave permission. If you are running somewhere else whatever AWS login you use would have to have the correct permissions in the IAM. - Original Message - From: "Anup Tiwari" < anupsdtiw...@gmail.com > To: user@hive.apache.org Sent: Thursday, April 12, 2018 9:11:37 AM Subject: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 Hi All, When i am trying to read s3 linked external table in 2.3.3 ; i am getting errors.It was working properly in 2.1.1. please find below details and let me know if i am missing something :- Hadoop Version :- 2.8.0 Query :- select log_date,count(1) as cnt from test.tt1 group by log_date; Error :- Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, vertex=vertex_1523502631429_0029_3_00 [Map 1], org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:128) at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:288) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2811) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2848) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2830) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:265) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) at org.apache.hadoop.hive.ql.io .HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442) at org.apache.hadoop.hive.ql.io .HiveInputFormat.getSplits(HiveInputFormat.java:561) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at java.ut
Re: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3
Based on the exception, it looks more like an AWS credentials issue than a Hive issue. Are you running in AWS EMR, on-prem? In AWS the resource accessing the S3 bucket would have to have an IAM that gave permission. If you are running somewhere else whatever AWS login you use would have to have the correct permissions in the IAM. - Original Message - From: "Anup Tiwari"To: user@hive.apache.org Sent: Thursday, April 12, 2018 9:11:37 AM Subject: Unable to read hive external table data which is linked to s3 after upgradation from 2.1.1 to 2.3.3 Hi All, When i am trying to read s3 linked external table in 2.3.3 ; i am getting errors.It was working properly in 2.1.1. please find below details and let me know if i am missing something :- Hadoop Version :- 2.8.0 Query :- select log_date,count(1) as cnt from test.tt1 group by log_date; Error :- Vertex failed, vertexName=Map 1, vertexId=vertex_1523502631429_0029_3_00, diagnostics=[Vertex vertex_1523502631429_0029_3_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tt1 initializer failed, vertex=vertex_1523502631429_0029_3_00 [Map 1], org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on g24x7.new-analytics: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ : No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:128) at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:288) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2811) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2848) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2830) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:265) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: The requested metadata is not found at http://16x.xxx.xxx.xx4/latest/meta-data/iam/security-credentials/ at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:151) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070) at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:276) ... 24 more Caused by: com.amazonaws.AmazonClientException: The requested metadata is not found at
Re: ALTER TABLE DROP PARTITION not working on S3
I can't duplicate the issue now. Works like it always has . . - Original Message - From: "Furcy Pin" <pin.fu...@gmail.com> To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:57:39 AM Subject: Re: ALTER TABLE DROP PARTITION not working on S3 Indeed. If I remember correctly, s3 does not really have the concept of "folder" like HDFS has, and Hive sort of makes up for it by creating a descriptor file where the partition "folder" is supposed to be. Maybe this is what is missing here. Perhaps you could try doing a "MSCK REPAIR TABLE tablename" to make sure that the partitions are correctly loaded and then try again dropping that particular partition? Or look at your s3 folder if you see any such "partition folder file" and check if it is missing for this particular partition? On 5 April 2018 at 15:40, Richard A. Bross < r...@oaktreepeak.com > wrote: Leaving the column list out, here you go: # Detailed Table Information Database: default Owner: hadoop CreateTime: Thu Apr 05 13:24:33 UTC 2018 LastAccessTime: UNKNOWN Retention: 0 Location: s3://zoomi-proto-warehouse-measurements/ Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles 10 numPartitions 7 numRows 153 orc.compress ZLIB orc.create.index true rawDataSize 113563 totalSize 37801 transient_lastDdlTime 1522934673 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io .orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io .orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io .orc.OrcOutputFormat Compressed: No Num Buckets: 61 Bucket Columns: [crs_id] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.467 seconds, Fetched: 98 row(s) - Original Message - From: "Furcy Pin" < pin.fu...@gmail.com > To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:21:06 AM Subject: Re: ALTER TABLE DROP PARTITION not working on S3 Hi Richard, could you please check if your table is EXTERNAL? You can see it with a "DESCRIBE FORMATTED table_name ;" That's what external tables are for, they don't delete underlying data when you drop them. On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: I think that someone put a file in there manually. Would that prevent Hive from dropping the partition. I also did a "drop table" and the s3 object keys persisted. - Original Message - From: "Richard A. Bross" < r...@oaktreepeak.com > To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:14:52 AM Subject: ALTER TABLE DROP PARTITION not working on S3 Hi, I have a Hive managed table on S3, "api_measurements". I've tried dropping a partition like so: hive> alter table api_measurements drop if exists partition(daydate='2018-04-04', epoch=1522876500); Dropped the partition daydate=2018-04-04/epoch=1522876500 OK Time taken: 2.109 seconds Yet the data is still on S3. Because object keys on S3 are always strings, I also tried this: hive> alter table api_measurements drop partition(daydate='2018-04-04', epoch='1522876500'); OK Time taken: 0.135 seconds Yet the object keys and data are still there. I assume that Im missing something really simple. Can anyone shed some light on this? Thanks
Re: ALTER TABLE DROP PARTITION not working on S3
Leaving the column list out, here you go: # Detailed Table Information Database: default Owner: hadoop CreateTime: Thu Apr 05 13:24:33 UTC 2018 LastAccessTime: UNKNOWN Retention: 0 Location: s3://zoomi-proto-warehouse-measurements/ Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles10 numPartitions 7 numRows 153 orc.compressZLIB orc.create.indextrue rawDataSize 113563 totalSize 37801 transient_lastDdlTime 1522934673 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets:61 Bucket Columns: [crs_id] Sort Columns: [] Storage Desc Params: serialization.format1 Time taken: 0.467 seconds, Fetched: 98 row(s) - Original Message - From: "Furcy Pin" <pin.fu...@gmail.com> To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:21:06 AM Subject: Re: ALTER TABLE DROP PARTITION not working on S3 Hi Richard, could you please check if your table is EXTERNAL? You can see it with a "DESCRIBE FORMATTED table_name ;" That's what external tables are for, they don't delete underlying data when you drop them. On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: I think that someone put a file in there manually. Would that prevent Hive from dropping the partition. I also did a "drop table" and the s3 object keys persisted. - Original Message - From: "Richard A. Bross" < r...@oaktreepeak.com > To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:14:52 AM Subject: ALTER TABLE DROP PARTITION not working on S3 Hi, I have a Hive managed table on S3, "api_measurements". I've tried dropping a partition like so: hive> alter table api_measurements drop if exists partition(daydate='2018-04-04', epoch=1522876500); Dropped the partition daydate=2018-04-04/epoch=1522876500 OK Time taken: 2.109 seconds Yet the data is still on S3. Because object keys on S3 are always strings, I also tried this: hive> alter table api_measurements drop partition(daydate='2018-04-04', epoch='1522876500'); OK Time taken: 0.135 seconds Yet the object keys and data are still there. I assume that Im missing something really simple. Can anyone shed some light on this? Thanks
Re: ALTER TABLE DROP PARTITION not working on S3
No definitely internal. It's loaded from an external table. - Original Message - From: "Furcy Pin" <pin.fu...@gmail.com> To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:21:06 AM Subject: Re: ALTER TABLE DROP PARTITION not working on S3 Hi Richard, could you please check if your table is EXTERNAL? You can see it with a "DESCRIBE FORMATTED table_name ;" That's what external tables are for, they don't delete underlying data when you drop them. On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote: I think that someone put a file in there manually. Would that prevent Hive from dropping the partition. I also did a "drop table" and the s3 object keys persisted. ----- Original Message - From: "Richard A. Bross" < r...@oaktreepeak.com > To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:14:52 AM Subject: ALTER TABLE DROP PARTITION not working on S3 Hi, I have a Hive managed table on S3, "api_measurements". I've tried dropping a partition like so: hive> alter table api_measurements drop if exists partition(daydate='2018-04-04', epoch=1522876500); Dropped the partition daydate=2018-04-04/epoch=1522876500 OK Time taken: 2.109 seconds Yet the data is still on S3. Because object keys on S3 are always strings, I also tried this: hive> alter table api_measurements drop partition(daydate='2018-04-04', epoch='1522876500'); OK Time taken: 0.135 seconds Yet the object keys and data are still there. I assume that Im missing something really simple. Can anyone shed some light on this? Thanks
Re: ALTER TABLE DROP PARTITION not working on S3
I think that someone put a file in there manually. Would that prevent Hive from dropping the partition. I also did a "drop table" and the s3 object keys persisted. - Original Message ----- From: "Richard A. Bross" <r...@oaktreepeak.com> To: user@hive.apache.org Sent: Thursday, April 5, 2018 9:14:52 AM Subject: ALTER TABLE DROP PARTITION not working on S3 Hi, I have a Hive managed table on S3, "api_measurements". I've tried dropping a partition like so: hive> alter table api_measurements drop if exists partition(daydate='2018-04-04', epoch=1522876500); Dropped the partition daydate=2018-04-04/epoch=1522876500 OK Time taken: 2.109 seconds Yet the data is still on S3. Because object keys on S3 are always strings, I also tried this: hive> alter table api_measurements drop partition(daydate='2018-04-04', epoch='1522876500'); OK Time taken: 0.135 seconds Yet the object keys and data are still there. I assume that Im missing something really simple. Can anyone shed some light on this? Thanks
ALTER TABLE DROP PARTITION not working on S3
Hi, I have a Hive managed table on S3, "api_measurements". I've tried dropping a partition like so: hive> alter table api_measurements drop if exists partition(daydate='2018-04-04', epoch=1522876500); Dropped the partition daydate=2018-04-04/epoch=1522876500 OK Time taken: 2.109 seconds Yet the data is still on S3. Because object keys on S3 are always strings, I also tried this: hive> alter table api_measurements drop partition(daydate='2018-04-04', epoch='1522876500'); OK Time taken: 0.135 seconds Yet the object keys and data are still there. I assume that Im missing something really simple. Can anyone shed some light on this? Thanks
Re: Hive, Tez, clustering, buckets, and Presto
This is great information, Gopal, thank you. I wish I had the time to create a comparison for our use case between Hive buckets and ORC files on S3 and ORC files without bucket. Unfortunately it's a chicken and egg issue, since I won't have enough data volume until we are in production, which should be shortly. Thanks again. - Original Message - From: "Gopal Vijayaraghavan" <gop...@apache.org> To: user@hive.apache.org Sent: Wednesday, April 4, 2018 7:31:31 PM Subject: Re: Hive, Tez, clustering, buckets, and Presto > so there asking "where is the Hive bucketing spec". Is it just to read the > code for that function? This worked the other way around in time, than writing a spec first - ACIDv1 implemented Streaming ingest via Storm, it used an explicit naming "bucket_" for the filename. Since until the compaction runs the actual base files don't exist, the ACID bucketing implementation has to handle missing buckets as 0 rows in base file + possibly more rows in uncompacted deltas. ACID's implementation has forced the two bucketing implementations to work similarly, for the ability to do bucket map-joins between ACID & non-ACID bucketed tables. Particularly about the modulus for -ve numbers, which was broken in Hive-1.0. https://issues.apache.org/jira/browse/HIVE-12025 that's the place where this all got refactored so that joins & filters for bucketed tables work the same way for ACID & non-ACID tables. Because of that spec lives in the comments now as a Regex. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1283 > They were looking for something more explicit, I think. I think a simple unit test will probably help them a bit more. create external table bucketed (x int) clustered by (x) into 4 buckets stored as orc; insert into bucketed values(1),(2),(3),(4); insert into bucketed values(1),(2),(3),(4); 0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed; | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/00_0 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/00_0_copy_1 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/01_0 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/01_0_copy_1 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/02_0 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/02_0_copy_1 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/03_0 | | -rw-r--r-- 3 hive hdfs181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/03_0_copy_1 | Even when all buckets are covered Presto should be expecting >1 files per bucket. I saw a JIRA comment which said "sort in file order and assign buckets", you can see that is only applicable for the 1st insert to table (& the regex will remove the copy numbering). And oddly enough this week, I saw an academic paper with a negative analysis of Hive bucketing. https://www.researchgate.net/publication/323997831_Partitioning_and_Bucketing_in_Hive-Based_Big_Data_Warehouses Cheers, Gopal On 4/3/18, 1:42 PM, "Richard A. Bross" <r...@oaktreepeak.com> wrote: Gopal, The Presto devs say they are willing to make the changes to adhere to the Hive bucket spec. I quoted "Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile)." so there asking "where is the Hive bucketing spec". Is it just to read the code for that function? They were looking for something more explicit, I think. Thanks - Original Message - From: "Gopal Vijayaraghavan" <gop...@apache.org> To: user@hive.apache.org Sent: Tuesday, April 3, 2018 3:15:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto >* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. My understanding was that this was related to the number of buckets, but you are relating it to ORC stripes. It is odd that no examples that I've seen include the SORTED BY statement other than in relation to ORC indexes (that I understand). So the question is; regardless of whether efficient ORC stripes are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this to have much of an effect) ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), w
Re: Building Datwarehouse Application in Spark
Mahender, To really address your question I think that you'd have to supply a bit more information, such as the kind of data that you want to save; RBDMS type look ups, key/value/index type look ups, insert velocity, etc. These wide choices of technologies are suited to different use cases, although they overlap in some areas. In a previous position that I held we used Spark on Cassandra to solve a similar problem. The Datastax distribution puts Spark worker nodes directly on Cassandra nodes. Because Cassandra partitions the data across nodes based on a row key, it's a nice match. If the key is chosen properly, the Spark nodes are typically accessing local data on the Cassandra nodes, meaning that there are typically very few shuffles for direct queries and also that inserts go directly to the proper Cassandra nodes. We had time series data based on unique row keys. So our row keys were unique and our column keys were the time stamps. In that case our queries were done directly with the Cassandra clients for the most part, with SparkQL primarily used for ad-hoc queries. At my current position, we directly load raw data into Hive (using HiveQL) and then use Presto for queries. That's our OLAP data store. You can use any number of other tools to query Hive created data stores as well. Then we have another pipeline that takes the same raw data, uses Spark for the ETL, and then inserts the results into Aurora (MySQL). The schema is designed for specific queries, so the Spark ETL is designed to transform the data to optimize for the schema so as to allow efficient updates to those tables. That's our OLTP data store and we use standard SQL for queries. Rick - Original Message - From: "Furcy Pin"To: user@hive.apache.org Sent: Wednesday, April 4, 2018 6:58:58 AM Subject: Re: Building Datwarehouse Application in Spark Hi Mahender, Did you look at this? https://www.snappydata.io/blog/the-spark-database But I believe that most people handle this use case by either using: - Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...) if the data is not too big - Their favorite New-SQL storage (Cassandra, HBase) if the data is too big and needs to be distributed Spark generally makes it easy enough to query these other databases to allow you to perform analytics. Hive and Spark have been designed as OLAP tools, not OLTP. I'm not sure what features you are seeking for your SCD but they probably won't be part of Spark's core design. Hope this helps, Furcy On 4 April 2018 at 11:29, Mahender Sarangam < mahender.bigd...@outlook.com > wrote: Hi, Does anyone has good architecture document/design principle for building warehouse application using Spark. Is it better way of having Hive Context created with HQL and perform transformation or Directly loading files in dataframe and perform data transformation. We need to implement SCD 2 Type in Spark, Is there any better document/reference for building Type 2 warehouse object Thanks in advace /Mahender
Re: Hive, Tez, clustering, buckets, and Presto
Gopal, The Presto devs say they are willing to make the changes to adhere to the Hive bucket spec. I quoted "Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile)." so there asking "where is the Hive bucketing spec". Is it just to read the code for that function? They were looking for something more explicit, I think. Thanks - Original Message - From: "Gopal Vijayaraghavan"To: user@hive.apache.org Sent: Tuesday, April 3, 2018 3:15:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto >* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. > My understanding was that this was related to the number of buckets, but you > are relating it to ORC stripes. It is odd that no examples that I've seen > include the SORTED BY statement other than in relation to ORC indexes (that I > understand). So the question is; regardless of whether efficient ORC stripes > are created (wouldn't I have to also specify 'orc.create.index’=’true’ for > this to have much of an effect) ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), which when combined with a columnar format & ordering within files produces better storage and runtimes together (which I guess is usually a trade-off). Without a SORTED BY, the organizing function for the data-shuffle does not order in any specific way - the partition key for the shuffle is the modulus, while the order key is 0 bytes long, so it sorts by (modulus,) which for a quick-sort also loses the input order into the shuffle & each bucket file is produced in random order within itself. An explicit sort with bucketing is what I recommend to most of the HDP customers who have performance problems with ORC. This turns the shuffle key into (modulus, key1, key2) producing more predictable order during shuffle. Then the key1 can be RLE encoded so that ORC vector impl will pass it on as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for integers. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5 was written as a warning to customers who use bucketing to try & solve performance problems, but have ended up bucketing as their main problem. Most of what I have written above was discussed a few years back and in general, bucketing on a high cardinality column + sorting on a low cardinality together has given good results to my customers. >I hadn't thought of the even number issue, not having looked at the > function; I had assumed that it was a hash, not a modulus; shame on me. > Reading the docs I see that hash is only used on string columns Actually a hash is used in theory, but I entirely blame Java for it - the Java hash is an identity function for Integers. scala> 42.hashCode res1: Int = 42 scala> 42L.hashCode res2: Int = 42 > Finally, I'm not sure that I got a specific answer to my original question, > which is can I force Tez to create all bucket files so Presto queries can > succeed? Anyway, I will be testing today and the solution will either be to > forgo buckets completely or to simply rely on ORC indexes. There's no config to do that today & Presto is already incompatible with Hive 3.0 tables (Update/Delete support). Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile). The file-count is a very flaky way to check if the table is bucketed correctly - either you trust the user to have properly bucketed the table or you don't use it. Failing to work on valid tables does look pretty bad, instead of soft fallbacks. I wrote a few UDFs which was used to validate suspect tables and fix them for customers who had bad historical data, which was loaded with "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945. https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27 LLAP has a bucket pruning implementation if Presto wants to copy from it (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD). Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236 Runtime: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281 That actually does things according to the Hive bucketing contract where uncovered buckets are assumed to have 0 rows without a file present & not error out instead. If
Re: Hive, Tez, clustering, buckets, and Presto
Gopal, Thanks for this. Great information and something to look at more closely to better understand the internals. Rick - Original Message - From: "Gopal Vijayaraghavan"To: user@hive.apache.org Sent: Tuesday, April 3, 2018 3:15:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto >* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. > My understanding was that this was related to the number of buckets, but you > are relating it to ORC stripes. It is odd that no examples that I've seen > include the SORTED BY statement other than in relation to ORC indexes (that I > understand). So the question is; regardless of whether efficient ORC stripes > are created (wouldn't I have to also specify 'orc.create.index’=’true’ for > this to have much of an effect) ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), which when combined with a columnar format & ordering within files produces better storage and runtimes together (which I guess is usually a trade-off). Without a SORTED BY, the organizing function for the data-shuffle does not order in any specific way - the partition key for the shuffle is the modulus, while the order key is 0 bytes long, so it sorts by (modulus,) which for a quick-sort also loses the input order into the shuffle & each bucket file is produced in random order within itself. An explicit sort with bucketing is what I recommend to most of the HDP customers who have performance problems with ORC. This turns the shuffle key into (modulus, key1, key2) producing more predictable order during shuffle. Then the key1 can be RLE encoded so that ORC vector impl will pass it on as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for integers. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5 was written as a warning to customers who use bucketing to try & solve performance problems, but have ended up bucketing as their main problem. Most of what I have written above was discussed a few years back and in general, bucketing on a high cardinality column + sorting on a low cardinality together has given good results to my customers. >I hadn't thought of the even number issue, not having looked at the > function; I had assumed that it was a hash, not a modulus; shame on me. > Reading the docs I see that hash is only used on string columns Actually a hash is used in theory, but I entirely blame Java for it - the Java hash is an identity function for Integers. scala> 42.hashCode res1: Int = 42 scala> 42L.hashCode res2: Int = 42 > Finally, I'm not sure that I got a specific answer to my original question, > which is can I force Tez to create all bucket files so Presto queries can > succeed? Anyway, I will be testing today and the solution will either be to > forgo buckets completely or to simply rely on ORC indexes. There's no config to do that today & Presto is already incompatible with Hive 3.0 tables (Update/Delete support). Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile). The file-count is a very flaky way to check if the table is bucketed correctly - either you trust the user to have properly bucketed the table or you don't use it. Failing to work on valid tables does look pretty bad, instead of soft fallbacks. I wrote a few UDFs which was used to validate suspect tables and fix them for customers who had bad historical data, which was loaded with "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945. https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27 LLAP has a bucket pruning implementation if Presto wants to copy from it (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD). Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236 Runtime: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281 That actually does things according to the Hive bucketing contract where uncovered buckets are assumed to have 0 rows without a file present & not error out instead. If you do have the ability to redeploy Hive, the change you are looking for is a 1-liner to enable. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248 Cheers, Gopal
Re: Hive, Tez, clustering, buckets, and Presto
I'm really confused and could use help understanding. The Hive documentation here https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables says: "Bucketed tables are fantastic in that they allow much more efficient sampling than do non-bucketed tables, and they may later allow for time saving operations such as mapside joins. However, the bucketing specified at table creation is not enforced when the table is written to, and so it is possible for the table's metadata to advertise properties which are not upheld by the table's actual layout. This should obviously be avoided. Here's how to do it right." "This should obviously be avoided". Yet with the Tez engine it seems to be unavoidable when the data does not allow for all the buckets to be populated. - Original Message - From: "Gopal Vijayaraghavan"To: user@hive.apache.org Sent: Monday, April 2, 2018 2:16:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto There's more here than Bucketing or Tez. > PARTITIONED BY(daydate STRING, epoch BIGINT) > CLUSTERED BY(r_crs_id) INTO 64 BUCKETS I hope the epoch partition column is actually a day rollup and not 1 partition for every timestamp. CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). CLUSTERED BY needs a SORTED BY to produce good ORC stripes here. > If I perform the inserts with the mr engine, which is now deprecated, when > the partition is created there are always 64 bucket files, even if some are 0 > length. In this case I can query with Presto just fine. The files should be at least 3 bytes long containing the 3 letters 'O','R','C'. If you have a bucketing impl which is producing empty files when hash(20k) values % 64 is producing 0 rows for some buckets, the skew is unnatural. > However, when the Tez engine is used, only buckets that contain data are > created, and Presto fails since the table properties in the metastore (64 > buckets) are a mismatch with the actual bucket count. That was done to prevent creating these 3 byte files which actually cost real money to store (because storage is usually at the 4k block level, this uses 1000x more storage). You can end up with >64 or <64 or exactly 64 files for a validly bucketed table (in the >64 case, at least some files have a _Copy suffix, but the bucket prefix is consistent). Presto just bails out when it finds something strange, because they do not trust the Hive bucketing impl - there is a really good reason why there are 2 bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & hive.tez.bucket.pruning.compat). > we'd really like to use buckets, since the the r_crs_id in production can > contain 20,000 values. With 20k values and 64 buckets, if you have zero sized buckets - you might want to run your hash values through this. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6 > "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is > always set to "true", but I can't seem to find a description of its intent. > Anyway, if it's supposed to force the creation of buckets it's not working > with the Tez engine. Previously you could disable "enforce.bucketing=false" and the inserts of data won't use the fixed number of buckets & can play other tricks with the data layout. That's not the config you're looking for. There is something which is a very sharp no-handle knife in the Hive toolkit called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth explaining right now (but just that it is possible to use it, but not very easily). > configuration so that r_crs_id was at least row optimized and sorted within > the ORC files. SORTED BY in the Table DDL should do the trick - I like to use a multi-dim sort, in some of these scenarios. CLUSTERED BY(r_crs_id) SORTED BY(r_crs_id, id) INTO 67 BUCKETS If you want to know why I like primes, when % 64 is done on even numbers. len(set([(x*2) % 64 for x in xrange(1000)])) fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets have 0x data. len(set([(x*2) % 61 for x in xrange(1000)])) fills all 61 buckets - producing better bucketing & no 0 sized files. FYI any prime number other than 31 works nicer than a 2^n - https://issues.apache.org/jira/browse/HIVE-7074 Cheers, Gopal
Re: Hive, Tez, clustering, buckets, and Presto
Gopal, Thanks for taking the time to try and help. A few things in relation to your response: * Yes, the 'epoch' column is an hourly timestamp. Clustering by a column with high cardinality would make little sense. * I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. My understanding was that this was related to the number of buckets, but you are relating it to ORC stripes. It is odd that no examples that I've seen include the SORTED BY statement other than in relation to ORC indexes (that I understand). So the question is; regardless of whether efficient ORC stripes are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this to have much of an effect) * Thanks for the hint about using primes for specifying the buckets. Given that r_crs_id is a wide range I think that it would have worked Ok, but any optimizations are appreciated and I hadn't thought of the even number issue, not having looked at the function; I had assumed that it was a hash, not a modulus; shame on me. Reading the docs I see that hash is only used on string columns Finally, I'm not sure that I got a specific answer to my original question, which is can I force Tez to create all bucket files so Presto queries can succeed? Anyway, I will be testing today and the solution will either be to forgo buckets completely or to simply rely on ORC indexes. Thanks. Rick - Original Message - From: "Gopal Vijayaraghavan"To: user@hive.apache.org Sent: Monday, April 2, 2018 2:16:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto There's more here than Bucketing or Tez. > PARTITIONED BY(daydate STRING, epoch BIGINT) > CLUSTERED BY(r_crs_id) INTO 64 BUCKETS I hope the epoch partition column is actually a day rollup and not 1 partition for every timestamp. CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). CLUSTERED BY needs a SORTED BY to produce good ORC stripes here. > If I perform the inserts with the mr engine, which is now deprecated, when > the partition is created there are always 64 bucket files, even if some are 0 > length. In this case I can query with Presto just fine. The files should be at least 3 bytes long containing the 3 letters 'O','R','C'. If you have a bucketing impl which is producing empty files when hash(20k) values % 64 is producing 0 rows for some buckets, the skew is unnatural. > However, when the Tez engine is used, only buckets that contain data are > created, and Presto fails since the table properties in the metastore (64 > buckets) are a mismatch with the actual bucket count. That was done to prevent creating these 3 byte files which actually cost real money to store (because storage is usually at the 4k block level, this uses 1000x more storage). You can end up with >64 or <64 or exactly 64 files for a validly bucketed table (in the >64 case, at least some files have a _Copy suffix, but the bucket prefix is consistent). Presto just bails out when it finds something strange, because they do not trust the Hive bucketing impl - there is a really good reason why there are 2 bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & hive.tez.bucket.pruning.compat). > we'd really like to use buckets, since the the r_crs_id in production can > contain 20,000 values. With 20k values and 64 buckets, if you have zero sized buckets - you might want to run your hash values through this. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6 > "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is > always set to "true", but I can't seem to find a description of its intent. > Anyway, if it's supposed to force the creation of buckets it's not working > with the Tez engine. Previously you could disable "enforce.bucketing=false" and the inserts of data won't use the fixed number of buckets & can play other tricks with the data layout. That's not the config you're looking for. There is something which is a very sharp no-handle knife in the Hive toolkit called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth explaining right now (but just that it is possible to use it, but not very easily). > configuration so that r_crs_id was at least row optimized and sorted within > the ORC files. SORTED BY in the Table DDL should do the trick - I like to use a multi-dim sort, in some of these scenarios. CLUSTERED BY(r_crs_id) SORTED BY(r_crs_id, id) INTO 67 BUCKETS If you want to know why I like primes, when % 64 is done on even numbers. len(set([(x*2) % 64 for x in xrange(1000)])) fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets have 0x data. len(set([(x*2) % 61 for x in xrange(1000)])) fills all 61 buckets - producing better bucketing & no 0 sized files. FYI any prime number other than 31 works nicer than a 2^n -
Hive, Tez, clustering, buckets, and Presto
Hello there, I've done a ton of reading and testing between Hive, Presto, Tez, etc. In our test environment I have 8 tables being loaded with Hive that works fine under certain circumstances. Most of our data is taken from JSON files in from S3, creates a temporary external table, and then performs an INSERT OVERWRITE to transform the data to ORC format in Hive managed tables. A simple example of the CREATE statement for the internal table looks like this (I've left out a number of irrelevant columns): CREATE TABLE IF NOT EXISTS measurements ( id STRING, session STRING, tt_ts INT, r_crs_id INT, user_agent STRING ) PARTITIONED BY(daydate STRING, epoch BIGINT) CLUSTERED BY(r_crs_id) INTO 64 BUCKETS STORED AS ORC LOCATION's3://warehouse/' tblproperties ("orc.compress"="ZLIB"); If I perform the inserts with the mr engine, which is now deprecated, when the partition is created there are always 64 bucket files, even if some are 0 length. In this case I can query with Presto just fine. However, when the Tez engine is used, only buckets that contain data are created, and Presto fails since the table properties in the metastore (64 buckets) are a mismatch with the actual bucket count. I've seen some other postings in StackExchange and via Google where others have run into this. We really need to use Presto to query Hive, and we'd really like to use buckets, since the the r_crs_id in production can contain 20,000 values. So a few questions (thanks for any help): The easiest fix for us would be a way to force Tez to create all buckets like mr so the behavior didn't change. Is that possible? "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is always set to "true", but I can't seem to find a description of its intent. Anyway, if it's supposed to force the creation of buckets it's not working with the Tez engine. ALternately, I was thinking that maybe someone could help me with the ORC file configuration so that r_crs_id was at least row optimized and sorted within the ORC files. I've been reading up on ORC and the Hive statements related to ORC files but I seem to have hit my limit of understanding. It appears that I should at the least have the rows sorted by r_crs_id when written, but I'm having problems figuring out how to apply that with the INSERT statement. It appears that the ORC table should be created with 'orc.create.index’=’true’' for a start. Currently our INSERT looks like this (we never UPDATE or DELETE): FROM raw_measurements_external INSERT OVERWRITE TABLE raw_measurements PARTITION(daydate='{2}', epoch={3}) SELECT *; Something like this? FROM raw_measurements_external SORT BY r_crs_id INSERT OVERWRITE TABLE raw_measurements PARTITION(daydate='{2}', epoch={3}) SELECT *; Thanks for any help from more experienced Hive developers (most of you).
Re: RE: Re: For Apache Hive HS2 , what is the largest heap size setting that works well?
Use JDK8 and try G1 first, we saw lower GC pause under G1 especially in our large memory environment On Wednesday, November 29, 2017, 8:14:03 AM PST, eric wongwrote: -- 已转发邮件 -- From: eric wong To: user@hive.apache.org Cc: Bcc: Date: Thu, 30 Nov 2017 00:07:47 +0800 Subject: Re: For Apache Hive HS2 , what is the largest heap size setting that works well? 1)Long GC pause should has detailed reson, like InitialMark/Remark、Full GC due to Concurrent Mode failure/Promotion failure。Please check it 2)CMS is nice and steady for our production. Please troubleshooting case by case. G1 is just difficult to perceive. 3)You shouldupdate to JDK8 ,for one case,ConcurrentInitialMark is disabled as default or just only can be sequenced in JDK7. 2017-11-29 15:14 GMT+08:00 Jörn Franke : I also recommend it you will have also performance improvements with JDK8 in general (use the latest version). Keep also in mind that more and more big data libraries etc will drop JDK7 support soon (Aside that JDK7 is anyway not maintained anymore). On 29. Nov 2017, at 01:31, Johannes Alberti wrote: Yes, I would recommend to go to Java 8 and give it a shot with G1 and report back :) Sent from my iPhone On Nov 28, 2017, at 3:30 PM, Sharanya Santhanam wrote: HI Johannes , We are running on Java version jdk1.7.0_67 . We are using ConcurrentMarkAndSweep. Would you recommend using G1GC ? These are our current settings -XX:NewRatio=8 -XX:+UseParNewGC -XX:-UseGCOverheadLimit -XX:PermSize=256m -Xloggc:<> -XX:HeapDumpPath=oom -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:ErrorFile=/oom/hs2jvm error%p.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=128M -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled -XX:+HeapDumpOnOutOfMemoryErro r -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:MaxPermSize=1024m -Xmx69427m -Xms128m -XX:MaxHeapFreeRatio=30 -XX:MinHeapFreeRatio=10 -XX:+UseParNewGC -XX:-UseGCOverheadLimit -XX:PermSize=256m Thanks ,Sharanya On Tue, Nov 28, 2017 at 2:19 PM, Johannes Alberti wrote: Hi Sharanya, Can you share your current GC settings and Java version. Are you using Java 8/9 w/ G1 already? Regards, Johannes Sent from my iPhone On Nov 28, 2017, at 12:57 PM, Sharanya Santhanam wrote: Hello , I am currently trying to upgrade hive version on our prod clusters form V1.2 to v2.1 We also want to adopt HS2 on the new upgraded cluster. Earlier all queries were submitted via Hive cli. Would like to understand how large a single HS2 Heap size can be ? And is there any formula to figure out the how many concurrent sessions I can support with this particular heap setting? We currently have a upper limit of supporting 300 concurrent sessions ( hive.server2.thrift.max.work er.threads=300). Based on this we set the max heap size to 70 GB , but seeing many long GC pauses. Would like to understand what is the industry standard for max HS2 Heap size. Are there any recommendations on what JMV GC setting work best for supporting such high number of concurrent sessions? Thanks,Sharanya -- 王海华 -- 王海华
Fw: read this
Hello! New message, please read <http://xn--80acveirclejkb.xn--p1ai/opinion.php?ib> Rosenfeld Richard E
Fw: important message
Hey friend! Check this out http://rhaquellasupplierkosmetik.com/different.php?oq Rosenfeld Richard E
Re: metastore (mysql) connection problems - hive v0.12
Hi Stephen, Try increasing the max connections limit in MySQL. I ran into a similar problem and this resolved things after a Hive 0.12 upgrade. You can increase the connection limit without a restart, but pay attention to other settings as well so that you don't over allocate resources on your db server. Regards, Rick On Dec 17, 2013 6:12 PM, Stephen Sprague sprag...@gmail.com wrote: hi guys, I'm pretty much at my wits end on this one. i'm using hive v0.12 against a mysql metastore. the error manifests itself (mostly randomly which is the problem) as this error in the client: 2013-12-17 16:54:03,158 ERROR metastore.RetryingRawStore (RetryingRawStore.java:invoke(146)) - JDO datastore error. Retrying metastore command after 1000 ms (attempt 1 of 1) 2013-12-17 16:54:04,345 ERROR metastore.RetryingHMSHandler (RetryingHMSHandler.java:invoke(141)) - MetaException(message:java.lang.RuntimeException: commitTransaction was called but openTransactionCalls = 0. This probably indicates that there are unbalanced calls to openTransaction/commitTransaction) but i believe this to be just a symptom of something happening further upstream. examining the hive.log file for this useer (under /tmp/user/hive.log) i find this hideous traceback with bonecp being the focus of attention. There's nothing wrong with the mysql server metastore especially given it works sometimes. to me i think it has something to do with the connection pooling logic but that's only speculation. before i surrender and rollback to a previous version i ask: Has anyone encountered this before? Or are there any nuggets in the traceback worthy of insight? thanks, Stephen. 2013-12-17 16:54:03,142 ERROR bonecp.ConnectionHandle (ConnectionHandle.java:markPossiblyBroken(297)) - Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08S01 2013-12-17 16:54:03,144 ERROR bonecp.ConnectionHandle (ConnectionHandle.java:markPossiblyBroken(297)) - Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08007 2013-12-17 16:54:03,146 ERROR bonecp.ConnectionHandle (ConnectionHandle.java:markPossiblyBroken(297)) - Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08007 2013-12-17 16:54:03,147 ERROR bonecp.ConnectionHandle (ConnectionHandle.java:markPossiblyBroken(297)) - Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08007 2013-12-17 16:54:03,148 ERROR bonecp.BoneCP (BoneCP.java:terminateAllConnections(199)) - Error in attempting to close connection com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at com.mysql.jdbc.Util.handleNewInstance(Util.java:411) at com.mysql.jdbc.Util.getInstance(Util.java:386) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4730) at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4325) at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1557) at com.jolbox.bonecp.ConnectionHandle.internalClose(ConnectionHandle.java:396) at com.jolbox.bonecp.BoneCP.terminateAllConnections(BoneCP.java:197) at com.jolbox.bonecp.ConnectionHandle.markPossiblyBroken(ConnectionHandle.java:298) at com.jolbox.bonecp.ConnectionHandle.internalClose(ConnectionHandle.java:404) at com.jolbox.bonecp.BoneCP.terminateAllConnections(BoneCP.java:197) at com.jolbox.bonecp.ConnectionHandle.markPossiblyBroken(ConnectionHandle.java:298) at com.jolbox.bonecp.ConnectionHandle.internalClose(ConnectionHandle.java:404) at com.jolbox.bonecp.BoneCP.terminateAllConnections(BoneCP.java:197) at com.jolbox.bonecp.ConnectionHandle.markPossiblyBroken(ConnectionHandle.java:298) at com.jolbox.bonecp.ConnectionHandle.getTransactionIsolation(ConnectionHandle.java:635) at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:422) at org.datanucleus.store.rdbms.SQLController.getStatementForQuery(SQLController.java:316) at
Re: How to prevent user drop table in Hive metadata?
You can use: ALTER TABLE {table_name} ENABLE NO_DROP; And it will keep a user from dropping the table - but it can be over ridden. Rick On Nov 22, 2013 12:36 PM, Echo Li echo...@gmail.com wrote: Good Friday! I was trying to apply certain level of security in our hive data warehouse, by modifying access mode of directories and files on hdfs to 755 I think it's good enough for a new user to remove data, however the user still can drop the table definition in hive cli, seems the revoke doesn't help much, is there any way to prevent this? Thanks, Echo
histogram_numeric find the most frequent value
I want to find the most frequent value of a column, I noticed histogram_numerc, but I cannot specify the bin boundary. The result is not what I want. take an example as follows, I want something like select gid, most_frequent(category) from mytable group by gid. where category is a column with descritized values. thanks. Richard
Re:Re: histogram_numeric find the most frequent value
good idea, I will try. thanks At 2013-10-16 19:12:30,Ed Soniat eson...@liveperson.com wrote: You could use a modular math to transform the data in to single value representations of each range you intend to represent with your boundary using a sub select. On Wed, Oct 16, 2013 at 7:09 AM, Richard codemon...@163.com wrote: I want to find the most frequent value of a column, I noticed histogram_numerc, but I cannot specify the bin boundary. The result is not what I want. take an example as follows, I want something like select gid, most_frequent(category) from mytable group by gid. where category is a column with descritized values. thanks. Richard This message may contain confidential and/or privileged information. If you are not the addressee or authorized to receive this on behalf of the addressee you must not use, copy, disclose or take action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply email and delete this message. Thank you.
Hive external table to HBase multiple versions
Hi Is it possible to create a Hive external table pointing to an HBase column family and for all versions of the data for the keys to be returned rather than just I presume the latest version of key/column to be returned. Regards
Re: Upgrade from Hive 0.9 to Hive 0.10 Heap Error on show tables;
Hi John, Do you have a copy of the MySQL JDBC driver in your Hive library path? Rick On Apr 3, 2013 3:57 PM, John Omernik j...@omernik.com wrote: Not sure what the issues is, conf is good, validated I can log in to mysql with username in the hive-site, and I ran the metastore update scripts. show tables; java.lang.OutOfMemoryError: Java heap space at org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:353) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:215) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:412) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:399) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:736) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:74) at $Proxy6.getDatabase(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1110) at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1099) at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:2206) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:334) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1336) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1122) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:935) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:755) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.DDLTask Thoughts?
Re: The dreaded Heap Space Issue on a Transform
What do you have set in core-site.XML for io.sort.mb, io.sort.factor, and io.file.buffer.size? You should be able to adjust these and get past the heap issue. Be careful about how much ram you ave though, and don't st them too high. Rick On Jan 30, 2013 8:55 AM, John Omernik j...@omernik.com wrote: So it's filling up on the emitting stage, so I need to look at the task logs and or my script that's printing to stdout as the likely culprits I am guessing. On Wed, Jan 30, 2013 at 9:11 AM, Philip Tromans philip.j.trom...@gmail.com wrote: That particular OutOfMemoryError is happening on one of your hadoop nodes. It's the heap within the process forked by the hadoop tasktracker, I think. Phil. On 30 January 2013 14:28, John Omernik j...@omernik.com wrote: So just a follow-up. I am less looking for specific troubleshooting on how to fix my problem, and more looking for a general understanding of heap space usage with Hive. When I get an error like this, is it heap space on a node, or heap space on my hive server? Is it the heap space of the tasktracker? Heap of the job kicked off on the node? Which heap is being affected? If it's not clear in my output, where can I better understand this? I am sorely out of my league here when it comes to understanding the JVM interactions of Hive and Hadoop, i.e. where hive is run, vs where task trackers are run etc. Thanks is advance! On Tue, Jan 29, 2013 at 7:43 AM, John Omernik j...@omernik.com wrote: I am running a transform script that parses through a bunch of binary data. In 99% of the cases it runs, it runs fine, but on certain files I get a failure (as seen below). Funny thing is, I can run a job with only the problem source file, and it will work fine, but when as a group of files, I get these warnings. I guess what I am asking here is this: Where is the heap error? Is this occurring on the nodes themselves or, since this is where the script is emitting records (and potentially large ones at that) and in this case my hive server running the job may be memory light, could the issue actually be due to heap on the hive server itself? My setup is 1 Hive node (that is woefully underpowered, under memoried, and under disk I/Oed) and 4 beefy hadoop nodes. I guess, my question is the heap issue on the sender or the receiver :) 13-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got brand-new compressor 2013-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows 2013-01-29 08:20:24,410 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 10 rows 2013-01-29 08:20:24,410 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 10 rows 2013-01-29 08:20:24,411 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 10 rows 2013-01-29 08:20:24,411 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 10 rows 2013-01-29 08:20:24,411 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 10 rows 2013-01-29 08:20:24,411 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 10 rows 2013-01-29 08:20:24,411 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 10 rows 2013-01-29 08:20:24,412 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 10 rows 2013-01-29 08:20:27,170 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 100 rows 2013-01-29 08:20:27,170 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 100 rows 2013-01-29 08:20:27,170 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100 rows 2013-01-29 08:20:27,171 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 100 rows 2013-01-29 08:20:27,171 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 100 rows 2013-01-29 08:20:27,171 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 100 rows 2013-01-29 08:20:27,171 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 100 rows 2013-01-29 08:20:27,171 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 100 rows 2013-01-29 08:21:16,247 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1000 rows 2013-01-29 08:21:16,247 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1000 rows 2013-01-29 08:21:16,247 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1000 rows 2013-01-29 08:21:16,247 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1000 rows 2013-01-29 08:21:16,248 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 1000 rows 2013-01-29 08:21:16,248 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 1000 rows 2013-01-29 08:21:16,248 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 1000 rows 2013-01-29 08:21:16,248 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1000 rows 2013-01-29 08:25:47,532 INFO
how may map-reduce needed in a hive query
I am wondering how to determine the number of map-reduce for a hive query. for example, the following query select sum(c1), sum(c2), k1 from { select transform(*) using 'mymapper' as c1, c2, k1 from t1 } a group by k1; when i run this query, it takes two map-reduce, but I expect it to take only 1. in the map stage, using 'mymapper' as the mapper, then shuffle the mapper output by k1 and perform sum reduce in the reducer. so why hive takes 2 map-reduce?
Re:how may map-reduce needed in a hive query
thanks. I used explain command and get the plan, but I am still confused. The below is the description of two map-reduce stages: it seems that in stage-1 the aggregation has already been done, why stage-2 has aggregation again? == STAGE PLANS: Stage: Stage-1 Map Reduce Alias - Map Operator Tree: a:t1 TableScan alias: t1 Select Operator expressions: expr: f type: string outputColumnNames: _col0 Transform Operator command: mymapper output info: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Select Operator expressions: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: string outputColumnNames: _col0, _col1, _col2 Group By Operator aggregations: expr: sum(_col0) expr: sum(_col1) bucketGroup: false keys: expr: _col2 type: string mode: hash outputColumnNames: _col0, _col1, _col2 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: rand() type: double tag: -1 value expressions: expr: _col1 type: double expr: _col2 type: double Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) expr: sum(VALUE._col1) bucketGroup: false keys: expr: KEY._col0 type: string mode: partials outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2 Map Reduce Alias - Map Operator Tree: hdfs://hdpnn:9000/mydata/hive/hive_2013-01-23_13-46-09_628_5487089660360786955/10002 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: double expr: _col2 type: double Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) expr: sum(VALUE._col1) bucketGroup: false keys: expr: KEY._col0 type: string mode: final outputColumnNames: _col0, _col1, _col2 Select Operator expressions: expr: _col1 type: double expr: _col2 type: double expr: _col0 type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat At 2013-01-23 11:45:13,Richard codemon...@163.com wrote: I am wondering how to determine the number of map-reduce for a hive query. for example, the following query select sum(c1), sum(c2), k1 from { select transform(*) using 'mymapper' as c1, c2, k1 from t1 } a group by k1; when i run this query, it takes two map-reduce, but I expect it to take only 1. in the map stage, using 'mymapper' as the mapper, then shuffle the mapper output by k1 and perform sum reduce in the reducer. so why hive takes 2 map-reduce?
Re:Re: create a hive table: always a tab space before each line
thanks. it seems that as long as I use sequencefile as the storage format, there will be \t before the first column. If this output is continously used by hive, it is fine. The problem is that I may use a self-define map-reduce job to read these files. Does that mean I have to take care of this \t by myself? is there any option that I can disable this \t in hive? At 2013-01-09 22:38:11,Dean Wampler dean.wamp...@thinkbiganalytics.com wrote: To add to what Nitin said, there is no key output by Hive in front of the tab. On Wed, Jan 9, 2013 at 3:07 AM, Nitin Pawar nitinpawar...@gmail.com wrote: you may want to look at the sequencefile format http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-based-data-structures/id3555432 that tab is to separate key from values in the record (I may be wrong but this is how I interpreted it) On Wed, Jan 9, 2013 at 12:49 AM, Richard codemon...@163.com wrote: more information: if I set the format as textfile, there is no tab space. if I set the format as sequencefile and view the content via hadoop fs -text, I saw a tab space in the head of each line. At 2013-01-09 15:44:00,Richard codemon...@163.com wrote: hi there I have a problem with creating a hive table. no matter what field delimiter I used, I always got a tab space in the head of each line (a line is a record). something like this: \t f1 \001 f2 \001 f3 ... where f1 , f2 , f3 denotes the field value and \001 is the field separator. here is the clause I used 35 create external table if not exists ${HIVETBL_my_table} 36 ( 37 nid string, 38 userid string, 39 spv bigint, 40 sipv bigint, 41 pay bigint, 42 spay bigint, 43 ipv bigint, 44 sellerid string, 45 cate string 46 ) 47 partitioned by(ds string) 48 row format delimited fields terminated by '\001' lines terminated by '\n' 49 stored as sequencefile 50 location '${HADOOP_PATH_4_MY_HIVE}/${HIVETBL_my_table}'; thanks for help. Richard -- Nitin Pawar -- Dean Wampler, Ph.D. thinkbiganalytics.com +1-312-339-1330
Re:create a hive table: always a tab space before each line
more information: if I set the format as textfile, there is no tab space. if I set the format as sequencefile and view the content via hadoop fs -text, I saw a tab space in the head of each line. At 2013-01-09 15:44:00,Richard codemon...@163.com wrote: hi there I have a problem with creating a hive table. no matter what field delimiter I used, I always got a tab space in the head of each line (a line is a record). something like this: \t f1 \001 f2 \001 f3 ... where f1 , f2 , f3 denotes the field value and \001 is the field separator. here is the clause I used 35 create external table if not exists ${HIVETBL_my_table} 36 ( 37 nid string, 38 userid string, 39 spv bigint, 40 sipv bigint, 41 pay bigint, 42 spay bigint, 43 ipv bigint, 44 sellerid string, 45 cate string 46 ) 47 partitioned by(ds string) 48 row format delimited fields terminated by '\001' lines terminated by '\n' 49 stored as sequencefile 50 location '${HADOOP_PATH_4_MY_HIVE}/${HIVETBL_my_table}'; thanks for help. Richard
Re:Re: create a hive table: always a tab space before each line
I am trying to create a table and insert overwrite it, so the data is supposed to be generated. At 2013-01-09 17:17:06,Anurag Tangri tangri.anu...@gmail.com wrote: Hi Richard, You should set the format in create external table command based on the format of your data on HDFS. Is your data text file or seq file on HDFS ? Thanks, Anurag Tangri Sent from my iPhone On Jan 9, 2013, at 12:49 AM, Richard codemon...@163.com wrote: more information: if I set the format as textfile, there is no tab space. if I set the format as sequencefile and view the content via hadoop fs -text, I saw a tab space in the head of each line. At 2013-01-09 15:44:00,Richard codemon...@163.com wrote: hi there I have a problem with creating a hive table. no matter what field delimiter I used, I always got a tab space in the head of each line (a line is a record). something like this: \t f1 \001 f2 \001 f3 ... where f1 , f2 , f3 denotes the field value and \001 is the field separator. here is the clause I used 35 create external table if not exists ${HIVETBL_my_table} 36 ( 37 nid string, 38 userid string, 39 spv bigint, 40 sipv bigint, 41 pay bigint, 42 spay bigint, 43 ipv bigint, 44 sellerid string, 45 cate string 46 ) 47 partitioned by(ds string) 48 row format delimited fields terminated by '\001' lines terminated by '\n' 49 stored as sequencefile 50 location '${HADOOP_PATH_4_MY_HIVE}/${HIVETBL_my_table}'; thanks for help. Richard
create a hive table: always a tab space before each line
hi there I have a problem with creating a hive table. no matter what field delimiter I used, I always got a tab space in the head of each line (a line is a record). something like this: \t f1 \001 f2 \001 f3 ... where f1 , f2 , f3 denotes the field value and \001 is the field separator. here is the clause I used 35 create external table if not exists ${HIVETBL_my_table} 36 ( 37 nid string, 38 userid string, 39 spv bigint, 40 sipv bigint, 41 pay bigint, 42 spay bigint, 43 ipv bigint, 44 sellerid string, 45 cate string 46 ) 47 partitioned by(ds string) 48 row format delimited fields terminated by '\001' lines terminated by '\n' 49 stored as sequencefile 50 location '${HADOOP_PATH_4_MY_HIVE}/${HIVETBL_my_table}'; thanks for help. Richard
user define data format
Hi, I want to use Hive on some data in the following format: doc\0x01 field1=val1\0x01 field2=val2\0x01 ... /doc\0x01 the lines between doc and /doc are a record. How should I define the table? thanks. Richard
Re:Re: update a hive table
I see, thanks. At 2012-03-16 20:30:24,Bejoy Ks bejoy...@yahoo.com wrote: Hi Robert, Sorry I missed answering your question If mytable has many columns but I only need to update one of them, how can I write the statement short? It is not possible in hive. At a bare minimum hive would overwrite a partition (not even a record/file) if you frame an optimized query. Hope it helps!... Regards Bejoy.K.S From: Bejoy Ks bejoy...@yahoo.com To: user@hive.apache.org user@hive.apache.org Sent: Friday, March 16, 2012 5:56 PM Subject: Re: update a hive table Hey Richard First, don't treat hive as a database. Hive is just an interface above map reduce that help you write SQL like queries that inturn is converted to MR jobs. Saying that hive doesn't support update. If you are just experimenting, the query that you wrote would overwrite a whole record (in a broader context a whole partion/table) Regards Bejoy.K.S From: Richard codemon...@163.com To:user@hive.apache.org Sent: Friday, March 16, 2012 4:26 PM Subject: update a hive table if I wang to update a table, e.g, insert overwrite table mytable select lower(col1), col2, col3 from mytable; if mytable has many columns but I only need to update one of them, how can I write the statement short?
update a hive table
if I wang to update a table, e.g, insert overwrite table mytable select lower(col1), col2, col3 from mytable; if mytable has many columns but I only need to update one of them, how can I write the statement short?
Re: Multi character delimiter for Hive Columns and Rows
A custom SerDe would be your best bet. We're using one to do exactly that. Regards, Rick On Apr 28, 2011 11:29 AM, Shantian Purkad shantian_pur...@yahoo.com wrote: Any suggestions? From: Shantian Purkad shantian_pur...@yahoo.com To: user@hive.apache.org Sent: Tue, April 26, 2011 11:05:46 PM Subject: Multi character delimiter for Hive Columns and Rows Hello, We have a situation where the data coming from source systems to hive may contain the common characters and delimiters such as |, tabs, new line characters etc. We may have to use multi character delimiters such as |# for columns and ||# for rows. How can we achieve this? In this case our single rows may look like below (|#is column delimiter and ||#is row delimiter row 1 col1 |# row 1 col2 |# row 1 col 3 has two new line characters |# and this is the last column of row 1 ||# row 2 col1 |# row 2 col2 |# row 2 col 3 has one tab and one new line character |# and this is the last column of row 2 ||# Would custom SerDe help us handle this situation? Thanks and Regards, Shantian