[jira] [Created] (HIVE-6308) COLUMNS_V2 Metastore table not populated for tables created without an explicit column list.
Alexander Behm created HIVE-6308: Summary: COLUMNS_V2 Metastore table not populated for tables created without an explicit column list. Key: HIVE-6308 URL: https://issues.apache.org/jira/browse/HIVE-6308 Project: Hive Issue Type: Bug Components: Database/Schema Affects Versions: 0.10.0 Reporter: Alexander Behm Consider this example table: CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='file:///path/to/the/schema/test_serializer.avsc'); When I try to run an ANALYZE TABLE for computing column stats on any of the columns, then I get: org.apache.hadoop.hive.ql.metadata.HiveException: NoSuchObjectException(message:Column o_orderpriority for which stats gathering is requested doesn't exist.) at org.apache.hadoop.hive.ql.metadata.Hive.updateTableColumnStatistics(Hive.java:2280) at org.apache.hadoop.hive.ql.exec.ColumnStatsTask.persistTableStats(ColumnStatsTask.java:331) at org.apache.hadoop.hive.ql.exec.ColumnStatsTask.execute(ColumnStatsTask.java:343) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:66) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1383) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1169) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:982) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) The root cause appears to be that the COLUMNS_V2 table in the Metastore isn't populated properly during the table creation. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HIVE-5264) SQL generated by MetaStoreDirectSql.java not compliant with Postgres.
Alexander Behm created HIVE-5264: Summary: SQL generated by MetaStoreDirectSql.java not compliant with Postgres. Key: HIVE-5264 URL: https://issues.apache.org/jira/browse/HIVE-5264 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 0.11.0 Environment: Ubuntu 12.04 PostgreSQL 9.1.8 Reporter: Alexander Behm Some operations against the Hive Metastore seem broken against Postgres. For example, when using HiveMetastoreClient.listPartitions() the Postgres logs show queries such as: 2013-09-09 19:10:01 PDT STATEMENT: select PARTITIONS.PART_ID from PARTITIONS inner join TBLS on PARTITIONS.TBL_ID = TBLS.TBL_ID inner join DBS on TBLS.DB_ID = DBS.DB_ID where TBLS.TBL_NAME = $1 and DBS.NAME = $2 order by PART_NAME asc with a somewhat cryptic (but correct) error: ERROR: relation "partitions" does not exist at character 32 Postgres identifiers are somewhat unusual. Unquoted identifiers are interpreted as lower case (there is no Postgres option to change this). Since the Metastore table schema uses upper case table names, the correct SQL requires escaped identifiers to those tables, i.e., select "PARTITIONS"."PART_ID" from "PARTITIONS"... Hive sets metastore.try.direct.sql=true by default, so the above SQL is generated by hive/metastore/MetaStoreDirectSql.java, i.e., this is not a Datanucleus problem. When I set metastore.try.direct.sql=false, then the Metastore backed by Postgres works. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-5264) SQL generated by MetaStoreDirectSql.java not compliant with Postgres.
[ https://issues.apache.org/jira/browse/HIVE-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763767#comment-13763767 ] Alexander Behm commented on HIVE-5264: -- Thanks, Sergey. I don't know when the change went in, so you are probably correct about the version. Afaik, Postgres does not have an ANSI-compliance flag or any other means to change that specific behavior with identifiers. Just trying to be a good citizen so I reported the issue. I understand the difficulties and I'm not suggesting any particular fix. Setting metastore.try.direct.sql=false is an acceptable workaround for me (but others may disagree). The fallback didn't work for me. I think the reason may be that the original (failed) SQL statement and the fallback are executed under the same transaction (see ObjectStore.java getPartitionsByNamesInternal()), so the fallback is also aborted, see this snippet from the Postgres log immediately after the error message: 2013-09-09 19:10:01 PDT ERROR: current transaction is aborted, commands ignored until end of transaction block I didn't dig too deep into the code so don't take my word for it. > SQL generated by MetaStoreDirectSql.java not compliant with Postgres. > - > > Key: HIVE-5264 > URL: https://issues.apache.org/jira/browse/HIVE-5264 > Project: Hive > Issue Type: Bug > Components: Metastore >Affects Versions: 0.11.0 > Environment: Ubuntu 12.04 > PostgreSQL 9.1.8 >Reporter: Alexander Behm > > Some operations against the Hive Metastore seem broken > against Postgres. > For example, when using HiveMetastoreClient.listPartitions() > the Postgres logs show queries such as: > 2013-09-09 19:10:01 PDT STATEMENT: select PARTITIONS.PART_ID from > PARTITIONS inner join TBLS on PARTITIONS.TBL_ID = TBLS.TBL_ID inner > join DBS on TBLS.DB_ID = DBS.DB_ID where TBLS.TBL_NAME = $1 and > DBS.NAME = $2 order by PART_NAME asc > with a somewhat cryptic (but correct) error: > ERROR: relation "partitions" does not exist at character 32 > Postgres identifiers are somewhat unusual. Unquoted identifiers are > interpreted as lower case (there is no Postgres option to change this). Since > the Metastore table schema uses upper case table names, the correct SQL > requires escaped identifiers to those tables, i.e., > select "PARTITIONS"."PART_ID" from "PARTITIONS"... > Hive sets metastore.try.direct.sql=true by default, so the above SQL is > generated by hive/metastore/MetaStoreDirectSql.java, i.e., this is not a > Datanucleus problem. > When I set metastore.try.direct.sql=false, then the Metastore backed by > Postgres works. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-5264) SQL generated by MetaStoreDirectSql.java not compliant with Postgres.
[ https://issues.apache.org/jira/browse/HIVE-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763791#comment-13763791 ] Alexander Behm commented on HIVE-5264: -- The logs seem to suggest so. > SQL generated by MetaStoreDirectSql.java not compliant with Postgres. > - > > Key: HIVE-5264 > URL: https://issues.apache.org/jira/browse/HIVE-5264 > Project: Hive > Issue Type: Bug > Components: Metastore >Affects Versions: 0.12.0 > Environment: Ubuntu 12.04 > PostgreSQL 9.1.8 >Reporter: Alexander Behm >Assignee: Sergey Shelukhin > > Some operations against the Hive Metastore seem broken > against Postgres. > For example, when using HiveMetastoreClient.listPartitions() > the Postgres logs show queries such as: > 2013-09-09 19:10:01 PDT STATEMENT: select PARTITIONS.PART_ID from > PARTITIONS inner join TBLS on PARTITIONS.TBL_ID = TBLS.TBL_ID inner > join DBS on TBLS.DB_ID = DBS.DB_ID where TBLS.TBL_NAME = $1 and > DBS.NAME = $2 order by PART_NAME asc > with a somewhat cryptic (but correct) error: > ERROR: relation "partitions" does not exist at character 32 > Postgres identifiers are somewhat unusual. Unquoted identifiers are > interpreted as lower case (there is no Postgres option to change this). Since > the Metastore table schema uses upper case table names, the correct SQL > requires escaped identifiers to those tables, i.e., > select "PARTITIONS"."PART_ID" from "PARTITIONS"... > Hive sets metastore.try.direct.sql=true by default, so the above SQL is > generated by hive/metastore/MetaStoreDirectSql.java, i.e., this is not a > Datanucleus problem. > When I set metastore.try.direct.sql=false, then the Metastore backed by > Postgres works. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4022) Structs and struct fields cannot be NULL in INSERT statements
[ https://issues.apache.org/jira/browse/HIVE-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268625#comment-14268625 ] Alexander Behm commented on HIVE-4022: -- Easier workaround: IF(false, named_struct("a", 1), NULL) > Structs and struct fields cannot be NULL in INSERT statements > - > > Key: HIVE-4022 > URL: https://issues.apache.org/jira/browse/HIVE-4022 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Reporter: Michael Malak > > Originally thought to be Avro-specific, and first noted with respect to > HIVE-3528 "Avro SerDe doesn't handle serializing Nullable types that require > access to a Schema", it turns out even native Hive tables cannot store NULL > in a STRUCT field or for the entire STRUCT itself, at least when the NULL is > specified directly in the INSERT statement. > Again, this affects both Avro-backed tables and native Hive tables. > ***For native Hive tables: > The following: > echo 1,2 >twovalues.csv > hive > CREATE TABLE tc (x INT, y INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > LOAD DATA LOCAL INPATH 'twovalues.csv' INTO TABLE tc; > CREATE TABLE oc (z STRUCT); > INSERT INTO TABLE oc SELECT null FROM tc; > produces the error > FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target > table because column number/types are different 'oc': Cannot convert column 0 > from void to struct. > The following: > INSERT INTO TABLE oc SELECT named_struct('a', null, 'b', null) FROM tc; > produces the error: > FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target > table because column number/types are different 'oc': Cannot convert column 0 > from struct to struct. > ***For Avro: > In HIVE-3528, there is in fact a null-struct test case in line 14 of > https://github.com/apache/hive/blob/15cc604bf10f4c2502cb88fb8bb3dcd45647cf2c/data/files/csv.txt > The test script at > https://github.com/apache/hive/blob/12d6f3e7d21f94e8b8490b7c6d291c9f4cac8a4f/ql/src/test/queries/clientpositive/avro_nullable_fields.q > does indeed work. But in that test, the query gets all of its data from a > test table verbatim: > INSERT OVERWRITE TABLE as_avro SELECT * FROM test_serializer; > If instead we stick in a hard-coded null for the struct directly into the > query, it fails: > INSERT OVERWRITE TABLE as_avro SELECT string1, int1, tinyint1, smallint1, > bigint1, boolean1, float1, double1, list1, map1, null, enum1, nullableint, > bytes1, fixed1 FROM test_serializer; > with the following error: > FAILED: SemanticException [Error 10044]: Line 1:23 Cannot insert into target > table because column number/types are different 'as_avro': Cannot convert > column 10 from void to struct. > Note, though, that substituting a hard-coded null for string1 (and restoring > struct1 into the query) does work: > INSERT OVERWRITE TABLE as_avro SELECT null, int1, tinyint1, smallint1, > bigint1, boolean1, float1, double1, list1, map1, struct1, enum1, nullableint, > bytes1, fixed1 FROM test_serializer; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.
Alexander Behm created HIVE-18743: - Summary: CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy. Key: HIVE-18743 URL: https://issues.apache.org/jira/browse/HIVE-18743 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 1.1.0, 1.2.0 Reporter: Alexander Behm When hive.stats.autogather=true then the Metastore lists all files under the table directory to populate basic stats like file counts and sizes. This file listing operation can be very expensive particularly on filesystems like S3. One way to address this issue is to reconfigure hive.stats.autogather=false. *Here's the bug* It is my understanding that the DO_NOT_UPDATE_STATS table property is intended to selectively prevent this stats collection. Unfortunately, this table property is checked *after* the expensive file listing operation, so the DO_NOT_UPDATE_STATS does not seem to work as intended. See: https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633 Relevant code snippet: {code} public static boolean updateTableStatsFast(Database db, Table tbl, Warehouse wh, boolean madeDir, boolean forceRecompute, EnvironmentContext environmentContext) throws MetaException { if (tbl.getPartitionKeysSize() == 0) { // Update stats only when unpartitioned FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, tbl); return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after wh.getFileStatusesForUnpartitionedTable() has already been called } else { return false; } } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-9720) Metastore does not properly migrate column stats when renaming a table across databases.
Alexander Behm created HIVE-9720: Summary: Metastore does not properly migrate column stats when renaming a table across databases. Key: HIVE-9720 URL: https://issues.apache.org/jira/browse/HIVE-9720 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 0.13.1 Reporter: Alexander Behm It appears that the Hive Metastore does not properly migrate column statistics when renaming a table across databases. While renaming across databases is not supported in HiveQL, it can be done via the Metastore Thrift API. The problem is that such a newly renamed table cannot be dropped (unless renamed back to its original database/name). Here are steps for reproducing the issue. 1. From the Hive shell/beeline: {code} create database db1; create database db2; create table db1.mv (i int); use db1; analyze table mv compute statistics for columns i; {code} 2. From a Java program: {code} public static void main(String[] args) throws Exception { HiveConf conf = new HiveConf(MetaStoreClientPool.class); HiveMetaStoreClient hiveClient = new HiveMetaStoreClient(conf); Table t = hiveClient.getTable("db1", "mv"); t.setDbName("db2"); t.setTableName("mv2"); hiveClient.alter_table("db1", "mv", t); } {code} 3. From the Hive shell/beeline: {code} drop table db2.mv2; {code} Stack shown when running 3: {code} FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451) at org.datanucleus.api.jdo.JDOTransaction.commit(JDOTransaction.java:165) at org.apache.hadoop.hive.metastore.ObjectStore.commitTransaction(ObjectStore.java:411) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108) at com.sun.proxy.$Proxy0.commitTransaction(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_table_core(HiveMetaStore.java:1389) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_table_with_environment_context(HiveMetaStore.java:1525) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:106) at com.sun.proxy.$Proxy1.drop_table_with_environment_context(Unknown Source) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_table_with_environment_context.getResult(ThriftHiveMetastore.java:8072) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_table_with_environment_context.getResult(ThriftHiveMetastore.java:8056) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hadoop.hive.metastore.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:48) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:244) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) NestedThrowablesStackTrace: java.sql.BatchUpdateException: Batch entry 0 DELETE FROM "TBLS" WHERE "TBL_ID"='1621' was aborted. Call getNextException to see the cause. at org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2598) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:407) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2737) at com.jolbox.bonecp.StatementHandle.executeBatch(StatementHandle.java:424) at org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeBatch(ParamLoggingPreparedStatement.java:372) at org.datanucleus.store.rdbms.SQLController.processConnectionStatement(SQLController.java:628)
[jira] [Created] (HIVE-10427) collect_list() and collect_set() should accept struct types as argument
Alexander Behm created HIVE-10427: - Summary: collect_list() and collect_set() should accept struct types as argument Key: HIVE-10427 URL: https://issues.apache.org/jira/browse/HIVE-10427 Project: Hive Issue Type: Wish Components: UDF Reporter: Alexander Behm The collect_list() and collect_set() functions currently only accept scalar argument types. It would be very useful if these functions could also accept struct argument types for creating nested data from flat data. For example, suppose I wanted to create a nested customers/orders table from two flat tables, customers and orders. Then it'd be very convenient to write something like this: {code} insert into table nested_customers_orders select c.*, collect_list(named_struct("oid", o.oid, "order_date": o.date...)) from customers c inner join orders o on (c.cid = o.oid) group by c.cid {code} Thanks you for your consideration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11622) Creating an Avro table with a complex map-typed column leads to incorrect column type.
Alexander Behm created HIVE-11622: - Summary: Creating an Avro table with a complex map-typed column leads to incorrect column type. Key: HIVE-11622 URL: https://issues.apache.org/jira/browse/HIVE-11622 Project: Hive Issue Type: Bug Components: Database/Schema Affects Versions: 1.1.0 Reporter: Alexander Behm In the following CREATE TABLE the following map-typed column leads to the wrong type. I suspect some problem with inferring the Avro schema from the column definitions, but I am not sure. Reproduction: {code} hive> create table t (c map>) stored as avro; OK Time taken: 0.101 seconds hive> desc t; OK c array> from deserializer Time taken: 0.135 seconds, Fetched: 1 row(s) {code} Note how the type shown in DESCRIBE is not the type originally passed in the CREATE TABLE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15653) Some ALTER TABLE commands drop table stats
Alexander Behm created HIVE-15653: - Summary: Some ALTER TABLE commands drop table stats Key: HIVE-15653 URL: https://issues.apache.org/jira/browse/HIVE-15653 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 1.1.0 Reporter: Alexander Behm Priority: Critical Some ALTER TABLE commands drop the table stats. That may make sense for some ALTER TABLE operations, but certainly not for others. Personally, I I think ALTER TABLE should only change what was requested by the user without any side effects that may be unclear to users. In particular, collecting stats can be an expensive operation so it's rather inconvenient for users if they get wiped accidentally. Repro: {code} create table t (i int); insert into t values(1); analyze table t compute statistics; alter table t set tblproperties('test'='test'); hive> describe formatted t; OK # col_name data_type comment i int # Detailed Table Information Database: default Owner: abehm CreateTime: Tue Jan 17 18:13:34 PST 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://localhost:20500/test-warehouse/t Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE false last_modified_byabehm last_modified_time 1484705748 numFiles1 numRows -1 rawDataSize -1 testtest totalSize 2 transient_lastDdlTime 1484705748 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format1 Time taken: 0.169 seconds, Fetched: 34 row(s) {code} The same behavior can be observed with several other ALTER TABLE commands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)