Using Hive metastore as general purpose RDBMS
Hi, I have a case where I would like to extend Hive to use information from a regular RDBMS. To limit the complexity of the installation I thought I could piggyback on the already existing metatstore. As I understand it, HCatalog is not built for this purpose. Is there someone out there that has a similar usecase or have any input on how this is done or if it should be avoided? The use case is to look up which partitions that contain certain data. Thanks, Petter
HiverServer2 Logging
Hey, We are considering to start using HiveServer2 and its seems that the logging capabilities are really limited as the server runs on a single port. I'll explain the problem more accurately : Today, we are using HiveServer where each user gets a specific port (or a set of ports) then, we start X hiveservers , 1 for each port in the following manner nohup hive --service hiveserver -p $2 -v --hiveconf hive.root.logger=INFO,console ~/hive.$2 21 For example : nohup hive --service hiveserver -p 12354 -v --hiveconf hive.root.logger=INFO,console ~/hive.12354 21 This way, hiveserver runs in the background, the logging is redirected to the console but then its redirected to a file. So each port has its own log file, assuming there are no concurrent executions on the same port, each file is a sequential execution log. This is really convenient to debugging executions on the HiveServer as the full stacktrace isn't passed through the JDBC and mostly we get a MapredTask error The full errors are found in the logs and debugging is easy. Now in the HiveServer2, I couldn't find any way to separate the logging per session. In fact i see that there is a Jirahttps://issues.apache.org/jira/browse/HIVE-5924 open for this issue. Is there any work around for this? I want each session / connection in its own log file, higher resolution is also acceptable (like a log per session) In fact, looking at hive-exec-log4j.properties it seems that logging is per query : hive.log.file=${hive.query.id}.log But this is a different log full with counter data / task progress when i'm looking for the standard logging as seem in the CLI. I think that this is defined by hive.log.file in hive-log4j.properties which is set to : hive.log.file=hive.log Trying to set it also as the previous doesn't help as ${hive.query.id} resolves to nothing. Any advice appreciated. --- [cid:image001.jpg@01CE92B5.CB034C90] Dima Machlin, Big Data Architect Pursway.comhttp://www.pursway.com/ inline: image001.jpg
Re: Formatting hive queries
I use vim and https://github.com/vim-scripts/SQLUtilities to do it. It's not hive specific. Any SQL formatting tool will work. On Tue, Jan 21, 2014 at 11:23 PM, pandees waran pande...@gmail.com wrote: Hi, I would like to come up with a code which automatically formats your hql files. Because, formatting is one of the tedious task and i would like to come up with an utility for that. Please let me know, whether any specific utilities exist already for formatting hive queries. -- Thanks, Pandeeswaran
Re: Using Hive metastore as general purpose RDBMS
HCatalog is definitely not designed for this purpose. Could you explain your use case more fully? Is this indexing for better query planning or faster file access? If so, you might look at some of the work going on in ORC, which is storing indices of its data in the format itself for these purposes. Also, how much data do you need to store? Even index size on a Hadoop scale data can quickly overwhelm MySQL or Postgres (which is what most people use for their metastores) if you are keeping per row information. If you truly want to access an RDBMS as if it were an external data store, you could implement a HiveStorageHandler for your RDBMS. Alan. On Jan 22, 2014, at 2:02 AM, Petter von Dolwitz (Hem) petter.von.dolw...@gmail.com wrote: Hi, I have a case where I would like to extend Hive to use information from a regular RDBMS. To limit the complexity of the installation I thought I could piggyback on the already existing metatstore. As I understand it, HCatalog is not built for this purpose. Is there someone out there that has a similar usecase or have any input on how this is done or if it should be avoided? The use case is to look up which partitions that contain certain data. Thanks, Petter -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
HIVE SUB QUERY:: How to implement this case
Hello all, I have a case statement where I need to work like this logic. select as_of_dt as as_of_dt, max_feed_key as max_feed_key, min_feed_key as min_feed_key from table feed_key_temp where max_fed_key ( select max(feed_key) from summ_table ) group by as_of_dt ; Here, max_feed_key and min_feed_key are the fields in table feed_key_temp. As Hive does not provide (0.9 version) sub query in where clause, Pls suggest the work around it, and how to implement it Thanks Regards Yogesh Kumar
Re: Using Hive metastore as general purpose RDBMS
Hi Alan, thank you for your reply. The loose idea I had was to store one row in the RDBMS per Hive partition so I don't think the size will be an issue (expecting 3000 partitions or so). The end goal was to help to decide which partitions that are relevant for a query. Something like adding partition info to the WHERE clause behind the scenes. The way the data is structured we currently need to look up which partitions to use elsewhere. I'll look into ORC for sure. Currently we do not use any of the provided file formats but have implemented our own InputFormat that read gzip:ed protobufs. I suspect that we later on should investigate a possible performance gain coming from moving to a another file format. Petter 2014/1/22 Alan Gates ga...@hortonworks.com HCatalog is definitely not designed for this purpose. Could you explain your use case more fully? Is this indexing for better query planning or faster file access? If so, you might look at some of the work going on in ORC, which is storing indices of its data in the format itself for these purposes. Also, how much data do you need to store? Even index size on a Hadoop scale data can quickly overwhelm MySQL or Postgres (which is what most people use for their metastores) if you are keeping per row information. If you truly want to access an RDBMS as if it were an external data store, you could implement a HiveStorageHandler for your RDBMS. Alan. On Jan 22, 2014, at 2:02 AM, Petter von Dolwitz (Hem) petter.von.dolw...@gmail.com wrote: Hi, I have a case where I would like to extend Hive to use information from a regular RDBMS. To limit the complexity of the installation I thought I could piggyback on the already existing metatstore. As I understand it, HCatalog is not built for this purpose. Is there someone out there that has a similar usecase or have any input on how this is done or if it should be avoided? The use case is to look up which partitions that contain certain data. Thanks, Petter -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
DESCRIBE EXTENDED show numRows=0
Hi, I have observed that DESCRIBE EXTENDED always shows number of rows to be zero despite the fact that the table has data. Is it a bug? Is it known ? Has anyone else also come across the same ? Thanks,
Hive 0.12.0 mysql metastore exception
Hi, guys I upgrade hive 0.70 to hive 0.12.0 recently. We resolve some problems, but the one as follows puzzles me. We have thousands jobs every day and that will have about 40 this errors. I increase mysql max connections from 2000 to 4000 and I found the history max connection less than 2000. Any suggestions will be greatly appreciated. Thanks in advance. 2014-01-23 06:35:47,724 WARN bonecp.BoneCPConfig (BoneCPConfig.java:sanitize(1537)) - Max Connections 1. Setting to 20 2014-01-23 06:35:48,433 ERROR metastore.RetryingRawStore (RetryingRawStore.java:invoke(146)) - JDO datastore error. Retrying metastore command after 1000 ms (attempt 1 of 1) 2014-01-23 06:35:49,467 ERROR exec.DDLTask (DDLTask.java:execute(435)) - org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1143) at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1128) at org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3479) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.hadoop.hive.cli.NewCliDriver.processCmd(NewCliDriver.java:166) at org.apache.hadoop.hive.cli.NewCliDriver.processLine(NewCliDriver.java:243) at org.apache.hadoop.hive.cli.NewCliDriver.main(NewCliDriver.java:427) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient /ERROR at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:286) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.init(RetryingHMSHandler.java:54) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59) at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4060) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:121) ... 27 more Caused by: java.sql.BatchUpdateException: Duplicate entry 'default' for key 2 at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2020) at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451) at com.jolbox.bonecp.StatementHandle.executeBatch(StatementHandle.java:469) at org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeBatch(ParamLoggingPreparedStatement.java:372) at org.datanucleus.store.rdbms.SQLController.processConnectionStatement(SQLController.java:628) at org.datanucleus.store.rdbms.SQLController.processStatementsForConnection(SQLController.java:596) at org.datanucleus.store.rdbms.SQLController$1.transactionFlushed(SQLController.java:683) at org.datanucleus.store.connection.AbstractManagedConnection.transactionFlushed(AbstractManagedConnection.java:86) at org.datanucleus.store.connection.ConnectionManagerImpl$2.transactionFlushed(ConnectionManagerImpl.java:454) at org.datanucleus.TransactionImpl.flush(TransactionImpl.java:199) at org.datanucleus.TransactionImpl.commit(TransactionImpl.java:263) at org.datanucleus.api.jdo.JDOTransaction.commit(JDOTransaction.java:98) ... 43 more 2014-01-23 06:35:49,468 ERROR ql.Driver (SessionState.java:printError(436)) - FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient -- Thanks, Jander
Re: Hive 0.12.0 mysql metastore exception
2014/1/23 Jander g jande...@gmail.com Caused by: java.sql.BatchUpdateException: Duplicate entry 'default' for key 2 what is the HQL you run? it look like hive try to insert 'default' to a meta table which violate the unique key. Jov blog: http:amutu.com/blog http://amutu.com/blog
Re: Hive 0.12.0 mysql metastore exception
Hi, jov Thanks for your attention. we use hive like this: hive -e use abc; insert overwrite . Here, abc is a hive schema which already exists in hive metastore. But, from the log we can see, hive ddl switch database is failed. That is to say, it doesn't read the exist schema? 2014-01-23 06:35:49,467 ERROR exec.DDLTask (DDLTask.java:execute(435)) - org.apache.hadoop.hive.ql. metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1143) at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1128) at org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase (DDLTask.java:3479) Best Regards, On Thu, Jan 23, 2014 at 1:27 PM, Jov am...@amutu.com wrote: 2014/1/23 Jander g jande...@gmail.com Caused by: java.sql.BatchUpdateException: Duplicate entry 'default' for key 2 what is the HQL you run? it look like hive try to insert 'default' to a meta table which violate the unique key. Jov blog: http:amutu.com/blog http://amutu.com/blog -- Thanks, Jander