Re: Run hive queries, and collect job information
Every hive query has a history file, and you can get these info from hive history file Following java code can be an example: https://github.com/anjuke/hwi/blob/master/src/main/java/org/apache/hadoop/hive/hwi/util/QueryUtil.java Regard, Qiang 2013/1/30 Mathieu Despriee mdespr...@octo.com Hi folks, I would like to run a list of generated HIVE queries. For each, I would like to retrieve the MR job_id (or ids, in case of multiple stages). And then, with this job_id, collect statistics from job tracker (cumulative CPU, read bytes...) How can I send HIVE queries from a bash or python script, and retrieve the job_id(s) ? For the 2nd part (collecting stats for the job), we're using a MRv1 Hadoop cluster, so I don't have the AppMaster REST APIhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html. I'm about to collect data from the jobtracker web UI. Any better idea ? Mathieu
classloader in org.apache.hadoop.hive.metastore.ObjectStore
class ObjectStore has a private member named 'classloader': private ClassLoader classLoader; { classLoader = Thread.currentThread().getContextClassLoader(); if (classLoader == null) { classLoader = QueryStore.class.getClassLoader(); } } But I can't find the place where it's used. Anyone has an idea about this ? Regards, Qiang
Re: Best practice for automating jobs
I believe the HWI (Hive Web Interface) can give you a hand. https://github.com/anjuke/hwi You can use the HWI to submit and run queries concurrently. Partition management can be achieved by creating crontabs using the HWI. It's simple and easy to use. Hope it helps. Regards, Qiang 2013/1/11 Tom Brown tombrow...@gmail.com All, I want to automate jobs against Hive (using an external table with ever growing partitions), and I'm running into a few challenges: Concurrency - If I run Hive as a thrift server, I can only safely run one job at a time. As such, it seems like my best bet will be to run it from the command line and setup a brand new instance for each job. That quite a bit of a hassle to solves a seemingly common problem, so I want to know if there are any accepted patterns or best practices for this? Partition management - New partitions will be added regularly. If I have to setup multiple instances of Hive for each (potentially) overlapping job, it will be difficult to keep track of the partitions that have been added. In the context of the preceding question, what is the best way to add metadata about new partitions? Thanks in advance! --Tom
Re: Best practice for automating jobs
The HWI will create a cli session for each query through hive libs, so several queries can run concurrently. 2013/1/11 Tom Brown tombrow...@gmail.com How is concurrency achieved with this solution? On Thursday, January 10, 2013, Qiang Wang wrote: I believe the HWI (Hive Web Interface) can give you a hand. https://github.com/anjuke/hwi You can use the HWI to submit and run queries concurrently. Partition management can be achieved by creating crontabs using the HWI. It's simple and easy to use. Hope it helps. Regards, Qiang 2013/1/11 Tom Brown tombrow...@gmail.com All, I want to automate jobs against Hive (using an external table with ever growing partitions), and I'm running into a few challenges: Concurrency - If I run Hive as a thrift server, I can only safely run one job at a time. As such, it seems like my best bet will be to run it from the command line and setup a brand new instance for each job. That quite a bit of a hassle to solves a seemingly common problem, so I want to know if there are any accepted patterns or best practices for this? Partition management - New partitions will be added regularly. If I have to setup multiple instances of Hive for each (potentially) overlapping job, it will be difficult to keep track of the partitions that have been added. In the context of the preceding question, what is the best way to add metadata about new partitions? Thanks in advance! --Tom
Re: Best practice for automating jobs
Are you using Embedded Metastore ? Only one process can connect to this metastore at a time. 2013/1/11 Tom Brown tombrow...@gmail.com When I've tried to create concurrent CLI sessions, I thought the 2nd one got an error about not being able to lock the metadata store. Is that error a real thing, or have I been mistaken this whole time? --Tom On Thursday, January 10, 2013, Qiang Wang wrote: The HWI will create a cli session for each query through hive libs, so several queries can run concurrently. 2013/1/11 Tom Brown tombrow...@gmail.com How is concurrency achieved with this solution? On Thursday, January 10, 2013, Qiang Wang wrote: I believe the HWI (Hive Web Interface) can give you a hand. https://github.com/anjuke/hwi You can use the HWI to submit and run queries concurrently. Partition management can be achieved by creating crontabs using the HWI. It's simple and easy to use. Hope it helps. Regards, Qiang 2013/1/11 Tom Brown tombrow...@gmail.com All, I want to automate jobs against Hive (using an external table with ever growing partitions), and I'm running into a few challenges: Concurrency - If I run Hive as a thrift server, I can only safely run one job at a time. As such, it seems like my best bet will be to run it from the command line and setup a brand new instance for each job. That quite a bit of a hassle to solves a seemingly common problem, so I want to know if there are any accepted patterns or best practices for this? Partition management - New partitions will be added regularly. If I have to setup multiple instances of Hive for each (potentially) overlapping job, it will be difficult to keep track of the partitions that have been added. In the context of the preceding question, what is the best way to add metadata about new partitions? Thanks in advance! --Tom
Re: Hive HWI ... request for your experience to be used Production
I know little about Oozie, but we will study it and review the issue. Thanks for you advice ! About query compilation before scheduling, we planned to implement it later. As we want to contribute this HWI back to hive community, Apache Licence may will be used. So feel free to help implement these usefull feature. Regards, QIang 2013/1/7 Manish Malhotra manish.hadoop.w...@gmail.com https://github.com/anjuke/hwi/issues/2 Added an issue for Integrating HWI with Oozie .. its not an issue but a feature request. Please review and see if it make sense atleast in long term. Plus i want to check 1 more usecase: 1. Query compilation: When user writes query on the UI, does it give compilation problem in synchronous way instead of scheduling a query that is not correct? Regards, Manish On Sun, Jan 6, 2013 at 10:45 AM, Manish Malhotra manish.hadoop.w...@gmail.com wrote: Thanks Edward for explaining Im also very much interested in building a robust tool for bringing HIVE more into Enterprise world where any Data Analyst / ETL developer can use it. Regards, Manish On Sun, Jan 6, 2013 at 9:10 AM, Edward Capriolo edlinuxg...@gmail.comwrote: The hive code is apache licensed. If you want to add you work to hive simple open a jira on http://issues.apache.org/jira/hive and produce a patch that will apply to hive trunk. That will start the process. On Saturday, January 5, 2013, Qiang Wang wsxy...@gmail.com wrote: Hi Manish: Glad to talk with you. 1. We are willing and trying to make it an open source project although we're not familiar with how to do so. We will study the licences and chose a proper one. 2. If Apache Hive agree to add this to the codebase, we will definitely do this. I think this HWI is not good enough yet. 3. ThreadPool is used to run queries. So synchronous mode can be achieved by setting its size to 1. Regards, Qiang 2013/1/6 Manish Malhotra manish.hadoop.w...@gmail.com Thanks Quiang, And glad that somebody already doing the improvement. Sure let me try it out. Quick questions: 1. What is the license of this HWI version? Can somebody else contribute to this as a true open source software and use it and make it available for community as well? 2. Any plans to merge this or add to Apache Hive codebase. 3. Does it has synchronous mode of running queries or only scheduling / async way? Thanks for your reply and time, Regards, Manish On Fri, Jan 4, 2013 at 9:34 PM, Qiang Wang wsxy...@gmail.com wrote: Hi Manish: Glad to receive your email because we are making efforts on HWI. We have improved the orignal and added some features and putted it on github: https://github.com/anjuke/hwi It's far from mature and standard, but it's improving and has already deployed for our company to use. After all, have a try and give some advice if you're interested in it. Thanks Qiang 2013/1/5 Manish Malhotra manish.hadoop.w...@gmail.com Hi All, We are exploring HWI to be used in PROD environment for adhoc queries etc. Want to check out in the hive community that can somebody share there experience while using the HWI in prod or any environment in terms of its stability and performance. Also evaluating to enhance to make it more useful with different features. Thanks for your time and help !! Regards, Manish
Re: Hive HWI ... request for your experience to be used Production
Before running queries, a compile method will be called in class org.apache.hadoop.hive.ql.Driver. I think we can take some useful code out of this method. Qiang 2013/1/7 Edward Capriolo edlinuxg...@gmail.com I think a simple way would be to try tacking an explain in front of it. Explains do not map reduce so they should be synchronous. Sent from my iPad On Jan 6, 2013, at 1:58 PM, Manish Malhotra manish.hadoop.w...@gmail.com wrote: https://github.com/anjuke/hwi/issues/2 Added an issue for Integrating HWI with Oozie .. its not an issue but a feature request. Please review and see if it make sense atleast in long term. Plus i want to check 1 more usecase: 1. Query compilation: When user writes query on the UI, does it give compilation problem in synchronous way instead of scheduling a query that is not correct? Regards, Manish On Sun, Jan 6, 2013 at 10:45 AM, Manish Malhotra manish.hadoop.w...@gmail.com wrote: Thanks Edward for explaining Im also very much interested in building a robust tool for bringing HIVE more into Enterprise world where any Data Analyst / ETL developer can use it. Regards, Manish On Sun, Jan 6, 2013 at 9:10 AM, Edward Capriolo edlinuxg...@gmail.comwrote: The hive code is apache licensed. If you want to add you work to hive simple open a jira on http://issues.apache.org/jira/hive and produce a patch that will apply to hive trunk. That will start the process. On Saturday, January 5, 2013, Qiang Wang wsxy...@gmail.com wrote: Hi Manish: Glad to talk with you. 1. We are willing and trying to make it an open source project although we're not familiar with how to do so. We will study the licences and chose a proper one. 2. If Apache Hive agree to add this to the codebase, we will definitely do this. I think this HWI is not good enough yet. 3. ThreadPool is used to run queries. So synchronous mode can be achieved by setting its size to 1. Regards, Qiang 2013/1/6 Manish Malhotra manish.hadoop.w...@gmail.com Thanks Quiang, And glad that somebody already doing the improvement. Sure let me try it out. Quick questions: 1. What is the license of this HWI version? Can somebody else contribute to this as a true open source software and use it and make it available for community as well? 2. Any plans to merge this or add to Apache Hive codebase. 3. Does it has synchronous mode of running queries or only scheduling / async way? Thanks for your reply and time, Regards, Manish On Fri, Jan 4, 2013 at 9:34 PM, Qiang Wang wsxy...@gmail.com wrote: Hi Manish: Glad to receive your email because we are making efforts on HWI. We have improved the orignal and added some features and putted it on github: https://github.com/anjuke/hwi It's far from mature and standard, but it's improving and has already deployed for our company to use. After all, have a try and give some advice if you're interested in it. Thanks Qiang 2013/1/5 Manish Malhotra manish.hadoop.w...@gmail.com Hi All, We are exploring HWI to be used in PROD environment for adhoc queries etc. Want to check out in the hive community that can somebody share there experience while using the HWI in prod or any environment in terms of its stability and performance. Also evaluating to enhance to make it more useful with different features. Thanks for your time and help !! Regards, Manish
Re: HiveHistoryViewer concurrency problem
Does Anybody have an idea about this? https://issues.apache.org/jira/browse/HIVE-3857 2013/1/4 Qiang Wang wsxy...@gmail.com new HiveHistoryViewer() throws ConcurrentModificationException when called concurrently by several threads. According to the stack trace, HiveHistory.parseLine use *private static MapString, String parseBuffer* to store parsed data and this caused the exception. I don't know why a static buffer rather than a local buffer is used! Anybody have an idea about this?
Re: HiveHistoryViewer concurrency problem
Hi Jie: As I know, hive history log is structured and class *HiveHistory* is used to write and read hive history log. *HiveHistoryViewer* serves as a listener to listen and store parsed log data. It has two members: HashMapString, QueryInfo *jobInfoMap*, which stores QueryInfo related with hive query and HashMapString, TaskInfo *taskInfoMap*, which stores TaskInfo related with hadoop map/red job you can dump the two maps and find what you want. Hope these info helps Qiang 2013/1/5 Jie Li ji...@cs.duke.edu Hi Qiang, Could you describe how HiveHistoryViewer is used? I'm also looking for a tool to understand the Hive log. Thanks, Jie On Sat, Jan 5, 2013 at 9:54 AM, Qiang Wang wsxy...@gmail.com wrote: Does Anybody have an idea about this? https://issues.apache.org/jira/browse/HIVE-3857 2013/1/4 Qiang Wang wsxy...@gmail.com new HiveHistoryViewer() throws ConcurrentModificationException when called concurrently by several threads. According to the stack trace, HiveHistory.parseLine use private static MapString, String parseBuffer to store parsed data and this caused the exception. I don't know why a static buffer rather than a local buffer is used! Anybody have an idea about this?
Re: HiveHistoryViewer concurrency problem
Maybe it's not. But this exception happens when I create an *HiveHistoryViewer* instance, in which case only reading, parsing file is invloved and it's not intended to be shared between threads. So the exception surprised me and I wonder why a static buffer was used instead of a local buffer which has no concurrent issue. 2013/1/5 Edward Capriolo edlinuxg...@gmail.com It is likely an oversight. The Majority of hive code was not written to be multi-threaded. On Fri, Jan 4, 2013 at 10:41 PM, Jie Li ji...@cs.duke.edu wrote: Hi Qiang, Could you describe how HiveHistoryViewer is used? I'm also looking for a tool to understand the Hive log. Thanks, Jie On Sat, Jan 5, 2013 at 9:54 AM, Qiang Wang wsxy...@gmail.com wrote: Does Anybody have an idea about this? https://issues.apache.org/jira/browse/HIVE-3857 2013/1/4 Qiang Wang wsxy...@gmail.com new HiveHistoryViewer() throws ConcurrentModificationException when called concurrently by several threads. According to the stack trace, HiveHistory.parseLine use private static MapString, String parseBuffer to store parsed data and this caused the exception. I don't know why a static buffer rather than a local buffer is used! Anybody have an idea about this?
Re: Hive HWI ... request for your experience to be used Production
Hi Manish: Glad to receive your email because we are making efforts on HWI. We have improved the orignal and added some features and putted it on github: https://github.com/anjuke/hwi It's far from mature and standard, but it's improving and has already deployed for our company to use. After all, have a try and give some advice if you're interested in it. Thanks Qiang 2013/1/5 Manish Malhotra manish.hadoop.w...@gmail.com Hi All, We are exploring HWI to be used in PROD environment for adhoc queries etc. Want to check out in the hive community that can somebody share there experience while using the HWI in prod or any environment in terms of its stability and performance. Also evaluating to enhance to make it more useful with different features. Thanks for your time and help !! Regards, Manish
Re: HiveHistory and HiveHistoryViewer
anybody has an idea about this ? https://issues.apache.org/jira/browse/HIVE-3810 2012/12/16 Qiang Wang wsxy...@gmail.com glad to receive your reply! here is my point: Firstly, I think HiveHistoryViewer is inconsistent with HiveHistory. Secondly, hive server may be deloyed on linux, but client can be anywhere. hql from client will be logged into history file and hql may contails '\r' 2012/12/16 afancy grou...@gmail.com I don\t think it is a bug. If the program in hive writes logs to HiveHistory.log using '\n' to indicate the end of a line. Then, it is OK to use *val = val.replace('\n', ' ');. Anyway, ** new line depends what on your OS:https://ccrma.stanford.edu/~craig/utility/flip/ Hive is typically deployed on Linux. * * DOS Windows: \r\n 0D0A (hex), 13,10 (decimal) Unix Mac OS X: \n, 0A, 10 Macintosh (OS 9): \r, 0D, 13 * On Sun, Dec 16, 2012 at 11:23 AM, Qiang Wang wsxy...@gmail.com wrote: '\n', '\r',
Re: HiveHistory and HiveHistoryViewer
HiveHistory.parseHiveHistory use BufferedReader.readLine which takes '\n', '\r', '\r\n' as line delimiter to parse history file And clients may be on mac, which takes '\r' as line delimiter So I think '\r' should also be replaced with space in HiveHistory.log, so that HiveHistory.parseHiveHistory could be consistent with HiveHistory.log and allow clients from mac Thanks! 2012/12/18 Mark Grover grover.markgro...@gmail.com Looks like a bug to me. This is the original JIRA that introduced this change: https://issues.apache.org/jira/browse/HIVE-176 I don't think back in the day, we really cared about clients being on windows. In any case, thanks for filing the JIRA, I have uploaded a patch which I think doesn't break anything for linux clients and fixes things up for Windows clients. Take a look, feedback welcome. The intent is the same as your suggestions but the approach is a little more conservative. If you feel strongly that it should be done according to one of your suggestions, let me know, I will take another look. Thanks! Mark On Mon, Dec 17, 2012 at 5:48 AM, Qiang Wang wsxy...@gmail.com wrote: anybody has an idea about this ? https://issues.apache.org/jira/browse/HIVE-3810 2012/12/16 Qiang Wang wsxy...@gmail.com glad to receive your reply! here is my point: Firstly, I think HiveHistoryViewer is inconsistent with HiveHistory. Secondly, hive server may be deloyed on linux, but client can be anywhere. hql from client will be logged into history file and hql may contails '\r' 2012/12/16 afancy grou...@gmail.com I don\t think it is a bug. If the program in hive writes logs to HiveHistory.log using '\n' to indicate the end of a line. Then, it is OK to use val = val.replace('\n', ' ');. Anyway, new line depends what on your OS: Hive is typically deployed on Linux. DOS Windows: \r\n 0D0A (hex), 13,10 (decimal) Unix Mac OS X: \n, 0A, 10 Macintosh (OS 9): \r, 0D, 13 On Sun, Dec 16, 2012 at 11:23 AM, Qiang Wang wsxy...@gmail.com wrote: '\n', '\r',
Re: HiveHistory and HiveHistoryViewer
glad to receive your reply! here is my point: Firstly, I think HiveHistoryViewer is inconsistent with HiveHistory. Secondly, hive server may be deloyed on linux, but client can be anywhere. hql from client will be logged into history file and hql may contails '\r' 2012/12/16 afancy grou...@gmail.com I don\t think it is a bug. If the program in hive writes logs to HiveHistory.log using '\n' to indicate the end of a line. Then, it is OK to use *val = val.replace('\n', ' ');. Anyway, ** new line depends what on your OS:https://ccrma.stanford.edu/~craig/utility/flip/ Hive is typically deployed on Linux. * * DOS Windows: \r\n 0D0A (hex), 13,10 (decimal) Unix Mac OS X: \n, 0A, 10 Macintosh (OS 9): \r, 0D, 13 * On Sun, Dec 16, 2012 at 11:23 AM, Qiang Wang wsxy...@gmail.com wrote: '\n', '\r',