Re: Run hive queries, and collect job information

2013-01-30 Thread Qiang Wang
Every hive query has a history file, and you can get these info from hive
history file

Following java code can be an example:
https://github.com/anjuke/hwi/blob/master/src/main/java/org/apache/hadoop/hive/hwi/util/QueryUtil.java

Regard,
Qiang


2013/1/30 Mathieu Despriee mdespr...@octo.com

 Hi folks,

 I would like to run a list of generated HIVE queries. For each, I would
 like to retrieve the MR job_id (or ids, in case of multiple stages). And
 then, with this job_id, collect statistics from job tracker (cumulative
 CPU, read bytes...)

 How can I send HIVE queries from a bash or python script, and retrieve the
 job_id(s) ?

 For the 2nd part (collecting stats for the job), we're using a MRv1 Hadoop
 cluster, so I don't have the AppMaster REST 
 APIhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html.
 I'm about to collect data from the jobtracker web UI. Any better idea ?

 Mathieu





classloader in org.apache.hadoop.hive.metastore.ObjectStore

2013-01-20 Thread Qiang Wang
class ObjectStore has a private member named 'classloader':

private ClassLoader classLoader;
{
classLoader = Thread.currentThread().getContextClassLoader();
if (classLoader == null) {
classLoader = QueryStore.class.getClassLoader();
}
}

But I can't find the place where it's used.

Anyone has an idea about this ?

Regards,
Qiang


Re: Best practice for automating jobs

2013-01-10 Thread Qiang Wang
I believe the HWI (Hive Web Interface) can give you a hand.

https://github.com/anjuke/hwi

You can use the HWI to submit and run queries concurrently.
Partition management can be achieved by creating crontabs using the HWI.

It's simple and easy to use. Hope it helps.

Regards,
Qiang


2013/1/11 Tom Brown tombrow...@gmail.com

 All,

 I want to automate jobs against Hive (using an external table with
 ever growing partitions), and I'm running into a few challenges:

 Concurrency - If I run Hive as a thrift server, I can only safely run
 one job at a time. As such, it seems like my best bet will be to run
 it from the command line and setup a brand new instance for each job.
 That quite a bit of a hassle to solves a seemingly common problem, so
 I want to know if there are any accepted patterns or best practices
 for this?

 Partition management - New partitions will be added regularly. If I
 have to setup multiple instances of Hive for each (potentially)
 overlapping job, it will be difficult to keep track of the partitions
 that have been added. In the context of the preceding question, what
 is the best way to add metadata about new partitions?

 Thanks in advance!

 --Tom



Re: Best practice for automating jobs

2013-01-10 Thread Qiang Wang
The HWI will create a cli session for each query through hive libs, so
several queries can run concurrently.


2013/1/11 Tom Brown tombrow...@gmail.com

 How is concurrency achieved with this solution?


 On Thursday, January 10, 2013, Qiang Wang wrote:

 I believe the HWI (Hive Web Interface) can give you a hand.

 https://github.com/anjuke/hwi

 You can use the HWI to submit and run queries concurrently.
 Partition management can be achieved by creating crontabs using the HWI.

 It's simple and easy to use. Hope it helps.

 Regards,
 Qiang


 2013/1/11 Tom Brown tombrow...@gmail.com

 All,

 I want to automate jobs against Hive (using an external table with
 ever growing partitions), and I'm running into a few challenges:

 Concurrency - If I run Hive as a thrift server, I can only safely run
 one job at a time. As such, it seems like my best bet will be to run
 it from the command line and setup a brand new instance for each job.
 That quite a bit of a hassle to solves a seemingly common problem, so
 I want to know if there are any accepted patterns or best practices
 for this?

 Partition management - New partitions will be added regularly. If I
 have to setup multiple instances of Hive for each (potentially)
 overlapping job, it will be difficult to keep track of the partitions
 that have been added. In the context of the preceding question, what
 is the best way to add metadata about new partitions?

 Thanks in advance!

 --Tom





Re: Best practice for automating jobs

2013-01-10 Thread Qiang Wang
Are you using Embedded Metastore ?
Only one process can connect to this metastore at a time.


2013/1/11 Tom Brown tombrow...@gmail.com

 When I've tried to create concurrent CLI sessions, I thought the 2nd
 one got an error about not being able to lock the metadata store.

 Is that error a real thing, or have I been mistaken this whole time?

 --Tom


 On Thursday, January 10, 2013, Qiang Wang wrote:

 The HWI will create a cli session for each query through hive libs, so
 several queries can run concurrently.


 2013/1/11 Tom Brown tombrow...@gmail.com

 How is concurrency achieved with this solution?


 On Thursday, January 10, 2013, Qiang Wang wrote:

 I believe the HWI (Hive Web Interface) can give you a hand.

 https://github.com/anjuke/hwi

 You can use the HWI to submit and run queries concurrently.
 Partition management can be achieved by creating crontabs using the HWI.

 It's simple and easy to use. Hope it helps.

 Regards,
 Qiang


 2013/1/11 Tom Brown tombrow...@gmail.com

 All,

 I want to automate jobs against Hive (using an external table with
 ever growing partitions), and I'm running into a few challenges:

 Concurrency - If I run Hive as a thrift server, I can only safely run
 one job at a time. As such, it seems like my best bet will be to run
 it from the command line and setup a brand new instance for each job.
 That quite a bit of a hassle to solves a seemingly common problem, so
 I want to know if there are any accepted patterns or best practices
 for this?

 Partition management - New partitions will be added regularly. If I
 have to setup multiple instances of Hive for each (potentially)
 overlapping job, it will be difficult to keep track of the partitions
 that have been added. In the context of the preceding question, what
 is the best way to add metadata about new partitions?

 Thanks in advance!

 --Tom






Re: Hive HWI ... request for your experience to be used Production

2013-01-06 Thread Qiang Wang
I know little about Oozie, but we will study it and review the issue.
Thanks for you advice !

About query compilation before scheduling, we planned to implement it later.

As we want to contribute this HWI back to hive community, Apache Licence
may will be used. So feel free to help implement these usefull feature.

Regards,
QIang


2013/1/7 Manish Malhotra manish.hadoop.w...@gmail.com

 https://github.com/anjuke/hwi/issues/2

 Added an issue for Integrating HWI with Oozie ..
 its not an issue but a feature request. Please review and see if it make
 sense atleast in long term.

 Plus i want to check 1 more usecase:

 1. Query compilation: When user writes query on the UI, does it give
 compilation problem in synchronous way instead of scheduling a query that
 is not correct?

 Regards,
 Manish


 On Sun, Jan 6, 2013 at 10:45 AM, Manish Malhotra 
 manish.hadoop.w...@gmail.com wrote:

 Thanks Edward for explaining 
 Im also very much interested in building a robust tool for bringing HIVE
 more into Enterprise world where any Data Analyst / ETL developer can use
 it.

 Regards,
 Manish


 On Sun, Jan 6, 2013 at 9:10 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 The hive code is apache licensed. If you want to add you work to hive
 simple open a jira on http://issues.apache.org/jira/hive and produce a
 patch that will apply to hive trunk. That will start the process.




 On Saturday, January 5, 2013, Qiang Wang wsxy...@gmail.com wrote:
  Hi Manish:
  Glad to talk with you.
  1. We are willing and trying to make it an open source project
 although we're not familiar with how to do so. We will study the licences
 and chose a proper one.
  2. If Apache Hive agree to add this to the codebase, we
 will definitely do this. I think this HWI is not good enough yet.
  3. ThreadPool is used to run queries. So synchronous mode can be
 achieved by setting its size to 1.
  Regards,
  Qiang
 
  2013/1/6 Manish Malhotra manish.hadoop.w...@gmail.com
 
  Thanks Quiang,
 
  And glad that somebody already doing the improvement. Sure let me try
 it out.
  Quick questions:
 
  1. What is the license of this HWI version?  Can somebody else
 contribute to this as a true open source software and use it and make it
 available for community as well?
  2. Any plans to merge this or add to Apache Hive codebase.
  3. Does it has synchronous mode of running queries or only scheduling
  / async way?
  Thanks for your reply and time,
  Regards,
  Manish
 
  On Fri, Jan 4, 2013 at 9:34 PM, Qiang Wang wsxy...@gmail.com wrote:
 
  Hi Manish:
 
  Glad to receive your email because we are making efforts on HWI.
  We have improved the orignal and added some features and putted it
 on github:
  https://github.com/anjuke/hwi
 
  It's far from mature and standard, but it's improving and has
 already deployed for our company to use.
  After all, have a try and give some advice if you're interested in
 it.
  Thanks
  Qiang
 
  2013/1/5 Manish Malhotra manish.hadoop.w...@gmail.com
 
  Hi All,
 
  We are exploring HWI to be used in PROD environment for adhoc
 queries etc.
  Want to check out in the hive community that can somebody share
 there experience while using the HWI in prod or any environment in terms of
 its stability and performance.
  Also evaluating to enhance to make it more useful with different
 features.
  Thanks for your time and help !!
  Regards,
  Manish
 
 
 
 
 
 






Re: Hive HWI ... request for your experience to be used Production

2013-01-06 Thread Qiang Wang
Before running queries, a compile method will be called in class
org.apache.hadoop.hive.ql.Driver. I think we can take some useful code out
of this method.

Qiang


2013/1/7 Edward Capriolo edlinuxg...@gmail.com

 I think a simple way would be to try tacking an explain in front of it.
 Explains do not map reduce so they should be synchronous.

 Sent from my iPad

 On Jan 6, 2013, at 1:58 PM, Manish Malhotra manish.hadoop.w...@gmail.com
 wrote:

 https://github.com/anjuke/hwi/issues/2

 Added an issue for Integrating HWI with Oozie ..
 its not an issue but a feature request. Please review and see if it make
 sense atleast in long term.

 Plus i want to check 1 more usecase:

 1. Query compilation: When user writes query on the UI, does it give
 compilation problem in synchronous way instead of scheduling a query that
 is not correct?

 Regards,
 Manish


 On Sun, Jan 6, 2013 at 10:45 AM, Manish Malhotra 
 manish.hadoop.w...@gmail.com wrote:

 Thanks Edward for explaining 
 Im also very much interested in building a robust tool for bringing HIVE
 more into Enterprise world where any Data Analyst / ETL developer can use
 it.

 Regards,
 Manish


 On Sun, Jan 6, 2013 at 9:10 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 The hive code is apache licensed. If you want to add you work to hive
 simple open a jira on http://issues.apache.org/jira/hive and produce a
 patch that will apply to hive trunk. That will start the process.




 On Saturday, January 5, 2013, Qiang Wang wsxy...@gmail.com wrote:
  Hi Manish:
  Glad to talk with you.
  1. We are willing and trying to make it an open source project
 although we're not familiar with how to do so. We will study the licences
 and chose a proper one.
  2. If Apache Hive agree to add this to the codebase, we
 will definitely do this. I think this HWI is not good enough yet.
  3. ThreadPool is used to run queries. So synchronous mode can be
 achieved by setting its size to 1.
  Regards,
  Qiang
 
  2013/1/6 Manish Malhotra manish.hadoop.w...@gmail.com
 
  Thanks Quiang,
 
  And glad that somebody already doing the improvement. Sure let me try
 it out.
  Quick questions:
 
  1. What is the license of this HWI version?  Can somebody else
 contribute to this as a true open source software and use it and make it
 available for community as well?
  2. Any plans to merge this or add to Apache Hive codebase.
  3. Does it has synchronous mode of running queries or only scheduling
  / async way?
  Thanks for your reply and time,
  Regards,
  Manish
 
  On Fri, Jan 4, 2013 at 9:34 PM, Qiang Wang wsxy...@gmail.com wrote:
 
  Hi Manish:
 
  Glad to receive your email because we are making efforts on HWI.
  We have improved the orignal and added some features and putted it
 on github:
  https://github.com/anjuke/hwi
 
  It's far from mature and standard, but it's improving and has
 already deployed for our company to use.
  After all, have a try and give some advice if you're interested in
 it.
  Thanks
  Qiang
 
  2013/1/5 Manish Malhotra manish.hadoop.w...@gmail.com
 
  Hi All,
 
  We are exploring HWI to be used in PROD environment for adhoc
 queries etc.
  Want to check out in the hive community that can somebody share
 there experience while using the HWI in prod or any environment in terms of
 its stability and performance.
  Also evaluating to enhance to make it more useful with different
 features.
  Thanks for your time and help !!
  Regards,
  Manish
 
 
 
 
 
 






Re: HiveHistoryViewer concurrency problem

2013-01-04 Thread Qiang Wang
Does Anybody have an idea about this?

https://issues.apache.org/jira/browse/HIVE-3857


2013/1/4 Qiang Wang wsxy...@gmail.com

 new HiveHistoryViewer() throws ConcurrentModificationException when called
 concurrently by several threads.

 According to the stack trace, HiveHistory.parseLine use *private static
 MapString, String parseBuffer* to store parsed data and this caused the
 exception.

 I don't know why a static buffer rather than a local buffer is used!
 Anybody have an idea about this?



Re: HiveHistoryViewer concurrency problem

2013-01-04 Thread Qiang Wang
Hi Jie:

As I know, hive history log is structured and class *HiveHistory* is used
to write and read hive history log.

*HiveHistoryViewer* serves as a listener to listen and store parsed log
data. It has two members:

HashMapString, QueryInfo *jobInfoMap*, which stores QueryInfo related
with hive query

and

HashMapString, TaskInfo *taskInfoMap*, which stores TaskInfo related with
hadoop map/red job

you can dump the two maps and find what you want.

Hope these info helps

Qiang


2013/1/5 Jie Li ji...@cs.duke.edu

 Hi Qiang,

 Could you describe how HiveHistoryViewer is used? I'm also looking for
 a tool to understand the Hive log.

 Thanks,
 Jie

 On Sat, Jan 5, 2013 at 9:54 AM, Qiang Wang wsxy...@gmail.com wrote:
  Does Anybody have an idea about this?
 
  https://issues.apache.org/jira/browse/HIVE-3857
 
 
  2013/1/4 Qiang Wang wsxy...@gmail.com
 
  new HiveHistoryViewer() throws ConcurrentModificationException when
 called
  concurrently by several threads.
 
  According to the stack trace, HiveHistory.parseLine use private static
  MapString, String parseBuffer to store parsed data and this caused the
  exception.
 
  I don't know why a static buffer rather than a local buffer is used!
  Anybody have an idea about this?
 
 



Re: HiveHistoryViewer concurrency problem

2013-01-04 Thread Qiang Wang
Maybe it's not.

But this exception happens when I create an *HiveHistoryViewer* instance,
in which case only reading, parsing file is invloved and it's not intended
to be shared between threads.

So the exception surprised me and I wonder why a static buffer was used
instead of a local buffer which has no concurrent issue.


2013/1/5 Edward Capriolo edlinuxg...@gmail.com

 It is likely an oversight. The Majority of hive code was not written to be
 multi-threaded.



 On Fri, Jan 4, 2013 at 10:41 PM, Jie Li ji...@cs.duke.edu wrote:

 Hi Qiang,

 Could you describe how HiveHistoryViewer is used? I'm also looking for
 a tool to understand the Hive log.

 Thanks,
 Jie

 On Sat, Jan 5, 2013 at 9:54 AM, Qiang Wang wsxy...@gmail.com wrote:
  Does Anybody have an idea about this?
 
  https://issues.apache.org/jira/browse/HIVE-3857
 
 
  2013/1/4 Qiang Wang wsxy...@gmail.com
 
  new HiveHistoryViewer() throws ConcurrentModificationException when
 called
  concurrently by several threads.
 
  According to the stack trace, HiveHistory.parseLine use private static
  MapString, String parseBuffer to store parsed data and this caused
 the
  exception.
 
  I don't know why a static buffer rather than a local buffer is used!
  Anybody have an idea about this?
 
 





Re: Hive HWI ... request for your experience to be used Production

2013-01-04 Thread Qiang Wang
Hi Manish:

Glad to receive your email because we are making efforts on HWI.

We have improved the orignal and added some features and putted it on
github:

https://github.com/anjuke/hwi

It's far from mature and standard, but it's improving and has already
deployed for our company to use.

After all, have a try and give some advice if you're interested in it.

Thanks

Qiang


2013/1/5 Manish Malhotra manish.hadoop.w...@gmail.com


 Hi All,

 We are exploring HWI to be used in PROD environment for adhoc queries etc.
 Want to check out in the hive community that can somebody share there
 experience while using the HWI in prod or any environment in terms of its
 stability and performance.
 Also evaluating to enhance to make it more useful with different features.

 Thanks for your time and help !!

 Regards,
 Manish






Re: HiveHistory and HiveHistoryViewer

2012-12-17 Thread Qiang Wang
anybody has an idea about this ?

https://issues.apache.org/jira/browse/HIVE-3810



2012/12/16 Qiang Wang wsxy...@gmail.com

 glad to receive your reply!

 here is my point:
 Firstly, I think HiveHistoryViewer is inconsistent with HiveHistory.
 Secondly, hive server may be deloyed on linux, but client can be anywhere.
 hql from client will be logged into history file and hql may contails '\r'


 2012/12/16 afancy grou...@gmail.com

 I don\t think it is a bug. If the program in hive writes logs to
 HiveHistory.log using '\n' to indicate the end of a line. Then, it is OK
 to use *val = val.replace('\n', ' ');.  Anyway, **
 new line depends what on your 
 OS:https://ccrma.stanford.edu/~craig/utility/flip/ Hive
 is typically deployed on Linux.
 *
 *
 DOS  Windows: \r\n 0D0A (hex), 13,10 (decimal)

 Unix  Mac OS X: \n, 0A, 10
 Macintosh (OS 9): \r, 0D, 13
 *
 On Sun, Dec 16, 2012 at 11:23 AM, Qiang Wang wsxy...@gmail.com wrote:

 '\n', '\r',








Re: HiveHistory and HiveHistoryViewer

2012-12-17 Thread Qiang Wang
HiveHistory.parseHiveHistory use BufferedReader.readLine which takes '\n',
'\r', '\r\n' as line delimiter to parse history file

And clients may be on mac, which takes '\r' as line delimiter

So I think '\r' should also be replaced with space in  HiveHistory.log, so
that HiveHistory.parseHiveHistory could be consistent with HiveHistory.log
and allow clients from mac

Thanks!


2012/12/18 Mark Grover grover.markgro...@gmail.com

 Looks like a bug to me. This is the original JIRA that introduced this
 change:
 https://issues.apache.org/jira/browse/HIVE-176

 I don't think back in the day, we really cared about clients being on
 windows.

 In any case, thanks for filing the JIRA, I have uploaded a patch which
 I think doesn't break anything for linux clients and fixes things up
 for Windows clients. Take a look, feedback welcome. The intent is the
 same as your suggestions but the approach is a little more
 conservative. If you feel strongly that it should be done according to
 one of your suggestions, let me know, I will take another look.

 Thanks!
 Mark

 On Mon, Dec 17, 2012 at 5:48 AM, Qiang Wang wsxy...@gmail.com wrote:
  anybody has an idea about this ?
 
  https://issues.apache.org/jira/browse/HIVE-3810
 
 
 
  2012/12/16 Qiang Wang wsxy...@gmail.com
 
  glad to receive your reply!
 
  here is my point:
  Firstly, I think HiveHistoryViewer is inconsistent with HiveHistory.
  Secondly, hive server may be deloyed on linux, but client can be
 anywhere.
  hql from client will be logged into history file and hql may contails
 '\r'
 
 
  2012/12/16 afancy grou...@gmail.com
 
  I don\t think it is a bug. If the program in hive writes logs to
  HiveHistory.log using '\n' to indicate the end of a line. Then, it is
 OK to
  use val = val.replace('\n', ' ');.  Anyway,
  new line depends what on your OS: Hive is typically deployed on Linux.
  DOS  Windows: \r\n 0D0A (hex), 13,10 (decimal)
 
  Unix  Mac OS X: \n, 0A, 10
  Macintosh (OS 9): \r, 0D, 13
 
  On Sun, Dec 16, 2012 at 11:23 AM, Qiang Wang wsxy...@gmail.com
 wrote:
 
  '\n', '\r',
 
 
 
 
 
 



Re: HiveHistory and HiveHistoryViewer

2012-12-16 Thread Qiang Wang
glad to receive your reply!

here is my point:
Firstly, I think HiveHistoryViewer is inconsistent with HiveHistory.
Secondly, hive server may be deloyed on linux, but client can be anywhere.
hql from client will be logged into history file and hql may contails '\r'


2012/12/16 afancy grou...@gmail.com

 I don\t think it is a bug. If the program in hive writes logs to
 HiveHistory.log using '\n' to indicate the end of a line. Then, it is OK
 to use *val = val.replace('\n', ' ');.  Anyway, **
 new line depends what on your 
 OS:https://ccrma.stanford.edu/~craig/utility/flip/ Hive
 is typically deployed on Linux.
 *
 *
 DOS  Windows: \r\n 0D0A (hex), 13,10 (decimal)

 Unix  Mac OS X: \n, 0A, 10
 Macintosh (OS 9): \r, 0D, 13
 *
 On Sun, Dec 16, 2012 at 11:23 AM, Qiang Wang wsxy...@gmail.com wrote:

 '\n', '\r',