Re: Is there a way to resolve Fair Scheduler Problem

2016-04-25 Thread Khaja Hussain
Hi

If you need other application in a queue to start you can use pre-emption.
Below link has the details. What pre-emption does is it guarantee's
capacity, in which case your job will start. If there is no load on the
given queue those resources will be used by other queue. Hope this helps.

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/preemption.html

Regards
Khaja Hussain

On Mon, Apr 25, 2016 at 7:16 PM, mahender bigdata <
mahender.bigd...@outlook.com> wrote:

> Hi Team,
>
> Is there way to resolve Fair Queue Scheduler problem. Currently I see If
> application requires more resources, it fully consumes available resources
> leaving other submitted applications in *pending *or *accepted *state. Do
> i need to modify
>  set yarn.scheduler.maximum-allocation-mb=5120;
>   set yarn.scheduler.minimum-allocation-mb=512;
>
> or
>
> Yarn scheduler Queue capacity needs to be modified.
>
> Thanks.
>
>


Is there a way to resolve Fair Scheduler Problem

2016-04-25 Thread mahender bigdata

Hi Team,

Is there way to resolve Fair Queue Scheduler problem. Currently I see 
If  application requires more resources, it fully consumes available 
resources leaving other submitted applications in *pending *or *accepted 
*state. Do i need to modify

set yarn.scheduler.maximum-allocation-mb=5120;
set yarn.scheduler.minimum-allocation-mb=512;

or

Yarn scheduler Queue capacity needs to be modified.

Thanks.



Re: Hive footprint

2016-04-25 Thread Mich Talebzadeh
Hi Naveen,





Thank you for your detailed explanation.



Please allow me to explain my points if I may



I think a viable solution for big data stack will encompass (again this is
my view) Spark with Hive, HDFS and Yarn as winning combinations. Hadoop
encompasses HDFS and it is almost impossible to side step it without
finding a viable alternative as a persistent storage. Yarn is the resource
rock, Spark is a great query tool including Spark streaming and Hive is the
real Data Warehouse in Big data space that provides the meta-data for all
the tools.



You will forgive me to set aside Impala as I don't hear much about it
anymore (please feel free to agree to differ). So my prime interest is to
see Hive being improved as it should be, i.e. a proper Data Warehouse with
proper indexing strategy. I don’t really subscribe to ORC storage index as
through my experience they have not delivered the contribution to Hive CBO
as expected. My personal experience has been that they provide some
improvements on what is already available (Stats wise), but otherwise
unless you bucket your table (i.e. have an effective numeric column with
high cardinality that can be used in hash partitioning the table), one
cannot make effective use of storage index.



Now back to Hive and its external indexes. Currently the infrastructure is
there but not the functionality. I don’t know what it takes to make indexes
in Hive accountable for the CBO. We should aim to consolidate Hadoop
ecosystem by investing in the existing tools rather than trying to fragment
it further. There seems to be little effort in this area for reasons that I
may not be aware. However, I am more than happy to contribute to this case.



Kind regards,



Mich





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 25 April 2016 at 19:28, Naveen Gangam  wrote:

> Hi Mich,
> I am a developer at Cloudera and contribute to Apache Hive.
>
> Hive and MPP query engine projects like Impala have settled into their
> respective positions so there is less confusion between these projects.
>
> For example, across Cloudera's customer base the majority of customers use
> Impala to enable them to perform BI and SQL analytics directly on Hadoop.
> Most Impala users are using Hive for the data preparation of the data sets
> they're serving up via Impala. As such Impala typically competes with
> traditional analytic databases where customers decide between:
> * Using Hadoop and Hive for data processing that feeds into another
> database or BI layer for the analytics
> * Unified architecture where they directly serve some sets of BI and
> analytics from Hadoop via Impala while typically using Hive, Spark,
> MapReduce, etc for their data preparation
> You can see nearly all Hadoop distributions provide users with Hive for
> core data processing plus an MPP query engine for BI and SQL analytics like
> Impala, Drill, BigSQL, etc. Even Facebook who created and still heavily
> uses Hive, also uses Presto internally as their MPP query engine for BI.
>
> For more details you can see Cloudera's SQL-on-Hadoop webinar that talks
> about when to use Hive, Impala, and Spark (SQL)
> 
>
>
> Support for local variables and stored procedures in Hive is included in
> HPL/SQL module of Hive 2.0. However, this is an experimental feature. We
> will evaluate it for production-readiness before including it in CDH Hive.
>
> Finally, HBase is typically not the best storage manager for migrations
> from commercial DWs to Big Data. Most commercial DW migrations use HDFS
> rather than HBase as the storage manager.
>
> Hope this helps.
>
> Thank you
> Naveen
>
> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> I notice that Impala is rarely mentioned these days.  I may be missing
>> something. However, I gather it is coming to end now as I don't recall many
>> use cases for it (or customers asking for it). In contrast, Hive has hold
>> its ground with the new addition of Spark and Tez as execution engines,
>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>> good choice for its metastore it scales well.
>>
>> If Hive had the ability (organic) to have local variable and stored
>> procedure support then it would be top notch Data Warehouse. Given its
>> metastore, I don't see any technical reason why it cannot support these
>> constructs.
>>
>> I was recently asked to comment on migration from commercial DWs to Big
>> Data (primarily for TCO reason) and really could not recall any better
>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>> decides there is still HDFS, a good engine for Hive (so

Re: [VOTE] Bylaws change to allow some commits without review

2016-04-25 Thread Lars Francke
Thanks for the further votes. If I'm not mistaken three more would be
missing for a successful vote.

@Carl, thanks for your vote. I'd behappy to hear any concerns you might
have.

@Ashutosh: Sounds like a very sensible idea. I've never actually gotten
around to use Travis CI so I hope there'll be some documentation :)


On Sun, Apr 24, 2016 at 9:19 PM, Ashutosh Chauhan 
wrote:

> +1
>
> PS: Some of us are trying to enable travis-ci for Hive over on
> https://issues.apache.org/jira/browse/HIVE-10293 Since travis-ci limits
> build time to ~30 mins, we can't run our full test suite on it. But, it
> might make sense to use that infra for commits discussed here. In that
> scenario a sanity build of compile on travis-ci will give confidence that
> 'minor' commits like these are not resulting in compilation failure.
>
> On Sat, Apr 23, 2016 at 8:41 PM, Carl Steinbach  wrote:
>
>> -0
>> On Apr 22, 2016 4:04 PM, "Chao Sun"  wrote:
>>
>>> +1
>>>
>>> On Fri, Apr 22, 2016 at 3:45 PM, Edward Capriolo 
>>> wrote:
>>>
 +1


 On Friday, April 22, 2016, Lars Francke  wrote:

> Yet another update. I went through the PMC list.
>
> These seven have not been active (still the same list as Vikram posted
> during the last vote):
> Ashish Thusoo
> Kevin Wilfong
> He Yongqiang
> Namit Jain
> Joydeep Sensarma
> Ning Zhang
> Raghotham Murthy
>
> There are 29 PMCs in total - 7 = 22 active * 2/3 = 15 votes required
>
> So far the following PMCs have voted:
>
> Alan Gates
> Jason Dere
> Sushanth Sowmyan
> Lefty Leverenz
> Navis Ryu
> Owen O'Malley
> Prasanth J
> Sergey Shelukhin
> Thejas Nair
>
> = 9 +1s
>
> So I'm hoping for six more. I've contacted a bunch of PMCs (sorry for
> the spam!) and hope to get a few more.
>
> In addition there have been six non-binding +1s. Thank you everyone
> for voting.
>
>
>
>
>
> On Fri, Apr 22, 2016 at 10:42 PM, Lars Francke  > wrote:
>
>> Hi everyone, thanks for the votes. I've been held up by personal
>> stuff this week but as there have been no -1s or other objections I'd 
>> like
>> to keep this vote open a bit longer until I've had time to go through the
>> PMCs and contact those that have not yet voted.
>>
>> On Thu, Apr 21, 2016 at 9:12 PM, Denise Rogers 
>> wrote:
>>
>>> +1
>>>
>>> Regards,
>>> Denise
>>> Cell - (860)989-3431
>>>
>>> Sent from mi iPhone
>>>
>>> On Apr 21, 2016, at 2:56 PM, Sergey Shelukhin <
>>> ser...@hortonworks.com> wrote:
>>>
>>> +1
>>>
>>> From: Tim Robertson 
>>> Reply-To: "user@hive.apache.org" 
>>> Date: Wednesday, April 20, 2016 at 06:17
>>> To: "user@hive.apache.org" 
>>> Subject: Re: [VOTE] Bylaws change to allow some commits without
>>> review
>>>
>>> +1
>>>
>>> On Wed, Apr 20, 2016 at 1:24 AM, Jimmy Xiang 
>>> wrote:
>>>
 +1

 On Tue, Apr 19, 2016 at 2:58 PM, Alpesh Patel <
 alpeshrpa...@gmail.com> wrote:
 > +1
 >
 > On Tue, Apr 19, 2016 at 1:29 PM, Lars Francke <
 lars.fran...@gmail.com>
 > wrote:
 >>
 >> Thanks everyone! Vote runs for at least one more day. I'd
 appreciate it if
 >> you could ping/bump your colleagues to chime in here.
 >>
 >> I'm not entirely sure how many PMC members are active and how
 many votes
 >> we need but I think a few more are probably needed.
 >>
 >> On Mon, Apr 18, 2016 at 8:02 PM, Thejas Nair <
 the...@hortonworks.com>
 >> wrote:
 >>>
 >>> +1
 >>>
 >>> 
 >>> From: Wei Zheng 
 >>> Sent: Monday, April 18, 2016 10:51 AM
 >>> To: user@hive.apache.org
 >>> Subject: Re: [VOTE] Bylaws change to allow some commits without
 review
 >>>
 >>> +1
 >>>
 >>> Thanks,
 >>> Wei
 >>>
 >>> From: Siddharth Seth 
 >>> Reply-To: "user@hive.apache.org" 
 >>> Date: Monday, April 18, 2016 at 10:29
 >>> To: "user@hive.apache.org" 
 >>> Subject: Re: [VOTE] Bylaws change to allow some commits without
 review
 >>>
 >>> +1
 >>>
 >>> On Wed, Apr 13, 2016 at 3:58 PM, Lars Francke <
 lars.fran...@gmail.com>
 >>> wrote:
 
  Hi everyone,
 
  we had a discussion on the dev@ list about allowing some
 forms of
  contributions to be committed without a review.
 
  The exact sentence I propose to add is: "Minor issues (e.g.
 typos, code
  style issues, JavaDoc changes. At committer's discretion) can
 be committed
  aft

Remove unnecessary joins from view at runtime.

2016-04-25 Thread Grant Overby (groverby)
Suppose I have a view that joins 3 tables together. If I execute a query 
against this view that is answerable by only joining 2 of these 3 tables 
together, can hive preform this optimization automatically?

Example:

For the below select and view, I'd like hive to avoid the join on iede_xff6.

SELECT access_control_policy_name, xff_v4 FROM intrusion_view;


CREATE VIEW intrusion_view AS SELECT
ie.access_control_policy_name,
iede_xff4.data as xff_v4,
iede_xff6.data as xff_v6
FROM intrusion_events as ie
LEFT OUTER JOIN intrusion_extra_data_events iede_xff4 ON 1=1
AND ie.dt = iede_xff4.dt
AND ie.event_id = iede_xff4.event_id
AND ie.event_second = iede_xff4.event_second
AND ie.sensor_id = iede_xff4.sensor_id
AND iede_xff4.type = 1
LEFT OUTER JOIN intrusion_extra_data_events iede_xff6 ON 1=1
AND ie.dt = iede_xff6.dt
AND ie.event_id = iede_xff6.event_id
AND ie.event_second = iede_xff6.event_second
AND ie.sensor_id = iede_xff6.sensor_id
AND iede_xff6.type = 2



[http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726]

Grant Overby
Software Engineer
Cisco.com
grove...@cisco.com
Mobile: 865 724 4910






[http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] Think before you 
print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.

Please click 
here for 
Company Registration Information.






Re: Hive footprint

2016-04-25 Thread Naveen Gangam
Hi Mich,
I am a developer at Cloudera and contribute to Apache Hive.

Hive and MPP query engine projects like Impala have settled into their
respective positions so there is less confusion between these projects.

For example, across Cloudera's customer base the majority of customers use
Impala to enable them to perform BI and SQL analytics directly on Hadoop.
Most Impala users are using Hive for the data preparation of the data sets
they're serving up via Impala. As such Impala typically competes with
traditional analytic databases where customers decide between:
* Using Hadoop and Hive for data processing that feeds into another
database or BI layer for the analytics
* Unified architecture where they directly serve some sets of BI and
analytics from Hadoop via Impala while typically using Hive, Spark,
MapReduce, etc for their data preparation
You can see nearly all Hadoop distributions provide users with Hive for
core data processing plus an MPP query engine for BI and SQL analytics like
Impala, Drill, BigSQL, etc. Even Facebook who created and still heavily
uses Hive, also uses Presto internally as their MPP query engine for BI.

For more details you can see Cloudera's SQL-on-Hadoop webinar that talks
about when to use Hive, Impala, and Spark (SQL)



Support for local variables and stored procedures in Hive is included in
HPL/SQL module of Hive 2.0. However, this is an experimental feature. We
will evaluate it for production-readiness before including it in CDH Hive.

Finally, HBase is typically not the best storage manager for migrations
from commercial DWs to Big Data. Most commercial DW migrations use HDFS
rather than HBase as the storage manager.

Hope this helps.

Thank you
Naveen

On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh 
wrote:

> Hi,
>
> I notice that Impala is rarely mentioned these days.  I may be missing
> something. However, I gather it is coming to end now as I don't recall many
> use cases for it (or customers asking for it). In contrast, Hive has hold
> its ground with the new addition of Spark and Tez as execution engines,
> support for ACID and ORC and new stuff in Hive 2. In addition provided a
> good choice for its metastore it scales well.
>
> If Hive had the ability (organic) to have local variable and stored
> procedure support then it would be top notch Data Warehouse. Given its
> metastore, I don't see any technical reason why it cannot support these
> constructs.
>
> I was recently asked to comment on migration from commercial DWs to Big
> Data (primarily for TCO reason) and really could not recall any better
> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
> decides there is still HDFS, a good engine for Hive (sounds like many
> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>
> Let me know your thoughts.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Using a different FileSystem for Staging

2016-04-25 Thread Blake Martin
Hi Hive Folks,

We're writing to S3-backed tables, but hoping to use HDFS for staging and
merging.  When we set hive.exec.stagingdir to an HDFS location, we get:

16/04/19 22:59:54 [main]: ERROR ql.Driver: FAILED: IllegalArgumentException
Wrong FS:
hdfs://:8020/tmp/hivestaging_hive_2016-04-19_22-59-52_470_66381161375278005-1,
expected: s3a://
java.lang.IllegalArgumentException: Wrong FS:
hdfs://ip-172-20-109-129.node.dc1.consul:8020/tmp/hivestaging_hive_2016-04-19_22-59-52_470_66381161375278005-1,
expected: s3a://dwh-dev
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:646)
at
org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:466)
at org.apache.hadoop.hive.ql.Context.getStagingDir(Context.java:230)
at
org.apache.hadoop.hive.ql.Context.getExtTmpPathRelTo(Context.java:426)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFileSinkPlan(SemanticAnalyzer.java:6271)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:9007)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:8898)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9743)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9636)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:10109)
at
org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:329)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10120)
at
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:211)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:456)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:316)
at
org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1181)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1229)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1118)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1108)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:216)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:168)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:379)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:739)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:624)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Is there an easy way to do this, without patching Hive or Hadoop?

Thanks!


java.lang.ArrayIndexOutOfBoundsException in getSplitHosts

2016-04-25 Thread Saumitra Shahapure
Hello,

I am using using Hive 0.13.1 in EMR and trying to create Hive table on top
of our custom file system (which is a thin wrapper on top of S3) and I am
getting error while accessing the data in the table. Stack trace and
command history below.

I had a doubt that CombineFileInputFormat is trying to access the splits in
incorrect way, but HiveInputFormat is also causing the same problem. Has
anyone seen such problem before? Note that SerDe and FileSystem are both
custom. Could either of those causing this problem?


hive> add jar /home/hadoop/logprocessing-pig-combined.jar;
> Added /home/hadoop/logprocessing-pig-combined.jar to class path
> Added resource: /home/hadoop/logprocessing-pig-combined.jar
> hive> Create external table nulf
> > (
> > tm STRING
> > )
> > ROW FORMAT SERDE 'logprocessing.nulf.basic.BasicHiveSerDe'
> > location 'cda://path/to/logs/';
> OK
> Time taken: 6.706 seconds
> hive> set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat;
> hive> select count(*) from nulf;
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=
> java.lang.ArrayIndexOutOfBoundsException: 1
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplitHosts(FileInputFormat.java:529)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)
> at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:290)
> at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:371)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
> at
> org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:420)
> at
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:275)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:227)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:430)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:803)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:697)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:636)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> Job Submission failed with exception
> 'java.lang.ArrayIndexOutOfBoundsException(1)'
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask


-- Saumitra S. Shahapure


Varying vcores/ram for hive queries running Tez engine

2016-04-25 Thread Nitin Kumar
I was trying to benchmark some hive queries. I am using the tez execution
engine. I varied the values of the following properties:

   1.

   hive.tez.container.size
   2.

   tez.task.resource.memory.mb
   3.

   tez.task.resource.cpu.vcores

Changes in values for property 1 is reflected properly. However it seems
that hive does not respect changes in values of property 3; it always
allocates one vcore per requested container (RM is configured to use the
DominantResourceCalculator). This got me thinking about the precedence of
property values in hive and tez.

I have the following questions with respect to these configurations

   1.

   Does hive respect the set values for the properties 2 and 3 at all?
   2.

   If I set property 1 to a value say 2048 MB and property 2 is set to a
   value of say 1024 MB does this mean that I am wasting about a GB of memory
   for each spawned container?
   3.

   Is there a property in hive similar to property 1 that allows me to use
   the 'set' command in the .hql file to specify the number of vcores to use
   per container?
   4.

   Changes in value for the property tez.am.resource.cpu.vcores are
   reflected at runtime. However I do not observe the same behaviour with
   property 3. Are there other configurations that take precedence over it?

Your inputs and suggestions would be highly appreciated.

Thanks!


PS: Tests conducted on a 5 node cluster running HDP 2.3.0