sporadic `Unable to find class` with anonymous functions in udf

2016-01-12 Thread Ophir Etzion
using cdh5.4.3 (hive1.1) via HiveServer.

Does anyone have a suggestion about what to do / look for?

the error:

org.apache.hadoop.hive.ql.parse.SemanticException: Generate Map Join Task
Error: Unable to find class:
com.foursquare.hadoop.hive.udf.IsDefinedUDF$$anonfun$initialize$6
Serialization trace:
isDefinedFunc (com.foursquare.hadoop.hive.udf.IsDefinedUDF)
genericUDF (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc)
chidren (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc)
chidren (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc)
chidren (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc)
chidren (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc)
chidren (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc)
predicate (org.apache.hadoop.hive.ql.plan.FilterDesc)
conf (org.apache.hadoop.hive.ql.exec.FilterOperator)
opParseCtxMap (org.apache.hadoop.hive.ql.plan.MapWork)
mapWork (org.apache.hadoop.hive.ql.plan.MapredWork)
at
org.apache.hadoop.hive.ql.optimizer.physical.CommonJoinTaskDispatcher.processCurrentTask(CommonJoinTaskDispatcher.java:517)
at
org.apache.hadoop.hive.ql.optimizer.physical.AbstractJoinTaskDispatcher.dispatch(AbstractJoinTaskDispatcher.java:180)
at
org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
at
org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)


the udf:

@Description(name = "isDefined",
value = "returns true if the object is not null and not empty and not \"\"",
extended = "Example:\n" +
  "SELECT isDefined(col)\n")
class IsDefinedUDF extends GenericUDF with Serializable {
  var isDefinedFunc: Option[Object] => Boolean = null

  override def initialize(arguments: Array[ObjectInspector]):
ObjectInspector = {
val arg = arguments.toVector

if (arg.length !=? 1) {
  throw new UDFArgumentLengthException("isDefined only takes one
argument.")
}

Option(arg.head) match {
  case Some(a: ListObjectInspector) => {
isDefinedFunc = {obj => obj.map(o =>
!(a.getList(o).asScala.toList.isEmpty)).getOrElse(false)}
  }
  case Some(a: MapObjectInspector) => {
isDefinedFunc = {obj => obj.map(o =>
!(a.getMap(o).asScala.toMap.isEmpty)).getOrElse(false)}
  }
  case Some(a: LazyStringObjectInspector) => {
isDefinedFunc = {obj => a.getPrimitiveJavaObject(obj.getOrElse(new
LazyString(a))) != ""}
  }
  case Some(a: StringObjectInspector) => {
isDefinedFunc = {obj => a.getPrimitiveJavaObject(obj.getOrElse(new
Text(""))) != ""}
  }
  case None => {
isDefinedFunc = {x => false}
  }
  case _ => {
isDefinedFunc = {obj => obj.isDefined}
  }
}

PrimitiveObjectInspectorFactory.javaBooleanObjectInspector
  }

  override def evaluate(arguments: Array[DeferredObject]): Object = {
val arg = arguments.toVector.head

isDefinedFunc(Option(arg.get())): java.lang.Boolean
  }

  override def getDisplayString(children: Array[String]) = {
"isDefined(" + children(0) + ")"
  }
}


Re: adding jars - hive on spark cdh 5.4.3

2016-01-08 Thread Ophir Etzion
It didn't work. assuming I did the right thing.
in the properties  you could see

{"key":"hive.aux.jars.path","value":"file:///data/loko/foursquare.web-hiverc/current/hadoop-hive-serde.jar,file:///data/loko/foursquare.web-hiverc/current/hadoop-hive-udf.jar","isFinal":false,"resource":"programatically"}
which includes the jar that has the class I need but I still get

org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to
find class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat



On Fri, Jan 8, 2016 at 12:24 PM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

> You can not 'add jar' input formats and serde's. They need to be part of
> your auxlib.
>
> On Fri, Jan 8, 2016 at 12:19 PM, Ophir Etzion <op...@foursquare.com>
> wrote:
>
>> I tried now. still getting
>>
>> 16/01/08 16:37:34 ERROR exec.Utilities: Failed to load plan: 
>> hdfs://hadoop-alidoro-nn-vip/tmp/hive/hive/c2af9882-38a9-42b0-8d17-3f56708383e8/hive_2016-01-08_16-36-41_370_3307331506800215903-3/-mr-10004/3c90a796-47fc-4541-bbec-b196c40aefab/map.xml:
>>  org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
>> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>> Serialization trace:
>> inputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
>> aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork)
>> org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
>> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>>
>>
>> HiveThriftSequenceFileInputFormat is in one of the jars I'm trying to add.
>>
>>
>> On Thu, Jan 7, 2016 at 9:58 PM, Prem Sure <premsure...@gmail.com> wrote:
>>
>>> did you try -- jars property in spark submit? if your jar is of huge
>>> size, you can pre-load the jar on all executors in a common available
>>> directory to avoid network IO.
>>>
>>> On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion <op...@foursquare.com>
>>> wrote:
>>>
>>>> I' trying to add jars before running a query using hive on spark on cdh
>>>> 5.4.3.
>>>> I've tried applying the patch in
>>>> https://issues.apache.org/jira/browse/HIVE-12045 (manually as the
>>>> patch is done on a different hive version) but still hasn't succeeded.
>>>>
>>>> did anyone manage to do ADD JAR successfully with CDH?
>>>>
>>>> Thanks,
>>>> Ophir
>>>>
>>>
>>>
>>
>


Re: adding jars - hive on spark cdh 5.4.3

2016-01-08 Thread Ophir Etzion
Thanks!
In certain use cases you could but forgot about the aux thing, thats
probably it.

On Fri, Jan 8, 2016 at 12:24 PM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

> You can not 'add jar' input formats and serde's. They need to be part of
> your auxlib.
>
> On Fri, Jan 8, 2016 at 12:19 PM, Ophir Etzion <op...@foursquare.com>
> wrote:
>
>> I tried now. still getting
>>
>> 16/01/08 16:37:34 ERROR exec.Utilities: Failed to load plan: 
>> hdfs://hadoop-alidoro-nn-vip/tmp/hive/hive/c2af9882-38a9-42b0-8d17-3f56708383e8/hive_2016-01-08_16-36-41_370_3307331506800215903-3/-mr-10004/3c90a796-47fc-4541-bbec-b196c40aefab/map.xml:
>>  org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
>> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>> Serialization trace:
>> inputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
>> aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork)
>> org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
>> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>>
>>
>> HiveThriftSequenceFileInputFormat is in one of the jars I'm trying to add.
>>
>>
>> On Thu, Jan 7, 2016 at 9:58 PM, Prem Sure <premsure...@gmail.com> wrote:
>>
>>> did you try -- jars property in spark submit? if your jar is of huge
>>> size, you can pre-load the jar on all executors in a common available
>>> directory to avoid network IO.
>>>
>>> On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion <op...@foursquare.com>
>>> wrote:
>>>
>>>> I' trying to add jars before running a query using hive on spark on cdh
>>>> 5.4.3.
>>>> I've tried applying the patch in
>>>> https://issues.apache.org/jira/browse/HIVE-12045 (manually as the
>>>> patch is done on a different hive version) but still hasn't succeeded.
>>>>
>>>> did anyone manage to do ADD JAR successfully with CDH?
>>>>
>>>> Thanks,
>>>> Ophir
>>>>
>>>
>>>
>>
>


Re: adding jars - hive on spark cdh 5.4.3

2016-01-08 Thread Ophir Etzion
I tried now. still getting

16/01/08 16:37:34 ERROR exec.Utilities: Failed to load plan:
hdfs://hadoop-alidoro-nn-vip/tmp/hive/hive/c2af9882-38a9-42b0-8d17-3f56708383e8/hive_2016-01-08_16-36-41_370_3307331506800215903-3/-mr-10004/3c90a796-47fc-4541-bbec-b196c40aefab/map.xml:
org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to
find class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
Serialization trace:
inputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork)
org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to
find class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat


HiveThriftSequenceFileInputFormat is in one of the jars I'm trying to add.


On Thu, Jan 7, 2016 at 9:58 PM, Prem Sure <premsure...@gmail.com> wrote:

> did you try -- jars property in spark submit? if your jar is of huge size,
> you can pre-load the jar on all executors in a common available directory
> to avoid network IO.
>
> On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion <op...@foursquare.com> wrote:
>
>> I' trying to add jars before running a query using hive on spark on cdh
>> 5.4.3.
>> I've tried applying the patch in
>> https://issues.apache.org/jira/browse/HIVE-12045 (manually as the patch
>> is done on a different hive version) but still hasn't succeeded.
>>
>> did anyone manage to do ADD JAR successfully with CDH?
>>
>> Thanks,
>> Ophir
>>
>
>


adding jars - hive on spark cdh 5.4.3

2016-01-07 Thread Ophir Etzion
I' trying to add jars before running a query using hive on spark on cdh
5.4.3.
I've tried applying the patch in
https://issues.apache.org/jira/browse/HIVE-12045 (manually as the patch is
done on a different hive version) but still hasn't succeeded.

did anyone manage to do ADD JAR successfully with CDH?

Thanks,
Ophir


last_modified_time and transient_lastDdlTime - what is transient_lastDdlTime for.

2016-01-06 Thread Ophir Etzion
I want to know for each of my tables the last time it was modified. some of
my tables don't have last_modified_time in the table parameters but all
have transient_lastDdlTime.
transient_lastDdlTime seems to be the same as last_modified_time in some of
the tables I randomly cheked.

what is the time in transient_lastDdlTime? if it also the modified time why
is there also last_modified_time?

Thanks,
Ophir


hive on spark

2015-12-18 Thread Ophir Etzion
During spark-submit when running hive on spark I get:

Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.hadoop.fs.FileSystem: Provider
org.apache.hadoop.hdfs.HftpFileSystem could not be instantiated


Caused by: java.lang.IllegalAccessError: tried to access method
org.apache.hadoop.fs.DelegationTokenRenewer.(Ljava/lang/Class;)V
from class org.apache.hadoop.hdfs.HftpFileSystem

I managed to make hive on spark work on a staging cluster I have and now
I'm trying to do the same on a production cluster and this happened. Both
are cdh5.4.3.

I read that this is due to something not being compiled against the
correct hadoop version.
my main question what is the binary/jar/file that can cause this?

I tried replacing the binaries and jars to the ones used by the
staging cluster (that hive on spark worked on) and it didn't help.

Thank you for anyone reading this, and thank you for any direction on
where to look.

Ophir


Re: Hive on Spark - Error: Child process exited before connecting back

2015-12-15 Thread Ophir Etzion
Hi,

the versions are spark 1.3.0 and hive 1.1.0 as part of cloudera 5.4.3.

I find it weird that it would work only on the version you mentioned as
there is documentation (not good documentation but still..) on how to do it
with cloudera that packages different versions.

Thanks for the answer though.

why would spark 1.5.2 specifically would not work with hive?

Ophir

On Tue, Dec 15, 2015 at 5:33 PM, Mich Talebzadeh <m...@peridale.co.uk>
wrote:

> Hi,
>
>
>
> The only version that I have managed to run Hive using Spark engine is
> Spark 1.3.1 on Hive 1.2.1
>
>
>
> Can you confirm the version of Spark you are running?
>
>
>
> FYI, Spark 1.5.2 will not work with Hive.
>
>
>
> HTH
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Ophir Etzion [mailto:op...@foursquare.com]
> *Sent:* 15 December 2015 22:27
> *To:* u...@spark.apache.org; user@hive.apache.org
> *Subject:* Hive on Spark - Error: Child process exited before connecting
> back
>
>
>
> Hi,
>
>
>
> when trying to do Hive on Spark on CDH5.4.3 I get the following error when
> trying to run a simple query using spark.
>
> I've tried setting everything written here (
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
> as well as what the cdh recommends.
>
> any one encountered this as well? (searching for it didn't help much)
>
> the error:
>
> ERROR : Failed to execute spark task, with exception
> 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
> client.)'
>
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
> client.
>
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:57)
>
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
>
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:120)
>
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97)
>
> at
> org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
>
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
>
> at
> org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1640)
>
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1399)
>
> at
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1183)
>
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
>
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1044)
>
> at
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:144)
>
> at
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
>
> at
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>
> at
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)
>
>   

Hive on Spark - Error: Child process exited before connecting back

2015-12-15 Thread Ophir Etzion
Hi,

when trying to do Hive on Spark on CDH5.4.3 I get the following error when
trying to run a simple query using spark.

I've tried setting everything written here (
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
as well as what the cdh recommends.

any one encountered this as well? (searching for it didn't help much)

the error:

ERROR : Failed to execute spark task, with exception
'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
client.)'
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
client.
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:57)
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
at
org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:120)
at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1640)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1399)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1183)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1044)
at
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:144)
at
org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
at
org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at
org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException:
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cancel
client '2b2d7314-e0cc-4933-82a1-992a3299d109'. Error: Child process exited
before connecting back
at com.google.common.base.Throwables.propagate(Throwables.java:156)
at
org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:109)
at
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
at
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:91)
at
org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:65)
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:55)
... 22 more
Caused by: java.util.concurrent.ExecutionException:
java.lang.RuntimeException: Cancel client
'2b2d7314-e0cc-4933-82a1-992a3299d109'. Error: Child process exited before
connecting back
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
at
org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:99)
... 26 more
Caused by: java.lang.RuntimeException: Cancel client
'2b2d7314-e0cc-4933-82a1-992a3299d109'. Error: Child process exited before
connecting back
at
org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:179)
at
org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:427)
... 1 more

ERROR : Failed to execute spark task, with exception
'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
client.)'
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
client.
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:57)
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
at
org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:120)
at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1640)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1399)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1183)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1044)
at
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:144)
at
org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
at

trying to figure out number of MR jobs from explain output

2015-12-11 Thread Ophir Etzion
Hi,

I've been trying to figure out how to know the number of MR jobs that will
be ran for a hive query using the EXPLAIN output.

I haven't got to a consistent method to knowing that.

for example (in one of my queries, ctas query):
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
  Stage-4
  Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
  Stage-8 depends on stages: Stage-0
  Stage-2 depends on stages: Stage-8
  Stage-3
  Stage-5
  Stage-6 depends on stages: Stage-5

Stage-1, Stage-3, Stage-5 are listed as map reduce steps.

eventually 2 MR jobs ran.

in other cases only 1 job runs.

I couldn't find a consistent rule on how to figure this out.

can anyone help??

Thank you!!

below is full output

explain CREATE TABLE beekeeper_results.test3 ROW FORMAT SERDE
"com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde" WITH
SERDEPROPERTIES ('escape.delim'='\\', 'mapkey.delim'='\;',
'colelction.delim'='|') AS SELECT * FROM beekeeper_results.test2;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
  Stage-4
  Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
  Stage-8 depends on stages: Stage-0
  Stage-2 depends on stages: Stage-8
  Stage-3
  Stage-5
  Stage-6 depends on stages: Stage-5

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: test2
Statistics: Num rows: 112 Data size: 11690 Basic stats:
COMPLETE Column stats: NONE
Select Operator
  expressions: blasttag (type: string), actioncounts (type:
array>), detailedclicks (type:
array>), countsbyclient
(type: array>),
totalactioncounts (type: array>),
actionsbydate (type:
array>)
  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
  Statistics: Num rows: 112 Data size: 11690 Basic stats:
COMPLETE Column stats: NONE
  File Output Operator
compressed: false
Statistics: Num rows: 112 Data size: 11690 Basic stats:
COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
name: beekeeper_results.test3

  Stage: Stage-7
Conditional Operator

  Stage: Stage-4
Move Operator
  files:
  hdfs directory: true
  destination:
hdfs://hadoop-alidoro-nn-vip/user/hive/warehouse/.hive-staging_hive_2015-12-11_21-52-35_063_8498858370292854265-1/-ext-10001

  Stage: Stage-0
Move Operator
  files:
  hdfs directory: true
  destination: ***

  Stage: Stage-8
  Create Table Operator:
Create Table
  columns: blasttag string, actioncounts
array>, detailedclicks
array>, countsbyclient
array>, totalactioncounts
array>, actionsbydate
array>
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
  serde name:
com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
  serde properties:
colelction.delim |
escape.delim \
mapkey.delim ;
  name: beekeeper_results.test3

  Stage: Stage-2
Stats-Aggr Operator

  Stage: Stage-3
Map Reduce
  Map Operator Tree:
  TableScan
File Output Operator
  compressed: false
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde:
com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
  name: beekeeper_results.test3

  Stage: Stage-5
Map Reduce
  Map Operator Tree:
  TableScan
File Output Operator
  compressed: false
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde:
com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
  name: beekeeper_results.test3

  Stage: Stage-6
Move Operator
  files:
  hdfs directory: true
  destination: ***