Re: Hbase-Hive table integration Error

2016-03-01 Thread Amrit Jangid
Anyone ?

On Fri, Feb 26, 2016 at 3:50 PM, Amrit Jangid 
wrote:

> Hi ,
>
> I wanted to access one hbase table in hive, during creating the table, had
>  incorrect columns mapping in query.
>
> Now i want to DROP the table to create it again.But every query(DESCRIBE ,
> DROP, SELECT) in table is giving error as :
>
> Error while compiling statement: FAILED: RuntimeException
> MetaException(message:org.apache.hadoop.hive.serde2.SerDeException
> org.apache.hadoop.hive.hbase.HBaseSerDe: columns has 227 elements while
> hbase.columns.mapping has 155 elements (counting the key if implicit))
>
> Please Help.
>
>
> --
>
> Regards,
> Amrit
>
>


-- 

Regards,
Amrit
DataPlatform Team

-- 



RE: having problem while querying out select statement in TEZ

2016-03-01 Thread Mahender Sarangam
Any update ?
 
To: user@hive.apache.org
From: mahender.bigd...@outlook.com
Subject: having problem while querying out select statement in TEZ
Date: Tue, 1 Mar 2016 12:55:20 -0800


  


  
  
Hi,

We have created ORC partition Bucketed Table in Hive with ~ has
delimiter.  Whenever i firing select statement on
ORCPartitionBucketing Table, I keep getting error 
org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector
cannot be cast to
org.apache.hadoop.hive.ql.exec.vector.LongColumnVector



select ID,..columnname..from hiveOrcpb table. I see this
issue is reproducing
https://issues.apache.org/jira/browse/HIVE-6349



  

data got inserted properly
previously using Hive 0.13 version , when we read data in
hive 1.2 version, we see a issue. Has  any one faced same
issue. please let  me know reason for this issue. Couldn't
figure out root cause of this error.

  

 

  


Vertex failed, vertexName=Map 1,
vertexId=vertex_1456489763556_0037_1_00, diagnostics=[Task
failed,
taskId=task_1456489763556_0037_1_00_00,
diagnostics=[TaskAttempt 0 failed,
info=[Error: Failure while running
task:java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.io.IOException:
java.lang.RuntimeException: java.lang.ClassCastException:
org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector cannot
be cast to
org.apache.hadoop.hive.ql.exec.vector.LongColumnVector
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at
java.security.AccessController.doPrivileged(Native Method)
at
javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at
org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at
java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at
java.lang.Thread.run(Thread.java:745)
Caused by:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.io.IOException:
java.lang.RuntimeException: java.lang.ClassCastException:
org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector cannot
be cast to
org.apache.hadoop.hive.ql.exec.vector.LongColumnVector
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:310)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
... 14 more
Caused by: java.io.IOException:
java.lang.RuntimeException: java.lang.ClassCastException:
org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector cannot
be cast to
org.apache.hadoop.hive.ql.exec.vector.LongColumnVector
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
at
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:141)
at
org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
at

Re:Re: How the actual "sample data" are implemented when using tez reduce auto-parallelism

2016-03-01 Thread Maria
 Thank you very very very much for your patiently answers. I got it, this is 
very helpful to understand the auto-parallelism Optimization.

At 2016-02-29 12:50:04, "Rajesh Balamohan"  wrote:
 


"tez.shuffle-vertex-manager.desired-task-input-size" - Determines the amount of 
desired task input size per reduce task. Default is around 100 MB.



"tez.shuffle-vertex-manager.min-task-parallelism" - Min task parallelism that 
ShuffleVertexManager should honor. I.e, if the client has set it as 100, 
ShuffleVertexManager would not try auto-reduce less than 100 tasks.


"tez.shuffle-vertex-manager.min-src-fraction", 
"tez.shuffle-vertex-manager.max-src-fraction" determine the slow-start behavior.


Hive mainly sets "tez.shuffle-vertex-manager.desired-task-input-size" and 
"tez.shuffle-vertex-manager.min-task-parallelism" at the time of creating the 
DAG. Min-task-parallelism is determined internally in Hive by couple of other 
parameters like "hive.tez.max.partition.factor / hive.tez.min.partition.factor" 
along with data size per reduce task. For instance, assume initial reduce task 
number is 100 & hive.tez.max.partition.factor=2.0 and 
hive.tez.min.partition.factor=0.25.  In this case, Hive would set the reducers 
to 200 and the hint to tez for its min-task-parallelism would be 25, so that 
Tez would not try to auto-reduce below 25 tasks. This serves as a safety net.


In Tez, When a source task generates output, DataMovementEvent (via RPC) is 
sent out and its payload carry details like outputsize.  ShuffleVertexManager 
keeps aggregating these values from different source tasks and checks 
periodically on whether it can determine compute the value for auto-reduce 
parallelism. If the aggregated data size is less than configured 
"desired-task-input-size", it waits for output stats from more source tasks. It 
is possible that by this time, the min-src-fraction reaches it limits. But 
min-src-fraction config is dynamically overriden as it is better to wait for 
data from more tasks to determine more accurate value for auto-parallelism.


There can be scenarios where the auto-reduce computed value is greater than the 
currently configured parallelism depending on the amount of data emitted by 
source tasks.  In such cases, existing parallelism is used.  


Following method contains details on how parallelism is determined at runtime.
https://github.com/apache/tez/blob/fd75e640396da8d5e1c67ef554d5db1846e08c69/tez-runtime-library/src/main/java/org/apache/tez/dag/library/vertexmanager/ShuffleVertexManager.java#L669


It is also possible for source to send the per-partition stats along with the 
DataMovementEvent payload. Retaining all details in the same payload can be 
fairly expensive. Currently, per-partition details are bucketted into one of 
the data range (0, 1, 10, 100, 1000 MB) and are stored in RoaringBitMap in the 
payload. This can be a little noisy, but atleast provides better hints to 
ShuffleVertexManager. Based on this info, ShuffleVertexManager can schedule the 
reducer task which would get the maximum amount of data. This can be enabled 
via "tez.runtime.report.partition.stats" (not enabled by default)


~Rajesh.B


On Sat, Feb 27, 2016 at 11:45 AM, LLBian  wrote:

Oh,I saw some useful mesage about statistics on data from TEZ_1167.

 now, my main confusions are:

(1) how does the reduce ShuffleVertexManger know how many sample data is enough 
to  estimate the whole vertex parallelism.

(2) the relationship between edge and event



I am eager to get your instruction.

Any reply  would be very very grateful.





At 2016-02-27 11:13:48, "LLBian"  wrote:

>

>

>Hello, Respected experts:

> 

>Recently, I am studying  tez reduce auto-parallelism, I read the article 
>"Apache Tez: Dynamic Graph Reconfiguration",TEZ-398 and HIVE-7158.

>I found the HIVE-7158 said that "Tez can optionally sample data from a 
>fraction of the tasks of a vertex and use that information to choose the 
>number of downstream tasks for any given scatter gather edge".

>I know how to use this optimization function,but I was so confused by this:

>

>" Tez defines a VertexManager event that can be used to send an arbitrary user 
>payload to the vertex manager of a given vertex. The partitioning tasks (say 
>the Map tasks) use this event to send statistics such as the size of the 
>output partitions produced to the ShuffleVertexManager for the reduce vertex. 
>The manager receives these events and tries to model the final output 
>statistics that would be produced by the all the tasks."

>

>(1)How the actual "sample data" are implemented?I mean how does the reduce 
>ShuffleVertexManger know how many sample data is enough to  estimate the whole 
>vertex parallelism, is that relates to reduce slow-start?  I studied the 
>source code of apache tez-0.7.0, but still not very clear. Mybe I was too 
>stupid to understood that.

>(2)Is the partitioning 

Re: Hive and Impala

2016-03-01 Thread Edward Capriolo
My nocks on impala. (not intended to be a post knocking impala)

Impala really has not delivered on the complex types that hive has (after
promising it for quite a while), also it only works with the 'blessed'
input formats, parquet, avro, text.

It is very annoying to work with impala, In my version if you create a
partition in hive impala does not see it. You have to run "refresh".

In impala I do not have all the UDFS that hive has like percentile, etc.

Impala is fast. Many data-analysts / data-scientist types that can't wait
10 seconds for a query so when I need top produce something for them I make
sure the data has no complex types and uses a table type that impala
understands.

But for my work I still work primarily in hive, because I do not want to
deal with all the things that impala does not have/might have/ and when I
need something special like my own UDFs it is easier to whip up the
solution in hive.

Having worked with M$ SQL server, and vertica, Impala is on par with them
but I don'think of it like i think of hive. To me it just feels like a
vertica that I can cheat loading sometimes because it is backed by hdfs.

Hive is something different, I am making pipelines, I am transforming data,
doing streaming, writing custom udfs, querying JSON directly. Its not !=
impala.

::random message of the day::




On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar  wrote:

>
> Dr Mitch,
>
> My two cents here.
>
> I don't have direct experience of Impala but in my humble opinion I share
> your views that Hive provides the best metastore of all Big Data systems.
> Looking around almost every product in one form and shape use Hive code
> somewhere. My colleagues inform me that Hive is one of the most stable Big
> Data products.
>
> With the capabilities of Spark on Hive and Hive on Spark or Tez plus of
> course MR, there is really little need for many other products in the same
> space. It is good to keep things simple.
>
> Warmest
>
>
> On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> I have not heard of Impala anymore. I saw an article in LinkedIn titled
>
> "Apache Hive Or Cloudera Impala? What is Best for me?"
>
> "We can access all objects from Hive data warehouse with HiveQL which
> leverages the map-reduce architecture in background for data retrieval and
> transformation and this results in latency."
>
> My response was
>
> This statement is no longer valid as you have choices of three engines now
> with MR, Spark and Tez. I have not used Impala myself as I don't think
> there is a need for it with Hive on Spark or Spark using Hive metastore
> providing whatever needed. Hive is for Data Warehouse and provides what is
> says on the tin. Please also bear in mind that Hive offers ORC storage
> files that provide store Index capabilities further optimizing the queries
> with additional stats at file, stripe and row group levels.
>
> Anyway the question is with Hive on Spark or Spark using Hive metastore
> what we cannot achieve that we can achieve with Impala?
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
>
>


Re: Hive and Impala

2016-03-01 Thread Ashok Kumar

Dr Mitch,
My two cents here.
I don't have direct experience of Impala but in my humble opinion I share your 
views that Hive provides the best metastore of all Big Data systems. Looking 
around almost every product in one form and shape use Hive code somewhere. My 
colleagues inform me that Hive is one of the most stable Big Data products.
With the capabilities of Spark on Hive and Hive on Spark or Tez plus of course 
MR, there is really little need for many other products in the same space. It 
is good to keep things simple.
Warmest 

On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh 
 wrote:
 

 I have not heard of Impala anymore. I saw an article in LinkedIn titled
"Apache Hive Or Cloudera Impala? What is Best for me?"
"We can access all objects from Hive data warehouse with HiveQL which leverages 
the map-reduce architecture in background for data retrieval and transformation 
and this results in latency." 
My response was
This statement is no longer valid as you have choices of three engines now with 
MR, Spark and Tez. I have not used Impala myself as I don't think there is a 
need for it with Hive on Spark or Spark using Hive metastore providing whatever 
needed. Hive is for Data Warehouse and provides what is says on the tin. Please 
also bear in mind that Hive offers ORC storage files that provide store Index 
capabilities further optimizing the queries with additional stats at file, 
stripe and row group levels. 
Anyway the question is with Hive on Spark or Spark using Hive metastore what we 
cannot achieve that we can achieve with Impala?

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 

  

having problem while querying out select statement in TEZ

2016-03-01 Thread mahender bigdata

Hi,
We have created ORC partition Bucketed Table in Hive with ~ has 
delimiter.  Whenever i firing select statement on ORCPartitionBucketing 
Table, I keep getting error 
*org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector cannot be cast 
to org.apache.hadoop.hive.ql.exec.vector.LongColumnVector


select ID,..columnname..from hiveOrcpb table. I see this issue is 
reproducing https://issues.apache.org/jira/browse/HIVE-6349


*
**data got inserted properly previously using Hive 0.13 version , when 
we read data in hive 1.2 version, we see a issue. Has  any one faced 
same issue. please let  me know reason for this issue. Couldn't figure 
out root cause of this error.

*

*

Vertex failed, vertexName=Map 1, 
vertexId=vertex_1456489763556_0037_1_00, diagnostics=[Task failed, 
taskId=task_1456489763556_0037_1_00_00, diagnostics=[TaskAttempt 0 
failed, info=[Error: Failure while running 
task:java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: 
java.lang.RuntimeException: java.lang.ClassCastException: 
org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector cannot be cast 
to org.apache.hadoop.hive.ql.exec.vector.LongColumnVector


at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)


at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)


at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)


at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)


at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)


at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)


at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)


at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)


at 
org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)


at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: java.lang.RuntimeException: 
java.lang.ClassCastException: 
org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector cannot be cast 
to org.apache.hadoop.hive.ql.exec.vector.LongColumnVector


at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)


at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:310)


at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)


... 14 more

Caused by: java.io.IOException: java.lang.RuntimeException: 
java.lang.ClassCastException: 
org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector cannot be cast 
to org.apache.hadoop.hive.ql.exec.vector.LongColumnVector


at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)


at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)


at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)


at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)


at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)


at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)


at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:141)


at 
org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)


at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)


... 16 more

Caused by: java.lang.RuntimeException: java.lang.ClassCastException: 
org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector cannot be cast 
to org.apache.hadoop.hive.ql.exec.vector.LongColumnVector


at 
org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:98)


at 
org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:52)


at 

Re: queues, beeline/hs2 and tez

2016-03-01 Thread Gopal Vijayaraghavan

> tez.queue.name via the --hiveconf switch on beeline and it doesn't look
>to me it works.  the question is... should it?

Nope, it shouldn't, because of Tez sessions the conf param is not job.

The tez.queue.name can be changed while a JDBC connection is up, so it is
not picked up from the conf & instead is acted upon when a user sends a
set command (i.e kill old session, start new session in the new queue).

>-u "jdbc:hive2://dwrdevnn1.sv2.trulia.com:10001/default

This is what I do, might need adapting to the noSASL.

beeline   -u "jdbc:hive2://localhost:1/default?tez.queue.name=adhoc"
-n gopalv --color=true   --outputformat=tsv2


Cheers,
Gopal




Re: queues, beeline/hs2 and tez

2016-03-01 Thread Siddharth Seth
+ user@hive.
This is specific to the way hive handles queues, and user@hive would be the
correct list to get an answer.

On Fri, Feb 26, 2016 at 3:21 PM, Stephen Sprague  wrote:

> hey guys (its me again!)
>
> this is a simple one i think.   i'm trying to set the tez.queue.name via
> the --hiveconf switch on beeline and it doesn't look to me it works.  the
> question is... should it?
>
>
> so this submits to the queue 'root.default' (which i don't want)
>
> {code}
> beeline \
> --hiveconf tez.queue.name=dwr.low \
> --hiveconf hive.execution.engine=tez \
> -u "jdbc:hive2://dwrdevnn1.sv2.trulia.com:10001/default;auth=noSasl
> 'spragues' nopwd org.apache.hive.jdbc.HiveDriver"  \
> --fastConnect=true  \
> -e "select count(*), date_key from omniture.hit_data_aws group by
> date_key order by date_key"
> {code}
>
> while this submits to the queue 'dwr.low' (which i do want) but i don't
> want to add that clause in red. :(
>
> {code}
> beeline \
> --hiveconf tez.queue.name=dwr.low \
> --hiveconf hive.execution.engine=tez \
> -u "jdbc:hive2://dwrdevnn1.sv2.trulia.com:10001/default;auth=noSasl
> 'spragues' nopwd org.apache.hive.jdbc.HiveDriver"  \
> --fastConnect=true  \
> -e "*set tez.queue.name =dwr.low;* select
> count(*), date_key from omniture.hit_data_aws group by date_key order by
> date_key"
> {code}
>
> given i'm trying to write a general purpose sql wrapper i'd kinda like the
> first one to work b/c the sql string used in the -e switch is coming from
> the caller and i really don't want to inject my code into that if i can
> help it.
>
> anybody else run across this before? is there a trick?
>
> thanks,
> Stephen.
> PS  when hive.execution.engine=mr, --hiveconf mapred.job.queue.name=dwr.low
> works as advertised.
>


Re: Wrong column is picked in HIVE 2.0.0 + TEZ 0.8.2 left join

2016-03-01 Thread Gopal Vijayaraghavan

On 3/1/16, 10:41 AM, "Sergey Shelukhin"  wrote:

>Can you please open a Hive JIRA? It is a bug.
 

https://issues.apache.org/jira/browse/HIVE-13191


https://issues.apache.org/jira/browse/HIVE-13190


Cheers,
Gopal




Re: Wrong column is picked in HIVE 2.0.0 + TEZ 0.8.2 left join

2016-03-01 Thread Sergey Shelukhin
Can you please open a Hive JIRA? It is a bug.

On 16/3/1, 10:28, "Gopal Vijayaraghavan"  wrote:

>(Bcc: Tez, Cross-post to hive)
>
>> I added ³set hive.execution.engine=mr;² at top of the script, seems the
>>result is correctŠ
>
>Pretty sure it's due to the same table aliases for both dummy tables
>(they're both called _dummy_table) auto join conversion.
>
>hive> set hive.auto.convert.join=false;
>
>
>Should go back to using slower tagged joins even in Tez, which will add a
>table-tag i.e first table will be (, 0) amd 2nd table will be
>(, 1).
>
>I suspect the difference between the MR and Tez runs are lookup between
>the table-name + expr (both equal for _dummy_table.11).
>
>> per Jeff Zhang's thinking if you were to set the exec engine to 'mr'
>>would it still fail?   if so, then its not Tez . :)
>
>Hive has a a whole set of join algorithms which can only work on Tez, so
>it's not always that easy.
>
>Considering this is on hive-2.0.0, I recommend filing a JIRA on 2.0.0 and
>marking it with 2.0.1 as a target version.
>
>Cheers,
>Gopal
>
>
>
>
> 
>
>
>
>
>
>
>
>
> 
>
>
>



Re: Wrong column is picked in HIVE 2.0.0 + TEZ 0.8.2 left join

2016-03-01 Thread Gopal Vijayaraghavan
(Bcc: Tez, Cross-post to hive)

> I added ³set hive.execution.engine=mr;² at top of the script, seems the
>result is correctŠ

Pretty sure it's due to the same table aliases for both dummy tables
(they're both called _dummy_table) auto join conversion.

hive> set hive.auto.convert.join=false;


Should go back to using slower tagged joins even in Tez, which will add a
table-tag i.e first table will be (, 0) amd 2nd table will be
(, 1).

I suspect the difference between the MR and Tez runs are lookup between
the table-name + expr (both equal for _dummy_table.11).

> per Jeff Zhang's thinking if you were to set the exec engine to 'mr'
>would it still fail?   if so, then its not Tez . :)

Hive has a a whole set of join algorithms which can only work on Tez, so
it's not always that easy.

Considering this is on hive-2.0.0, I recommend filing a JIRA on 2.0.0 and
marking it with 2.0.1 as a target version.

Cheers,
Gopal




 








 




Reg: Hive Authorization Issue

2016-03-01 Thread Bharat Viswanadham
Hi,
A release was made for fix an authorization issue in Hive wherein parent
tables of partitions are not authenticated against for some partition-level
operations which was released on January 28 2016.
In the release notes, it was mentioned, this fix was released to all
branches.
Can some one please provide information about this patch and the Jira for
this patch.
Has the fix gone in to Hive 2.0.0 version.


Regards,
Bharat


Re: How does Hive do authentication on UDF

2016-03-01 Thread Alan Gates
There are several Hive authorization schemes, but at the moment none of them 
restrict function use.  At some point we’d like to add that feature to SQL 
standard authorization (see 
https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization
 ) but no one has done it yet.

Alan.

> On Mar 1, 2016, at 02:01, Todd  wrote:
> 
> Hi ,
> 
> We are allowing users to write/use their own UDF in our Hive environment.When 
> they create the function against the db, then all the users that can use the 
> db will see(or use) the udf.
> I would ask how UDF authentication is done, can UDF be granted to some 
> specific users,so that other users can't see it?
>  
> Thanks a lot!
> 



Re: ORC file split calculation problems

2016-03-01 Thread Patrick Duin
Hi Prasanth,

Thanks for this. I tried out the configuration and I wanted to share some
number with you.

My test setup is a cascading job that reads in 240 files (ranging from
1.5GB to 2.5GB).
In the job log I get the duration from these lines:
INFO log.PerfLogger: 

Running this without any of the configuration takes:116501 ms
Setting both flags as per your email: 27233 ms
A nice improvement.
But doing the same test on data where the files have file size smaller than
256MB (The orc block size).
The orcGetSplits takes: 2741 ms
With or without setting the configuration, result are the same.

This is still a fairly big gap. Knowing we can tune the performance with
your suggested configuration is great as we might not always have the
option to repartition our data. Still avoiding spanning files over multiple
blocks seems to have much more of an impact even though it is
counter-intuitive.
Would be good to know if other users have similar experiences.

Again thanks for your help.

Kind regards,
 Patrick.



2016-02-29 6:38 GMT+00:00 Prasanth Jayachandran <
pjayachand...@hortonworks.com>:

> Hi Patrick
>
> Please find answers inline
>
> On Feb 26, 2016, at 9:36 AM, Patrick Duin  wrote:
>
> Hi Prasanth.
>
> Thanks for the quick reply!
>
> The logs don't show much more of the stacktrace I'm afraid:
> java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:809)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> The stacktrace isn't really the issue though. The NullPointer is a symptom
> caused by not being able to return any stripes, if you look at the line in
> the code it is  because the 'stripes' field is null which should never
> happen. This, we think, is caused by failing namenode network traffic. We
> would have lots of IO warning in the logs saying block's cannot be found or
> e.g.:
> 16/02/01 13:20:34 WARN hdfs.BlockReaderFactory: I/O error constructing
> remote block reader.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.ipc.Client.call(Client.java:1448)
> at org.apache.hadoop.ipc.Client.call(Client.java:1400)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy32.getServerDefaults(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:268)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy33.getServerDefaults(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:1007)
> at
> org.apache.hadoop.hdfs.DFSClient.shouldEncryptData(DFSClient.java:2062)
> at
> org.apache.hadoop.hdfs.DFSClient.newDataEncryptionKey(DFSClient.java:2068)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:208)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:159)
> at
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:90)
> at
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3123)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:755)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:670)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:337)
> at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> at java.io.DataInputStream.readFully(DataInputStream.java:195)
> at
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:407)
> at
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:311)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
> at
> 

Re: Hive and Impala

2016-03-01 Thread Mich Talebzadeh
Just to clarify the statement in quotes was made by the author of the
article

"We can access all objects from Hive data warehouse with HiveQL which
leverages the map-reduce architecture in background for data retrieval and
transformation and this results in latency."

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 1 March 2016 at 11:33, Mich Talebzadeh  wrote:

> I have not heard of Impala anymore. I saw an article in LinkedIn titled
>
> "Apache Hive Or Cloudera Impala? What is Best for me?"
>
> "We can access all objects from Hive data warehouse with HiveQL which
> leverages the map-reduce architecture in background for data retrieval and
> transformation and this results in latency."
>
> My response was
>
> This statement is no longer valid as you have choices of three engines now
> with MR, Spark and Tez. I have not used Impala myself as I don't think
> there is a need for it with Hive on Spark or Spark using Hive metastore
> providing whatever needed. Hive is for Data Warehouse and provides what is
> says on the tin. Please also bear in mind that Hive offers ORC storage
> files that provide store Index capabilities further optimizing the queries
> with additional stats at file, stripe and row group levels.
>
> Anyway the question is with Hive on Spark or Spark using Hive metastore
> what we cannot achieve that we can achieve with Impala?
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Hive and Impala

2016-03-01 Thread Mich Talebzadeh
I have not heard of Impala anymore. I saw an article in LinkedIn titled

"Apache Hive Or Cloudera Impala? What is Best for me?"

"We can access all objects from Hive data warehouse with HiveQL which
leverages the map-reduce architecture in background for data retrieval and
transformation and this results in latency."

My response was

This statement is no longer valid as you have choices of three engines now
with MR, Spark and Tez. I have not used Impala myself as I don't think
there is a need for it with Hive on Spark or Spark using Hive metastore
providing whatever needed. Hive is for Data Warehouse and provides what is
says on the tin. Please also bear in mind that Hive offers ORC storage
files that provide store Index capabilities further optimizing the queries
with additional stats at file, stripe and row group levels.

Anyway the question is with Hive on Spark or Spark using Hive metastore
what we cannot achieve that we can achieve with Impala?


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Re: Hive Cli ORC table read error with limit option

2016-03-01 Thread Biswajit Nayak
Hi,

It works for MR engine, while in TEZ it fails.

*hive> set hive.execution.engine=tez;*

*hive> set hive.fetch.task.conversion=none;*

*hive> select h from test*db.table_orc* where year = 2016 and month =1 and
day >29 limit 10;*

*Query ID = 26f9a510-c10c-475c-9988-081998b66b0c*

*Total jobs = 1*

*Launching Job 1 out of 1*



*Status: Running (Executing on YARN cluster with App id
application_1456379707708_1135)*


**

*VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
KILLED*

**

*Map 1 FAILED -1  00   -1   0
0*

**

*VERTICES: 00/01  [>>--] 0%ELAPSED TIME: 0.37
s *

**

*Status: Failed*

*Vertex failed, vertexName=Map 1, vertexId=vertex_1456379707708_1135_1_00,
diagnostics=[Vertex vertex_1456379707708_1135_1_00 [Map 1] killed/failed
due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: t*able_orc* initializer
failed, vertex=vertex_1456379707708_1135_1_00 [Map 1],
java.lang.RuntimeException: serious problem*

* at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)*

* at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)*

* at
org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)*

* at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)*

* at
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:131)*

* at
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:245)*

* at
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:239)*

* at java.security.AccessController.doPrivileged(Native Method)*

* at javax.security.auth.Subject.doAs(Subject.java:415)*

* at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)*

* at
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:239)*

* at
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:226)*

* at java.util.concurrent.FutureTask.run(FutureTask.java:262)*

* at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)*

* at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)*

* at java.lang.Thread.run(Thread.java:744)*

*Caused by: java.util.concurrent.ExecutionException:
java.lang.IndexOutOfBoundsException: Index: 0*

* at java.util.concurrent.FutureTask.report(FutureTask.java:122)*

* at java.util.concurrent.FutureTask.get(FutureTask.java:188)*

* at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1016)*

* ... 15 more*

*Caused by: java.lang.IndexOutOfBoundsException: Index: 0*

* at java.util.Collections$EmptyList.get(Collections.java:3212)*

* at
org.apache.hadoop.hive.ql.io.orc.OrcProto$Type.getSubtypes(OrcProto.java:12240)*

* at
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getColumnIndicesFromNames(ReaderImpl.java:651)*

* at
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getRawDataSizeOfColumns(ReaderImpl.java:634)*

* at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:927)*

* at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:836)*

* at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:702)*

* ... 4 more*

*]*

*DAG did not succeed due to VERTEX_FAILURE. failedVertices:1
killedVertices:0*

*FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map
1, vertexId=vertex_1456379707708_1135_1_00, diagnostics=[Vertex
vertex_1456379707708_1135_1_00 [Map 1] killed/failed due
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: *table_orc* initializer failed,
vertex=vertex_1456379707708_1135_1_00 [Map 1], java.lang.RuntimeException:
serious problem*

* at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)*

* at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)*

* at
org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)*

* at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)*

* at
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:131)*

* at

Re: Hive-2.0.1 Release date

2016-03-01 Thread Oleksiy MapR
>> Hi. It will be released when some critical mass of bugfixes is
accumulated. We already found
>> some issues that would be nice to fix, so it may be some time in March.
Is there a particular fix
>> that interests you?

Hi Sergey!

Thanks for the information. There is no particular fix I want to see. It is
common case to have additional release with small fixed after a great
release, so I wanted to know the terms of it.

Oleksiy.

On Tue, Mar 1, 2016 at 9:12 AM, Dmitry Tolpeko  wrote:

> Mich,
>
> >What is best way of using with Hive now?
>
> If we are talking about the command line: Hive/beeline CLI to execute
> standalone SQL statements; Hplsql if you need to surround them with
> procedural SQL (flow of control, loops, exception handlers, dynamic SQL
> etc.)
>
> Dmitry
>
>
> On Tue, Mar 1, 2016 at 9:50 AM, Mich Talebzadeh  > wrote:
>
>> Thanks Dmitry
>>
>> I can see
>>
>> cd $HIVE_HOME
>> find ./ -name '*hplsql*'
>> ./bin/ext/hplsql.sh
>> ./bin/hplsql.cmd
>> ./bin/hplsql
>> ./lib/hive-hplsql-2.0.0.jar
>>
>>  hplsql
>> usage: hplsql
>>  -d,--define 

How does Hive do authentication on UDF

2016-03-01 Thread Todd
Hi ,


We are allowing users to write/use their own UDF in our Hive environment.When 
they create the function against the db, then all the users that can use the db 
will see(or use) the udf.
I would ask how UDF authentication is done, can UDF be granted to some specific 
users,so that other users can't see it?
 
Thanks a lot!



About the hive python client pyhs2

2016-03-01 Thread Tale Firefly
Hello !

I contact you because I have a question related to interact with hive with
python.

The python client pyhs2 is recommended on the apache site :
https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

I was wondering if I can use it or is it better to use the hive cli (or
beeline cli) through python ?

BR.

Tale.