Re: Mechanism when doing a select *

2016-03-21 Thread Gopal Vijayaraghavan

>> Or does all the data go directly from the datanodes to my client ?

Not yet.

https://issues.apache.org/jira/browse/HIVE-11527


Cheers,
Gopal




Re: Mechanism when doing a select *

2016-03-21 Thread Gopal Vijayaraghavan

> Or does all the data go directly from the datanodes to my client ?

Not yet.

https://issues.apache.org/jira/browse/HIVE-11527


Cheers,
Gopal



Re: Error selecting from a Hive ORC table in Spark-sql

2016-03-21 Thread Mich Talebzadeh
sounds like with ORC transactional table this happens

When I create that table as ORC but non transactional it works!

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 21 March 2016 at 17:53, Eugene Koifman  wrote:

> The system thinks t2 is an Acid table but the files on disk don’t follow
> the convention acid system would expect.
> Perhaps Xuefu Zhang would know more on Spark/Aicd integration.
>
> From: Mich Talebzadeh 
> Reply-To: "user@hive.apache.org" 
> Date: Monday, March 21, 2016 at 9:39 AM
> To: "user @spark" , user 
> Subject: Error selecting from a Hive ORC table in Spark-sql
>
> Hi,
>
> Do we know the cause of this error when selecting from an Hive ORC table
>
> spark-sql>
> *select * from t2; *16/03/21 16:38:33 ERROR SparkSQLDriver: Failed in
> [select * from t2]
> java.lang.RuntimeException: serious problem
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
> at
> org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
> at
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
> at
> org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:587)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:308)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.NumberFormatException: For input string: "039_"
> at
> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
>  

Re: Error selecting from a Hive ORC table in Spark-sql

2016-03-21 Thread Eugene Koifman
The system thinks t2 is an Acid table but the files on disk don’t follow the 
convention acid system would expect.
Perhaps Xuefu Zhang would know more on Spark/Aicd integration.

From: Mich Talebzadeh 
>
Reply-To: "user@hive.apache.org" 
>
Date: Monday, March 21, 2016 at 9:39 AM
To: "user @spark" >, user 
>
Subject: Error selecting from a Hive ORC table in Spark-sql

Hi,

Do we know the cause of this error when selecting from an Hive ORC table

spark-sql> select * from t2;
16/03/21 16:38:33 ERROR SparkSQLDriver: Failed in [select * from t2]
java.lang.RuntimeException: serious problem
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:587)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:308)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.util.concurrent.ExecutionException: 
java.lang.NumberFormatException: For input string: "039_"
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
at java.util.concurrent.FutureTask.get(FutureTask.java:111)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
... 43 more
Caused by: java.lang.NumberFormatException: For input string: "039_"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:441)
at 

Error selecting from a Hive ORC table in Spark-sql

2016-03-21 Thread Mich Talebzadeh
Hi,

Do we know the cause of this error when selecting from an Hive ORC table

spark-sql>
*select * from t2;*16/03/21 16:38:33 ERROR SparkSQLDriver: Failed in
[select * from t2]
java.lang.RuntimeException: serious problem
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
at
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:587)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:308)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.util.concurrent.ExecutionException:
java.lang.NumberFormatException: For input string: "039_"
at
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
at java.util.concurrent.FutureTask.get(FutureTask.java:111)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
... 43 more
Caused by: java.lang.NumberFormatException: For input string: "039_"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:441)
at java.lang.Long.parseLong(Long.java:483)
at
org.apache.hadoop.hive.ql.io.AcidUtils.parseDelta(AcidUtils.java:310)
at
org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:379)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:634)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:620)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at

Re: Mechanism when doing a select *

2016-03-21 Thread Mich Talebzadeh
You are correct. it  should not. There is nothing to optimise here.

0: jdbc:hive2://rhes564:10010/default>
*select * from countries;*OK
INFO  : Compiling
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318):
select * from countries
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema:
Schema(fieldSchemas:[FieldSchema(name:countries.country_id, type:double,
comment:null), FieldSchema(name:countries.country_iso_code, type:string,
comment:null), FieldSchema(name:countries.country_name, type:string,
comment:null), FieldSchema(name:countries.country_subregion, type:string,
comment:null), FieldSchema(name:countries.country_subregion_id,
type:double, comment:null), FieldSchema(name:countries.country_region,
type:string, comment:null), FieldSchema(name:countries.country_region_id,
type:double, comment:null), FieldSchema(name:countries.country_total,
type:string, comment:null), FieldSchema(name:countries.country_total_id,
type:double, comment:null), FieldSchema(name:countries.country_name_hist,
type:string, comment:null)], properties:null)
INFO  : Completed compiling
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318);
Time taken: 0.047 seconds
INFO  : Executing
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318):
select * from countries
INFO  : Completed executing
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318);
Time taken: 0.001 seconds
INFO  : OK

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 21 March 2016 at 15:56, Tale Firefly  wrote:

> Hm, I need to check if statistics are enabled for this table and
> up-to-date.
> I'm going to check this.
>
> I don't know if I was clear in my previous statement, but I am surprised
> that a job is launched just by doing a select * from my_table.
> I thought a select * from my_table was not running any MR jobs.
>
> Best regards.
>
> Tale.
>
> On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Well I use Spark as engine.
>>
>> Now the question is have you updated statistics on ORC table?
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 March 2016 at 15:32, Tale Firefly  wrote:
>>
>>> Re.
>>>
>>> Ty ty for your answer.
>>>
>>> I'm using Tez as execution engine for this query.
>>> And it launches a job to yarn.
>>>
>>> Do you know why it launches a job just for a select when I use Tez as
>>> execution engine ?
>>>
>>> BR.
>>>
>>> Tale
>>>
>>>
>>> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Your query is a table level query  that covers all rows in the table.

 Using ODBC you are connecting to Hive server 2 that runs on a given
 port.

 Depending on the version of Hive you are running Hive under the
 bonnet is most likely using Map-Reduce as the execution engine.

 Data has to be collected from all blocks that hold data for this table.
 The underlying ORC stats can only act at table level as there is no
 predicate push down and data has to be sent to ODBC driver through the
 network.

 The ODBC driver can only communicate with Hive server 2 so there is no
 connectivity to individual nodes from your client.

 So in summary Hive server 2 collects data from all blocks and forwards
 it to the client. The actual collection and filtering of result set in SQL
 query will depend on many factors.

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 21 March 2016 at 14:26, Tale Firefly  wrote:

> Hello guys !
>
> I'm trying to understand the mechanism for a simple query select *
> from my_table when using HiveServer2.
>
> I'm using the hortonworks ODBC Driver for HiveServer2.
> I just do a select * from my_table.
> my_table is an ORC table based on files divised into blocks located on
> all my datanodes.
> I have 50 datanodes.
>
> My question is the following :
> Does all the data go from the datanodes to the node hosting the
> hiveserver2 before coming back to my client ?
> Or does all the data go directly from the datanodes to my client ?
>
> Hope you 

Re: Mechanism when doing a select *

2016-03-21 Thread Tale Firefly
Oh my bad, even with the execution engine set to MR, my query turns into a
MR job.

I'm gonna make more tests with Hive CLI and beeline, and excel to check if
this behaviour is linked to the ODBC driver.

BR.

Tale.

On Mon, Mar 21, 2016 at 4:56 PM, Tale Firefly  wrote:

> Hm, I need to check if statistics are enabled for this table and
> up-to-date.
> I'm going to check this.
>
> I don't know if I was clear in my previous statement, but I am surprised
> that a job is launched just by doing a select * from my_table.
> I thought a select * from my_table was not running any MR jobs.
>
> Best regards.
>
> Tale.
>
> On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Well I use Spark as engine.
>>
>> Now the question is have you updated statistics on ORC table?
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 March 2016 at 15:32, Tale Firefly  wrote:
>>
>>> Re.
>>>
>>> Ty ty for your answer.
>>>
>>> I'm using Tez as execution engine for this query.
>>> And it launches a job to yarn.
>>>
>>> Do you know why it launches a job just for a select when I use Tez as
>>> execution engine ?
>>>
>>> BR.
>>>
>>> Tale
>>>
>>>
>>> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Your query is a table level query  that covers all rows in the table.

 Using ODBC you are connecting to Hive server 2 that runs on a given
 port.

 Depending on the version of Hive you are running Hive under the
 bonnet is most likely using Map-Reduce as the execution engine.

 Data has to be collected from all blocks that hold data for this table.
 The underlying ORC stats can only act at table level as there is no
 predicate push down and data has to be sent to ODBC driver through the
 network.

 The ODBC driver can only communicate with Hive server 2 so there is no
 connectivity to individual nodes from your client.

 So in summary Hive server 2 collects data from all blocks and forwards
 it to the client. The actual collection and filtering of result set in SQL
 query will depend on many factors.

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 21 March 2016 at 14:26, Tale Firefly  wrote:

> Hello guys !
>
> I'm trying to understand the mechanism for a simple query select *
> from my_table when using HiveServer2.
>
> I'm using the hortonworks ODBC Driver for HiveServer2.
> I just do a select * from my_table.
> my_table is an ORC table based on files divised into blocks located on
> all my datanodes.
> I have 50 datanodes.
>
> My question is the following :
> Does all the data go from the datanodes to the node hosting the
> hiveserver2 before coming back to my client ?
> Or does all the data go directly from the datanodes to my client ?
>
> Hope you can help me o/
>
> Thank you
>
> Tale
>


>>>
>>
>


Re: Mechanism when doing a select *

2016-03-21 Thread Tale Firefly
Hm, I need to check if statistics are enabled for this table and up-to-date.
I'm going to check this.

I don't know if I was clear in my previous statement, but I am surprised
that a job is launched just by doing a select * from my_table.
I thought a select * from my_table was not running any MR jobs.

Best regards.

Tale.

On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh 
wrote:

> Well I use Spark as engine.
>
> Now the question is have you updated statistics on ORC table?
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 March 2016 at 15:32, Tale Firefly  wrote:
>
>> Re.
>>
>> Ty ty for your answer.
>>
>> I'm using Tez as execution engine for this query.
>> And it launches a job to yarn.
>>
>> Do you know why it launches a job just for a select when I use Tez as
>> execution engine ?
>>
>> BR.
>>
>> Tale
>>
>>
>> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Your query is a table level query  that covers all rows in the table.
>>>
>>> Using ODBC you are connecting to Hive server 2 that runs on a given port.
>>>
>>> Depending on the version of Hive you are running Hive under the
>>> bonnet is most likely using Map-Reduce as the execution engine.
>>>
>>> Data has to be collected from all blocks that hold data for this table.
>>> The underlying ORC stats can only act at table level as there is no
>>> predicate push down and data has to be sent to ODBC driver through the
>>> network.
>>>
>>> The ODBC driver can only communicate with Hive server 2 so there is no
>>> connectivity to individual nodes from your client.
>>>
>>> So in summary Hive server 2 collects data from all blocks and forwards
>>> it to the client. The actual collection and filtering of result set in SQL
>>> query will depend on many factors.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 21 March 2016 at 14:26, Tale Firefly  wrote:
>>>
 Hello guys !

 I'm trying to understand the mechanism for a simple query select * from
 my_table when using HiveServer2.

 I'm using the hortonworks ODBC Driver for HiveServer2.
 I just do a select * from my_table.
 my_table is an ORC table based on files divised into blocks located on
 all my datanodes.
 I have 50 datanodes.

 My question is the following :
 Does all the data go from the datanodes to the node hosting the
 hiveserver2 before coming back to my client ?
 Or does all the data go directly from the datanodes to my client ?

 Hope you can help me o/

 Thank you

 Tale

>>>
>>>
>>
>


Re: Mechanism when doing a select *

2016-03-21 Thread Mich Talebzadeh
Well I use Spark as engine.

Now the question is have you updated statistics on ORC table?

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 21 March 2016 at 15:32, Tale Firefly  wrote:

> Re.
>
> Ty ty for your answer.
>
> I'm using Tez as execution engine for this query.
> And it launches a job to yarn.
>
> Do you know why it launches a job just for a select when I use Tez as
> execution engine ?
>
> BR.
>
> Tale
>
>
> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> Your query is a table level query  that covers all rows in the table.
>>
>> Using ODBC you are connecting to Hive server 2 that runs on a given port.
>>
>> Depending on the version of Hive you are running Hive under the bonnet is
>> most likely using Map-Reduce as the execution engine.
>>
>> Data has to be collected from all blocks that hold data for this table.
>> The underlying ORC stats can only act at table level as there is no
>> predicate push down and data has to be sent to ODBC driver through the
>> network.
>>
>> The ODBC driver can only communicate with Hive server 2 so there is no
>> connectivity to individual nodes from your client.
>>
>> So in summary Hive server 2 collects data from all blocks and forwards it
>> to the client. The actual collection and filtering of result set in SQL
>> query will depend on many factors.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 March 2016 at 14:26, Tale Firefly  wrote:
>>
>>> Hello guys !
>>>
>>> I'm trying to understand the mechanism for a simple query select * from
>>> my_table when using HiveServer2.
>>>
>>> I'm using the hortonworks ODBC Driver for HiveServer2.
>>> I just do a select * from my_table.
>>> my_table is an ORC table based on files divised into blocks located on
>>> all my datanodes.
>>> I have 50 datanodes.
>>>
>>> My question is the following :
>>> Does all the data go from the datanodes to the node hosting the
>>> hiveserver2 before coming back to my client ?
>>> Or does all the data go directly from the datanodes to my client ?
>>>
>>> Hope you can help me o/
>>>
>>> Thank you
>>>
>>> Tale
>>>
>>
>>
>


Re: Mechanism when doing a select *

2016-03-21 Thread Tale Firefly
Re.

Ty ty for your answer.

I'm using Tez as execution engine for this query.
And it launches a job to yarn.

Do you know why it launches a job just for a select when I use Tez as
execution engine ?

BR.

Tale


On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh 
wrote:

> Hi,
>
> Your query is a table level query  that covers all rows in the table.
>
> Using ODBC you are connecting to Hive server 2 that runs on a given port.
>
> Depending on the version of Hive you are running Hive under the bonnet is
> most likely using Map-Reduce as the execution engine.
>
> Data has to be collected from all blocks that hold data for this table.
> The underlying ORC stats can only act at table level as there is no
> predicate push down and data has to be sent to ODBC driver through the
> network.
>
> The ODBC driver can only communicate with Hive server 2 so there is no
> connectivity to individual nodes from your client.
>
> So in summary Hive server 2 collects data from all blocks and forwards it
> to the client. The actual collection and filtering of result set in SQL
> query will depend on many factors.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 March 2016 at 14:26, Tale Firefly  wrote:
>
>> Hello guys !
>>
>> I'm trying to understand the mechanism for a simple query select * from
>> my_table when using HiveServer2.
>>
>> I'm using the hortonworks ODBC Driver for HiveServer2.
>> I just do a select * from my_table.
>> my_table is an ORC table based on files divised into blocks located on
>> all my datanodes.
>> I have 50 datanodes.
>>
>> My question is the following :
>> Does all the data go from the datanodes to the node hosting the
>> hiveserver2 before coming back to my client ?
>> Or does all the data go directly from the datanodes to my client ?
>>
>> Hope you can help me o/
>>
>> Thank you
>>
>> Tale
>>
>
>


Re: How to work around non-executive /tmp with Hive in Parquet+Snappy compression?

2016-03-21 Thread Tale Firefly
Hey !

Are you talking about the hdfs /tmp or the local FS /tmp ?

For the HDFS one, I think it should be the property :
hive.exec.scratchdir

For the local, I think it should be the property :
hive.exec.local.scratchdir

BR

Tale

On Sat, Mar 19, 2016 at 8:46 PM, Rex X  wrote:

> The local /tmp is non-executive configured by admin.
>
> When we do a "select ...limit 10" query on Hive, it copied some file to
> /tmp, and tried  to execute it. But since the /tmp is non-executive, I
> always bumped out of the Hive shell with some binding error.
>
> What is the setting to change this /tmp work folder to other directory?
>
>
>
>


Re: Mechanism when doing a select *

2016-03-21 Thread Mich Talebzadeh
Hi,

Your query is a table level query  that covers all rows in the table.

Using ODBC you are connecting to Hive server 2 that runs on a given port.

Depending on the version of Hive you are running Hive under the bonnet is
most likely using Map-Reduce as the execution engine.

Data has to be collected from all blocks that hold data for this table. The
underlying ORC stats can only act at table level as there is no predicate
push down and data has to be sent to ODBC driver through the network.

The ODBC driver can only communicate with Hive server 2 so there is no
connectivity to individual nodes from your client.

So in summary Hive server 2 collects data from all blocks and forwards it
to the client. The actual collection and filtering of result set in SQL
query will depend on many factors.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 21 March 2016 at 14:26, Tale Firefly  wrote:

> Hello guys !
>
> I'm trying to understand the mechanism for a simple query select * from
> my_table when using HiveServer2.
>
> I'm using the hortonworks ODBC Driver for HiveServer2.
> I just do a select * from my_table.
> my_table is an ORC table based on files divised into blocks located on all
> my datanodes.
> I have 50 datanodes.
>
> My question is the following :
> Does all the data go from the datanodes to the node hosting the
> hiveserver2 before coming back to my client ?
> Or does all the data go directly from the datanodes to my client ?
>
> Hope you can help me o/
>
> Thank you
>
> Tale
>


Mechanism when doing a select *

2016-03-21 Thread Tale Firefly
Hello guys !

I'm trying to understand the mechanism for a simple query select * from
my_table when using HiveServer2.

I'm using the hortonworks ODBC Driver for HiveServer2.
I just do a select * from my_table.
my_table is an ORC table based on files divised into blocks located on all
my datanodes.
I have 50 datanodes.

My question is the following :
Does all the data go from the datanodes to the node hosting the hiveserver2
before coming back to my client ?
Or does all the data go directly from the datanodes to my client ?

Hope you can help me o/

Thank you

Tale


Re: Column type conversion in Hive

2016-03-21 Thread Edward Capriolo
Explicit conversion is done using cast (x as bigint)

You said: As a matter of interest what is the underlying storage for
Integer?

This is dictated on disk by the input format the "temporal in memory
format" is dictated by the serde, an integer could be stored as "1",
"1" , as dictated by the Input Format and Serde Storage
Handler in use.

On Sun, Mar 20, 2016 at 6:27 PM, Mich Talebzadeh 
wrote:

>
> As a matter of interest how does how cast columns from say String to
> Integer implicitly?
>
> For example the following shows this
>
> create table s(col1 String);
> insert into s values("1");
> insert into s values("2");
>
> Now create a target table with col1 being integer
>
>
> create table o (col1 Int);
> insert into o select * from s;
>
> select * from o;
> +-+--+
> | o.col1  |
> +-+--+
> | 1   |
> | 2   |
> +-+--+
>
> So this implicit column conversion from String to Integer happens without
> intervention in the code. As a matter of interest what is the underlying
> storage for Integer. In a conventional RDBMS this needs to be done through
> cast (CHAR AS INT) etc?
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Re: Error in Hive on Spark

2016-03-21 Thread Stana
Does anyone have suggestions in setting property of hive-exec-2.0.0.jar
path in application?
Something like
'hiveConf.set("hive.remote.driver.jar","hdfs://storm0:9000/tmp/spark-assembly-1.4.1-hadoop2.6.0.jar")'.



2016-03-11 10:53 GMT+08:00 Stana :

> Thanks for reply
>
> I have set the property spark.home in my application. Otherwise the
> application threw 'SPARK_HOME not found exception'.
>
> I found hive source code in SparkClientImpl.java:
>
> private Thread startDriver(final RpcServer rpcServer, final String
> clientId, final String secret)
>   throws IOException {
> ...
>
> List argv = Lists.newArrayList();
>
> ...
>
> argv.add("--class");
> argv.add(RemoteDriver.class.getName());
>
> String jar = "spark-internal";
> if (SparkContext.jarOfClass(this.getClass()).isDefined()) {
> jar = SparkContext.jarOfClass(this.getClass()).get();
> }
> argv.add(jar);
>
> ...
>
> }
>
> When hive executed spark-submit , it generate the shell command with
> --class org.apache.hive.spark.client.RemoteDriver ,and set jar path with
> SparkContext.jarOfClass(this.getClass()).get(). It will get the local path
> of hive-exec-2.0.0.jar.
>
> In my situation, the application and yarn cluster are in different cluster.
> When application executed spark-submit with local path of
> hive-exec-2.0.0.jar to yarn cluster, there 's no hive-exec-2.0.0.jar in
> yarn cluster. Then application threw the exception: "hive-exec-2.0.0.jar
>   does not exist ...".
>
> Can it be set property of hive-exec-2.0.0.jar path in application ?
> Something like 'hiveConf.set("hive.remote.driver.jar",
> "hdfs://storm0:9000/tmp/spark-assembly-1.4.1-hadoop2.6.0.jar")'.
> If not, is it possible to achieve in the future version?
>
>
>
>
> 2016-03-10 23:51 GMT+08:00 Xuefu Zhang :
>
>> You can probably avoid the problem by set environment variable SPARK_HOME
>> or JVM property spark.home that points to your spark installation.
>>
>> --Xuefu
>>
>> On Thu, Mar 10, 2016 at 3:11 AM, Stana  wrote:
>>
>> >  I am trying out Hive on Spark with hive 2.0.0 and spark 1.4.1, and
>> > executing org.apache.hadoop.hive.ql.Driver with java application.
>> >
>> > Following are my situations:
>> > 1.Building spark 1.4.1 assembly jar without Hive .
>> > 2.Uploading the spark assembly jar to the hadoop cluster.
>> > 3.Executing the java application with eclipse IDE in my client computer.
>> >
>> > The application went well and it submitted mr job to the yarn cluster
>> > successfully when using " hiveConf.set("hive.execution.engine", "mr")
>> > ",but it threw exceptions in spark-engine.
>> >
>> > Finally, i traced Hive source code and came to the conclusion:
>> >
>> > In my situation, SparkClientImpl class will generate the spark-submit
>> > shell and executed it.
>> > The shell command allocated  --class with RemoteDriver.class.getName()
>> > and jar with SparkContext.jarOfClass(this.getClass()).get(), so that
>> > my application threw the exception.
>> >
>> > Is it right? And how can I do to execute the application with
>> > spark-engine successfully in my client computer ? Thanks a lot!
>> >
>> >
>> > Java application code:
>> >
>> > public class TestHiveDriver {
>> >
>> > private static HiveConf hiveConf;
>> > private static Driver driver;
>> > private static CliSessionState ss;
>> > public static void main(String[] args){
>> >
>> > String sql = "select * from hadoop0263_0 as a join
>> > hadoop0263_0 as b
>> > on (a.key = b.key)";
>> > ss = new CliSessionState(new
>> HiveConf(SessionState.class));
>> > hiveConf = new HiveConf(Driver.class);
>> > hiveConf.set("fs.default.name", "hdfs://storm0:9000");
>> > hiveConf.set("yarn.resourcemanager.address",
>> > "storm0:8032");
>> > hiveConf.set("yarn.resourcemanager.scheduler.address",
>> > "storm0:8030");
>> >
>> >
>> hiveConf.set("yarn.resourcemanager.resource-tracker.address","storm0:8031");
>> > hiveConf.set("yarn.resourcemanager.admin.address",
>> > "storm0:8033");
>> > hiveConf.set("mapreduce.framework.name", "yarn");
>> > hiveConf.set("mapreduce.johistory.address",
>> > "storm0:10020");
>> >
>> >
>> hiveConf.set("javax.jdo.option.ConnectionURL","jdbc:mysql://storm0:3306/stana_metastore");
>> >
>> >
>> hiveConf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver");
>> > hiveConf.set("javax.jdo.option.ConnectionUserName",
>> > "root");
>> > hiveConf.set("javax.jdo.option.ConnectionPassword",
>> > "123456");
>> > hiveConf.setBoolean("hive.auto.convert.join",false);
>> > hiveConf.set("spark.yarn.jar",
>> > "hdfs://storm0:9000/tmp/spark-assembly-1.4.1-hadoop2.6.0.jar");
>> > hiveConf.set("spark.home","target/spark");
>> > hiveConf.set("hive.execution.engine", "spark");
>> >