Re: Mechanism when doing a select *

Prasanth Jayachandran Mon, 21 Mar 2016 14:07:32 -0700

Hi

Simple select * query launches a job when the input size is >1Gb by default. 
Two configs that determines if a job has to be launched


hive.fetch.task.conversion
hive.fetch.task.conversion.threshold

Is your table size >1GB (hive.fetch.task.conversion.threshold)? You can see 
that from “describe formatted tablename”.

Thanks
Prasanth

On Mar 21, 2016, at 11:16 AM, Mich Talebzadeh 
<[email protected]<mailto:[email protected]>> wrote:

You are correct. it  should not. There is nothing to optimise here.

0: jdbc:hive2://rhes564:10010/default> select * from countries;
OK
INFO  : Compiling 
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318): 
select * from countries
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: 
Schema(fieldSchemas:[FieldSchema(name:countries.country_id, type:double, 
comment:null), FieldSchema(name:countries.country_iso_code, type:string, 
comment:null), FieldSchema(name:countries.country_name, type:string, 
comment:null), FieldSchema(name:countries.country_subregion, type:string, 
comment:null), FieldSchema(name:countries.country_subregion_id, type:double, 
comment:null), FieldSchema(name:countries.country_region, type:string, 
comment:null), FieldSchema(name:countries.country_region_id, type:double, 
comment:null), FieldSchema(name:countries.country_total, type:string, 
comment:null), FieldSchema(name:countries.country_total_id, type:double, 
comment:null), FieldSchema(name:countries.country_name_hist, type:string, 
comment:null)], properties:null)
INFO  : Completed compiling 
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318); 
Time taken: 0.047 seconds
INFO  : Executing 
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318): 
select * from countries
INFO  : Completed executing 
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318); 
Time taken: 0.001 seconds
INFO  : OK

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 21 March 2016 at 15:56, Tale Firefly 
<[email protected]<mailto:[email protected]>> wrote:
Hm, I need to check if statistics are enabled for this table and up-to-date.
I'm going to check this.

I don't know if I was clear in my previous statement, but I am surprised that a 
job is launched just by doing a select * from my_table.
I thought a select * from my_table was not running any MR jobs.

Best regards.

Tale.

On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh 
<[email protected]<mailto:[email protected]>> wrote:
Well I use Spark as engine.

Now the question is have you updated statistics on ORC table?

HTH



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 21 March 2016 at 15:32, Tale Firefly 
<[email protected]<mailto:[email protected]>> wrote:
Re.

Ty ty for your answer.

I'm using Tez as execution engine for this query.
And it launches a job to yarn.

Do you know why it launches a job just for a select when I use Tez as execution 
engine ?

BR.

Tale


On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

Your query is a table level query  that covers all rows in the table.

Using ODBC you are connecting to Hive server 2 that runs on a given port.

Depending on the version of Hive you are running Hive under the bonnet is most 
likely using Map-Reduce as the execution engine.

Data has to be collected from all blocks that hold data for this table. The 
underlying ORC stats can only act at table level as there is no predicate push 
down and data has to be sent to ODBC driver through the network.

The ODBC driver can only communicate with Hive server 2 so there is no 
connectivity to individual nodes from your client.

So in summary Hive server 2 collects data from all blocks and forwards it to 
the client. The actual collection and filtering of result set in SQL query will 
depend on many factors.

HTH

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 21 March 2016 at 14:26, Tale Firefly 
<[email protected]<mailto:[email protected]>> wrote:
Hello guys !

I'm trying to understand the mechanism for a simple query select * from 
my_table when using HiveServer2.

I'm using the hortonworks ODBC Driver for HiveServer2.
I just do a select * from my_table.
my_table is an ORC table based on files divised into blocks located on all my 
datanodes.
I have 50 datanodes.

My question is the following :
Does all the data go from the datanodes to the node hosting the hiveserver2 
before coming back to my client ?
Or does all the data go directly from the datanodes to my client ?

Hope you can help me o/

Thank you

Tale

Re: Mechanism when doing a select *

Reply via email to