Re: Mechanism when doing a select *

Tale Firefly Wed, 30 Mar 2016 01:08:00 -0700

Hello guys !

Just a quick thank you again for your answer on this topic !


I noticed that when a job is performed (if the table is bigger than
hive.fetch.task.conversion.threshold), then it seems that temporary files
are created in HDFS (in /tmp). If I understood well, the select * is
divided into subsets and each of this subset is performed inside a mapper,
right ?

And then, once all these subsets are completed successfully, the global
result is sent to the client through the HiveServer2 but where it is stored
? In the RAM of HS2 ? If I perform a select * on a table far bigger than
the treshhold, is there a risk my HS2 crash because the result is too big ?

Best regards.

Tale

On Tue, Mar 22, 2016 at 10:21 AM, Tale Firefly <[email protected]> wrote:

> Hello everyone.
>
> Thanks for your answers.
>
> I'm gonna test this.
>
> Best regards.
>
> Tale
>
>
>
> On Mon, Mar 21, 2016 at 10:06 PM, Prasanth Jayachandran <
> [email protected]> wrote:
>
>> Hi
>>
>> Simple select * query launches a job when the input size is >1Gb by
>> default. Two configs that determines if a job has to be launched
>>
>> hive.fetch.task.conversion
>> hive.fetch.task.conversion.threshold
>>
>> Is your table size >1GB (hive.fetch.task.conversion.threshold)? You can
>> see that from “describe formatted tablename”.
>>
>> Thanks
>> Prasanth
>>
>> On Mar 21, 2016, at 11:16 AM, Mich Talebzadeh <[email protected]>
>> wrote:
>>
>> You are correct. it  should not. There is nothing to optimise here.
>>
>> 0: jdbc:hive2://rhes564:10010/default>
>> *select * from countries; *OK
>> INFO  : Compiling
>> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318):
>> select * from countries
>> INFO  : Semantic Analysis Completed
>> INFO  : Returning Hive schema:
>> Schema(fieldSchemas:[FieldSchema(name:countries.country_id, type:double,
>> comment:null), FieldSchema(name:countries.country_iso_code, type:string,
>> comment:null), FieldSchema(name:countries.country_name, type:string,
>> comment:null), FieldSchema(name:countries.country_subregion, type:string,
>> comment:null), FieldSchema(name:countries.country_subregion_id,
>> type:double, comment:null), FieldSchema(name:countries.country_region,
>> type:string, comment:null), FieldSchema(name:countries.country_region_id,
>> type:double, comment:null), FieldSchema(name:countries.country_total,
>> type:string, comment:null), FieldSchema(name:countries.country_total_id,
>> type:double, comment:null), FieldSchema(name:countries.country_name_hist,
>> type:string, comment:null)], properties:null)
>> INFO  : Completed compiling
>> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318);
>> Time taken: 0.047 seconds
>> INFO  : Executing
>> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318):
>> select * from countries
>> INFO  : Completed executing
>> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318);
>> Time taken: 0.001 seconds
>> INFO  : OK
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 March 2016 at 15:56, Tale Firefly <[email protected]> wrote:
>>
>>> Hm, I need to check if statistics are enabled for this table and
>>> up-to-date.
>>> I'm going to check this.
>>>
>>> I don't know if I was clear in my previous statement, but I am surprised
>>> that a job is launched just by doing a select * from my_table.
>>> I thought a select * from my_table was not running any MR jobs.
>>>
>>> Best regards.
>>>
>>> Tale.
>>>
>>> On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>> Well I use Spark as engine.
>>>>
>>>> Now the question is have you updated statistics on ORC table?
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 21 March 2016 at 15:32, Tale Firefly <[email protected]> wrote:
>>>>
>>>>> Re.
>>>>>
>>>>> Ty ty for your answer.
>>>>>
>>>>> I'm using Tez as execution engine for this query.
>>>>> And it launches a job to yarn.
>>>>>
>>>>> Do you know why it launches a job just for a select when I use Tez as
>>>>> execution engine ?
>>>>>
>>>>> BR.
>>>>>
>>>>> Tale
>>>>>
>>>>>
>>>>> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Your query is a table level query  that covers all rows in the table.
>>>>>>
>>>>>> Using ODBC you are connecting to Hive server 2 that runs on a given
>>>>>> port.
>>>>>>
>>>>>> Depending on the version of Hive you are running Hive under the
>>>>>> bonnet is most likely using Map-Reduce as the execution engine.
>>>>>>
>>>>>> Data has to be collected from all blocks that hold data for this
>>>>>> table. The underlying ORC stats can only act at table level as there is 
>>>>>> no
>>>>>> predicate push down and data has to be sent to ODBC driver through the
>>>>>> network.
>>>>>>
>>>>>> The ODBC driver can only communicate with Hive server 2 so there is
>>>>>> no connectivity to individual nodes from your client.
>>>>>>
>>>>>> So in summary Hive server 2 collects data from all blocks and
>>>>>> forwards it to the client. The actual collection and filtering of result
>>>>>> set in SQL query will depend on many factors.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 21 March 2016 at 14:26, Tale Firefly <[email protected]> wrote:
>>>>>>
>>>>>>> Hello guys !
>>>>>>>
>>>>>>> I'm trying to understand the mechanism for a simple query select *
>>>>>>> from my_table when using HiveServer2.
>>>>>>>
>>>>>>> I'm using the hortonworks ODBC Driver for HiveServer2.
>>>>>>> I just do a select * from my_table.
>>>>>>> my_table is an ORC table based on files divised into blocks located
>>>>>>> on all my datanodes.
>>>>>>> I have 50 datanodes.
>>>>>>>
>>>>>>> My question is the following :
>>>>>>> Does all the data go from the datanodes to the node hosting the
>>>>>>> hiveserver2 before coming back to my client ?
>>>>>>> Or does all the data go directly from the datanodes to my client ?
>>>>>>>
>>>>>>> Hope you can help me o/
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>> Tale
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: Mechanism when doing a select *

Reply via email to