Hello guys ! Just a quick thank you again for your answer on this topic !
I noticed that when a job is performed (if the table is bigger than hive.fetch.task.conversion.threshold), then it seems that temporary files are created in HDFS (in /tmp). If I understood well, the select * is divided into subsets and each of this subset is performed inside a mapper, right ? And then, once all these subsets are completed successfully, the global result is sent to the client through the HiveServer2 but where it is stored ? In the RAM of HS2 ? If I perform a select * on a table far bigger than the treshhold, is there a risk my HS2 crash because the result is too big ? Best regards. Tale On Tue, Mar 22, 2016 at 10:21 AM, Tale Firefly <tale.h...@gmail.com> wrote: > Hello everyone. > > Thanks for your answers. > > I'm gonna test this. > > Best regards. > > Tale > > > > On Mon, Mar 21, 2016 at 10:06 PM, Prasanth Jayachandran < > pjayachand...@hortonworks.com> wrote: > >> Hi >> >> Simple select * query launches a job when the input size is >1Gb by >> default. Two configs that determines if a job has to be launched >> >> hive.fetch.task.conversion >> hive.fetch.task.conversion.threshold >> >> Is your table size >1GB (hive.fetch.task.conversion.threshold)? You can >> see that from “describe formatted tablename”. >> >> Thanks >> Prasanth >> >> On Mar 21, 2016, at 11:16 AM, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> You are correct. it should not. There is nothing to optimise here. >> >> 0: jdbc:hive2://rhes564:10010/default> >> *select * from countries; *OK >> INFO : Compiling >> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318): >> select * from countries >> INFO : Semantic Analysis Completed >> INFO : Returning Hive schema: >> Schema(fieldSchemas:[FieldSchema(name:countries.country_id, type:double, >> comment:null), FieldSchema(name:countries.country_iso_code, type:string, >> comment:null), FieldSchema(name:countries.country_name, type:string, >> comment:null), FieldSchema(name:countries.country_subregion, type:string, >> comment:null), FieldSchema(name:countries.country_subregion_id, >> type:double, comment:null), FieldSchema(name:countries.country_region, >> type:string, comment:null), FieldSchema(name:countries.country_region_id, >> type:double, comment:null), FieldSchema(name:countries.country_total, >> type:string, comment:null), FieldSchema(name:countries.country_total_id, >> type:double, comment:null), FieldSchema(name:countries.country_name_hist, >> type:string, comment:null)], properties:null) >> INFO : Completed compiling >> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318); >> Time taken: 0.047 seconds >> INFO : Executing >> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318): >> select * from countries >> INFO : Completed executing >> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318); >> Time taken: 0.001 seconds >> INFO : OK >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 21 March 2016 at 15:56, Tale Firefly <tale.h...@gmail.com> wrote: >> >>> Hm, I need to check if statistics are enabled for this table and >>> up-to-date. >>> I'm going to check this. >>> >>> I don't know if I was clear in my previous statement, but I am surprised >>> that a job is launched just by doing a select * from my_table. >>> I thought a select * from my_table was not running any MR jobs. >>> >>> Best regards. >>> >>> Tale. >>> >>> On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Well I use Spark as engine. >>>> >>>> Now the question is have you updated statistics on ORC table? >>>> >>>> HTH >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> On 21 March 2016 at 15:32, Tale Firefly <tale.h...@gmail.com> wrote: >>>> >>>>> Re. >>>>> >>>>> Ty ty for your answer. >>>>> >>>>> I'm using Tez as execution engine for this query. >>>>> And it launches a job to yarn. >>>>> >>>>> Do you know why it launches a job just for a select when I use Tez as >>>>> execution engine ? >>>>> >>>>> BR. >>>>> >>>>> Tale >>>>> >>>>> >>>>> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Your query is a table level query that covers all rows in the table. >>>>>> >>>>>> Using ODBC you are connecting to Hive server 2 that runs on a given >>>>>> port. >>>>>> >>>>>> Depending on the version of Hive you are running Hive under the >>>>>> bonnet is most likely using Map-Reduce as the execution engine. >>>>>> >>>>>> Data has to be collected from all blocks that hold data for this >>>>>> table. The underlying ORC stats can only act at table level as there is >>>>>> no >>>>>> predicate push down and data has to be sent to ODBC driver through the >>>>>> network. >>>>>> >>>>>> The ODBC driver can only communicate with Hive server 2 so there is >>>>>> no connectivity to individual nodes from your client. >>>>>> >>>>>> So in summary Hive server 2 collects data from all blocks and >>>>>> forwards it to the client. The actual collection and filtering of result >>>>>> set in SQL query will depend on many factors. >>>>>> >>>>>> HTH >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> >>>>>> On 21 March 2016 at 14:26, Tale Firefly <tale.h...@gmail.com> wrote: >>>>>> >>>>>>> Hello guys ! >>>>>>> >>>>>>> I'm trying to understand the mechanism for a simple query select * >>>>>>> from my_table when using HiveServer2. >>>>>>> >>>>>>> I'm using the hortonworks ODBC Driver for HiveServer2. >>>>>>> I just do a select * from my_table. >>>>>>> my_table is an ORC table based on files divised into blocks located >>>>>>> on all my datanodes. >>>>>>> I have 50 datanodes. >>>>>>> >>>>>>> My question is the following : >>>>>>> Does all the data go from the datanodes to the node hosting the >>>>>>> hiveserver2 before coming back to my client ? >>>>>>> Or does all the data go directly from the datanodes to my client ? >>>>>>> >>>>>>> Hope you can help me o/ >>>>>>> >>>>>>> Thank you >>>>>>> >>>>>>> Tale >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >> >