Re: Reg:Column Statistics with Parquet
Hi , I tried the same with compute statistics for columns a, b,c as above and still seeing the same results in explain plan. How do I confirm if its generating all the column stats for a given column. If this is confirmed, we can debug why Hive is still not using it? Thanks Suma On Thu, Jul 24, 2014 at 11:49 PM, Prasanth Jayachandran pjayachand...@hortonworks.com wrote: You have to explicit specifics column list in analyze command for gathering columns stats. This command will only collect basic stats like number of rows, total file size, raw data size, number of files. analyze table user_table partition(dt='2014-06-01',hour='00') compute statistics; To collect column statistics add the column list like below analyze table user_table partition(dt='2014-06-01',hour='00') compute statistics for columns a, b, c; Thanks Prasanth Jayachandran On Jul 24, 2014, at 5:13 AM, Sandeep Samudrala sandeep.samudr...@inmobi.com wrote: I am trying to enable Column statistics usage with Parquet tables. This is the query I am executing. However on explain, I see that even though *Basic stats: COMPLETE *is seen *Column stats *is seen as*NONE.* Can someone please explain what else I need to debug/fix this. set hive.compute.query.using.stats=true; set hive.stats.reliable=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; set hive.cbo.enable=true; analyze table user_table partition(dt='2014-06-01',hour='00') compute statistics; explain select min(a), max(b), min(c) from user_table; hive explain select min(a), max(b), min(c) from usertable; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: user_table Statistics: Num rows: 55490383 Data size: 1831182639 *Basic stats: COMPLETE Column stats: NONE* Select Operator expressions: a (type: double), b (type: double), c (type: int) outputColumnNames: a, b, c Statistics: Num rows: 55490383 Data size: 1831182639* Basic stats: COMPLETE Column stats: NONE* Group By Operator aggregations: min(a), max(b), min(c) mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 20 *Basic stats: COMPLETE Column stats: NONE* Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 20 *Basic stats: COMPLETE Column stats: NONE* value expressions: _col0 (type: double), _col1 (type: double), _col2 (type: int) Reduce Operator Tree: Group By Operator aggregations: min(VALUE._col0), max(VALUE._col1), min(VALUE._col2) mode: mergepartial outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: double), _col1 (type: double), _col2 (type: int) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Thanks, -sandeep _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying,
RE: Reg:Column Statistics with Parquet
Well not the correct way ,you can check the statistics in mysql part_col_stats like tables in mysql data base if you are using mysql stat database . Or the other way is calling max,min,distinct on int columns ,largest length on string columns etc,if they run whole map reduce on these operation then statistics are not getting created . From: Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com] Sent: Friday, July 25, 2014 12:43 PM To: user@hive.apache.org Subject: Re: Reg:Column Statistics with Parquet Hi , I tried the same with compute statistics for columns a, b,c as above and still seeing the same results in explain plan. How do I confirm if its generating all the column stats for a given column. If this is confirmed, we can debug why Hive is still not using it? Thanks Suma On Thu, Jul 24, 2014 at 11:49 PM, Prasanth Jayachandran pjayachand...@hortonworks.commailto:pjayachand...@hortonworks.com wrote: You have to explicit specifics column list in analyze command for gathering columns stats. This command will only collect basic stats like number of rows, total file size, raw data size, number of files. analyze table user_table partition(dt='2014-06-01',hour='00') compute statistics; To collect column statistics add the column list like below analyze table user_table partition(dt='2014-06-01',hour='00') compute statistics for columns a, b, c; Thanks Prasanth Jayachandran On Jul 24, 2014, at 5:13 AM, Sandeep Samudrala sandeep.samudr...@inmobi.commailto:sandeep.samudr...@inmobi.com wrote: I am trying to enable Column statistics usage with Parquet tables. This is the query I am executing. However on explain, I see that even though Basic stats: COMPLETE is seen Column stats is seen asNONE. Can someone please explain what else I need to debug/fix this. set hive.compute.query.using.stats=true; set hive.stats.reliable=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; set hive.cbo.enable=true; analyze table user_table partition(dt='2014-06-01',hour='00') compute statistics; explain select min(a), max(b), min(c) from user_table; hive explain select min(a), max(b), min(c) from usertable; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: user_table Statistics: Num rows: 55490383 Data size: 1831182639 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: double), b (type: double), c (type: int) outputColumnNames: a, b, c Statistics: Num rows: 55490383 Data size: 1831182639 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: min(a), max(b), min(c) mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: double), _col1 (type: double), _col2 (type: int) Reduce Operator Tree: Group By Operator aggregations: min(VALUE._col0), max(VALUE._col1), min(VALUE._col2) mode: mergepartial outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: double), _col1 (type: double), _col2 (type: int) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Thanks, -sandeep _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is
Re: Hive shell code exception, urgent help needed
Hi, I have 4 instances on ec2 1 master with namenode and YARN running on it 1 for secondary namenode and 2 slaves I used t2.medium instance for the master node only and left the rest as they were and still i got the same exception. t2.medium is a decent instance with 4GB RAM and 2 CPUs so i don't think this exception is related to memory. Any other suggestions please ? Regards, Sarfraz Rasheed Ramay (DIT) Dublin, Ireland. On Fri, Jul 25, 2014 at 1:51 AM, Juan Martin Pampliega jpampli...@gmail.com wrote: Hi, The actual useful part of the error is: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask If you do a search for this plus EC2 in Google you will find a couple of results that point to memory exhaustion issues. You should try increasing the configurated memory size. Since you are using a t2.micro you should really try using a bigger Amazon instance size. This might probably be a lot more useful than trying different configurations. On Jul 24, 2014 7:08 AM, Sarfraz Ramay sarfraz.ra...@gmail.com wrote: Can anyone please help with this ? [image: Inline image 1] i followed the advice here http://stackoverflow.com/questions/20390217/mapreduce-job-in-headless-environment-fails-n-times-due-to-am-container-exceptio and added to mapred-site.xml following properties but still getting the same error. property namemapred.child.java.opts/name value-Djava.awt.headless=true/value/property!-- add headless to default -Xmx1024m --property nameyarn.app.mapreduce.am.command-opts/name value-Djava.awt.headless=true -Xmx1024m/value/propertyproperty nameyarn.app.mapreduce.am.admin-command-opts/name value-Djava.awt.headless=true/value/property Regards, Sarfraz Rasheed Ramay (DIT) Dublin, Ireland. On Tue, Jul 22, 2014 at 8:19 AM, Sarfraz Ramay sarfraz.ra...@gmail.com wrote: Hi, I am using Hive 0.13.1 and Hadoop 2-2.0 on amazon EC2 t2.micro instances. I have 4 instances, master has the namenode and yarn, secondarynode is a separate instance and two slaves are on separate instances each. It was working fine till now but it started to break when i tried to run the following query on tpch generated 3GB data. same worked ok on 1GB SELECT l_orderkey , sum(l_extendedprice*(1-l_discount)) as revenue , o_orderdate , o_shippriority FROM customer c JOIN orders o ON (c.c_custkey = o.o_custkey) JOIN lineitem l on (l.l_orderkey = o.o_orderkey) WHERE o_orderdate '1995-03-15' and l_shipdate '1995-03-15' AND c.c_mktsegment = 'AUTOMOBILE' GROUP BY l_orderkey, o_orderdate, o_shippriority HAVING sum(l_extendedprice*(1-l_discount)) 38500 --average revenue --LIMIT 10; i have tried many things but nothing seems to work. I am also attaching my mapred-site.xml and yarn-site.xml files for reference plus the error log. I have also tried to limit the memory settings in mapred-site.xml and yarn-site but nothing seems to be working. For full log details please find attached hive.log file. Please help! Hadoop job information for Stage-7: number of mappers: 9; number of reducers: 0 2014-07-22 06:39:31,643 Stage-7 map = 0%, reduce = 0% 2014-07-22 06:39:43,940 Stage-7 map = 6%, reduce = 0%, Cumulative CPU 5.34 sec 2014-07-22 06:39:45,002 Stage-7 map = 11%, reduce = 0%, Cumulative CPU 6.94 sec 2014-07-22 06:40:08,373 Stage-7 map = 17%, reduce = 0%, Cumulative CPU 12.6 sec 2014-07-22 06:40:10,417 Stage-7 map = 22%, reduce = 0%, Cumulative CPU 14.06 sec 2014-07-22 06:40:22,732 Stage-7 map = 28%, reduce = 0%, Cumulative CPU 24.46 sec 2014-07-22 06:40:25,843 Stage-7 map = 33%, reduce = 0%, Cumulative CPU 25.74 sec 2014-07-22 06:40:33,039 Stage-7 map = 44%, reduce = 0%, Cumulative CPU 33.32 sec 2014-07-22 06:40:38,709 Stage-7 map = 56%, reduce = 0%, Cumulative CPU 37.19 sec 2014-07-22 06:41:07,648 Stage-7 map = 61%, reduce = 0%, Cumulative CPU 42.83 sec 2014-07-22 06:41:15,900 Stage-7 map = 56%, reduce = 0%, Cumulative CPU 39.49 sec 2014-07-22 06:41:27,299 Stage-7 map = 67%, reduce = 0%, Cumulative CPU 46.07 sec 2014-07-22 06:41:28,342 Stage-7 map = 56%, reduce = 0%, Cumulative CPU 40.9 sec 2014-07-22 06:41:43,753 Stage-7 map = 61%, reduce = 0%, Cumulative CPU 42.84 sec 2014-07-22 06:41:45,801 Stage-7 map = 100%, reduce = 0%, Cumulative CPU 37.19 sec MapReduce Total cumulative CPU time: 37 seconds 190 msec Ended Job = job_1406011031680_0002 with errors Error during job, obtaining debugging information... Job Tracking URL: http://ec2-54-77-76-145.eu-west-1.compute.amazonaws.com:8088/proxy/application_1406011031680_0002/ Examining task ID: task_1406011031680_0002_m_01 (and more) from job job_1406011031680_0002 Examining task ID: task_1406011031680_0002_m_05 (and more) from job job_1406011031680_0002 Task with the most failures(4): - Task ID: task_1406011031680_0002_m_08 URL:
Re: Using Parquet and Thrift in Hive
+ Re-sending as delivery of earlier mail failed. On Fri, Jul 25, 2014 at 5:14 PM, Abhishek Agarwal abhishc...@gmail.com wrote: Hi All, Is it possible to create a table with Parquet as storage mechanism, with schema being supplied from thrift IDL rather than metastore? Something like below, hive CREATE EXTERNAL TABLE Table Name ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer' WITH serdeproperties ( serialization.class=IDL Class, serialization.format=org.apache.thrift.protocol.TBinaryProtocol ) STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'; I want the schema to be generated through the ThriftSerDe and also use Parquet as storage. Would I need to write custom wrappers around the serde and input/output format? -- Regards, Abhishek Agarwal -- Regards, Abhishek Agarwal
Re: Hive User Group Meeting
Dear Hive users and developers, As an update, the hive user group meeting during Hadoop World will be held on Oct. 15th, from 6:30pm to 9:00pm at about.com's office at 1500 Broadway, 6th floor, New York, NY 10036.. Here is the schedule: 6:30 pm: Doors open 6:30 pm-7:00pm: Networking and refreshments 7:00pm-9pm: Talks (15 minutes each) Currently the event is open to talk/presentations. Please let me know if you have a proposal. The event will be announced at http://www.meetup.com/Hive-User-Group-Meeting/ soon. Please do sign up RSVP if you plan to attend so that we can plan accordingly. Many thanks to about.com for hosting the event. Also thanks go to Edward Capriolo and his company, HuffPost, as they also offered to host. Once we have a list of talks, I shall update you again. Thanks and have a nice weekend! --Xuefu On Mon, Jul 7, 2014 at 6:01 PM, Xuefu Zhang xzh...@cloudera.com wrote: Dear Hive users, Hive community is considering a user group meeting during Hadoop World that will be held in New York October 15-17th. To make this happen, your support is essential. First, I'm wondering if any user, especially those in New York area would be willing to host the meetup. Secondly, I'm soliciting talks from users as well as developers, and so please propose or share your thoughts on the contents of the meetup. I will soon setup a meetup event to formally announce this. In the meantime, your suggestions, comments, and kind assistance are greatly appreciated. Sincerely, Xuefu
HiveServer2 Availability
Has anyone had any experience with a multiple-machine HiveServer2 setup? Hive needs to be available at all times for our use-case, so if for some reason, one of our HiveServer2 machines goes down or the connection drops, the user should be able to just re-connect to another machine. In the end, we'd like a system where someone just connects to, for example, hiveserver2.company.com and the user is automatically routed to a server round-robin style. Anyone have any thoughts on this subject? Thanks in advance. -- *Raymond Lau* Software Engineer - Intern | r...@ooyala.com | (925) 395-3806
Re: [parquet-dev] Re: Using Parquet and Thrift in Hive
This is not possible today. Can you file an enhancement and describe the motivation? Also Parquet has moved to apache: http://parquet.incubator.apache.org/ All questions and discussions should now be sent to d...@parquet.incubator.apache.org please subscribe by emailing dev-subscr...@parquet.incubator.apache.org On Fri, Jul 25, 2014 at 4:46 AM, Abhishek Agarwal abhishc...@gmail.com wrote: + Re-sending as delivery of earlier mail failed. On Fri, Jul 25, 2014 at 5:14 PM, Abhishek Agarwal abhishc...@gmail.com wrote: Hi All, Is it possible to create a table with Parquet as storage mechanism, with schema being supplied from thrift IDL rather than metastore? Something like below, hive CREATE EXTERNAL TABLE Table Name ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer' WITH serdeproperties ( serialization.class=IDL Class, serialization.format=org.apache.thrift.protocol.TBinaryProtocol ) STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'; I want the schema to be generated through the ThriftSerDe and also use Parquet as storage. Would I need to write custom wrappers around the serde and input/output format? -- Regards, Abhishek Agarwal -- Regards, Abhishek Agarwal -- All questions and discussions should now be sent to d...@parquet.incubator.apache.org please subscribe by emailing dev-subscr...@parquet.incubator.apache.org --- You received this message because you are subscribed to the Google Groups Parquet group. To post to this group, send email to parquet-...@googlegroups.com.
Re: HiveServer2 Availability
I have put stAndard load balancers infront before and wrote a thrift show tables as a test. On Friday, July 25, 2014, Raymond Lau r...@ooyala.com wrote: Has anyone had any experience with a multiple-machine HiveServer2 setup? Hive needs to be available at all times for our use-case, so if for some reason, one of our HiveServer2 machines goes down or the connection drops, the user should be able to just re-connect to another machine. In the end, we'd like a system where someone just connects to, for example, hiveserver2.company.com and the user is automatically routed to a server round-robin style. Anyone have any thoughts on this subject? Thanks in advance. -- Raymond Lau Software Engineer - Intern | https://lh6.googleusercontent.com/hyz76OkGgnUwiU5b-fZWpAjIcTm-SaytgiFJAbgc6A_dzWIRMpwuB1497LLBOhbB4GU7X04YXaB9B4Qth_bB042dxPIXbHONj8r6LCTlp6Mt3QQpj3c r...@ooyala.com | (925) 395-3806 -- Sorry this was sent from mobile. Will do less grammar and spell check than usual.