Re: Reg:Column Statistics with Parquet

2014-07-25 Thread Suma Shivaprasad
Hi ,

I tried the same with compute statistics for columns a, b,c as above and
still seeing the same results in explain plan.

How do I confirm if its generating all the column stats for a given column.
If this is confirmed, we can debug why Hive is still not using it?

Thanks
Suma


On Thu, Jul 24, 2014 at 11:49 PM, Prasanth Jayachandran 
pjayachand...@hortonworks.com wrote:

 You have to explicit specifics column list in analyze command for
 gathering columns stats.

 This command will only collect basic stats like number of rows, total file
 size, raw data size, number of files.
 analyze table user_table partition(dt='2014-06-01',hour='00') compute
 statistics;

 To collect column statistics add the column list like below
 analyze table user_table partition(dt='2014-06-01',hour='00') compute
 statistics for columns a, b, c;

 Thanks
 Prasanth Jayachandran

 On Jul 24, 2014, at 5:13 AM, Sandeep Samudrala 
 sandeep.samudr...@inmobi.com wrote:

 I am trying to enable Column statistics usage with Parquet tables. This is
 the query I am executing. However on explain, I see that even though *Basic
 stats: COMPLETE *is seen *Column stats *is seen as*NONE.*
 Can someone please explain what else I need to debug/fix this.

 set hive.compute.query.using.stats=true;
 set hive.stats.reliable=true;
 set hive.stats.fetch.column.stats=true;
 set hive.stats.fetch.partition.stats=true;
 set hive.cbo.enable=true;

 analyze table user_table partition(dt='2014-06-01',hour='00') compute
 statistics;

 explain select min(a), max(b), min(c) from user_table;

 hive explain select min(a), max(b), min(c) from usertable;
 OK
 STAGE DEPENDENCIES:
   Stage-1 is a root stage
   Stage-0 is a root stage

 STAGE PLANS:
   Stage: Stage-1
 Map Reduce
   Map Operator Tree:
   TableScan
 alias: user_table
 Statistics: Num rows: 55490383 Data size: 1831182639 *Basic
 stats: COMPLETE Column stats: NONE*
 Select Operator
   expressions: a (type: double), b (type: double), c (type:
 int)
   outputColumnNames: a, b, c
   Statistics: Num rows: 55490383 Data size: 1831182639* Basic
 stats: COMPLETE Column stats: NONE*
   Group By Operator
 aggregations: min(a), max(b), min(c)
 mode: hash
 outputColumnNames: _col0, _col1, _col2
 Statistics: Num rows: 1 Data size: 20 *Basic stats:
 COMPLETE Column stats: NONE*
 Reduce Output Operator
   sort order:
   Statistics: Num rows: 1 Data size: 20 *Basic stats:
 COMPLETE Column stats: NONE*
   value expressions: _col0 (type: double), _col1 (type:
 double), _col2 (type: int)
   Reduce Operator Tree:
 Group By Operator
   aggregations: min(VALUE._col0), max(VALUE._col1),
 min(VALUE._col2)
   mode: mergepartial
   outputColumnNames: _col0, _col1, _col2
   Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE
 Column stats: NONE
   Select Operator
 expressions: _col0 (type: double), _col1 (type: double), _col2
 (type: int)
 outputColumnNames: _col0, _col1, _col2
 Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE
 Column stats: NONE
 File Output Operator
   compressed: false
   Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE
 Column stats: NONE
   table:
   input format: org.apache.hadoop.mapred.TextInputFormat
   output format:
 org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

   Stage: Stage-0
 Fetch Operator
   limit: -1


 Thanks,
 -sandeep

 _
 The information contained in this communication is intended solely for the
 use of the individual or entity to whom it is addressed and others
 authorized to receive it. It may contain confidential or legally privileged
 information. If you are not the intended recipient you are hereby notified
 that any disclosure, copying, distribution or taking any action in reliance
 on the contents of this information is strictly prohibited and may be
 unlawful. If you have received this communication in error, please notify
 us immediately by responding to this email and then delete it from your
 system. The firm is neither liable for the proper and complete transmission
 of the information contained in this communication nor for any delay in its
 receipt.



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, 

RE: Reg:Column Statistics with Parquet

2014-07-25 Thread Navdeep Agrawal
Well not the correct way ,you can check the statistics in mysql part_col_stats 
like tables in mysql data base if you are using mysql stat database .
Or the other way is calling max,min,distinct on int columns ,largest length on 
string columns etc,if they run whole map reduce on these operation then 
statistics are not getting created .

From: Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com]
Sent: Friday, July 25, 2014 12:43 PM
To: user@hive.apache.org
Subject: Re: Reg:Column Statistics with Parquet

Hi ,

I tried the same with compute statistics for columns a, b,c as above and still 
seeing the same results in explain plan.

How do I confirm if its generating all the column stats for a given column. If 
this is confirmed, we can debug why Hive is still not using it?

Thanks
Suma

On Thu, Jul 24, 2014 at 11:49 PM, Prasanth Jayachandran 
pjayachand...@hortonworks.commailto:pjayachand...@hortonworks.com wrote:
You have to explicit specifics column list in analyze command for gathering 
columns stats.

This command will only collect basic stats like number of rows, total file 
size, raw data size, number of files.
analyze table user_table partition(dt='2014-06-01',hour='00') compute 
statistics;

To collect column statistics add the column list like below
analyze table user_table partition(dt='2014-06-01',hour='00') compute 
statistics for columns a, b, c;

Thanks
Prasanth Jayachandran

On Jul 24, 2014, at 5:13 AM, Sandeep Samudrala 
sandeep.samudr...@inmobi.commailto:sandeep.samudr...@inmobi.com wrote:


I am trying to enable Column statistics usage with Parquet tables. This is the 
query I am executing. However on explain, I see that even though Basic stats: 
COMPLETE is seen Column stats is seen asNONE.
Can someone please explain what else I need to debug/fix this.

set hive.compute.query.using.stats=true;
set hive.stats.reliable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cbo.enable=true;

analyze table user_table partition(dt='2014-06-01',hour='00') compute 
statistics;

explain select min(a), max(b), min(c) from user_table;

hive explain select min(a), max(b), min(c) from usertable;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: user_table
Statistics: Num rows: 55490383 Data size: 1831182639 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  expressions: a (type: double), b (type: double), c (type: int)
  outputColumnNames: a, b, c
  Statistics: Num rows: 55490383 Data size: 1831182639 Basic stats: 
COMPLETE Column stats: NONE
  Group By Operator
aggregations: min(a), max(b), min(c)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE 
Column stats: NONE
Reduce Output Operator
  sort order:
  Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE 
Column stats: NONE
  value expressions: _col0 (type: double), _col1 (type: 
double), _col2 (type: int)
  Reduce Operator Tree:
Group By Operator
  aggregations: min(VALUE._col0), max(VALUE._col1), min(VALUE._col2)
  mode: mergepartial
  outputColumnNames: _col0, _col1, _col2
  Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column 
stats: NONE
  Select Operator
expressions: _col0 (type: double), _col1 (type: double), _col2 
(type: int)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column 
stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE 
Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1


Thanks,
-sandeep

_
The information contained in this communication is intended solely for the use 
of the individual or entity to whom it is addressed and others authorized to 
receive it. It may contain confidential or legally privileged information. If 
you are not the intended recipient you are hereby notified that any disclosure, 
copying, distribution or taking any action in reliance on the contents of this 
information is strictly prohibited and may be unlawful. If you have received 
this communication in error, please notify us immediately by responding to this 
email and then delete it from your system. The firm is 

Re: Hive shell code exception, urgent help needed

2014-07-25 Thread Sarfraz Ramay
Hi,

I have 4 instances on ec2
1 master with namenode and YARN running on it
1 for secondary namenode and 2 slaves

I used t2.medium instance for the master node only and left the rest as
they were and still i got the same exception. t2.medium is a decent
instance with 4GB RAM and 2 CPUs so i don't think this exception is related
to memory. Any other suggestions please ?

Regards,
Sarfraz Rasheed Ramay (DIT)
Dublin, Ireland.


On Fri, Jul 25, 2014 at 1:51 AM, Juan Martin Pampliega jpampli...@gmail.com
 wrote:

 Hi,
 The actual useful part of the error is:

 Execution Error, return code 2 from
 org.apache.hadoop.hive.ql.exec.mr.MapRedTask

 If you do a search for this plus EC2 in Google you will find a couple of
 results that point to memory exhaustion issues. You should try increasing
 the configurated memory size.

 Since you are using a t2.micro you should really try using a bigger Amazon
 instance size. This might probably be a lot more useful than trying
 different configurations.
  On Jul 24, 2014 7:08 AM, Sarfraz Ramay sarfraz.ra...@gmail.com wrote:

 Can anyone please help with this ?

 [image: Inline image 1]


 i followed the advice here
 http://stackoverflow.com/questions/20390217/mapreduce-job-in-headless-environment-fails-n-times-due-to-am-container-exceptio

 and added to mapred-site.xml following properties but still getting the
 same error.

 property
 namemapred.child.java.opts/name
 value-Djava.awt.headless=true/value/property!-- add headless to 
 default -Xmx1024m --property
 nameyarn.app.mapreduce.am.command-opts/name
 value-Djava.awt.headless=true -Xmx1024m/value/propertyproperty
 nameyarn.app.mapreduce.am.admin-command-opts/name
 value-Djava.awt.headless=true/value/property



 Regards,
 Sarfraz Rasheed Ramay (DIT)
 Dublin, Ireland.


 On Tue, Jul 22, 2014 at 8:19 AM, Sarfraz Ramay sarfraz.ra...@gmail.com
 wrote:

 Hi,

 I am using Hive 0.13.1 and Hadoop 2-2.0 on amazon EC2 t2.micro
 instances. I have 4 instances, master has the namenode and yarn,
 secondarynode is a separate instance and two slaves are on separate
 instances each.

 It was working fine till now but it started to break when i tried to run
 the following query on tpch generated 3GB data. same worked ok on 1GB

 SELECT
   l_orderkey
   , sum(l_extendedprice*(1-l_discount)) as revenue
   , o_orderdate
   , o_shippriority
 FROM
 customer c JOIN orders o
 ON (c.c_custkey = o.o_custkey)
 JOIN lineitem l
 on (l.l_orderkey = o.o_orderkey)
 WHERE
  o_orderdate  '1995-03-15' and l_shipdate  '1995-03-15'
 AND c.c_mktsegment = 'AUTOMOBILE'
 GROUP BY
 l_orderkey, o_orderdate, o_shippriority
 HAVING
 sum(l_extendedprice*(1-l_discount))  38500 --average revenue
 --LIMIT 10;

 i have tried many things but nothing seems to work. I am also attaching
 my mapred-site.xml and yarn-site.xml files for reference plus the error
 log. I have also tried to limit the memory settings in mapred-site.xml and
 yarn-site but nothing seems to be working. For full log details please find
 attached hive.log file. Please help!

 Hadoop job information for Stage-7: number of mappers: 9; number of
 reducers: 0
 2014-07-22 06:39:31,643 Stage-7 map = 0%,  reduce = 0%
 2014-07-22 06:39:43,940 Stage-7 map = 6%,  reduce = 0%, Cumulative CPU
 5.34 sec
 2014-07-22 06:39:45,002 Stage-7 map = 11%,  reduce = 0%, Cumulative CPU
 6.94 sec
 2014-07-22 06:40:08,373 Stage-7 map = 17%,  reduce = 0%, Cumulative CPU
 12.6 sec
 2014-07-22 06:40:10,417 Stage-7 map = 22%,  reduce = 0%, Cumulative CPU
 14.06 sec
 2014-07-22 06:40:22,732 Stage-7 map = 28%,  reduce = 0%, Cumulative CPU
 24.46 sec
 2014-07-22 06:40:25,843 Stage-7 map = 33%,  reduce = 0%, Cumulative CPU
 25.74 sec
 2014-07-22 06:40:33,039 Stage-7 map = 44%,  reduce = 0%, Cumulative CPU
 33.32 sec
 2014-07-22 06:40:38,709 Stage-7 map = 56%,  reduce = 0%, Cumulative CPU
 37.19 sec
 2014-07-22 06:41:07,648 Stage-7 map = 61%,  reduce = 0%, Cumulative CPU
 42.83 sec
 2014-07-22 06:41:15,900 Stage-7 map = 56%,  reduce = 0%, Cumulative CPU
 39.49 sec
 2014-07-22 06:41:27,299 Stage-7 map = 67%,  reduce = 0%, Cumulative CPU
 46.07 sec
 2014-07-22 06:41:28,342 Stage-7 map = 56%,  reduce = 0%, Cumulative CPU
 40.9 sec
 2014-07-22 06:41:43,753 Stage-7 map = 61%,  reduce = 0%, Cumulative CPU
 42.84 sec
 2014-07-22 06:41:45,801 Stage-7 map = 100%,  reduce = 0%, Cumulative CPU
 37.19 sec
 MapReduce Total cumulative CPU time: 37 seconds 190 msec
 Ended Job = job_1406011031680_0002 with errors
 Error during job, obtaining debugging information...
 Job Tracking URL:
 http://ec2-54-77-76-145.eu-west-1.compute.amazonaws.com:8088/proxy/application_1406011031680_0002/
 Examining task ID: task_1406011031680_0002_m_01 (and more) from job
 job_1406011031680_0002
 Examining task ID: task_1406011031680_0002_m_05 (and more) from job
 job_1406011031680_0002

 Task with the most failures(4):
 -
 Task ID:
   task_1406011031680_0002_m_08

 URL:

 

Re: Using Parquet and Thrift in Hive

2014-07-25 Thread Abhishek Agarwal
+ Re-sending as delivery of earlier mail failed.


On Fri, Jul 25, 2014 at 5:14 PM, Abhishek Agarwal abhishc...@gmail.com
wrote:

 Hi All,
 Is it possible to create a table with Parquet as storage mechanism, with
 schema being supplied from thrift IDL rather than metastore?

 Something like below,


 hive CREATE EXTERNAL TABLE Table Name ROW FORMAT SERDE
 'org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer'
 WITH serdeproperties (


 serialization.class=IDL Class,
 serialization.format=org.apache.thrift.protocol.TBinaryProtocol
 ) STORED AS


 INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
 OUTPUTFORMAT


 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';



 I want the schema to be generated through the ThriftSerDe and also use
 Parquet as storage. Would I need to write custom wrappers around the serde
 and input/output format?


 --
 Regards,
 Abhishek Agarwal




-- 
Regards,
Abhishek Agarwal


Re: Hive User Group Meeting

2014-07-25 Thread Xuefu Zhang
Dear Hive users and developers,

As an update, the hive user group meeting during Hadoop World will be held
on Oct. 15th, from 6:30pm to 9:00pm at about.com's office at 1500 Broadway,
6th floor, New York, NY 10036.. Here is the schedule:

6:30 pm: Doors open
6:30 pm-7:00pm: Networking and refreshments
7:00pm-9pm: Talks (15 minutes each)

Currently the event is open to talk/presentations. Please let me know if
you have a proposal.

The event will be announced at
http://www.meetup.com/Hive-User-Group-Meeting/ soon. Please do sign up RSVP
if you plan to attend so that we can plan accordingly.

Many thanks to about.com for hosting the event. Also thanks go to Edward
Capriolo and his company, HuffPost, as they also offered to host.

Once we have a list of talks, I shall update you again.

Thanks and have a nice weekend!

--Xuefu



On Mon, Jul 7, 2014 at 6:01 PM, Xuefu Zhang xzh...@cloudera.com wrote:

 Dear Hive users,

 Hive community is considering a user group meeting during Hadoop World
 that will be held in New York October 15-17th. To make this happen, your
 support is essential. First, I'm wondering if any user, especially those in
 New York area would be willing to host the meetup. Secondly, I'm soliciting
 talks from users as well as developers, and so please propose or share your
 thoughts on the contents of the meetup.

 I will soon setup a meetup event to  formally announce this. In the
 meantime, your suggestions, comments, and kind assistance are greatly
 appreciated.

 Sincerely,

 Xuefu



HiveServer2 Availability

2014-07-25 Thread Raymond Lau
Has anyone had any experience with a multiple-machine HiveServer2 setup?
 Hive needs to be available at all times for our use-case, so if for some
reason, one of our HiveServer2 machines goes down or the connection drops,
the user should be able to just re-connect to another machine.

In the end, we'd like a system where someone just connects to, for example,
hiveserver2.company.com and the user is automatically routed to a server
round-robin style.

Anyone have any thoughts on this subject?

Thanks in advance.

-- 
*Raymond Lau*
Software Engineer - Intern |
r...@ooyala.com | (925) 395-3806


Re: [parquet-dev] Re: Using Parquet and Thrift in Hive

2014-07-25 Thread Brock Noland
This is not possible today. Can you file an enhancement and describe
the motivation?

Also Parquet has moved to apache: http://parquet.incubator.apache.org/

All questions and discussions should now be sent to
d...@parquet.incubator.apache.org please subscribe by emailing
dev-subscr...@parquet.incubator.apache.org

On Fri, Jul 25, 2014 at 4:46 AM, Abhishek Agarwal abhishc...@gmail.com wrote:
 + Re-sending as delivery of earlier mail failed.


 On Fri, Jul 25, 2014 at 5:14 PM, Abhishek Agarwal abhishc...@gmail.com
 wrote:

 Hi All,
 Is it possible to create a table with Parquet as storage mechanism, with
 schema being supplied from thrift IDL rather than metastore?

 Something like below,


 hive CREATE EXTERNAL TABLE Table Name ROW FORMAT SERDE
 'org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer'
 WITH serdeproperties (

 serialization.class=IDL Class,
 serialization.format=org.apache.thrift.protocol.TBinaryProtocol
 ) STORED AS

 INPUTFORMAT
 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT

 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';


 I want the schema to be generated through the ThriftSerDe and also use
 Parquet as storage. Would I need to write custom wrappers around the serde
 and input/output format?



 --
 Regards,
 Abhishek Agarwal




 --
 Regards,
 Abhishek Agarwal

 --
 All questions and discussions should now be sent to
 d...@parquet.incubator.apache.org
 please subscribe by emailing dev-subscr...@parquet.incubator.apache.org
 ---
 You received this message because you are subscribed to the Google Groups
 Parquet group.
 To post to this group, send email to parquet-...@googlegroups.com.


Re: HiveServer2 Availability

2014-07-25 Thread Edward Capriolo
I have put stAndard load balancers infront before and wrote a thrift show
tables as a test.

On Friday, July 25, 2014, Raymond Lau r...@ooyala.com wrote:
 Has anyone had any experience with a multiple-machine HiveServer2 setup?
 Hive needs to be available at all times for our use-case, so if for some
reason, one of our HiveServer2 machines goes down or the connection drops,
the user should be able to just re-connect to another machine.
 In the end, we'd like a system where someone just connects to, for
example, hiveserver2.company.com and the user is automatically routed to a
server round-robin style.
 Anyone have any thoughts on this subject?
 Thanks in advance.
 --
 Raymond Lau
 Software Engineer - Intern | 
https://lh6.googleusercontent.com/hyz76OkGgnUwiU5b-fZWpAjIcTm-SaytgiFJAbgc6A_dzWIRMpwuB1497LLBOhbB4GU7X04YXaB9B4Qth_bB042dxPIXbHONj8r6LCTlp6Mt3QQpj3c

 r...@ooyala.com | (925) 395-3806


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.