Re: different outer join plan between hive 0.9 and hive 0.10

2013-03-27 Thread Navis류승우
It's a bug (https://issues.apache.org/jira/browse/HIVE-4206).

Thanks for reporting it.

2013/3/24 wzc :
> Recently we tried to upgrade our hive from 0.9 to 0.10, but found some of
> our hive queries almost 7 times slow.  One of such query consists multiple
> table outer join on the same key. By looking into the query,  we found the
> query plans generate by hive 0.9 and hive 0.10 are different.  Here is the
> example:
>
> testcase:
>
> use default;
> create table test_join (
>   `key` string,
>   `value` string
> );
>
> explain select
> sum(a.value) val
> from default.test_join a
> left outer join default.test_join b on a.key = b.key
> left outer join default.test_join c on a.key = c.key
> left outer join default.test_join d on a.key = d.key
> left outer join default.test_join e on a.key = e.key
> left outer join default.test_join f on a.key = f.key
> left outer join default.test_join g on a.key = g.key
>
>
> the explain of hive 0.9:
>
>  STAGE DEPENDENCIES:
>
> Stage-1 is a root stage
>
> Stage-2 depends on stages: Stage-1
>
> Stage-0 is a root stage
>
> ...
>
> Reduce Operator Tree:
>
>   Join Operator
>
> condition map:
>
>  Left Outer Join0 to 1
>
>  Left Outer Join0 to 2
>
>  Left Outer Join0 to 3
>
>  Left Outer Join0 to 4
>
>  Left Outer Join0 to 5
>
>  Left Outer Join0 to 6
>
> condition expressions:
>
>   0 {VALUE._col1}
>
>   1
>
>   2
>
>   3
>
>   4
>
>   5
>
>   6
>
> ..
>
>
> while the explain of hive 0.10:
>
>   STAGE DEPENDENCIES:
>
> Stage-6 is a root stage
>
> Stage-1 depends on stages: Stage-6
>
> Stage-2 depends on stages: Stage-1
>
> Stage-0 is a root stage
>
> ...
>
>  Reduce Operator Tree:
>
>   Join Operator
>
> condition map:
>
>  Left Outer Join0 to 1
>
>  Left Outer Join0 to 2
>
> condition expressions:
>
>   0 {VALUE._col0} {VALUE._col1}
>
>   1
>
>   2
>
> ...
>
>  Reduce Operator Tree:
>
>   Join Operator
>
> condition map:
>
>  Left Outer Join0 to 1
>
>  Left Outer Join0 to 2
>
>  Left Outer Join0 to 3
>
>  Left Outer Join0 to 4
>
> condition expressions:
>
>   0 {VALUE._col9}
>
>   1
>
>   2
>
>   3
>
>   4
>
> 
>
>
> It seems like hive 0.9 use only one stage/job to process all outer joins but
> hive 0.10 split them into two stage.  When running such kind of query on
> hive0.10 in production,  in the second stage of outer join process, some
> reducer stucks.
>
> I can't find any param to change the query plain ,  can anyone give me some
> hint?
>
> Thanks!
>
>
>
>


Re: Setting to make all tables external by default

2013-03-27 Thread Nitin Pawar
somebody had answered this before on another mail thread

set the external property to default true in metastore and that should
solve it


On Thu, Mar 28, 2013 at 1:43 AM, Babe Ruth wrote:

>
> Hello,
>
> Is there a setting where all tables created will default to external? I
> know this isn't ideal long term but I am setting up the environment and
> would rather start with all tables being external then switch them to
> managed if necessary.
>
> Thanks, George
>
>


-- 
Nitin Pawar


Re: hive hbase storage handler fail

2013-03-27 Thread Jason Shih
Hi Sanjay,


thanks for the info. indeed, I did raise up the log level earlier to avoid 
dumping too much info in app log. I got much the same info and exception after 
lowering down at INFO. also, I am trying with CDH 4.1.2, except I am running 
mapreduce rather than YARN. The exception observe only in mapreduce (MR1), I 
can successfully insert the data into hbase by initiating the mapper tasks 
follow the insert statement. 


forgot to paste also the log observed in datanodes, hereafter the 
java.io.IOException read from datanode log file: 

---
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:243)
at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:522)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:373)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at org.apache.hadoop.mapred.Child.main



Cheers,
Jason

On Mar 28, 2013, at 6:18 AM, Sanjay Subramanian wrote:

> If you can run your hive insert data script with debug option u may get
> some clues
> 
> /usr/lib/hive/bin/hive -hiveconf hive.root.logger=INFO,console -e "insert
> into dest select * from some_table_same_structure_as_dest limit 10;"
> 
> I created a small demo usecase and this is failing for me as well
> The error I get is
> Unable to retrieve URL for Hadoop Task logs. port out of range:99
> 13/03/27 14:02:32 ERROR exec.Task: Unable to retrieve URL for Hadoop Task
> logs. port out of range:99
> 
> I am using ClouderaManager 4.1.2 Hadoop, Yarn, HDFS, Oozie, Hive and Hue
> 
> sanjay
> 
> 
> On 3/27/13 1:25 PM, "Jason Shih"  wrote:
> 
>> 
>> Hi all,
>> 
>> 
>> I try inserting data into hive table storing by hbase storage handler but
>> fail with the following exception end of mapreduce stage info.
>> 
>> however, we didn't have problem accessing HbaseStorageHandler if force
>> accessing YARN rather than mapreduce. (hive: 0.9.0, hbase: 0.92.1).
>> 
>> could expert shed some light how we tweak the settings if would like to
>> initiate the mapper tasks to insert data leveraging HBaseStorageHandler.
>> Thanks.
>> 
>> 
>> hereafter simple hive script tried before:
>> 
>> CREATE TABLE dest(num string, name string)
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name")
>> TBLPROPERTIES ("hbase.table.name" = "dest");
>> 
>> 
>> and the exception:
>> 
>> ...
>> 2013-03-28 03:49:41,107 Stage-0 map = 0%,  reduce = 0%, Cumulative CPU
>> 0.67 sec
>> 2013-03-28 03:49:42,125 Stage-0 map = 0%,  reduce = 0%, Cumulative CPU
>> 0.67 sec
>> 2013-03-28 03:49:43,143 Stage-0 map = 0%,  reduce = 0%, Cumulative CPU
>> 0.67 sec
>> 2013-03-28 03:49:44,153 Stage-0 map = 0%,  reduce = 0%
>> 2013-03-28 03:49:47,184 Stage-0 map = 100%,  reduce = 100%
>> MapReduce Total cumulative CPU time: 670 msec
>> Ended Job = job_201303251402_0005 with errors
>> Error during job, obtaining debugging information...
>> Examining task ID: task_201303251402_0005_m_02 (and more) from job
>> job_201303251402_0005
>> Exception in thread "Thread-35" java.lang.NullPointerException
>>  at
>> org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(Hadoop23Sh
>> ims.java:44)
>>  at
>> org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.getTaskInfos(Jo
>> bDebugger.java:186)
>>  at
>> org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.run(JobDebugger
>> .java:142)
>>  at java.lang.Thread.run(Thread.java:662)
>>  No encryption was performed by peer.
>> FAILED: Execution Error, return code 2 from
>> org.apache.hadoop.hive.ql.exec.MapRedTask
>> MapReduce Jobs Launched:
>> Job 0: Map: 1   Cumulative CPU: 0.67 sec   HDFS Read: 0 HDFS Write: 0 FAIL
>> 
>> 
>> 
>> Cheers,
>> Jason



hive.limit.optimize.fetch.max

2013-03-27 Thread Sanjay Subramanian
Hi
I have following settings in the hive-site.xml

  hive.limit.row.max.size
  10



  hive.limit.optimize.enable
  true



  hive.limit.optimize.fetch.max
  11


When I do a select query with WHERE clause it does not LIMIT The results
to 10.

How do u limit the SELECT query results to 10 rows ?

My end goal is I want to create a GROUP in Hive in production that allows
only MAX LIMIT 10 rows returnedŠbecause I don't want business analysts
going SQL happy on Beeswax and bringing my server down with heavy duty
queries

Thanks
sanjay


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



Re: hive hbase storage handler fail

2013-03-27 Thread Sanjay Subramanian
If you can run your hive insert data script with debug option u may get
some clues

/usr/lib/hive/bin/hive -hiveconf hive.root.logger=INFO,console -e "insert
into dest select * from some_table_same_structure_as_dest limit 10;"

I created a small demo usecase and this is failing for me as well
The error I get is
Unable to retrieve URL for Hadoop Task logs. port out of range:99
13/03/27 14:02:32 ERROR exec.Task: Unable to retrieve URL for Hadoop Task
logs. port out of range:99

I am using ClouderaManager 4.1.2 Hadoop, Yarn, HDFS, Oozie, Hive and Hue

sanjay


On 3/27/13 1:25 PM, "Jason Shih"  wrote:

>
>Hi all,
>
>
>I try inserting data into hive table storing by hbase storage handler but
>fail with the following exception end of mapreduce stage info.
>
>however, we didn't have problem accessing HbaseStorageHandler if force
>accessing YARN rather than mapreduce. (hive: 0.9.0, hbase: 0.92.1).
>
>could expert shed some light how we tweak the settings if would like to
>initiate the mapper tasks to insert data leveraging HBaseStorageHandler.
>Thanks.
>
>
>hereafter simple hive script tried before:
>
>CREATE TABLE dest(num string, name string)
>STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name")
>TBLPROPERTIES ("hbase.table.name" = "dest");
>
>
>and the exception:
>
>...
>2013-03-28 03:49:41,107 Stage-0 map = 0%,  reduce = 0%, Cumulative CPU
>0.67 sec
>2013-03-28 03:49:42,125 Stage-0 map = 0%,  reduce = 0%, Cumulative CPU
>0.67 sec
>2013-03-28 03:49:43,143 Stage-0 map = 0%,  reduce = 0%, Cumulative CPU
>0.67 sec
>2013-03-28 03:49:44,153 Stage-0 map = 0%,  reduce = 0%
>2013-03-28 03:49:47,184 Stage-0 map = 100%,  reduce = 100%
>MapReduce Total cumulative CPU time: 670 msec
>Ended Job = job_201303251402_0005 with errors
>Error during job, obtaining debugging information...
>Examining task ID: task_201303251402_0005_m_02 (and more) from job
>job_201303251402_0005
>Exception in thread "Thread-35" java.lang.NullPointerException
>   at
>org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(Hadoop23Sh
>ims.java:44)
>   at
>org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.getTaskInfos(Jo
>bDebugger.java:186)
>   at
>org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.run(JobDebugger
>.java:142)
>   at java.lang.Thread.run(Thread.java:662)
>   No encryption was performed by peer.
>FAILED: Execution Error, return code 2 from
>org.apache.hadoop.hive.ql.exec.MapRedTask
>MapReduce Jobs Launched:
>Job 0: Map: 1   Cumulative CPU: 0.67 sec   HDFS Read: 0 HDFS Write: 0 FAIL
>
>
>
>Cheers,
>Jason
>
>


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



hive hbase storage handler fail

2013-03-27 Thread Jason Shih

Hi all,


I try inserting data into hive table storing by hbase storage handler but fail 
with the following exception end of mapreduce stage info. 

however, we didn't have problem accessing HbaseStorageHandler if force 
accessing YARN rather than mapreduce. (hive: 0.9.0, hbase: 0.92.1). 

could expert shed some light how we tweak the settings if would like to 
initiate the mapper tasks to insert data leveraging HBaseStorageHandler. Thanks.


hereafter simple hive script tried before: 

CREATE TABLE dest(num string, name string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name")
TBLPROPERTIES ("hbase.table.name" = "dest");


and the exception: 

...
2013-03-28 03:49:41,107 Stage-0 map = 0%,  reduce = 0%, Cumulative CPU 0.67 sec
2013-03-28 03:49:42,125 Stage-0 map = 0%,  reduce = 0%, Cumulative CPU 0.67 sec
2013-03-28 03:49:43,143 Stage-0 map = 0%,  reduce = 0%, Cumulative CPU 0.67 sec
2013-03-28 03:49:44,153 Stage-0 map = 0%,  reduce = 0%
2013-03-28 03:49:47,184 Stage-0 map = 100%,  reduce = 100%
MapReduce Total cumulative CPU time: 670 msec
Ended Job = job_201303251402_0005 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201303251402_0005_m_02 (and more) from job 
job_201303251402_0005
Exception in thread "Thread-35" java.lang.NullPointerException
   at 
org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(Hadoop23Shims.java:44)
   at 
org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.getTaskInfos(JobDebugger.java:186)
   at 
org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.run(JobDebugger.java:142)
   at java.lang.Thread.run(Thread.java:662)
   No encryption was performed by peer.
FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 0.67 sec   HDFS Read: 0 HDFS Write: 0 FAIL



Cheers,
Jason



Setting to make all tables external by default

2013-03-27 Thread Babe Ruth


Hello,



Is there a setting where all tables created will default to external? I know 
this isn't ideal long term but I am setting up the environment and would rather 
start with all tables being external then switch them to managed if necessary.



Thanks, George



  

Re: MySQL instance on hadoop name node server in production ?

2013-03-27 Thread Sanjay Subramanian
Thanks Nitin…the mysql replication is a good point
Regards
sanjay

From: Nitin Pawar mailto:nitinpawar...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Wednesday, March 27, 2013 11:36 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Re: MySQL instance on hadoop name node server in production ?

few writes understood but many reads will depends on how big your metastore 
will grow.

also what size of hdfs you are looking at that will play a huge factor in this 
cause namenode will contain the FSImage in memory so you need a larger memory 
given to namenode. Potentially if its a live cluster then I guess your namenode 
will be under heavy loads if you are setting up a production system.

i would still suggest for metastore you setup a couple of VMs with mysql 
replication enabled. so the reads can be distributed with load balancers etc


On Thu, Mar 28, 2013 at 12:00 AM, Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
 wrote:
Thanks Nitin.
The mysql instance is for hive metastore only so few writes, many reads.
Regards
sanjay

From: Nitin Pawar mailto:nitinpawar...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Wednesday, March 27, 2013 11:13 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Re: MySQL instance on hadoop name node server in production ?

ideally when you setup namenode and jobtracker in any hadoop cluster on 
production those are normally dedicated to that purpose only.

Depending on how frequently you are going to run queries and how much data the 
hdfs is going to hold is key factor in deciding this.


On Wed, Mar 27, 2013 at 11:32 PM, Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
 wrote:
Hi all

I am planning to install mysql server (as hive metastore) on the same box
as my name node.
My name node has 16GB RAM and hopefully I can get 2TB
Any problems with mysql on the dame node as name node ?

Thanks
sanjay


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.




--
Nitin Pawar

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



--
Nitin Pawar

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: MySQL instance on hadoop name node server in production ?

2013-03-27 Thread Nitin Pawar
few writes understood but many reads will depends on how big your metastore
will grow.

also what size of hdfs you are looking at that will play a huge factor in
this cause namenode will contain the FSImage in memory so you need a larger
memory given to namenode. Potentially if its a live cluster then I guess
your namenode will be under heavy loads if you are setting up a production
system.

i would still suggest for metastore you setup a couple of VMs with mysql
replication enabled. so the reads can be distributed with load balancers
etc


On Thu, Mar 28, 2013 at 12:00 AM, Sanjay Subramanian <
sanjay.subraman...@wizecommerce.com> wrote:

>  Thanks Nitin.
> The mysql instance is for hive metastore only so few writes, many reads.
> Regards
> sanjay
>
>   From: Nitin Pawar 
> Reply-To: "user@hive.apache.org" 
> Date: Wednesday, March 27, 2013 11:13 AM
> To: "user@hive.apache.org" 
> Subject: Re: MySQL instance on hadoop name node server in production ?
>
>   ideally when you setup namenode and jobtracker in any hadoop cluster on
> production those are normally dedicated to that purpose only.
>
>  Depending on how frequently you are going to run queries and how much
> data the hdfs is going to hold is key factor in deciding this.
>
>
> On Wed, Mar 27, 2013 at 11:32 PM, Sanjay Subramanian <
> sanjay.subraman...@wizecommerce.com> wrote:
>
>> Hi all
>>
>> I am planning to install mysql server (as hive metastore) on the same box
>> as my name node.
>> My name node has 16GB RAM and hopefully I can get 2TB
>> Any problems with mysql on the dame node as name node ?
>>
>> Thanks
>> sanjay
>>
>>
>> CONFIDENTIALITY NOTICE
>> ==
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>>
>
>
>  --
> Nitin Pawar
>
> CONFIDENTIALITY NOTICE
> ==
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>



-- 
Nitin Pawar


Re: MySQL instance on hadoop name node server in production ?

2013-03-27 Thread Sanjay Subramanian
Thanks Nitin.
The mysql instance is for hive metastore only so few writes, many reads.
Regards
sanjay

From: Nitin Pawar mailto:nitinpawar...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Wednesday, March 27, 2013 11:13 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Re: MySQL instance on hadoop name node server in production ?

ideally when you setup namenode and jobtracker in any hadoop cluster on 
production those are normally dedicated to that purpose only.

Depending on how frequently you are going to run queries and how much data the 
hdfs is going to hold is key factor in deciding this.


On Wed, Mar 27, 2013 at 11:32 PM, Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
 wrote:
Hi all

I am planning to install mysql server (as hive metastore) on the same box
as my name node.
My name node has 16GB RAM and hopefully I can get 2TB
Any problems with mysql on the dame node as name node ?

Thanks
sanjay


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.




--
Nitin Pawar

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: MySQL instance on hadoop name node server in production ?

2013-03-27 Thread Nitin Pawar
ideally when you setup namenode and jobtracker in any hadoop cluster on
production those are normally dedicated to that purpose only.

Depending on how frequently you are going to run queries and how much data
the hdfs is going to hold is key factor in deciding this.


On Wed, Mar 27, 2013 at 11:32 PM, Sanjay Subramanian <
sanjay.subraman...@wizecommerce.com> wrote:

> Hi all
>
> I am planning to install mysql server (as hive metastore) on the same box
> as my name node.
> My name node has 16GB RAM and hopefully I can get 2TB
> Any problems with mysql on the dame node as name node ?
>
> Thanks
> sanjay
>
>
> CONFIDENTIALITY NOTICE
> ==
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
>


-- 
Nitin Pawar


MySQL instance on hadoop name node server in production ?

2013-03-27 Thread Sanjay Subramanian
Hi all

I am planning to install mysql server (as hive metastore) on the same box
as my name node.
My name node has 16GB RAM and hopefully I can get 2TB
Any problems with mysql on the dame node as name node ?

Thanks
sanjay


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



RE: S3/EMR Hive: Load contents of a single file

2013-03-27 Thread Tony Burton

No problem Keith - it was a worthwhile exercise for me to go back and double 
check everything was working as expected.




-Original Message-
From: Keith Wiley [mailto:kwi...@keithwiley.com] 
Sent: 27 March 2013 17:03
To: user@hive.apache.org
Subject: Re: S3/EMR Hive: Load contents of a single file

Okay, I also saw your previous response which analyzed queries into two tables 
built around two files in the same directory.  I guess I was simply wrong in my 
understanding that a Hive table is fundamentally associated with a directory 
instead of a file.  Turns out, it be can either one.  A directory table uses 
all files in the directory while a file table uses one specific file and 
properly avoids sibling files.  My bad.

Thanks for the careful analysis and clarification.  TIL!

Cheers!

On Mar 27, 2013, at 02:58 , Tony Burton wrote:

> A bit more info - do an extended description of the table:
>  
> $ desc extended gsrc1;
>  
> And the "location" field is "location:s3://mybucket/path/to/data/src1.txt"
>  
> Do the same on a table created with a location pointing at the directory and 
> the same info gives (not surprisingly) "location:s3://mybucket/path/to/data/"
> 


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with 
isn't it, and what's it seems weird and scary to me."
   --  Abe (Grandpa) Simpson 




Please consider the environment before printing this email

Inbound Email has been scanned for viruses and SPAM 

**
Please consider the environment before printing this email or attachments

This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Gateway House, 
Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and 
regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling 
Commission (reg. no. 000-027343-R-308898-001).  Any financial promotion 
contained herein has been issued
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

Re: S3/EMR Hive: Load contents of a single file

2013-03-27 Thread Keith Wiley
Okay, I also saw your previous response which analyzed queries into two tables 
built around two files in the same directory.  I guess I was simply wrong in my 
understanding that a Hive table is fundamentally associated with a directory 
instead of a file.  Turns out, it be can either one.  A directory table uses 
all files in the directory while a file table uses one specific file and 
properly avoids sibling files.  My bad.

Thanks for the careful analysis and clarification.  TIL!

Cheers!

On Mar 27, 2013, at 02:58 , Tony Burton wrote:

> A bit more info - do an extended description of the table:
>  
> $ desc extended gsrc1;
>  
> And the “location” field is “location:s3://mybucket/path/to/data/src1.txt”
>  
> Do the same on a table created with a location pointing at the directory and 
> the same info gives (not surprisingly) “location:s3://mybucket/path/to/data/”
> 


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
   --  Abe (Grandpa) Simpson




RE: S3/EMR Hive: Load contents of a single file

2013-03-27 Thread Tony Burton
A bit more info - do an extended description of the table:

$ desc extended gsrc1;

And the "location" field is "location:s3://mybucket/path/to/data/src1.txt"

Do the same on a table created with a location pointing at the directory and 
the same info gives (not surprisingly) "location:s3://mybucket/path/to/data/"





From: Tony Burton [mailto:tbur...@sportingindex.com]
Sent: 27 March 2013 08:46
To: 'user@hive.apache.org'
Subject: RE: S3/EMR Hive: Load contents of a single file

Thanks for the reply Keith.

> you could have dispensed with the additional "alter table" business and 
> simply created the original table around the directory in the first place

Yep, but I have multiple files in that directory and wanted to create a table 
based upon one file per table.

> Do you know for certain that it isn't using other files also in that 
> directory as part of the same table
> or if it is currently empty, that if you add a new file to the directory 
> after creating the table in your
> described fashion, it doesn't immediately become visible as part of the table?

I've got two files in my s3://mybucket/path/to/data/ directory, 
s3://mybucket/path/to/data/src1.txt and s3://mybucket/path/to/data/src2.txt - 
both contain lists of ~-separated date/count pairs, eg 20130101~12345. Both 
contain data for just the month of February this year.

Create two tables:

$ create external table gsrc1 (gdate string, c int) row format delimited fields 
terminated by '~' stored as textfile;
$ alter table gsrc1 set location 's3://spinmetrics/global/src1.txt';
$ create external table gsrc2 (gdate string, c int) row format delimited fields 
terminated by '~' stored as textfile;
$ alter table gsrc2 set location 's3://spinmetrics/global/src2.txt';

Count(*) on each table:

$ select count(*) from gsrc1:
28
$ select count(*) from gsrc2:
28

Ok, but both tables could be pointing at the same data. Check max, min and 
first/last entry from both tables:

$ select min(c), max(c) from gsrc1;
2935 23130
$ select min(c), max(c) from gsrc2;
865953 2768868

$ select * from gsrc1 where gdate="20130201"
20130201 5153
$ select * from gsrc1 where gdate="20130228"
20130228 7051
$ select * from gsrc2 where gdate="20130201"
20130201 1472017
$ select * from gsrc2 where gdate="20130228"
20130228 1323241

And without copying in the whole data set I am 100% confident that these values 
match the contents of the individual files in s3. Maybe other readers could try 
a similar exercise and present their results? Are there other tests I could try 
to further verify my findings?

Tony





-Original Message-
From: Keith Wiley [mailto:kwi...@keithwiley.com]
Sent: 26 March 2013 19:40
To: user@hive.apache.org
Subject: Re: S3/EMR Hive: Load contents of a single file

Are you sure this is doing what you think it's doing? Since Hive associates 
tables with directories (well external tables at least, I'm not very familiar 
with internal tables), my suspicion is that even if your approach described 
below works, what Hive actually did was use s3://mybucket/path/to/data/ as the 
table location...in which case you could have dispensed with the additional 
"alter table" business and simply created the original table around the 
directory in the first place...or I could be completely wrong. Do you know for 
certain that it isn't using other files also in that directory as part of the 
same table...or if it is currently empty, that if you add a new file to the 
directory after creating the table in your described fashion, it doesn't 
immediately become visible as part of the table? I eagerly await clarification.

On Mar 26, 2013, at 10:39 , Tony Burton wrote:

>
> Thanks for the quick reply Sanjay.
>
> ALTER TABLE is the key, but slightly different to your suggestion. I create 
> the table as before, but don't specify location:
>
> $ create external table myData (str1 string, str2 string, count1 int)
> partitioned by  row format  stored as textfile;
>
> Then use ALTER TABLE like this:
>
> $ ALTER TABLE myData SET LOCATION '
> s3://mybucket/path/to/data/src1.txt ';
>
> Bingo, I can now run queries with myData in the same way I can when the 
> LOCATION is a directory. Cool!
>
> Tony
>
>
>
>
>
>
>
> From: Sanjay Subramanian [mailto:sanjay.subraman...@wizecommerce.com]
> Sent: 26 March 2013 17:22
> To: user@hive.apache.org
> Subject: Re: S3/EMR Hive: Load contents of a single file
>
> Hi Tony
>
> Can u create the table without any location.
>
> After that you could do an ALTER TABLE add location and partition
>
> ALTER TABLE myData ADD PARTITION (partitionColumn1='$value1' , 
> partitionColumn2='$value2') LOCATION '/path/to/your/directory/in/hdfs';"
>
>
> An example Without Partitions
> -
> ALTER TABLE myData SET LOCATION 
> 'hdfs://10.48.97.97:9000/path/to/your/data/directory/in/hdfs';"
>
>
> While specifying location, you have to point to a directory. You cannot point 
> to a fil

Re: how to make data statistics efficiency in hive?

2013-03-27 Thread 周梦想
Hi Jan,

You are so great! This expression is just I'm looking for.

Thank you very much!

Best Regards,
Andy Zhou

2013/3/27 Jan Dolinár 

> Hi Andy,
>
> I'm not sure if I entirely understood your question, but I think you're
> looking for something like this:
>
> select
> concat(date,':',uid),
> sum(1) as total,
> sum(if(a=1,1,0)) AS ca,
> sum(if(b=1,1,0)) AS cb,
> sum(if(c=1,1,0)) AS cc
> from mytable
> group by uid, date;
>
> Query like this allows you to efficiently compute sums of multiple columns
> in a single map-reduce.
>
> Best regards,
> Jan
>
>
> On Wed, Mar 27, 2013 at 4:34 AM, 周梦想  wrote:
>
>> hello,
>> about hsql statistics.
>>
>> table mytable
>> date,uid,a,b,c
>> 
>> 03/13/13   185690475  0   1   1
>> 03/13/13   187270278  0   1   0
>> 03/13/13   185690475  1   1   0
>> 03/13/13   186012530  1   0   1
>> 03/13/13   180286243  0   1   0
>> 03/13/13   185690475  1   1   0
>> 03/13/13   186012530  0   1   0
>> 03/13/13   183256782  1   0   0
>> 03/14/13   185690475  0   0   1
>>
>> I want to get one day,each user total count,count a=1 ,count b=1, count
>> c=1
>> the out put should like:
>>
>> key,total, counta, countb, countc
>> ---
>> 03/13/13:185690475  3  2   3   1
>> 03/13/13:187270278  1  0   1   0
>> 03/13/13:186012530  2  1   1   1
>> 03/13/13:180286243  1  0   1   0
>> 03/13/13:183256782  1  1   0   0
>> 03/14/13:185690475  1  0   0   1
>>
>> the hsql i want is:
>> select concat(date:uid),count(1),count(a=1),count(b=1),count(c=1) from
>> mytable group by uid,date;
>>
>> but I have to write ugly and inefficiency hsql like :
>> select  concat(s1.date:s1.uid),s1.total,s2.ca,s3.cb,s4.cc from
>> (
>> select date,uid,count(1) total total from mytable group by uid,date) s1
>> inner outer join
>> (select date,uid,count(1) ca total from mytable where a=1 group by
>> uid,date)s2
>> inner outer join
>> (select date,uid,count(1) cb total from mytable where b=1 group by
>> uid,date)s3
>> inner outer join
>> (select date,uid,count(1) cc total from mytable where c=1 group by
>> uid,date)s4
>> );
>>
>> each select sub-clause should run a map-reduce.
>>
>> if I have to count a very big number of  columns table, this should be a
>> very long task.
>> some one have any good ideals?
>>
>> Thank you!
>>
>> Best Regards,
>> Andy Zhou
>>
>
>


RE: S3/EMR Hive: Load contents of a single file

2013-03-27 Thread Tony Burton
Thanks for the reply Keith.

> you could have dispensed with the additional "alter table" business and 
> simply created the original table around the directory in the first place

Yep, but I have multiple files in that directory and wanted to create a table 
based upon one file per table.

> Do you know for certain that it isn't using other files also in that 
> directory as part of the same table
> or if it is currently empty, that if you add a new file to the directory 
> after creating the table in your
> described fashion, it doesn't immediately become visible as part of the table?

I've got two files in my s3://mybucket/path/to/data/ directory, 
s3://mybucket/path/to/data/src1.txt and s3://mybucket/path/to/data/src2.txt - 
both contain lists of ~-separated date/count pairs, eg 20130101~12345. Both 
contain data for just the month of February this year.

Create two tables: 

$ create external table gsrc1 (gdate string, c int) row format delimited fields 
terminated by '~' stored as textfile;
$ alter table gsrc1 set location 's3://spinmetrics/global/src1.txt';
$ create external table gsrc2 (gdate string, c int) row format delimited fields 
terminated by '~' stored as textfile;
$ alter table gsrc2 set location 's3://spinmetrics/global/src2.txt';

Count(*) on each table:

$ select count(*) from gsrc1:
28
$ select count(*) from gsrc2:
28

Ok, but both tables could be pointing at the same data. Check max, min and 
first/last entry from both tables:

$ select min(c), max(c) from gsrc1;
293523130
$ select min(c), max(c) from gsrc2;
865953  2768868

$ select * from gsrc1 where gdate="20130201"
201302015153
$ select * from gsrc1 where gdate="20130228"
201302287051
$ select * from gsrc2 where gdate="20130201"
201302011472017
$ select * from gsrc2 where gdate="20130228"
201302281323241

And without copying in the whole data set I am 100% confident that these values 
match the contents of the individual files in s3. Maybe other readers could try 
a similar exercise and present their results? Are there other tests I could try 
to further verify my findings?

Tony





-Original Message-
From: Keith Wiley [mailto:kwi...@keithwiley.com] 
Sent: 26 March 2013 19:40
To: user@hive.apache.org
Subject: Re: S3/EMR Hive: Load contents of a single file

Are you sure this is doing what you think it's doing?  Since Hive associates 
tables with directories (well external tables at least, I'm not very familiar 
with internal tables), my suspicion is that even if your approach described 
below works, what Hive actually did was use s3://mybucket/path/to/data/ as the 
table location...in which case you could have dispensed with the additional 
"alter table" business and simply created the original table around the 
directory in the first place...or I could be completely wrong.  Do you know for 
certain that it isn't using other files also in that directory as part of the 
same table...or if it is currently empty, that if you add a new file to the 
directory after creating the table in your described fashion, it doesn't 
immediately become visible as part of the table?  I eagerly await clarification.

On Mar 26, 2013, at 10:39 , Tony Burton wrote:

>  
> Thanks for the quick reply Sanjay.
>  
> ALTER TABLE is the key, but slightly different to your suggestion. I create 
> the table as before, but don't specify location:
>  
> $ create external table myData (str1 string, str2 string, count1 int) 
> partitioned by  row format  stored as textfile;
>  
> Then use ALTER TABLE like this:
>  
> $ ALTER TABLE myData SET LOCATION ' 
> s3://mybucket/path/to/data/src1.txt ';
>  
> Bingo, I can now run queries with myData in the same way I can when the 
> LOCATION is a directory. Cool!
>  
> Tony
>  
>  
>  
>  
>  
>  
>  
> From: Sanjay Subramanian [mailto:sanjay.subraman...@wizecommerce.com]
> Sent: 26 March 2013 17:22
> To: user@hive.apache.org
> Subject: Re: S3/EMR Hive: Load contents of a single file
>  
> Hi Tony
>  
> Can u create the table without any location. 
>  
> After that you could do an ALTER TABLE add location and partition
>  
> ALTER TABLE myData ADD PARTITION (partitionColumn1='$value1' , 
> partitionColumn2='$value2') LOCATION '/path/to/your/directory/in/hdfs';"
> 
> 
> An example Without Partitions
> -
> ALTER TABLE myData SET LOCATION 
> 'hdfs://10.48.97.97:9000/path/to/your/data/directory/in/hdfs';"
> 
> 
> While specifying location, you have to point to a directory. You cannot point 
> to a file (IMHO).
>  
> Hope that helps
>  
> sanjay
>  
> From: Tony Burton 
> Reply-To: "user@hive.apache.org" 
> Date: Tuesday, March 26, 2013 10:11 AM
> To: "user@hive.apache.org" 
> Subject: S3/EMR Hive: Load contents of a single file
>  
> Hi list,
>  
> I've been using hive to perform queries on data hosted on AWS S3, and 
> my tables point at data by specifying the directory in which the data 
> is stored, eg
>  
> $ create external table myData (str1 string,