Re: Multi-GroupBy-Insert optimization

2012-06-04 Thread Jan Dolinár
On 6/4/12, shan s  wrote:
> Thanks for the explanation Jan.
> If I understand correctly, the input will be read one single time and will
> be preprocessed in some form,  and this intermediate data is used for
> subsequent group-by..
> Not sure if my scenario will help this single step, since group-by varies
> across vast entities.

Yes, that is that is correct. The very simplest use case is when you
only scan a part of table. But if you are interested in all the data,
it is not going to help you much.

> If I were to implement group-by,manually, generally  we could club them
> together in single program. Can I do better with hive, with some
> hints/optimizations?
> Or  is there a possibility that Pig might perform better in this case.(
> Assuming Pig would probably handle this in a single job?)

In some cases it might be able to outsmart the hive optimizer and
write the mapreduce job directly in java in such way that it might
perform better. In most cases though, it is probably not worth the
trouble. You might easily end up in situation where buying more
machines is cheaper than developing the low level solutions that might
or might not be slightly faster... I'm not familiar with Pig or any
other tools that might be of use in your situation.

Jan


Re: Multi-group-by select always scans entire table

2012-06-04 Thread Jan Dolinár
On Mon, Jun 4, 2012 at 7:20 PM, Mark Grover  wrote:

> Hi Jan,
> Glad you found something workable.
>
> What version of Hive are you using? Could you also please check what the
> value of the property hive.optimize.ppd is for you?
>
> Thanks,
> Mark
>
>
Hi Mark,

Thanks for reply. I'm using hive 0.7.1 distributed from Cloudera as cdh3u4.
The property hive.optimize.ppd is set true, but I have tried to turn it off
and it doesn't effect the behavior of the problematic query at all. Any
other ideas? :-)

Also could some of you good guys try to check this on hadoop 0.8 or newer?
It would be nice to know if it is worth to go through all the hassle of
upgrading or if it won't help. Also, if it is not fixed already, it might
be good idea to report it as a bug.

Jan


RE: hive test faild ,Permission denied

2012-06-04 Thread Chinna Rao Lalam
Hi,

Try by giving permissions to that file.


From: Hezhiqiang (Ransom) [ransom.hezhiqi...@huawei.com]
Sent: Tuesday, June 05, 2012 8:56 AM
To: user@hive.apache.org
Cc: Zhaojun (Terry)
Subject: hive test faild ,Permission denied

Hi all.
When I run :
ant -Dhadoop.version=1.0.0 clean package test tar -logfile ant.log
The test was faild . how to solve it?
This is the log.
[junit] Number of reduce tasks is set to 0 since there's no reduce operator
[junit] java.io.IOException: Cannot run program 
"hive-0.9-test/workspace/testutils/hadoop" (in directory 
"hive-0.9-test/workspace/service"): java.io.IOException: error=13, Permission 
denied

[junit] at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
[junit] at java.lang.Runtime.exec(Runtime.java:593)
[junit] at java.lang.Runtime.exec(Runtime.java:431)
[junit] at 
org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:268)
[junit] at 
org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:134)
[junit] at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
[junit] at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1326)
[junit] at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1118)
[junit] at org.apache.hadoop.hive.ql.Driver.run(Driver.java:951)
[junit] at 
org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:191)
[junit] at 
org.apache.hadoop.hive.service.TestHiveServer.testDynamicSerde(TestHiveServer.java:300)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at junit.framework.TestCase.runTest(TestCase.java:154)
[junit] at junit.framework.TestCase.runBare(TestCase.java:127)
[junit] at junit.framework.TestResult$1.protect(TestResult.java:106)
[junit] at junit.framework.TestResult.runProtected(TestResult.java:124)
[junit] at junit.framework.TestResult.run(TestResult.java:109)
[junit] at junit.framework.TestCase.run(TestCase.java:118)
[junit] at junit.framework.TestSuite.runTest(TestSuite.java:208)
[junit] at junit.framework.TestSuite.run(TestSuite.java:203)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:518)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1052)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:906)
[junit] Caused by: java.io.IOException: java.io.IOException: error=13, 
Permission denied
[junit] at java.lang.UNIXProcess.(UNIXProcess.java:148)
[junit] at java.lang.ProcessImpl.start(ProcessImpl.java:65)
[junit] at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
[junit] ... 25 more
[junit] FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MapRedTask


RE: Filtering on TIMESTAMP data type

2012-06-04 Thread Ladda, Anand
Can anyone helpout with the TIMESTAMP literals piece. So far, I've gotten

Select day_timestamp from lu_day where day_timestamp > 
to_utc_timestamp('2012-06-04 00:00:00', 'GMT') to work ok and give me back 
timestamps greater than the one in the literal. Is this the best function to 
get this to work or is there something else I should be using

From: Ladda, Anand
Sent: Monday, May 28, 2012 11:00 AM
To: user@hive.apache.org
Subject: RE: FW: Filtering on TIMESTAMP data type

Debarshi
Didn't quite follow your first comment. I get the write-your-own UDF part but 
was wondering how others have been transitioning from STRING dates to TIMESTAMP 
dates and getting filtering, partition pruning, etc to work with constants
-Anand

From: Debarshi Basak [mailto:debarshi.ba...@tcs.com]
Sent: Saturday, May 26, 2012 11:54 AM
To: user@hive.apache.org
Subject: Re: FW: Filtering on TIMESTAMP data type

I guess it exist gotta check.
btw...You can always go and write a udf


Debarshi Basak
Tata Consultancy Services
Mailto: debarshi.ba...@tcs.com
Website: http://www.tcs.com

Experience certainty. IT Services
Business Solutions
Outsourcing


-"Ladda, Anand" wrote: -
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>, 
"d...@hive.apache.org" 
mailto:d...@hive.apache.org>>
From: "Ladda, Anand" mailto:lan...@microstrategy.com>>
Date: 05/26/2012 06:58PM
Subject: FW: Filtering on TIMESTAMP data type
How do I set-up a filter constant for TIMESTAMP datatype. In Hive 0.7 since 
timestamps were represented as strings a query like this would return data

select * from LU_day where day_date ='2010-01-01 00:00:00';

But now with day_date as a TIMESTAMP column it doesn't. Is there some type of a 
TO_TIMESTAMP function in hive to convert the string constant into a TIMESTAMP 
one

As a workaround I can do

select * from LU_DAY where TO_DATE(day_date) = '2010-01-01' but that would be a 
problem for partitioning pruning, etc



=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you


Re: Need help running query over "yesterday". getting "No partition predicate found" errors

2012-06-04 Thread Mark Grover
Hi Matthew,
unix_timestamp() is being executed on the Hive server in your case. Therefore, 
as your query progresses, the timestamp returned by your unix_timestamp() would 
change. Based on your usage, this would lead to unsuspecting problems when this 
queries runs across UTC midnight. In any case, I'd recommend that you get your 
Hive client code to get the present unix_timestamp and generate your partition 
pruning constant(s) on the client before you send your query to the server. 
When you do so, partition pruning will occur as expected.

Mark

- Original Message -
From: "Matthew Hooker" 
To: user@hive.apache.org
Sent: Monday, June 4, 2012 8:01:26 PM
Subject: Need help running query over "yesterday". getting "No partition 
predicate found" errors

Hello, 


I'm trying to construct a query which will do some simple counts over a table 
of events, and insert them in to a summary table. 


The query I came up with looks something like 



INSERT OVERWRITE TABLE activity 
PARTITION(date_utc) 
select count(1) views, 
from impressions 

where dt >= date_sub(to_date(from_unixtime(unix_timestamp())),1) 
and dt < to_date(from_unixtime(unix_timestamp())) 
group by d_theme; 


This seemed to work fine, but I realized it was doing a full table scan, and 
not using the partitions in "views" appropriately. 


I set hive.mapred.mode=strict and the above query fails to run. 
If I change the expressions in the where clause to literals, it works as I 
would expect. 


Can anyone help me figure out how to do what I want? 




I put together a test script to illustrate my problem: 



set hive.mapred.mode=strict; 
set hive.optimize.ppd=true; 
DROP TABLE IF EXISTS test_where_expr; 
CREATE TABLE test_where_expr ( 
ts int 
) 
PARTITIONED BY ( dt STRING ); 
explain SELECT * from test_where_expr WHERE dt = 
to_date(date_sub(from_unixtime(unix_timestamp()),1)); 
explain SELECT * from test_where_expr WHERE dt = '2012-06-01'; 




This is the output: 



hive> set hive.mapred.mode=strict; 
hive> set hive.optimize.ppd=true; 
hive> 
> DROP TABLE IF EXISTS test_where_expr; 
OK 
Time taken: 3.405 seconds 
hive> 
> CREATE TABLE test_where_expr ( 
> ts int 
> ) 
> PARTITIONED BY ( dt STRING ); 
OK 
Time taken: 0.189 seconds 
hive> 
> explain SELECT * from test_where_expr 
> WHERE dt = to_date(date_sub(from_unixtime(unix_timestamp()),1)); 
FAILED: Error in semantic analysis: No partition predicate found for Alias 
"test_where_expr" Table "test_where_expr" 
hive> 
> explain SELECT * from test_where_expr 
> WHERE dt = '2012-06-01'; 
OK 
ABSTRACT SYNTAX TREE: 
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME test_where_expr))) (TOK_INSERT 
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
TOK_ALLCOLREF)) (TOK_WHERE (= (TOK_TABLE_OR_COL dt) '2012-06-01' 


STAGE DEPENDENCIES: 
Stage-0 is a root stage 


STAGE PLANS: 
Stage: Stage-0 
Fetch Operator 
limit: -1 




Time taken: 0.166 seconds 
hive> hadoop@ip-10-68-190-136:~$ hive --version 
Hive version 0.8.1. 
hadoop@ip-10-68-190-136:~$ hadoop -version 
java version "1.6.0_26" 
Java(TM) SE Runtime Environment (build 1.6.0_26-b03) 
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) 





Thanks, 
--Matthew Hooker 


hive test faild ,Permission denied

2012-06-04 Thread Hezhiqiang (Ransom)
Hi all.
When I run :
ant -Dhadoop.version=1.0.0 clean package test tar -logfile ant.log
The test was faild . how to solve it?
This is the log.
[junit] Number of reduce tasks is set to 0 since there's no reduce operator
[junit] java.io.IOException: Cannot run program 
"hive-0.9-test/workspace/testutils/hadoop" (in directory 
"hive-0.9-test/workspace/service"): java.io.IOException: error=13, Permission 
denied

[junit] at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
[junit] at java.lang.Runtime.exec(Runtime.java:593)
[junit] at java.lang.Runtime.exec(Runtime.java:431)
[junit] at 
org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:268)
[junit] at 
org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:134)
[junit] at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
[junit] at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1326)
[junit] at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1118)
[junit] at org.apache.hadoop.hive.ql.Driver.run(Driver.java:951)
[junit] at 
org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:191)
[junit] at 
org.apache.hadoop.hive.service.TestHiveServer.testDynamicSerde(TestHiveServer.java:300)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at junit.framework.TestCase.runTest(TestCase.java:154)
[junit] at junit.framework.TestCase.runBare(TestCase.java:127)
[junit] at junit.framework.TestResult$1.protect(TestResult.java:106)
[junit] at junit.framework.TestResult.runProtected(TestResult.java:124)
[junit] at junit.framework.TestResult.run(TestResult.java:109)
[junit] at junit.framework.TestCase.run(TestCase.java:118)
[junit] at junit.framework.TestSuite.runTest(TestSuite.java:208)
[junit] at junit.framework.TestSuite.run(TestSuite.java:203)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:518)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1052)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:906)
[junit] Caused by: java.io.IOException: java.io.IOException: error=13, 
Permission denied
[junit] at java.lang.UNIXProcess.(UNIXProcess.java:148)
[junit] at java.lang.ProcessImpl.start(ProcessImpl.java:65)
[junit] at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
[junit] ... 25 more
[junit] FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MapRedTask


Need help running query over "yesterday". getting "No partition predicate found" errors

2012-06-04 Thread Matthew Hooker
Hello,

I'm trying to construct a query which will do some simple counts over a
table of events, and insert them in to a summary table.

The query I came up with looks something like

INSERT OVERWRITE TABLE activity
PARTITION(date_utc)
select count(1) views,
from impressions
where dt >= date_sub(to_date(from_unixtime(unix_timestamp())),1)
and dt < to_date(from_unixtime(unix_timestamp()))
group by d_theme;

This seemed to work fine, but I realized it was doing a full table scan,
and not using the partitions in "views" appropriately.

I set hive.mapred.mode=strict and the above query fails to run.
If I change the expressions in the where clause to literals, it works as I
would expect.

Can anyone help me figure out how to do what I want?


I put together a test script to illustrate my problem:

set hive.mapred.mode=strict;
set hive.optimize.ppd=true;
DROP TABLE IF EXISTS test_where_expr;
CREATE TABLE test_where_expr (
ts int
  )
  PARTITIONED BY ( dt STRING );
explain SELECT * from test_where_expr WHERE dt =
to_date(date_sub(from_unixtime(unix_timestamp()),1));
explain SELECT * from test_where_expr WHERE dt = '2012-06-01';


This is the output:

hive> set hive.mapred.mode=strict;
hive> set hive.optimize.ppd=true;
hive>
> DROP TABLE IF EXISTS test_where_expr;
OK
Time taken: 3.405 seconds
hive>
> CREATE TABLE test_where_expr (
> ts int
>   )
>   PARTITIONED BY ( dt STRING );
OK
Time taken: 0.189 seconds
hive>
> explain SELECT * from test_where_expr
> WHERE dt = to_date(date_sub(from_unixtime(unix_timestamp()),1));
FAILED: Error in semantic analysis: No partition predicate found for Alias
"test_where_expr" Table "test_where_expr"
hive>
> explain SELECT * from test_where_expr
> WHERE dt = '2012-06-01';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME test_where_expr)))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (TOK_TABLE_OR_COL dt)
'2012-06-01'

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: -1


Time taken: 0.166 seconds
hive> hadoop@ip-10-68-190-136:~$ hive --version
Hive version 0.8.1.
hadoop@ip-10-68-190-136:~$ hadoop -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)


Thanks,
--Matthew Hooker


Re: Problem: LINES TERMINATED BY only supports newline '\n' right now.

2012-06-04 Thread Mark Grover
Hi Tabraiz,
The 10 in the source code is what '\n' is in ASCII (base 10). That's why you 
see it. It still represents a linefeed.

Mark

- Original Message -
From: "tabraiz anwer" 
To: "Mark Grover" , "hive group" 
Sent: Monday, June 4, 2012 4:42:02 PM
Subject: Re: Problem: LINES TERMINATED BY only supports newline '\n' right now.



hello Mark, 
instead of '\n\ we can also termincate records by '10' i have see the exmaple 
in hive wiki where they are creating tables and their records are terminating 
by '\001\' 
i have checked src of hive syntax analyzer . there are only two option of LINES 
termination one is '\n' and other is '10' now the question arises is how i can 
add another line termination values of '\001\ 
Regards. 





From: Mark Grover  
To: user@hive.apache.org; tabraiz anwer  
Sent: Monday, 4 June 2012 9:26 PM 
Subject: Re: Problem: LINES TERMINATED BY only supports newline '\n' right now. 

Hi Tabriz, 
As far as I know, newlines are the only supported way to separate records right 
now. As a corollary if a single logical records exists across multiple lines, 
you will have to get rid of the extra newlines for all of it to be in the same 
record. 

So, to get around it, you can do one of two things: 
1) Pre-process your files to break records apart on newlines. 
2) As Ed Capriolo suggested in a previous email thread, you could try to use 
streaming, parse out your XML there and emit out multiple records. 

Mark 

- Original Message - 
From: "tabraiz anwer" < tabraizan...@yahoo.com > 
To: "hive group" < user@hive.apache.org > 
Sent: Monday, June 4, 2012 12:08:12 PM 
Subject: Problem: LINES TERMINATED BY only supports newline '\n' right now. 



Hi, 
i had tried to create the table by "LINES terminated by '\001' " 
and it is giving me the error 


Error in semantic analysis: 3:66 LINES TERMINATED BY only supports newline '\n' 
right now. Error encountered near token ''\001'' 


CREATE TABLE xmlgw4 ( transactionid string, typeid string, 
sentxml string,receivedxml string ) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\001' 
STORED AS TEXTFILE; 



instead of '\n' i am using '\001' because i have an xml value , which i want to 
store in hive that invludes \n values.. 
using hive version : hive-0.8.1 

any suggestion? 
Regards. 




Re: Problem: LINES TERMINATED BY only supports newline '\n' right now.

2012-06-04 Thread tabraiz anwer
hello Mark,
instead of '\n\ we can also termincate records by '10' i have see the  exmaple 
in hive wiki where they are creating tables and their records are terminating 
by '\001\'
i have checked src of hive syntax analyzer . there are only two option of LINES 
termination one is '\n' and other is '10' now the question arises is how i can 
add another line termination values of '\001\
Regards.




 From: Mark Grover 
To: user@hive.apache.org; tabraiz anwer  
Sent: Monday, 4 June 2012 9:26 PM
Subject: Re: Problem: LINES TERMINATED BY only supports newline '\n' right now.
 
Hi Tabriz,
As far as I know, newlines are the only supported way to separate records right 
now. As a corollary if a single logical records exists across multiple lines, 
you will have to get rid of the extra newlines for all of it to be in the same 
record.

So, to get around it, you can do one of two things:
1) Pre-process your files to break records apart on newlines.
2) As Ed Capriolo suggested in a previous email thread, you could try to use 
streaming, parse out your XML there and emit out multiple records.

Mark

- Original Message -
From: "tabraiz anwer" 
To: "hive group" 
Sent: Monday, June 4, 2012 12:08:12 PM
Subject: Problem: LINES TERMINATED BY only supports newline '\n' right now.



Hi, 
i had tried to create the table by "LINES terminated by '\001' " 
and it is giving me the error 


Error in semantic analysis: 3:66 LINES TERMINATED BY only supports newline '\n' 
right now. Error encountered near token ''\001'' 


CREATE TABLE xmlgw4 ( transactionid string, typeid string, 
sentxml string,receivedxml string ) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\001' 
STORED AS TEXTFILE; 



instead of '\n' i am using '\001' because i have an xml value , which i want to 
store in hive that invludes \n values.. 
using hive version : hive-0.8.1 

any suggestion? 
Regards. 

Re: Multi-group-by select always scans entire table

2012-06-04 Thread Mark Grover
Hi Jan,
Glad you found something workable.

What version of Hive are you using? Could you also please check what the value 
of the property hive.optimize.ppd is for you?

Thanks,
Mark

- Original Message -
From: "Jan Dolinár" 
To: user@hive.apache.org
Sent: Tuesday, May 29, 2012 1:57:25 AM
Subject: Re: Multi-group-by select always scans entire table


On Fri, May 25, 2012 at 12:03 PM, Jan Dolinár < dolik@gmail.com > wrote: 


-- see what happens when you try to perform multi-group-by query on one of the 
partitions 
EXPLAIN EXTENDED 
FROM partition_test 
LATERAL VIEW explode(col1) tmp AS exp_col1 
INSERT OVERWRITE DIRECTORY '/test/1' 
SELECT exp_col1 
WHERE (part_col=2) 
INSERT OVERWRITE DIRECTORY '/test/2' 
SELECT exp_col1 
WHERE (part_col=2); 
-- result: it wants to scan all partitions :-( 


Since nobody else did, let me answer myself... In the end I found out that the 
correct partition pruning can be achieved using subquery. Continuing the 
example from my last post, the query would be: 


FROM ( 
SELECT * FROM partition_test 
LATERAL VIEW explode(col1) tmp AS exp_col1 
WHERE part_col=2 
) t 
INSERT OVERWRITE DIRECTORY '/test/1' 
SELECT exp_col1 
INSERT OVERWRITE DIRECTORY '/test/2' 
SELECT exp_col1; 


I still think the pruning should work correctly no matter how the query is 
written, but for now I'm happy with this solution. 


J. Dolinar


Re: How to execute query with timestamp type (Hbase/Hive integeration)

2012-06-04 Thread Mark Grover
Hi Peyton,
I've been using Integer to store Unix timestamps in Hive and that seems to work 
fine.

Let me know if I misunderstood your question:-)

Mark
- Original Message -
From: "Peyton Peng" 
To: user@hive.apache.org
Sent: Thursday, May 31, 2012 2:41:15 AM
Subject: Re: How to execute query with timestamp type (Hbase/Hive integeration)

Hi Mark,

thanks for your response, I tried with other data type, it seems the issue 
occur while query the timestamp field only, not sure how the timestamp 
mapping work...

>From the hbase, I seek the data and the value of timestamp(event_time) is: 
Wed May 30 16:15:06 CST 2012, should I should the value as long type?

Regards,
Peyton

-原始邮件- 
From: Mark Grover
Sent: Thursday, May 31, 2012 11:14 AM
To: user@hive.apache.org
Subject: Re: How to execute query with timestamp type (Hbase/Hive 
integeration)

Hi Peyton,
It seems like something to do with timestamp mapping.
What happens if you change your Hive table definition to have the event_time 
as int or string?

Mark

- Original Message -
From: "Peyton Peng" 
To: user@hive.apache.org
Sent: Wednesday, May 30, 2012 5:54:20 AM
Subject: Re: How to execute query with timestamp type (Hbase/Hive 
integeration)




Actually I can execute the first sql and it works well, all the libs you 
specified is under the hive lib folder,

I doubt if the issue is caused by the timestamp mapping between hbase with 
hive..

Regards,
Peyton





From: shashwat shriparv
Sent: Wednesday, May 30, 2012 5:26 PM
To: user@hive.apache.org
Subject: Re: How to execute query with timestamp type (Hbase/Hive 
integeration)

Add these file to hive lib folder

>>> hadoop-0.20-core.jar
>>> hive/lib/hive-exec-0.7.1.jar
>>> hive/lib/hive-jdbc-0.7.1.jar
>>> hive/lib/hive-metastore-0.7.1.jar
>>> hive/lib/hive-service-0.7.1.jar
>>> hive/lib/libfb303.jar
>>> lib/commons-logging-1.0.4.jar
>>> slf4j-api-1.6.1.jar
>>> slf4j-log4j12-1.6.1.jar

and then try

On Wed, May 30, 2012 at 2:23 PM, Peyton Peng < pengp...@yunyou.tv > wrote:






Hi,

I build the hive table mapped with hbase table,

CREATE TABLE http_access(key string, client_ip string, client_port int, 
request_method string, event_time timestamp)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = 
":key,client:ip,client:port,request:method,request:event_time"
);

The data is store with hbase client.

I get an issue while query with hive for (hbase/hive integration), while I 
execute sql: select ip, port, request_method from http_access , it works 
well with no problem,

but while I execute below sql: select ip, port, event_time from http_access, 
I got below exception.

The only difference between two sqls is: the event_time is timestamp type, I 
can scan the corresponding hbase table and see the value of event_time is:
1338365792142 column=request:event_time, timestamp=1338365739818, value=Wed 
May 30 16:15:06 CST 2012

Anyone who know what the issue is? (Not sure if I made a wrong mapping or 
should I just store the timestamp value as long in hbase? currently I store 
the value as java.util.Date)

Thank you very much

Regards,
Peyton

Exception tracking:

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
hdfs://Server:9000/user/hive/warehouse/http_access
Starting Job = job_201205291421_0008, Tracking URL = 
http://Server:50030/jobdetails.jsp?jobid=job_201205291421_0008
Kill Command = //libexec/../bin/hadoop 
job -Dmapred.job.tracker=Server:9001 –kill job_201205291421_0008
Hadoop job information for Stage-1: number of mappers: 1; number of 
reducers: 0
2012-05-30 16:28:01,572 Stage-1 map = 0%, reduce = 0%
2012-05-30 16:28:34,707 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201205291421_0008 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201205291421_0008_m_02 (and more) from job 
job_201205291421_0008
Exception in thread "Thread-211" java.lang.RuntimeException: Error while 
reading from task log url
at 
org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
at 
org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebuggerjava:211)
at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for 
URL: 
http://Server:50060/tasklog?taskid=attempt_201205291421_0008_m_00_1&start=-8193
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1305)
at java.net.URL.openStream(URL.java:1009)
at 
org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
.. 3 more
FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec





-- 


∞ Shashw

Re: Problem: LINES TERMINATED BY only supports newline '\n' right now.

2012-06-04 Thread Mark Grover
Hi Tabriz,
As far as I know, newlines are the only supported way to separate records right 
now. As a corollary if a single logical records exists across multiple lines, 
you will have to get rid of the extra newlines for all of it to be in the same 
record.

So, to get around it, you can do one of two things:
1) Pre-process your files to break records apart on newlines.
2) As Ed Capriolo suggested in a previous email thread, you could try to use 
streaming, parse out your XML there and emit out multiple records.

Mark

- Original Message -
From: "tabraiz anwer" 
To: "hive group" 
Sent: Monday, June 4, 2012 12:08:12 PM
Subject: Problem: LINES TERMINATED BY only supports newline '\n' right now.



Hi, 
i had tried to create the table by "LINES terminated by '\001' " 
and it is giving me the error 


Error in semantic analysis: 3:66 LINES TERMINATED BY only supports newline '\n' 
right now. Error encountered near token ''\001'' 


CREATE TABLE xmlgw4 ( transactionid string, typeid string, 
sentxml string,receivedxml string ) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\001' 
STORED AS TEXTFILE; 



instead of '\n' i am using '\001' because i have an xml value , which i want to 
store in hive that invludes \n values.. 
using hive version : hive-0.8.1 

any suggestion? 
Regards. 


Problem: LINES TERMINATED BY only supports newline '\n' right now.

2012-06-04 Thread tabraiz anwer
Hi,
    i had tried to create the table by "LINES terminated by '\001' " 
and it is giving me the error

Error in semantic analysis: 3:66 LINES TERMINATED BY only supports newline '\n' 
right now. Error encountered near token ''\001''


CREATE TABLE xmlgw4 (  transactionid string,    typeid string,   
   sentxml string,receivedxml string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\001'
STORED AS TEXTFILE;


instead of '\n' i am using '\001' because i have an xml value , which i want to 
store in hive that invludes \n values..
using hive version : hive-0.8.1

any suggestion?
Regards.


Re: HADOOP_HOME

2012-06-04 Thread Mark Grover
Babak,
For what it's worth, HADOOP_HOME has been deprecated starting Hadoop 0.23

Mark

- Original Message -
From: "Bejoy Ks" 
To: user@hive.apache.org
Sent: Monday, June 4, 2012 6:50:33 AM
Subject: Re: HADOOP_HOME



Hi 


If you are looking to set HADOOP_HOME just for hive, you can do that at 
hive_env.sh . 


For your question, yes it is location (hadoop parent dir) where you have 
extracted your tall ball. 


Regards 
Bejoy KS 





From: shashwat shriparv  
To: user@hive.apache.org 
Sent: Saturday, June 2, 2012 10:29 PM 
Subject: Re: HADOOP_HOME 


One more way. 


add this line to you /etc/environment file 


export HADOOP_HOME= 


On Sat, Jun 2, 2012 at 6:38 PM, Siddharth Tiwari < siddharth.tiw...@live.com > 
wrote: 




DO this:- 

nano ~/.bashrc 
add following lines:- 
export HADOOP_HOME= (path to your hadoop bin directory) 

#done 


** 
Cheers !!! 
Siddharth Tiwari 
Have a refreshing day !!! 
"Every duty is holy, and devotion to duty is the highest form of worship of 
God.” 




Date: Sat, 2 Jun 2012 11:20:10 +0200 
Subject: HADOOP_HOME 
From: babak...@gmail.com 
To: user@hive.apache.org 


Hello experts 


How can I set HADOOP _HOME? in which file I can find this variable? I should 
set this variable with the path in that I downloaded hadoop? 


thank you so much 



-- 

∞ Shashwat Shriparv 






Re: Front end visualization tool with Hive (when using as a warehouse)

2012-06-04 Thread Mark Grover
Not sure if this would fall in the line of discussion but thought of mentioning 
it anyways.

A lot of people who run Hive as a warehouse, would load up their aggregations 
and results in traditional RDBMS database (using something like Sqoop) and then 
use one of the visualization tools to visualize those aggregations without 
having to wait for Hive-like timings.

Mark
- Original Message -
From: "Anand Ladda" 
To: user@hive.apache.org, "Bejoy Ks" 
Sent: Monday, June 4, 2012 9:32:02 AM
Subject: RE: Front end visualization tool with Hive (when using as a warehouse)




I agree with Bejoy’s assessment – Hive is good for processing large volumes of 
data in a batch manner. But for real-time or any complex SQL based analysis you 
would typically want to have some type of a RDBMS in the mix along with 
Hadoop/Hive. In terms of what’s missing in Hive today - On the query side Hive 
doesn’t yet support all flavors of subqueries (correlated subqueries to be 
specific. There are potential workarounds for the non-correlated ones), it also 
doesn’t support inserting data from a stream i.e, INSERT INTO TABLE VALUES (…) 
type syntax, Hive’s query optimizer is mostly rule-based at this time although 
there’s push to move towards a cost-based one. On the administration side 
there’s no workload management/job prioritization scheme like a typical RDBMS, 
Hive Server isn’t thread-safe and also doesn’t yet have any kind of 
security/authentication scheme. 









From: Bejoy Ks [mailto:bejoy...@yahoo.com] 
Sent: Monday, June 04, 2012 7:20 AM 
To: user@hive.apache.org 
Subject: Re: Front end visualization tool with Hive (when using as a warehouse) 





Hi Sreenath 





First of all don't take hive like a RDBMS system, while designing your 
solution. It is an awesome tool when it comes to processing of huge volumes of 
data in non real time mode. If any of your use cases comes with 'updates' on 
rows, it is not supported in hive. It is pretty expensive to have a work around 
for updates as well. (you can implement it on overwriting a per partition level 
in the most granular manner, still it is expensive) 





By the way I'm not a DWH guy, may be others can add on their experience over 
these. 





Regards 


Bejoy KS 









From: Sreenath Menon < sreenathmen...@gmail.com > 
To: user@hive.apache.org ; Bejoy Ks < bejoy...@yahoo.com > 
Sent: Monday, June 4, 2012 4:25 PM 
Subject: Re: Front end visualization tool with Hive (when using as a warehouse) 






Hi Bejoy 

I am not looking for just an UI for queries (even though at first, when working 
on twitter data, that was of my interest). But, now I am planning on using Hive 
as a warehouse with a front end in-memory processing engine. Microstrategy or 
tableau would be a good choice. 

Now further refining the problem, I would ask what is the warehousing power of 
Hive when compared to a traditional warehouse. Can Hive perform all operations 
performed/required in a warehouse. If not, where are the short comings which I 
need to deal with. 

Always thankful for your apt assistance. 


On Mon, Jun 4, 2012 at 3:49 PM, Bejoy Ks < bejoy...@yahoo.com > wrote: 




Hi Sreenath 





If you are looking at a UI for queries then Cloudera's hue is the best choice. 
Also you do have odbc connectors that integrates BI tools like microstrategy, 
tableau etc with hive. 





Regards 


Bejoy KS 









From: Sreenath Menon < sreenathmen...@gmail.com > 
To: user@hive.apache.org 
Sent: Monday, June 4, 2012 2:42 PM 
Subject: Front end visualization tool with Hive (when using as a warehouse) 






Hi all 

I am new to hive and am working on analysis of twitter data with Hive and 
Hadoop in a 27node cluster. 
At present am using Microsoft powerpivot as the visualization tool for visual 
representation of analysis done using Hive and have got some really good 
results and I am stunned by the scalability power of the Hadoop system. 

As a next step, I would like to evaluate the warehousing capabilities of Hive 
for business data. 
Any insights into this is welcome. And am facing problem of delegating job to 
Hive/Powerpivot as Powerpivot itself has capabilities of being a warehouse 
tool. Any other good visualization tools for usage with Hive is also welcome. 

For analyzing twitter data, I just ran complex Hive queries for each of 
analysis done. But for a warehouse, this does not sound like a good solution. 

Any help is greatly appreciated. 

Thanks. 







Re: Multi-GroupBy-Insert optimization

2012-06-04 Thread shan s
Thanks for the explanation Jan.
If I understand correctly, the input will be read one single time and will
be preprocessed in some form,  and this intermediate data is used for
subsequent group-by..
Not sure if my scenario will help this single step, since group-by varies
across vast entities.

If I were to implement group-by,manually, generally  we could club them
together in single program. Can I do better with hive, with some
hints/optimizations?
Or  is there a possibility that Pig might perform better in this case.(
Assuming Pig would probably handle this in a single job?)

Thank You,
Prashant

p.s.
Just In case, if the below data helps...
In my scenario, my data has # of entity1 = 500,000 and # of entity2=500, #
of entity3=5.
Fact table has 250M rows (entity1 * entity2)
Current job has 22 group bys,  based on various combination of 3 entities,
and fact table record types, it produces  22M rows. It takes 3 hours on 4
machine cluster, with good configuration.


On Mon, Jun 4, 2012 at 6:52 PM, Jan Dolinár  wrote:

>
>On Fri, Jun 1, 2012 at 5:25 PM, shan s  wrote:
>>
>>> I am using Multi-GroupBy-Insert. I was expecting a single map-reduce job
>>> which would club the group-bys together.
>>> However it is scheduling n jobs where n = number of group bys..
>>> Could you please explain this behaviour.
>>>
>>>
>>
> No, it will result in at least as many jobs as there is group-bys. The
> efficiency is hidden not in lowering number of jobs, but in fact that the
> first job usually reduces the amount of the data that the rest needs to go
> through. E.g. if the FROM clause includes subquery or when the group-bys
> have similar WHERE caluses - then this "pre-selection" is executed first
> and the subsequent jobs operate on the results of the first instead of
> entire table/partition and are therefore much faster.
>
>
> J. Dolinar
>


Re: Bucketing broken in hive 0.9.0?

2012-06-04 Thread Edward Capriolo
I confirmed on hive 0.7.0 and hive 0.9.0 In non local mode query
creates three output tables.

I opened:
https://issues.apache.org/jira/browse/HIVE-3083

Because the unit testing uses a local mode likely the tests are
retuning false positives.

Edward


On 6/4/12, Edward Capriolo  wrote:
> How come only a single output file is being produced here? Shouldnt
> this bucking produce 3 files? LOCAL MODE BTW
>
> [edward@tablitha hive-0.9.0-bin]$ bin/hive
> hive> create table numbersflat(number int);
> hive> load data local inpath '/home/edward/numbers' into table numbersflat;
> Copying data from file:/home/edward/numbers
> Copying file: file:/home/edward/numbers
> Loading data to table default.numbersflat
> OK
> Time taken: 0.288 seconds
> hive> select * from numbersflat;
> OK
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> Time taken: 0.274 seconds
> hive> CREATE TABLE numbers_bucketed(number int,number1 int) CLUSTERED
> BY (number) INTO 3 BUCKETS;
> OK
> Time taken: 0.082 seconds
> hive> set hive.enforce.bucketing = true;
> hive> set hive.exec.reducers.max = 200;
> hive> set hive.merge.mapfiles=false;
> hive>
> > insert OVERWRITE table numbers_bucketed select number,number+1
> as number1 from numbersflat;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 3
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=
> 12/06/04 00:50:35 WARN conf.HiveConf: hive-site.xml not found on CLASSPATH
> Execution log at:
> /tmp/edward/edward_20120604005050_e17eb952-af76-4cf3-aee1-93bd59e74517.log
> Job running in-process (local Hadoop)
> Hadoop job information for null: number of mappers: 0; number of reducers:
> 0
> 2012-06-04 00:50:47,938 null map = 0%,  reduce = 0%
> 2012-06-04 00:50:48,940 null map = 100%,  reduce = 0%
> 2012-06-04 00:50:49,942 null map = 100%,  reduce = 100%
> Ended Job = job_local_0001
> Execution completed successfully
> Mapred Local Task Succeeded . Convert the Join into MapJoin
> Loading data to table default.numbers_bucketed
> Deleted file:/user/hive/warehouse/numbers_bucketed
> Table default.numbers_bucketed stats: [num_partitions: 0, num_files:
> 1, num_rows: 10, total_size: 43, raw_data_size: 33]
> OK
> Time taken: 16.722 seconds
> hive> dfs -ls /user/hive/warehouse/numbers_bucketed;
> Found 1 items
> -rwxrwxrwx   1 edward edward 43 2012-06-04 00:50
> /user/hive/warehouse/numbers_bucketed/00_0
> hive> dfs -ls /user/hive/warehouse/numbers_bucketed/00_0;
> Found 1 items
> -rwxrwxrwx   1 edward edward 43 2012-06-04 00:50
> /user/hive/warehouse/numbers_bucketed/00_0
> hive> cat /user/hive/warehouse/numbers_bucketed/00_0;
> FAILED: Parse Error: line 1:0 cannot recognize input near 'cat' '/' 'user'
>
> hive> dfs -cat /user/hive/warehouse/numbers_bucketed/00_0;
> 1 2
> 2 3
> 3 4
> 4 5
> 5 6
> 6 7
> 7 8
> 8 9
> 9 10
> 10 11
> hive>
>


RE: Front end visualization tool with Hive (when using as a warehouse)

2012-06-04 Thread Ladda, Anand
I agree with Bejoy's assessment - Hive is good for processing large volumes of 
data in a batch manner. But for real-time or any complex SQL based analysis you 
would typically want to have some type of a RDBMS in the mix along with 
Hadoop/Hive. In terms of what's missing in Hive today - On the query side Hive 
doesn't yet support all flavors of subqueries (correlated subqueries to be 
specific. There are potential workarounds for the non-correlated ones), it also 
doesn't support inserting data from a stream i.e, INSERT INTO TABLE VALUES 
(...) type syntax, Hive's query optimizer is mostly rule-based at this time 
although there's push to move towards a cost-based one. On the administration 
side there's no workload management/job prioritization scheme like a typical 
RDBMS, Hive Server isn't thread-safe and also doesn't yet have any kind of 
security/authentication scheme.



From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Monday, June 04, 2012 7:20 AM
To: user@hive.apache.org
Subject: Re: Front end visualization tool with Hive (when using as a warehouse)

Hi Sreenath

First of all don't take hive like a RDBMS system, while designing your 
solution. It is an awesome tool when it comes to processing of huge volumes of 
data in non real time mode. If any of your use cases comes with 'updates' on 
rows, it is not supported in hive. It is pretty expensive to have a work around 
for updates as well. (you can implement it on overwriting a per partition level 
in the most granular manner, still it is expensive)

By the way I'm not a DWH guy, may be others can add on their experience over 
these.

Regards
Bejoy KS


From: Sreenath Menon mailto:sreenathmen...@gmail.com>>
To: user@hive.apache.org; Bejoy Ks 
mailto:bejoy...@yahoo.com>>
Sent: Monday, June 4, 2012 4:25 PM
Subject: Re: Front end visualization tool with Hive (when using as a warehouse)


Hi Bejoy

I am not looking for just an UI for queries (even though at first, when working 
on twitter data, that was of my interest). But, now I am planning on using Hive 
as a warehouse with a front end in-memory processing engine. Microstrategy or 
tableau would be a good choice.

Now further refining the problem, I would ask what is the warehousing power of 
Hive when compared to a traditional warehouse. Can Hive perform all operations 
performed/required in a warehouse. If not, where are the short comings which I 
need to deal with.

Always thankful for your apt assistance.
On Mon, Jun 4, 2012 at 3:49 PM, Bejoy Ks 
mailto:bejoy...@yahoo.com>> wrote:
Hi Sreenath

 If you are looking at a UI for queries then Cloudera's hue is the best 
choice. Also you do have odbc connectors that integrates BI tools like 
microstrategy, tableau etc with hive.

Regards
Bejoy KS


From: Sreenath Menon mailto:sreenathmen...@gmail.com>>
To: user@hive.apache.org
Sent: Monday, June 4, 2012 2:42 PM
Subject: Front end visualization tool with Hive (when using as a warehouse)

Hi all

I am new to hive and am working on analysis of twitter data with Hive and 
Hadoop in a 27node cluster.
At present am using Microsoft powerpivot as the visualization tool for visual 
representation of analysis done using Hive and have got some really good 
results and I am stunned by the scalability power of the Hadoop system.

As a next step, I would like to evaluate the warehousing capabilities of Hive 
for business data.
Any insights into this is welcome. And am facing problem of delegating job to 
Hive/Powerpivot as Powerpivot itself has capabilities of being a warehouse 
tool. Any other good visualization tools for usage with Hive is also welcome.

For analyzing twitter data, I just ran complex Hive queries for each of 
analysis done. But for a warehouse, this does not sound like a good solution.

Any help is greatly appreciated.

Thanks.





Re: Multi-GroupBy-Insert optimization

2012-06-04 Thread Jan Dolinár
> On Fri, Jun 1, 2012 at 5:25 PM, shan s  wrote:
>
>> I am using Multi-GroupBy-Insert. I was expecting a single map-reduce job
>> which would club the group-bys together.
>> However it is scheduling n jobs where n = number of group bys..
>> Could you please explain this behaviour.
>>
>>
>
No, it will result in at least as many jobs as there is group-bys. The
efficiency is hidden not in lowering number of jobs, but in fact that the
first job usually reduces the amount of the data that the rest needs to go
through. E.g. if the FROM clause includes subquery or when the group-bys
have similar WHERE caluses - then this "pre-selection" is executed first
and the subsequent jobs operate on the results of the first instead of
entire table/partition and are therefore much faster.


J. Dolinar


Re: Need help with simple subquery

2012-06-04 Thread shan s
Thanks Nanda, Igor,
This worked..

On Tue, May 29, 2012 at 11:55 PM, Nanda Vijaydev
wrote:

> Use a join as given below, I have used the table name temp.
>
> select a.Category, a.count, b.tot_count, (100*a.count)/tot_count as
> percent_count
> from   ( select Category, count(*) as count from temp group by Category )
> a
> join (select count(*) as tot_count from temp) b
>  on (true)
>
> Hope this helps
> Nanda Vijaydev
>
>
> On Mon, May 28, 2012 at 3:28 PM, Igor Tatarinov  wrote:
>
>> Try replacing the comma with JOIN
>>
>> igor
>> decide.com
>>
>> On Mon, May 28, 2012 at 6:48 AM, shan s  wrote:
>>
>>> I need help with a simple subquery. Given below data, I need counts and
>>> percentage counts per category. (Re-phrasing my earlier question )
>>> With the code below I get an error: *FAILED: Parse Error:* line 6:50 
>>> *mismatched
>>> input ','* *expecting EOF near 'a'*
>>>
>>> Looking at the documentation the syntax it is not clear to me.. In the
>>> form clause, can I have multiple selects and just alias them and use them
>>> in the top query?
>>> If so, what am I missing in the below code...
>>>
>>> select a.category, a.count, b.totalCount
>>> from
>>> (select category, count(*) as count from gt group by category) a,
>>> (select count(*) as totalCount from gt) b ;
>>>
>>> Many Thanks,
>>> Prashant.
>>>
>>> Id Category Count Output   Percentage Output
>>> 1AA  4 A 40 (4/10)*100
>>> 2AB  3 B 30
>>> 3BC  3 C 30
>>> 4A
>>> 5C
>>> 6C
>>> 7B
>>> 8B
>>> 9A
>>> 10  C
>>>
>>
>>
>


Re: Multi-GroupBy-Insert optimization

2012-06-04 Thread shan s
Anyone?
Thanks..

On Fri, Jun 1, 2012 at 5:25 PM, shan s  wrote:

> I am using Multi-GroupBy-Insert. I was expecting a single map-reduce job
> which would club the group-bys together.
> However it is scheduling n jobs where n = number of group bys..
> Could you please explain this behaviour.
>
> From X
> INSERT OVERWRITE LOCAL DIRECTORY 'output/y1'
> SELECT a, b , c, count(*)
> group by a,b,c
> INSERT OVERWRITE LOCAL DIRECTORY 'output/y2'
> SELECT  a ,  count(*)
> group by a
> INSERT OVERWRITE LOCAL DIRECTORY 'output/y3'
> SELECT b,  count(*)
> group by b
> …..
> …..
> ……
>


Re: Front end visualization tool with Hive (when using as a warehouse)

2012-06-04 Thread Bejoy Ks
Hi Sreenath

First of all don't take hive like a RDBMS system, while designing your 
solution. It is an awesome tool when it comes to processing of huge volumes of 
data in non real time mode. If any of your use cases comes with 'updates' on 
rows, it is not supported in hive. It is pretty expensive to have a work around 
for updates as well. (you can implement it on overwriting a per partition level 
in the most granular manner, still it is expensive)

By the way I'm not a DWH guy, may be others can add on their experience over 
these.

Regards
Bejoy KS



 From: Sreenath Menon 
To: user@hive.apache.org; Bejoy Ks  
Sent: Monday, June 4, 2012 4:25 PM
Subject: Re: Front end visualization tool with Hive (when using as a warehouse)
 

Hi Bejoy

I am not looking for just an UI for queries (even though at first, when working 
on twitter data, that was of my interest). But, now I am planning on using Hive 
as a warehouse with a front end in-memory processing engine. Microstrategy or 
tableau would be a good choice.

Now further refining the problem, I would ask what is the warehousing power of 
Hive when compared to a traditional warehouse. Can Hive perform all operations 
performed/required in a warehouse. If not, where are the short comings which I 
need to deal with.
  
Always thankful for your apt assistance.


On Mon, Jun 4, 2012 at 3:49 PM, Bejoy Ks  wrote:

Hi Sreenath
>
>
>     If you are looking at a UI for queries then Cloudera's hue is the best 
>choice. Also you do have odbc connectors that integrates BI tools like 
>microstrategy, tableau etc with hive.
>
>
>Regards
>Bejoy KS
>
>
>
>
> From: Sreenath Menon 
>To: user@hive.apache.org 
>Sent: Monday, June 4, 2012 2:42 PM
>Subject: Front end visualization tool with Hive (when using as a warehouse)
> 
>
>
>Hi all
>
>I am new to hive and am working on analysis of twitter data with Hive and 
>Hadoop in a 27node cluster.
>At present am using Microsoft powerpivot as the visualization tool for visual 
>representation of analysis done using Hive and have got some really good 
>results and I am stunned by the scalability power of the Hadoop system.
>
>As a next step, I would like to evaluate the warehousing capabilities of Hive 
>for business data.
>Any insights into this is welcome. And am facing problem of delegating job to 
>Hive/Powerpivot as Powerpivot itself has capabilities of being a warehouse 
>tool. Any other good visualization tools for usage with Hive is also welcome.
>
>For analyzing twitter data, I just ran complex Hive queries for each of 
>analysis done. But for a warehouse, this does not sound like a good solution.
>
>Any help is greatly appreciated.
>
>Thanks.
>
>
>

Re: Front end visualization tool with Hive (when using as a warehouse)

2012-06-04 Thread Sreenath Menon
Hi Bejoy

I am not looking for just an UI for queries (even though at first, when
working on twitter data, that was of my interest). But, now I am planning
on using Hive as a warehouse with a front end in-memory processing engine.
Microstrategy or tableau would be a good choice.

Now further refining the problem, I would ask what is the warehousing power
of Hive when compared to a traditional warehouse. Can Hive perform all
operations performed/required in a warehouse. If not, where are the short
comings which I need to deal with.

Always thankful for your apt assistance.

On Mon, Jun 4, 2012 at 3:49 PM, Bejoy Ks  wrote:

> Hi Sreenath
>
>  If you are looking at a UI for queries then Cloudera's hue is the
> best choice. Also you do have odbc connectors that integrates BI tools like
> microstrategy, tableau etc with hive.
>
> Regards
> Bejoy KS
>
>   --
> *From:* Sreenath Menon 
> *To:* user@hive.apache.org
> *Sent:* Monday, June 4, 2012 2:42 PM
> *Subject:* Front end visualization tool with Hive (when using as a
> warehouse)
>
> Hi all
>
> I am new to hive and am working on analysis of twitter data with Hive and
> Hadoop in a 27node cluster.
> At present am using Microsoft powerpivot as the visualization tool for
> visual representation of analysis done using Hive and have got some really
> good results and I am stunned by the scalability power of the Hadoop system.
>
> As a next step, I would like to evaluate the warehousing capabilities of
> Hive for business data.
> Any insights into this is welcome. And am facing problem of delegating job
> to Hive/Powerpivot as Powerpivot itself has capabilities of being a
> warehouse tool. Any other good visualization tools for usage with Hive is
> also welcome.
>
> For analyzing twitter data, I just ran complex Hive queries for each of
> analysis done. But for a warehouse, this does not sound like a good
> solution.
>
> Any help is greatly appreciated.
>
> Thanks.
>
>
>


Re: HADOOP_HOME

2012-06-04 Thread Bejoy Ks
Hi

If you are looking to set HADOOP_HOME just for hive, you can do that at 
hive_env.sh .

For your question, yes it is location (hadoop parent dir) where you have 
extracted your tall ball.

Regards
Bejoy KS



 From: shashwat shriparv 
To: user@hive.apache.org 
Sent: Saturday, June 2, 2012 10:29 PM
Subject: Re: HADOOP_HOME
 

One more way.

add this line to you /etc/environment     file

export HADOOP_HOME=


On Sat, Jun 2, 2012 at 6:38 PM, Siddharth Tiwari  
wrote:

DO this:-
>
>nano ~/.bashrc
>add following lines:-
>export HADOOP_HOME= (path to your hadoop bin directory)
>
>#done
>
>
>**
>Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
>"Every duty is holy, and devotion to duty is the highest form of worship of 
>God.” 
>
>
>
>
>
>Date: Sat, 2 Jun 2012 11:20:10 +0200
>Subject: HADOOP_HOME
>From: babak...@gmail.com
>To: user@hive.apache.org
>
>
>Hello experts
>
>
>How can I set  HADOOP _HOME? in which file I can find this variable? I should 
>set this variable with the path in that I downloaded hadoop?
>
>
>thank you so much


-- 
    
∞
Shashwat Shriparv

Re: FileNotFoundException

2012-06-04 Thread Bejoy Ks
Hi Pavel
    
The issue is some how your hive client is pointing to lfs rather than hdfs. 
There could be two possible reasons for this, first one amd the most likeliest 
one is  that your hdfs 'fs.default.name' points to file:/// instead of hdfs:// 
. In default the value of this property is 'file:///' you need to override this 
property  in core-site.xml.

Second possibility is less likely, it can happen if you have overridden the 
value of hive.metastore.warehouse.dir in your hive-site.xml and the value is 
prefixed with 'file:///'

Regards
Bejoy KS



 From: "kulkarni.swar...@gmail.com" 
To: "user@hive.apache.org"  
Cc: "user@hive.apache.org"  
Sent: Sunday, June 3, 2012 8:20 PM
Subject: Re: FileNotFoundException
 
Did you setup the /usr/hive/warehouse directory in hdfs and do a chmod g+w on 
it? That would be required before you run any queries.

On Jun 3, 2012, at 1:53 AM, Павел Мезенцев  wrote:

> Hello all!
> 
> I tried to create simple table in hive
> > create table test (a string); 
> 
> But its failed:
> FAILED: Error in metadata: MetaException(message:Got exception: 
> java.io.FileNotFoundException File file:/user/hive/warehouse/test does not 
> exist.)
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask
> 
> I set up hadoop and hive locally from couderra distribution CDH3 on Ubuntu 
> 12.04.
> And use default settings.
> 
> What I do wrong?  How can I fix it?
> 
> Thank you!
> Best regards 
> Mezentsev Pavel
> Russia

Re: Front end visualization tool with Hive (when using as a warehouse)

2012-06-04 Thread Jagat
Hello Sreenath,

Beside the tools mentioned by Bejoy you can also refer to Pentaho and Hive
both play well.

Regards,

Jagat Singh

On Mon, Jun 4, 2012 at 3:49 PM, Bejoy Ks  wrote:

> Hi Sreenath
>
>  If you are looking at a UI for queries then Cloudera's hue is the
> best choice. Also you do have odbc connectors that integrates BI tools like
> microstrategy, tableau etc with hive.
>
> Regards
> Bejoy KS
>
>   --
> *From:* Sreenath Menon 
> *To:* user@hive.apache.org
> *Sent:* Monday, June 4, 2012 2:42 PM
> *Subject:* Front end visualization tool with Hive (when using as a
> warehouse)
>
> Hi all
>
> I am new to hive and am working on analysis of twitter data with Hive and
> Hadoop in a 27node cluster.
> At present am using Microsoft powerpivot as the visualization tool for
> visual representation of analysis done using Hive and have got some really
> good results and I am stunned by the scalability power of the Hadoop system.
>
> As a next step, I would like to evaluate the warehousing capabilities of
> Hive for business data.
> Any insights into this is welcome. And am facing problem of delegating job
> to Hive/Powerpivot as Powerpivot itself has capabilities of being a
> warehouse tool. Any other good visualization tools for usage with Hive is
> also welcome.
>
> For analyzing twitter data, I just ran complex Hive queries for each of
> analysis done. But for a warehouse, this does not sound like a good
> solution.
>
> Any help is greatly appreciated.
>
> Thanks.
>
>
>


Re: Front end visualization tool with Hive (when using as a warehouse)

2012-06-04 Thread Bejoy Ks
Hi Sreenath

     If you are looking at a UI for queries then Cloudera's hue is the best 
choice. Also you do have odbc connectors that integrates BI tools like 
microstrategy, tableau etc with hive.

Regards
Bejoy KS



 From: Sreenath Menon 
To: user@hive.apache.org 
Sent: Monday, June 4, 2012 2:42 PM
Subject: Front end visualization tool with Hive (when using as a warehouse)
 

Hi all

I am new to hive and am working on analysis of twitter data with Hive and 
Hadoop in a 27node cluster.
At present am using Microsoft powerpivot as the visualization tool for visual 
representation of analysis done using Hive and have got some really good 
results and I am stunned by the scalability power of the Hadoop system.

As a next step, I would like to evaluate the warehousing capabilities of Hive 
for business data.
Any insights into this is welcome. And am facing problem of delegating job to 
Hive/Powerpivot as Powerpivot itself has capabilities of being a warehouse 
tool. Any other good visualization tools for usage with Hive is also welcome.

For analyzing twitter data, I just ran complex Hive queries for each of 
analysis done. But for a warehouse, this does not sound like a good solution.

Any help is greatly appreciated.

Thanks.

RE: Re[2]: table design and performance questions

2012-06-04 Thread ramon.pin
Hi Avdeev,

  in response to question 2. Bucketing is a second level of data division on 
the table. You can tell Hive that your data should be distributed over several 
files, inside your table of partition, using some fields as distribution key. 
So you can have partitions, what means your files are inside a directory 
structure, and buckets that split your data in distinct files.

See an example. You have the following data:


   |id |country
   | 1 | US
   | 2 | ES
   | 3 | ES
   | 4 | ES
   | 5 | US
   | 6 | US

You create the table as:

CREATE TABLE data (id int)
PARTITIONED BY (country string)
CLUSTERED BY (id) INTO 2 BUCKETS;

Then the structure of data will be (bucket filenames don't have any meaning 
just a sample):

   data/country=ES/bucket.0
   data/country=ES/bucket.1
   data/country=US/bucket.0
   data/country=US/bucket.1

Then bucket.0 should contain registers with id 1,3,5... while bucket.1 should 
contain 2,4,6... This can be helpful for sampling (you don't what to read all 
registers but just 50% of them in this sample). In that scenario you only need 
to read bucket.0 on each partition. Also can be useful for joins. If both 
tables have the same number of buckets (or an exact multiple) then the join can 
be done in a bucket by bucket fashion instead of a full tables join. Looking 
for a given id also can be optimized in this situation because you know which 
buckets to discard.

There is another optional clause that tells a bucket should by ordered by 
cluster key: CLUSTER BY (id) SORTED BY (id ASC) INTO 2 BUCKETS. This is an 
extra optimization for joins because now not only can be done in a bucket by 
bucket fashion. It also can be implemented as a merge sort.

Hope this was useful,
   Ramón Pin


-Original Message-
From: Avdeev V. M. [mailto:ls...@list.ru]
Sent: sábado, 02 de junio de 2012 18:43
To: ruben.devr...@hyves.nl
Cc: user@hive.apache.org
Subject: Re[2]: table design and performance questions

Thank for the information Ruben.

1. I found the issue https://issues.apache.org/jira/browse/HIVE-1642
does it mean that MAPJOIN hint is obsolete since 2010 and I can avoid this hint 
absolutely?

2. sorry for stupid questions, but I can't understand bucketing still. 
partitioning is ok, it is hdfs folders and I able to understand how it improve 
query execution. but what is bucketing in terms of storing data?

3. I embarrassed to ask such stupid questions, but is there 'how hive works' 
manual or something like?

And again  - sorry for bad English.

Vyacheslav

Tue, 29 May 2012 10:02:14 +0200 от Ruben de Vries :
> Partitioning can greatly increase performance for WHERE clauses since hive 
> can omit parsing the data in the partitions which do no meet the requirement.
>
> For example if you partition by date (I do it by INT dateint, in which case I 
> set dateint to be MMDD) and you do WHERE dateint >= 20120101 then it 
> won't even have to touch any of the data from before 2012-01-01 and in my 
> case that means I don't parse the last 2 years of data, reducing the time the 
> query takes by about 70% :-)
>
>
>
> Buckets are the second awesome way of getting a big optimization in, 
> specifically for joins! If you have 2 tables you're joining onto each other 
> then if they're both bucketed on their join column it will also greatly 
> increase speed.
>
> Another good join optimization is MAPJOIN, if one of the tables you're 
> joining is rather small (below 30mb) then you can force it to MAPJOIN or you 
> can enable automatic mapjoin, I personally prefere explicit behavory instead 
> of automagic so use a hint:
>
> SELECT /* +MAPJOIN(the_small_table) */ fields FROM table JOIN 
> the_small_table, etc.
>
> Sorted by is for sorting within buckets, only relevant if you're doing a lot 
> of ordering I think.
>
>
>
> I'm assuming sequencefiles are faster, but I wouldn't really know :( need 
> someone else to tell us more about that ;)
>
>
>
>
>
> -Original Message-
>
> From: Avdeev V. M. [mailto:ls...@list.ru]
>
> Sent: Monday, May 28, 2012 7:17 AM
>
> To: user@hive.apache.org
>
> Subject: table design and performance questions
>
>
>
> Question from novice.
>
>
>
> Where I can read table design best practices? I have a measure table with 
> millions of rows and many dimension tables with less than 1000 rows each. I 
> can't find out the way to get optimal design of both kind of tables. Is there 
> performance tuning guides or performance FAQ?
>
>
>
> Specifically
>
> 1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using 
> these statements make sense?
>
> 2) DDL language manual says 'This can improve performance on certain kinds of 
> queries.' about CLUSTERED BY statement. What kind of queries can be improved?
>
> 3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of 
> performance? What aspects should be taken into account when choosing a file 
> format?
>
> 4) Compressed storage article says 'Keeping dat

Front end visualization tool with Hive (when using as a warehouse)

2012-06-04 Thread Sreenath Menon
Hi all

I am new to hive and am working on analysis of twitter data with Hive and
Hadoop in a 27node cluster.
At present am using Microsoft powerpivot as the visualization tool for
visual representation of analysis done using Hive and have got some really
good results and I am stunned by the scalability power of the Hadoop system.

As a next step, I would like to evaluate the warehousing capabilities of
Hive for business data.
Any insights into this is welcome. And am facing problem of delegating job
to Hive/Powerpivot as Powerpivot itself has capabilities of being a
warehouse tool. Any other good visualization tools for usage with Hive is
also welcome.

For analyzing twitter data, I just ran complex Hive queries for each of
analysis done. But for a warehouse, this does not sound like a good
solution.

Any help is greatly appreciated.

Thanks.


RE: Hive 'rest' column

2012-06-04 Thread ramon.pin
I'm reviewing that LazySerDe's option right now and it seems to be what I want. 
Do you know any good tutorial or documentation of all LazySerde's options I can 
use from Hive?

Thx,
   Ramón Pin

From: shrikanth shankar [mailto:sshan...@qubole.com]
Sent: miércoles, 30 de mayo de 2012 22:47
To: user@hive.apache.org
Subject: Re: Hive 'rest' column

I believe the default LazySerDe takes a parameter called 
'serialization.last.column.takes.rest'. Setting this to true might solve your 
issue (restoMsg would become a string then and you might have to parse it in 
the query into an array)

thanks,
Shrikanth
On May 30, 2012, at 9:27 AM, 
mailto:ramon@accenture.com>> 
mailto:ramon@accenture.com>> wrote:


Hi,

   I'm trying to define a table over an external file. My file has 12 fixed 
columns followed by a varying amount of columns that depends on some of the 
fixed ones. I tried to define the table as:

CREATE EXTERNAL TABLE IF NOT EXISTS log_array (
dt  string,
txOperOpciResto string,
idRegPerf   string,
operstring,
opcion  string,
accion  string,
servc   string,
canal   string,
platf   string,
codIdioma   string,
paisstring,
lacre   string,
dirIP   string,
restoMsgarray
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
  COLLECTION ITEMS TERMINATED BY '|'
STORED AS SEQUENCEFILE
LOCATION '/user/hadoop-user/uc3/seq/';

So what I tried was to get all varing part on an array field (restoMsg). The 
trick is not working because both delimiters, fields and collections, are the 
same. My restoMsg field only gets one column and the rest are omitted.

Is there any way to get that last part without custom code? If not, what 
classes should I create to this and how can I define the table then?

Thx,
   Ramón Pin



Subject to local law, communications with Accenture and its affiliates 
including telephone calls and emails (including content), may be monitored by 
our systems for the purposes of security and the assessment of internal 
compliance with Accenture policy.
__

www.accenture.com