Multiple insert on same table with INSERT OVERWRITE + INSERT INTO ?

2014-06-30 Thread Furcy Pin
Hi all,


Does anyone know what is the expected behavior on a query like this :

FROM source
INSERT OVERWRITE TABLE dest
SELECT *
INSERT INTO TABLE dest
SELECT *

is it the same as

FROM source
INSERT INTO TABLE dest
SELECT *
INSERT OVERWRITE TABLE dest
SELECT *

?

I'm asking because I ran a query looking like the first one on hive 0.12,
and it seems to me that the OVERWRITE was completely ignored :
I mean that the old data (which was there before the query) was still
there...
And I checked, it was no access right problem, table was 777, no sticky
bits
(I already had issues with ghost data on OVERWRITE because of it)

One last thing : my table was partitioned and I used nonstrict dynamic
partitioning,
perhaps it had an impact on the behavior too.

Thanks,

Furcy


Column selection in Hive

2014-06-30 Thread Chhaya Vishwakarma
Hi


I have a Customer table with 200 columns I want to create another table from 
Customer table which should have only 190 columns.
How can i skip few columns in hive from base table?

Regards,
Chhaya Vishwakarma



The contents of this e-mail and any attachment(s) may contain confidential or 
privileged information for the intended recipient(s). Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail and 
using or disseminating the information, and must notify the sender and delete 
it from their system. LT Infotech will not accept responsibility or liability 
for the accuracy or completeness of, or the presence of any virus or disabling 
code in this e-mail


ERROR 1115: Unsupported type: 5 in Pig's schema

2014-06-30 Thread Carlotta Hicks
Hi All!  I created an hbase table and hcatalog table like the following:


hive describe formatted client;
OK
# col_namedata_type   comment

a  string from deserializer
b string from deserializer
c  int   from deserializer
d string from deserializer
e string from deserializer
f  string from deserializer
g  string from deserializer
h string from deserializer
i   string from deserializer
j   double  from deserializer
k  int   from deserializer
l   double  from deserializer
mdouble  from deserializer
n boolean   from deserializer

# Detailed Table Information
Database:   default
Owner: root
CreateTime:   Wed Jun 25 17:29:55 EDT 2014
LastAccessTime:   UNKNOWN
Protect Mode:  None
Retention:  0
Location:  hdfs://localhost:8020/user/hive/warehouse/client
Table Type: MANAGED_TABLE
Table Parameters:
hbase.columns.mapping  :key, gender:b, age:c, 
maritalStatus:d, city:e, state:f, zip:g, BirthDate:h, startDate:i, 
indInsValue:j, totalInsPolicies:k, totalInsValue:l, maxInsValue:m, deceased:n
hbase.table.name   client
storage_handler  
org.apache.hcatalog.hbase.HBaseHCatStorageHandler
transient_lastDdlTime   1403731795

I am trying to store data into the hbase table via Pig, but I am getting the 
following error:

clientData = LOAD 'hdfs:///user/Data/cleint.csv' USING PigStorage(',') as (
a:chararray, b:chararray, c:int, d:chararray, e:chararray, f:chararray, 
g:chararray,
h:chararray, i:chararray, j:double, k:int, l:double, m:double, n:boolean);

STORE clientData INTO 'client' USING org.apache.hcatalog.pig.HCatStorer();

2014-06-30 09:44:53,999 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1115: Unsupported type: 5  in Pig's schema

Does anyone understand what the problem could be?


Carlotta Hicks




Alter location of database in Hive

2014-06-30 Thread Jon Bender
Hey all,

I'm on Hive 0.10.0 on one of my clusters.  We had a namenode hostname
change, so I'm trying to point all of our tables, partitions and databases
to the new locations.

When i describe database mydb, the location shows up as
hdfs://old_hostname/user/hive/warehouse/mydb.db, and i want to set it
to hdfs://new_hostname/user/hive/warehouse/mydb.db

Is there a way to do this.  Or do I need to go poking around in the mysql
metadata to actually carry this out?

Regards,
Jon


Re: Alter location of database in Hive

2014-06-30 Thread Jon Bender
Answered my own question, no there is not.  The way to do is is to modify
the DB_LOCATION_URI field in metastore.DBS (at least if you're using MySQL)


On Mon, Jun 30, 2014 at 5:14 PM, Jon Bender jonathan.ben...@gmail.com
wrote:

 Hey all,

 I'm on Hive 0.10.0 on one of my clusters.  We had a namenode hostname
 change, so I'm trying to point all of our tables, partitions and databases
 to the new locations.

 When i describe database mydb, the location shows up as
 hdfs://old_hostname/user/hive/warehouse/mydb.db, and i want to set it
 to hdfs://new_hostname/user/hive/warehouse/mydb.db

 Is there a way to do this.  Or do I need to go poking around in the mysql
 metadata to actually carry this out?

 Regards,
 Jon



Re: Alter location of database in Hive

2014-06-30 Thread Prem Yadav
I think you should be able to copy the data to a different location and
then drop the old db, and create a new one with the new location.


On Tue, Jul 1, 2014 at 1:54 AM, Jon Bender jonathan.ben...@gmail.com
wrote:

 Answered my own question, no there is not.  The way to do is is to modify
 the DB_LOCATION_URI field in metastore.DBS (at least if you're using MySQL)


 On Mon, Jun 30, 2014 at 5:14 PM, Jon Bender jonathan.ben...@gmail.com
 wrote:

 Hey all,

 I'm on Hive 0.10.0 on one of my clusters.  We had a namenode hostname
 change, so I'm trying to point all of our tables, partitions and databases
 to the new locations.

 When i describe database mydb, the location shows up as
 hdfs://old_hostname/user/hive/warehouse/mydb.db, and i want to set it
 to hdfs://new_hostname/user/hive/warehouse/mydb.db

 Is there a way to do this.  Or do I need to go poking around in the mysql
 metadata to actually carry this out?

 Regards,
 Jon





Re: Column selection in Hive

2014-06-30 Thread Szehon Ho
Why dont you just use CTAS (Create table as select), putting the 190
columns in the select part?  I guess you're maybe asking for any way to not
have to type all the columns, but I'm not aware of any.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS)



On Mon, Jun 30, 2014 at 2:54 AM, Chhaya Vishwakarma 
chhaya.vishwaka...@lntinfotech.com wrote:

  Hi





 I have a Customer table with 200 columns I want to create another table
 from Customer table which should have only 190 columns.

 How can i skip few columns in hive from base table?



 Regards,

 Chhaya Vishwakarma



 --
 The contents of this e-mail and any attachment(s) may contain confidential
 or privileged information for the intended recipient(s). Unintended
 recipients are prohibited from taking action on the basis of information in
 this e-mail and using or disseminating the information, and must notify the
 sender and delete it from their system. LT Infotech will not accept
 responsibility or liability for the accuracy or completeness of, or the
 presence of any virus or disabling code in this e-mail



Calculation works wrong when hive.groupby.skewindata is true and count(*) count(distinct) group by work simultaneously

2014-06-30 Thread Chris Chen
 

【Phenomenon】
The query results are not the same as when hive.groupby.skewindata was
setted to true and false.

【my question】
I want to calculate the count(*) and count(distinct) simultaneously
,otherwise it will cost 2 MR job to calculate. But when i set the
hive.groupby.skewindata to be true, the count(*) result shoud not be same as
the count(distinct) , but the real result is same, so it's wrong. And I find
the difference of its query plan which the Reduce Operator Tree-Group By
Operator-mode is mergepartial when skew is set to false and 
Reduce Operator Tree-Group By Operator-mode is complete when skew is set
to true. So i'm confused the root cause of the error.

【sql】
select ds,appid,eventname,active,count(distinct(guid)), count(*) from
eventinfo_tmp where ds='20140612' and length(eventname)1000 and eventname
like '%alibaba%' group by ds,appid,eventname,active;

【the others hive configaration exclude hive.groupby.skewindata】
hive.exec.compress.output=true
hive.exec.compress.intermediate=true
io.seqfile.compression.type=BLOCK
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
hive.map.aggr=true
hive.stats.autogather=false
hive.exec.scratchdir=/user/complat/tmp
mapred.job.queue.name=complat
hive.exec.mode.local.auto=false
hive.exec.mode.local.auto.inputbytes.max=500
hive.exec.mode.local.auto.tasks.max=10
hive.exec.mode.local.auto.input.files.max=1000
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
mapred.max.split.size=1
mapred.min.split.size.per.node=1
mapred.min.split.size.per.rack=1

【result】
when hive.groupby.skewindata=true the result is :
20140612 8 alibaba 1 87 147

when it=false the result is : 
20140612 8 alibaba 1 87 87

【query plan】
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
(TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR
(TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active))
(TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR
(TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds)
'20140612') ( (TOK_FUNCTION length (TOK_TABLE_OR_COL eventname)) 1000))
(like (TOK_TABLE_OR_COL eventname) '%tvvideo_setting%'))) (TOK_GROUPBY
(TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname)
(TOK_TABLE_OR_COL active

STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage

STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias - Map Operator Tree:
eventinfo_tmp 
TableScan
alias: eventinfo_tmp
Filter Operator
predicate:
expr: ((length(eventname)  1000) and (eventname like '%tvvideo_setting%'))
type: boolean
Select Operator
expressions:
expr: ds
type: string
expr: appid
type: string
expr: eventname
type: string
expr: active
type: int
expr: guid
type: string
outputColumnNames: ds, appid, eventname, active, guid
Group By Operator
aggregations:
expr: count(DISTINCT guid)
expr: count()
bucketGroup: false
keys:
expr: ds
type: string
expr: appid
type: string
expr: eventname
type: string
expr: active
type: int
expr: guid
type: string
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: int
expr: _col4
type: string
sort order: +
Map-reduce partition columns:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: int
tag: -1
value expressions:
expr: _col5
type: bigint
expr: _col6
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col4:0._col0)
expr: count(VALUE._col1)
bucketGroup: false
keys:
expr: KEY._col0
type: string
expr: KEY._col1
type: string
expr: KEY._col2
type: string
expr: KEY._col3
type: int
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: int
expr: _col4
type: bigint
expr: _col5
type: bigint
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0
Fetch Operator
limit: -1

ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
(TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR
(TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active))
(TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR
(TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds)
'20140612') ( (TOK_FUNCTION