Multiple insert on same table with INSERT OVERWRITE + INSERT INTO ?
Hi all, Does anyone know what is the expected behavior on a query like this : FROM source INSERT OVERWRITE TABLE dest SELECT * INSERT INTO TABLE dest SELECT * is it the same as FROM source INSERT INTO TABLE dest SELECT * INSERT OVERWRITE TABLE dest SELECT * ? I'm asking because I ran a query looking like the first one on hive 0.12, and it seems to me that the OVERWRITE was completely ignored : I mean that the old data (which was there before the query) was still there... And I checked, it was no access right problem, table was 777, no sticky bits (I already had issues with ghost data on OVERWRITE because of it) One last thing : my table was partitioned and I used nonstrict dynamic partitioning, perhaps it had an impact on the behavior too. Thanks, Furcy
Column selection in Hive
Hi I have a Customer table with 200 columns I want to create another table from Customer table which should have only 190 columns. How can i skip few columns in hive from base table? Regards, Chhaya Vishwakarma The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail
ERROR 1115: Unsupported type: 5 in Pig's schema
Hi All! I created an hbase table and hcatalog table like the following: hive describe formatted client; OK # col_namedata_type comment a string from deserializer b string from deserializer c int from deserializer d string from deserializer e string from deserializer f string from deserializer g string from deserializer h string from deserializer i string from deserializer j double from deserializer k int from deserializer l double from deserializer mdouble from deserializer n boolean from deserializer # Detailed Table Information Database: default Owner: root CreateTime: Wed Jun 25 17:29:55 EDT 2014 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://localhost:8020/user/hive/warehouse/client Table Type: MANAGED_TABLE Table Parameters: hbase.columns.mapping :key, gender:b, age:c, maritalStatus:d, city:e, state:f, zip:g, BirthDate:h, startDate:i, indInsValue:j, totalInsPolicies:k, totalInsValue:l, maxInsValue:m, deceased:n hbase.table.name client storage_handler org.apache.hcatalog.hbase.HBaseHCatStorageHandler transient_lastDdlTime 1403731795 I am trying to store data into the hbase table via Pig, but I am getting the following error: clientData = LOAD 'hdfs:///user/Data/cleint.csv' USING PigStorage(',') as ( a:chararray, b:chararray, c:int, d:chararray, e:chararray, f:chararray, g:chararray, h:chararray, i:chararray, j:double, k:int, l:double, m:double, n:boolean); STORE clientData INTO 'client' USING org.apache.hcatalog.pig.HCatStorer(); 2014-06-30 09:44:53,999 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1115: Unsupported type: 5 in Pig's schema Does anyone understand what the problem could be? Carlotta Hicks
Alter location of database in Hive
Hey all, I'm on Hive 0.10.0 on one of my clusters. We had a namenode hostname change, so I'm trying to point all of our tables, partitions and databases to the new locations. When i describe database mydb, the location shows up as hdfs://old_hostname/user/hive/warehouse/mydb.db, and i want to set it to hdfs://new_hostname/user/hive/warehouse/mydb.db Is there a way to do this. Or do I need to go poking around in the mysql metadata to actually carry this out? Regards, Jon
Re: Alter location of database in Hive
Answered my own question, no there is not. The way to do is is to modify the DB_LOCATION_URI field in metastore.DBS (at least if you're using MySQL) On Mon, Jun 30, 2014 at 5:14 PM, Jon Bender jonathan.ben...@gmail.com wrote: Hey all, I'm on Hive 0.10.0 on one of my clusters. We had a namenode hostname change, so I'm trying to point all of our tables, partitions and databases to the new locations. When i describe database mydb, the location shows up as hdfs://old_hostname/user/hive/warehouse/mydb.db, and i want to set it to hdfs://new_hostname/user/hive/warehouse/mydb.db Is there a way to do this. Or do I need to go poking around in the mysql metadata to actually carry this out? Regards, Jon
Re: Alter location of database in Hive
I think you should be able to copy the data to a different location and then drop the old db, and create a new one with the new location. On Tue, Jul 1, 2014 at 1:54 AM, Jon Bender jonathan.ben...@gmail.com wrote: Answered my own question, no there is not. The way to do is is to modify the DB_LOCATION_URI field in metastore.DBS (at least if you're using MySQL) On Mon, Jun 30, 2014 at 5:14 PM, Jon Bender jonathan.ben...@gmail.com wrote: Hey all, I'm on Hive 0.10.0 on one of my clusters. We had a namenode hostname change, so I'm trying to point all of our tables, partitions and databases to the new locations. When i describe database mydb, the location shows up as hdfs://old_hostname/user/hive/warehouse/mydb.db, and i want to set it to hdfs://new_hostname/user/hive/warehouse/mydb.db Is there a way to do this. Or do I need to go poking around in the mysql metadata to actually carry this out? Regards, Jon
Re: Column selection in Hive
Why dont you just use CTAS (Create table as select), putting the 190 columns in the select part? I guess you're maybe asking for any way to not have to type all the columns, but I'm not aware of any. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS) On Mon, Jun 30, 2014 at 2:54 AM, Chhaya Vishwakarma chhaya.vishwaka...@lntinfotech.com wrote: Hi I have a Customer table with 200 columns I want to create another table from Customer table which should have only 190 columns. How can i skip few columns in hive from base table? Regards, Chhaya Vishwakarma -- The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail
Calculation works wrong when hive.groupby.skewindata is true and count(*) count(distinct) group by work simultaneously
【Phenomenon】 The query results are not the same as when hive.groupby.skewindata was setted to true and false. 【my question】 I want to calculate the count(*) and count(distinct) simultaneously ,otherwise it will cost 2 MR job to calculate. But when i set the hive.groupby.skewindata to be true, the count(*) result shoud not be same as the count(distinct) , but the real result is same, so it's wrong. And I find the difference of its query plan which the Reduce Operator Tree-Group By Operator-mode is mergepartial when skew is set to false and Reduce Operator Tree-Group By Operator-mode is complete when skew is set to true. So i'm confused the root cause of the error. 【sql】 select ds,appid,eventname,active,count(distinct(guid)), count(*) from eventinfo_tmp where ds='20140612' and length(eventname)1000 and eventname like '%alibaba%' group by ds,appid,eventname,active; 【the others hive configaration exclude hive.groupby.skewindata】 hive.exec.compress.output=true hive.exec.compress.intermediate=true io.seqfile.compression.type=BLOCK mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec hive.map.aggr=true hive.stats.autogather=false hive.exec.scratchdir=/user/complat/tmp mapred.job.queue.name=complat hive.exec.mode.local.auto=false hive.exec.mode.local.auto.inputbytes.max=500 hive.exec.mode.local.auto.tasks.max=10 hive.exec.mode.local.auto.input.files.max=1000 hive.exec.dynamic.partition=true hive.exec.dynamic.partition.mode=nonstrict hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat mapred.max.split.size=1 mapred.min.split.size.per.node=1 mapred.min.split.size.per.rack=1 【result】 when hive.groupby.skewindata=true the result is : 20140612 8 alibaba 1 87 147 when it=false the result is : 20140612 8 alibaba 1 87 87 【query plan】 ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR (TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds) '20140612') ( (TOK_FUNCTION length (TOK_TABLE_OR_COL eventname)) 1000)) (like (TOK_TABLE_OR_COL eventname) '%tvvideo_setting%'))) (TOK_GROUPBY (TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname) (TOK_TABLE_OR_COL active STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias - Map Operator Tree: eventinfo_tmp TableScan alias: eventinfo_tmp Filter Operator predicate: expr: ((length(eventname) 1000) and (eventname like '%tvvideo_setting%')) type: boolean Select Operator expressions: expr: ds type: string expr: appid type: string expr: eventname type: string expr: active type: int expr: guid type: string outputColumnNames: ds, appid, eventname, active, guid Group By Operator aggregations: expr: count(DISTINCT guid) expr: count() bucketGroup: false keys: expr: ds type: string expr: appid type: string expr: eventname type: string expr: active type: int expr: guid type: string mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: string expr: _col3 type: int expr: _col4 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: string expr: _col3 type: int tag: -1 value expressions: expr: _col5 type: bigint expr: _col6 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCT KEY._col4:0._col0) expr: count(VALUE._col1) bucketGroup: false keys: expr: KEY._col0 type: string expr: KEY._col1 type: string expr: KEY._col2 type: string expr: KEY._col3 type: int mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Select Operator expressions: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: string expr: _col3 type: int expr: _col4 type: bigint expr: _col5 type: bigint outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR (TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds) '20140612') ( (TOK_FUNCTION