Re: Overwrite into table using only custom mapper

Josh Ferguson Sun, 11 Jan 2009 17:57:30 -0800

I just want to note that conceptually using a transform like thisseems like it is a series of 3 steps:


1) dump
2) some stuff in the middle the semantic analyzer shouldn't care about
3 ) load

I'm not really sure what the difference is between what is actuallyhappening and those steps, but I know that it feels wrong to have tospecify my delimiting information in multiple places and in everyscript I want to run that inserts data into a table containing acolumn with a MAP or a LIST type.

Maybe hive isn't set up like this right now. I think all theinformation you need to do any of those steps (even repeatedly) isalready available somewhere long before this query is ever run, so Ishould be able to use that information and not have to specify itagain every time I want to run a query like this.

Maybe there is something I'm missing? What was the use case for the"multiple processing steps" that you mentioned in your last email?


Josh F.

On Jan 10, 2009, at 11:38 PM, Zheng Shao wrote:

I don't think it's a good idea to rely on the information from thetable. The data might go through multiple processing steps before itreaches the final destination table. And the destination table maystore the data in any way (may not be delimited).
What about allowing some syntax like this:
SELECT TRANSFORM(myint, mymap) ROW FORMAT DELIMITED KEY TERMINATEDBY '3' COLLECTION ITEM TERMINATED BY '2'
USING '/bin/cat'
AS (myint INT, mymap MAP<STRING,STRING>) ROW FORMAT DELIMITED KEYTERMINATED BY '3' COLLECTION ITEM TERMINATED BY '2'
The first ROW FORMAT describes the input format for the script, andthe second describes the output format of the script.
Zheng
On Sat, Jan 10, 2009 at 11:22 PM, Josh Ferguson <[email protected]>wrote:My initial assumption when I tried to write the query was that itwould use the same delimiters I defined in the schema definition ofthe target table. That led to my confusion because I thought hivehad enough information (in the schema) to do proper string ->map<x,x> data conversion.
Maybe something like that could work?

Josh F.


On Jan 10, 2009, at 10:46 PM, Zheng Shao wrote:
Hi Josh,

Yes the transform assumes every output column will be string.
And, if the input of the transform is not a string, it will beconverted to a string. But we don't have a mechanism to convertstring back to map<string,string> in the language. What do youthink we should do to support that?
Zheng
On Sat, Jan 10, 2009 at 10:25 PM, Josh Ferguson<[email protected]> wrote:One more small update, it seems that transform doesn't work at allfor inserting into columns of type MAP<X,Y>. I suspect this isbecause the semantic analyzer treats all columns out of a custommap phase as type 'STRING' and then complains when it can't convertthe assumed type into the type necessary, which is MAP<STRING,STRING> in this case. Is this correct? Is anyone else using a MAPtype with custom map or reduce scripts? What queries have yougotten to work?
Jos
On Sat, Jan 10, 2009 at 12:16 PM, Josh Ferguson<[email protected]> wrote:I want to follow up on this a little, here are the schemas for thesource and destination tables and the query I am trying to run.
Source table:
hive> DESCRIBE EXTENDEDusers; OK
occurred_at     int
id      string
properties      map<string,string>
account string
application     string
dataset string
hour    int
Detailed Table Information:
Table(tableName:users,dbName:default,owner:Josh,createTime:1231485489,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:occurred_at,type:int,comment:null),FieldSchema(name:id,type:string,comment:null),FieldSchema(name:properties,type:map<string,string>,comment:null)],location:/user/hive/warehouse/users,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:{colelction.delim=44,mapkey.delim=58,serialization.format=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[id],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type:string,comment:null),FieldSchema(name:application,type:string,comment:null),FieldSchema(name:dataset,type:string,comment:null),FieldSchema(name:hour,type:int,comment:null)],parameters:{})
Destination table:

hive> DESCRIBE EXTENDED distinct_users;
OK
occurred_at     int
id      string
properties      map<string,string>
account string
application     string
dataset string
hour    int
Detailed Table Information:
Table(tableName:distinct_users,dbName:default,owner:Josh,createTime:1231488500,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:occurred_at,type:int,comment:null),FieldSchema(name:id,type:string,comment:null),FieldSchema(name:properties,type:map<string,string>,comment:null)],location:/user/hive/warehouse/distinct_users,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:{colelction.delim=44,mapkey.delim=58,serialization.format=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[id],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type:string,comment:null),FieldSchema(name:application,type:string,comment:null),FieldSchema(name:dataset,type:string,comment:null),FieldSchema(name:hour,type:int,comment:null)],parameters:{})
The query:
hive> INSERT OVERWRITE TABLE distinct_users SELECTTRANSFORM(users.occurred_at, users.id, users.properties) USING '/bin/cat' AS (occurred_at, id, properties) FROM users;FAILED: Error in semantic analysis: line 1:23 Cannot insert intotarget table because column number/types are differentdistinct_users: Cannot convert column 2 from string tomap<string,string>.
I'm really confused because the two tables are the exact sameexcept for their names and I'm just trying to do an insert from oneof them into the other using a script.
For reference this appears to work:
hive> INSERT OVERWRITE TABLE distinct_users SELECT occurred_at, id,properties FROM users;
What is it about transforming that is messing up the semanticanalysis?
Josh Ferguson
On Fri, Jan 9, 2009 at 11:52 AM, Josh Ferguson <[email protected]>wrote:
Is it possible to do a query like the following:

INSERT OVERWRITE TABLE table1 PARTITION(...)
FROM table2
SELECT TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script'AS (col1, col2, ...)
WHERE (...)
I can run the select transform segment of the query by itself fineand I get the results I expect.
When I try and do the insert as well I'm getting errors with columntype mismatches even though my script is outputting 3 columns withthe exact same types in the exact order that they appear in table1.I tried doing this with both a mapper and reducer similar to whatwas shown in the Apache Con slides and it still didn't work. Am Idoing something wrong query wise?
I'm using the 0.19 release.

Josh Ferguson





--
Yours,
Zheng
--
Yours,
Zheng

Re: Overwrite into table using only custom mapper

Reply via email to