My initial assumption when I tried to write the query was that it
would use the same delimiters I defined in the schema definition of
the target table. That led to my confusion because I thought hive had
enough information (in the schema) to do proper string -> map<x,x>
data conversion.
Maybe something like that could work?
Josh F.
On Jan 10, 2009, at 10:46 PM, Zheng Shao wrote:
Hi Josh,
Yes the transform assumes every output column will be string.
And, if the input of the transform is not a string, it will be
converted to a string. But we don't have a mechanism to convert
string back to map<string,string> in the language. What do you think
we should do to support that?
Zheng
On Sat, Jan 10, 2009 at 10:25 PM, Josh Ferguson <[email protected]>
wrote:
One more small update, it seems that transform doesn't work at all
for inserting into columns of type MAP<X,Y>. I suspect this is
because the semantic analyzer treats all columns out of a custom map
phase as type 'STRING' and then complains when it can't convert the
assumed type into the type necessary, which is MAP<STRING, STRING>
in this case. Is this correct? Is anyone else using a MAP type with
custom map or reduce scripts? What queries have you gotten to work?
Jos
On Sat, Jan 10, 2009 at 12:16 PM, Josh Ferguson <[email protected]>
wrote:
I want to follow up on this a little, here are the schemas for the
source and destination tables and the query I am trying to run.
Source table:
hive> DESCRIBE EXTENDED
users
; OK
occurred_at int
id string
properties map<string,string>
account string
application string
dataset string
hour int
Detailed Table Information:
Table(tableName:users,dbName:default,owner:Josh,createTime:
1231485489,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:
[FieldSchema(name:occurred_at,type:int,comment:null),
FieldSchema(name:id,type:string,comment:null),
FieldSchema
(name:properties,type:map<string,string>,comment:null)],location:/
user/hive/warehouse/
users
,inputFormat:org
.apache
.hadoop
.mapred
.TextInputFormat
,outputFormat:org
.apache
.hadoop
.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:
32
,serdeInfo:SerDeInfo
(name:null
,serializationLib:org
.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:
{colelction
.delim
=
44
,mapkey
.delim
=
58
,serialization
.format
=
org
.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:
[id],sortCols:[],parameters:{}),partitionKeys:
[FieldSchema(name:account,type:string,comment:null),
FieldSchema(name:application,type:string,comment:null),
FieldSchema(name:dataset,type:string,comment:null),
FieldSchema(name:hour,type:int,comment:null)],parameters:{})
Destination table:
hive> DESCRIBE EXTENDED distinct_users;
OK
occurred_at int
id string
properties map<string,string>
account string
application string
dataset string
hour int
Detailed Table Information:
Table(tableName:distinct_users,dbName:default,owner:Josh,createTime:
1231488500,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:
[FieldSchema(name:occurred_at,type:int,comment:null),
FieldSchema(name:id,type:string,comment:null),
FieldSchema
(name:properties,type:map<string,string>,comment:null)],location:/
user/hive/warehouse/
distinct_users
,inputFormat:org
.apache
.hadoop
.mapred
.TextInputFormat
,outputFormat:org
.apache
.hadoop
.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:
32
,serdeInfo:SerDeInfo
(name:null
,serializationLib:org
.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:
{colelction
.delim
=
44
,mapkey
.delim
=
58
,serialization
.format
=
org
.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:
[id],sortCols:[],parameters:{}),partitionKeys:
[FieldSchema(name:account,type:string,comment:null),
FieldSchema(name:application,type:string,comment:null),
FieldSchema(name:dataset,type:string,comment:null),
FieldSchema(name:hour,type:int,comment:null)],parameters:{})
The query:
hive> INSERT OVERWRITE TABLE distinct_users SELECT
TRANSFORM(users.occurred_at, users.id, users.properties) USING '/bin/
cat' AS (occurred_at, id, properties) FROM users;
FAILED: Error in semantic analysis: line 1:23 Cannot insert into
target table because column number/types are different
distinct_users: Cannot convert column 2 from string to
map<string,string>.
I'm really confused because the two tables are the exact same except
for their names and I'm just trying to do an insert from one of them
into the other using a script.
For reference this appears to work:
hive> INSERT OVERWRITE TABLE distinct_users SELECT occurred_at, id,
properties FROM users;
What is it about transforming that is messing up the semantic
analysis?
Josh Ferguson
On Fri, Jan 9, 2009 at 11:52 AM, Josh Ferguson <[email protected]>
wrote:
Is it possible to do a query like the following:
INSERT OVERWRITE TABLE table1 PARTITION(...)
FROM table2
SELECT TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script'
AS (col1, col2, ...)
WHERE (...)
I can run the select transform segment of the query by itself fine
and I get the results I expect.
When I try and do the insert as well I'm getting errors with column
type mismatches even though my script is outputting 3 columns with
the exact same types in the exact order that they appear in table1.
I tried doing this with both a mapper and reducer similar to what
was shown in the Apache Con slides and it still didn't work. Am I
doing something wrong query wise?
I'm using the 0.19 release.
Josh Ferguson
--
Yours,
Zheng