[
https://issues.apache.org/jira/browse/HCATALOG-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403504#comment-13403504
]
Sushanth Sowmyan commented on HCATALOG-436:
-------------------------------------------
Here's what's going on:
In the case of CTAS, the table is written out before the metadata is created.
Also, the call to the SerDe .initialize() is made using column schema that
matches an internal column name, such as _col0,_col1,etc.
Looking through hive code reveals that that's part of design to be
position-dependent, and name-independent, so the column names are obtained from
HiveConf.getColumnInternalName(int pos) (which has an equivalent
getPositionFromInternalName(String internalNAme) for reverse mapping) during
write time.
So, this problem is intractable from the SerDe itself.
If/when writing from HCatalog (HCatOF-based), we do not have any problems.
An idea I had, and am implementing, then, is to make the reading more robust,
so that if it sees column names such as _col0, etc, it realizes it's reading
internal column names, and treats them as aliases for the appropriate position
in the record it's returning. The issues with this approach are that people now
cannot use column names like "_col0" for columns that are not the 0th column,
and so on. That could be fixed by introducing escaping semantics, but I'm
loathe to do so as it adds unnecessary complexity that I think most people
won't care about.
If this gets to be a problem later on, we can introduce a conf parameter that
determines whether or not this translation behaviour is to be used.
One issue that this fix will still not "fix" is that the underlying json file
containing the data will still have "_col0"/etc and will not be readable as-is
by third party consumers in a way that makes sense to them. They will want to
read it using HCatIF that loads the metadata for them. Either that, or they
need to switch to using HCatOF to write out to the file.
> JSON SerDe column misnaming on CTAS
> -----------------------------------
>
> Key: HCATALOG-436
> URL: https://issues.apache.org/jira/browse/HCATALOG-436
> Project: HCatalog
> Issue Type: Bug
> Reporter: Sushanth Sowmyan
> Assignee: Sushanth Sowmyan
> Labels: json, serde
>
> Given an origin table as follows:
> --
> hive -e 'describe extended ttf'
> OK
> sterm string
> count bigint
>
> Detailed Table Information Table(tableName:ttf, dbName:default,
> owner:hive, createTime:1339518715, lastAccessTime:0, retention:0,
> sd:StorageDescriptor(cols:[FieldSchema(name:sterm, type:string,
> comment:null), FieldSchema(name:count, type:bigint, comment:null)],
> location:hdfs://localhost:54310/user/hive/warehouse/ttf,
> inputFormat:org.apache.hadoop.mapred.TextInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
> parameters:{}), partitionKeys:[], parameters:{numPartitions=0, numFiles=1,
> transient_lastDdlTime=1339518715, totalSize=2155, numRows=0, rawDataSize=0},
> viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
> --
> On doing a CTAS, such as:
> --
> hive -e "create table ttf_json row format serde
> 'org.apache.hcatalog.data.JsonSerDe' as select * from ttf;"
> --
> We get a resultant table ttf_json with schema similar to ttf, but on looking
> at the data present in the json file itself, we'd notice data like this:
> --
> {"_col0":"S8.66045288732867","_col1":103}
> {"_col0":"S8.66322678828148","_col1":95}
> --
> This will then result in this table not being readable.
> This is behaviour similar to the one fixed in HCATALOG-275, but we've
> obviously not fixed all the possibilities of that problem.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira