Hive-Parquet case sensitivity

Raymond Lau Tue, 29 Jul 2014 13:10:05 -0700

So I'm having the same case sensitivity issue mentioned in a previous
thread: https://groups.google.com/forum/#!topic/parquet-dev/ko-TM2lLpxE


The solution that Christos posted works great, but it didn't work for me
when it comes to *partitioned* external tables, either I couldn't read or I
couldn't write.  All of the data I'm working with is already partitioned in
HDFS so all I need to do is run an 'ALTER TABLE table ADD PARTITION
(partitionkey = blah) LOCATION '/path/'.

The workaround I made for this was by editing the init function in the
DataWritableReadSupport class (Original -
https://github.com/Parquet/parquet-mr/blob/7b0778c490e6782a83663bd5b1ec9d8a7dd7c2ae/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java),
so that lower-cased field names would be used for the Hive table and when
the Parquet files are being read, the typeListWanted is edited so that it
properly reads the data that I need.  I'm able to insert all of my data and
run queries on it in Hive.

    if (columns != null) {
            final List<String> listColumns = getColumns(columns);

            /* EDIT - create a map that maps lowercase field name -> normal
field name from the parquet files */
            final Map<String, String> lowerCaseFileSchemaColumns = new
HashMap<String,String>();
            for(ColumnDescriptor c : fileSchema.getColumns()) {

lowerCaseFileSchemaColumns.put(c.getPath()[0].toLowerCase(),
c.getPath()[0]);
            }

            final List<Type> typeListTable = new ArrayList<Type>();
            for (final String col : listColumns) {
                /* EDIT - check if a Hive column field exists in the map,
instead of whether it exists in the parquet file schema.  this is where the
case sensitivity would normally cause a problem.  if it exists, get the
type information from the parquet file schema (we need the case sensitive
field name to get it) */
                if (lowerCaseFileSchemaColumns.containsKey(col)) {

typeListTable.add(fileSchema.getType(lowerCaseFileSchemaColumns.get(col)));
                } else {
                    typeListTable.add(new
PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
                }
            }

            MessageType tableSchema = new MessageType(TABLE_SCHEMA,
typeListTable);
            contextMetadata.put(HIVE_SCHEMA_KEY, tableSchema.toString());

            MessageType requestedSchemaByUser = tableSchema;
            final List<Integer> indexColumnsWanted =
getReadColumnIDs(configuration);

            final List<Type> typeListWanted = new ArrayList<Type>();

            /* EDIT - again we need the case sensitive field name for
getType */
            for (final Integer idx : indexColumnsWanted) {

typeListWanted.add(tableSchema.getType(lowerCaseFileSchemaColumns.get(listColumns.get(idx))));
            }

    ....

I was wondering if there were any consequences of doing it this way that I
missed and whether this fix or something similar could someday become a
patch.

-- 
*Raymond Lau*
Software Engineer - Intern |
[email protected] | (925) 395-3806

Hive-Parquet case sensitivity

Reply via email to