Re: Hive-Parquet case sensitivity

Raymond Lau Tue, 29 Jul 2014 18:49:28 -0700

I'm pretty sure my distribution is Cloudera, all my Hadoop/Hive folders
have CDH in them.  Hive Version: 0.12


Raymond


On Tue, Jul 29, 2014 at 6:03 PM, Brock Noland <[email protected]> wrote:

> Hi,
>
> Thanks for the message. I am looking at this issue myself.
>
> Which version if Hive are you using from which distribution?
>
> Brock
> On Jul 29, 2014 1:09 PM, "Raymond Lau" <[email protected]> wrote:
>
> > So I'm having the same case sensitivity issue mentioned in a previous
> > thread: https://groups.google.com/forum/#!topic/parquet-dev/ko-TM2lLpxE
> >
> > The solution that Christos posted works great, but it didn't work for me
> > when it comes to *partitioned* external tables, either I couldn't read
> or I
> > couldn't write.  All of the data I'm working with is already partitioned
> in
> > HDFS so all I need to do is run an 'ALTER TABLE table ADD PARTITION
> > (partitionkey = blah) LOCATION '/path/'.
> >
> > The workaround I made for this was by editing the init function in the
> > DataWritableReadSupport class (Original -
> >
> >
> https://github.com/Parquet/parquet-mr/blob/7b0778c490e6782a83663bd5b1ec9d8a7dd7c2ae/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
> > ),
> > so that lower-cased field names would be used for the Hive table and when
> > the Parquet files are being read, the typeListWanted is edited so that it
> > properly reads the data that I need.  I'm able to insert all of my data
> and
> > run queries on it in Hive.
> >
> >     if (columns != null) {
> >             final List<String> listColumns = getColumns(columns);
> >
> >             /* EDIT - create a map that maps lowercase field name ->
> normal
> > field name from the parquet files */
> >             final Map<String, String> lowerCaseFileSchemaColumns = new
> > HashMap<String,String>();
> >             for(ColumnDescriptor c : fileSchema.getColumns()) {
> >
> > lowerCaseFileSchemaColumns.put(c.getPath()[0].toLowerCase(),
> > c.getPath()[0]);
> >             }
> >
> >             final List<Type> typeListTable = new ArrayList<Type>();
> >             for (final String col : listColumns) {
> >                 /* EDIT - check if a Hive column field exists in the map,
> > instead of whether it exists in the parquet file schema.  this is where
> the
> > case sensitivity would normally cause a problem.  if it exists, get the
> > type information from the parquet file schema (we need the case sensitive
> > field name to get it) */
> >                 if (lowerCaseFileSchemaColumns.containsKey(col)) {
> >
> >
> typeListTable.add(fileSchema.getType(lowerCaseFileSchemaColumns.get(col)));
> >                 } else {
> >                     typeListTable.add(new
> > PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
> >                 }
> >             }
> >
> >             MessageType tableSchema = new MessageType(TABLE_SCHEMA,
> > typeListTable);
> >             contextMetadata.put(HIVE_SCHEMA_KEY, tableSchema.toString());
> >
> >             MessageType requestedSchemaByUser = tableSchema;
> >             final List<Integer> indexColumnsWanted =
> > getReadColumnIDs(configuration);
> >
> >             final List<Type> typeListWanted = new ArrayList<Type>();
> >
> >             /* EDIT - again we need the case sensitive field name for
> > getType */
> >             for (final Integer idx : indexColumnsWanted) {
> >
> >
> >
> typeListWanted.add(tableSchema.getType(lowerCaseFileSchemaColumns.get(listColumns.get(idx))));
> >             }
> >
> >     ....
> >
> > I was wondering if there were any consequences of doing it this way that
> I
> > missed and whether this fix or something similar could someday become a
> > patch.
> >
> > --
> > *Raymond Lau*
> > Software Engineer - Intern |
> > [email protected] | (925) 395-3806
> >
>



-- 
*Raymond Lau*
Software Engineer - Intern |
[email protected] | (925) 395-3806

Re: Hive-Parquet case sensitivity

Reply via email to