Re: Hive-Parquet case sensitivity

Brock Noland Wed, 30 Jul 2014 08:45:08 -0700

Hi Raymond,

Would you be able to upload a small (1-10 rows) parquet file stored which
would demonstrate the bug without the fix?


https://issues.apache.org/jira/browse/HIVE-7554
https://issues.apache.org/jira/browse/PARQUET-54

Cheers,
Brock


On Tue, Jul 29, 2014 at 6:47 PM, Raymond Lau <[email protected]> wrote:

> I'm pretty sure my distribution is Cloudera, all my Hadoop/Hive folders
> have CDH in them.  Hive Version: 0.12
>
> Raymond
>
>
> On Tue, Jul 29, 2014 at 6:03 PM, Brock Noland <[email protected]> wrote:
>
> > Hi,
> >
> > Thanks for the message. I am looking at this issue myself.
> >
> > Which version if Hive are you using from which distribution?
> >
> > Brock
> > On Jul 29, 2014 1:09 PM, "Raymond Lau" <[email protected]> wrote:
> >
> > > So I'm having the same case sensitivity issue mentioned in a previous
> > > thread:
> https://groups.google.com/forum/#!topic/parquet-dev/ko-TM2lLpxE
> > >
> > > The solution that Christos posted works great, but it didn't work for
> me
> > > when it comes to *partitioned* external tables, either I couldn't read
> > or I
> > > couldn't write.  All of the data I'm working with is already
> partitioned
> > in
> > > HDFS so all I need to do is run an 'ALTER TABLE table ADD PARTITION
> > > (partitionkey = blah) LOCATION '/path/'.
> > >
> > > The workaround I made for this was by editing the init function in the
> > > DataWritableReadSupport class (Original -
> > >
> > >
> >
> https://github.com/Parquet/parquet-mr/blob/7b0778c490e6782a83663bd5b1ec9d8a7dd7c2ae/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
> > > ),
> > > so that lower-cased field names would be used for the Hive table and
> when
> > > the Parquet files are being read, the typeListWanted is edited so that
> it
> > > properly reads the data that I need.  I'm able to insert all of my data
> > and
> > > run queries on it in Hive.
> > >
> > >     if (columns != null) {
> > >             final List<String> listColumns = getColumns(columns);
> > >
> > >             /* EDIT - create a map that maps lowercase field name ->
> > normal
> > > field name from the parquet files */
> > >             final Map<String, String> lowerCaseFileSchemaColumns = new
> > > HashMap<String,String>();
> > >             for(ColumnDescriptor c : fileSchema.getColumns()) {
> > >
> > > lowerCaseFileSchemaColumns.put(c.getPath()[0].toLowerCase(),
> > > c.getPath()[0]);
> > >             }
> > >
> > >             final List<Type> typeListTable = new ArrayList<Type>();
> > >             for (final String col : listColumns) {
> > >                 /* EDIT - check if a Hive column field exists in the
> map,
> > > instead of whether it exists in the parquet file schema.  this is where
> > the
> > > case sensitivity would normally cause a problem.  if it exists, get the
> > > type information from the parquet file schema (we need the case
> sensitive
> > > field name to get it) */
> > >                 if (lowerCaseFileSchemaColumns.containsKey(col)) {
> > >
> > >
> >
> typeListTable.add(fileSchema.getType(lowerCaseFileSchemaColumns.get(col)));
> > >                 } else {
> > >                     typeListTable.add(new
> > > PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
> > >                 }
> > >             }
> > >
> > >             MessageType tableSchema = new MessageType(TABLE_SCHEMA,
> > > typeListTable);
> > >             contextMetadata.put(HIVE_SCHEMA_KEY,
> tableSchema.toString());
> > >
> > >             MessageType requestedSchemaByUser = tableSchema;
> > >             final List<Integer> indexColumnsWanted =
> > > getReadColumnIDs(configuration);
> > >
> > >             final List<Type> typeListWanted = new ArrayList<Type>();
> > >
> > >             /* EDIT - again we need the case sensitive field name for
> > > getType */
> > >             for (final Integer idx : indexColumnsWanted) {
> > >
> > >
> > >
> >
> typeListWanted.add(tableSchema.getType(lowerCaseFileSchemaColumns.get(listColumns.get(idx))));
> > >             }
> > >
> > >     ....
> > >
> > > I was wondering if there were any consequences of doing it this way
> that
> > I
> > > missed and whether this fix or something similar could someday become a
> > > patch.
> > >
> > > --
> > > *Raymond Lau*
> > > Software Engineer - Intern |
> > > [email protected] | (925) 395-3806
> > >
> >
>
>
>
> --
> *Raymond Lau*
> Software Engineer - Intern |
> [email protected] | (925) 395-3806
>

Re: Hive-Parquet case sensitivity

Reply via email to