Re: Hive-Parquet case sensitivity

Raymond Lau Thu, 31 Jul 2014 15:41:35 -0700

Thanks Brock.  I uploaded it as an attachment to:

https://issues.apache.org/jira/browse/HIVE-7554


-Raymond


On Wed, Jul 30, 2014 at 8:42 AM, Brock Noland <[email protected]> wrote:

> Hi Raymond,
>
> Would you be able to upload a small (1-10 rows) parquet file stored which
> would demonstrate the bug without the fix?
>
> https://issues.apache.org/jira/browse/HIVE-7554
> https://issues.apache.org/jira/browse/PARQUET-54
>
> Cheers,
> Brock
>
>
> On Tue, Jul 29, 2014 at 6:47 PM, Raymond Lau <[email protected]> wrote:
>
> > I'm pretty sure my distribution is Cloudera, all my Hadoop/Hive folders
> > have CDH in them.  Hive Version: 0.12
> >
> > Raymond
> >
> >
> > On Tue, Jul 29, 2014 at 6:03 PM, Brock Noland <[email protected]>
> wrote:
> >
> > > Hi,
> > >
> > > Thanks for the message. I am looking at this issue myself.
> > >
> > > Which version if Hive are you using from which distribution?
> > >
> > > Brock
> > > On Jul 29, 2014 1:09 PM, "Raymond Lau" <[email protected]> wrote:
> > >
> > > > So I'm having the same case sensitivity issue mentioned in a previous
> > > > thread:
> > https://groups.google.com/forum/#!topic/parquet-dev/ko-TM2lLpxE
> > > >
> > > > The solution that Christos posted works great, but it didn't work for
> > me
> > > > when it comes to *partitioned* external tables, either I couldn't
> read
> > > or I
> > > > couldn't write.  All of the data I'm working with is already
> > partitioned
> > > in
> > > > HDFS so all I need to do is run an 'ALTER TABLE table ADD PARTITION
> > > > (partitionkey = blah) LOCATION '/path/'.
> > > >
> > > > The workaround I made for this was by editing the init function in
> the
> > > > DataWritableReadSupport class (Original -
> > > >
> > > >
> > >
> >
> https://github.com/Parquet/parquet-mr/blob/7b0778c490e6782a83663bd5b1ec9d8a7dd7c2ae/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
> > > > ),
> > > > so that lower-cased field names would be used for the Hive table and
> > when
> > > > the Parquet files are being read, the typeListWanted is edited so
> that
> > it
> > > > properly reads the data that I need.  I'm able to insert all of my
> data
> > > and
> > > > run queries on it in Hive.
> > > >
> > > >     if (columns != null) {
> > > >             final List<String> listColumns = getColumns(columns);
> > > >
> > > >             /* EDIT - create a map that maps lowercase field name ->
> > > normal
> > > > field name from the parquet files */
> > > >             final Map<String, String> lowerCaseFileSchemaColumns =
> new
> > > > HashMap<String,String>();
> > > >             for(ColumnDescriptor c : fileSchema.getColumns()) {
> > > >
> > > > lowerCaseFileSchemaColumns.put(c.getPath()[0].toLowerCase(),
> > > > c.getPath()[0]);
> > > >             }
> > > >
> > > >             final List<Type> typeListTable = new ArrayList<Type>();
> > > >             for (final String col : listColumns) {
> > > >                 /* EDIT - check if a Hive column field exists in the
> > map,
> > > > instead of whether it exists in the parquet file schema.  this is
> where
> > > the
> > > > case sensitivity would normally cause a problem.  if it exists, get
> the
> > > > type information from the parquet file schema (we need the case
> > sensitive
> > > > field name to get it) */
> > > >                 if (lowerCaseFileSchemaColumns.containsKey(col)) {
> > > >
> > > >
> > >
> >
> typeListTable.add(fileSchema.getType(lowerCaseFileSchemaColumns.get(col)));
> > > >                 } else {
> > > >                     typeListTable.add(new
> > > > PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
> > > >                 }
> > > >             }
> > > >
> > > >             MessageType tableSchema = new MessageType(TABLE_SCHEMA,
> > > > typeListTable);
> > > >             contextMetadata.put(HIVE_SCHEMA_KEY,
> > tableSchema.toString());
> > > >
> > > >             MessageType requestedSchemaByUser = tableSchema;
> > > >             final List<Integer> indexColumnsWanted =
> > > > getReadColumnIDs(configuration);
> > > >
> > > >             final List<Type> typeListWanted = new ArrayList<Type>();
> > > >
> > > >             /* EDIT - again we need the case sensitive field name for
> > > > getType */
> > > >             for (final Integer idx : indexColumnsWanted) {
> > > >
> > > >
> > > >
> > >
> >
> typeListWanted.add(tableSchema.getType(lowerCaseFileSchemaColumns.get(listColumns.get(idx))));
> > > >             }
> > > >
> > > >     ....
> > > >
> > > > I was wondering if there were any consequences of doing it this way
> > that
> > > I
> > > > missed and whether this fix or something similar could someday
> become a
> > > > patch.
> > > >
> > > > --
> > > > *Raymond Lau*
> > > > Software Engineer - Intern |
> > > > [email protected] | (925) 395-3806
> > > >
> > >
> >
> >
> >
> > --
> > *Raymond Lau*
> > Software Engineer - Intern |
> > [email protected] | (925) 395-3806
> >
>



-- 
*Raymond Lau*
Software Engineer - Intern |
[email protected] | (925) 395-3806

Re: Hive-Parquet case sensitivity

Reply via email to