[jira] [Commented] (HIVE-4329) HCatalog should use getHiveRecordWriter rather than getRecordWriter

David Chen (JIRA) Thu, 28 Aug 2014 15:51:27 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114530#comment-14114530
 ]


David Chen commented on HIVE-4329:
----------------------------------

Hi Sushanth,

I really appreciate you taking your time to look at this patch and for your 
tips. However, I am still a bit unclear about some of the concerns you 
mentioned.

bq. Unfortunately, this will not work, because that simply fetches a substitute 
HiveOutputFormat from a map of substitutes, which contain substitutes for only 
IgnoreKeyTextOutputFormat and SequenceFileOutputFormat.

>From my understanding, {{HivePassThroughOutputFormat}} was introduced in order 
>to support generic OutputFormats and not just {{HiveOutputFormat}}. According 
>to {{[HiveFileFormatUtils. 
>getOutputFormatSubstitute|https://github.com/apache/hive/blob/b8250ac2f30539f6b23ce80a20a9e338d3d31458/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]}},
> {{HivePassThroughOutputFormat}} is returned if the {{OutputFormat}} does not 
>exist in the map but only if it is called with {{storageHandlerFlag = true}}. 
>From [searching the 
>codebase|https://github.com/apache/hive/search?utf8=%E2%9C%93&q=getOutputFormatSubstitute&type=Code],
> the only place where {{getOutputFormatSubstitute}} could be called with 
>{{storageHandlerFlag}} set to true is in {{Table.getOutputFormatClass}} and if 
>the {{storage_handler}} property is set.

As a result, I changed my patch to retrieve the {{OutputFormat}} class using 
{{Table.getOutputFormatClass}} so that HCatalog would follow the same codepath 
as Hive proper for getting the {{OutputFormat}}. Does this address your concern?

bq. If your patch were so that it fetches an underlying HiveOutputFormat, and 
if it were a HiveOutputFormat, using getHiveRecordWriter, and if it were not, 
using getRecordWriter, that solution would not break runtime backward 
compatibility, and would be acceptable

I tried this approach, but I think that it is cleaner to change 
{{OutputFormatContainer}} and {{RecordWriterContainer}} to wrap the Hive 
implementations ({{HiveOutputFormat}} and {{FileSinkOperator.RecordWriter}}) 
rather than introduce yet another set of wrappers. After all, Hive already has 
a mechanism for supporting both Hive OFs and MR OFs by wrapping MR OFs with 
{{HivePassThroughOutputFormat}}, and I think that HCatalog should evolve to 
share more common infrastructure with Hive.

I have attached a new revision of my patch that now fixes the original reason 
why this ticket is opened; writing to an Avro table via HCatalog now works. 
There are still a few remaining issues though:

 * The way that tables with static partitioning is handled is not completely 
correct. I have opened HIVE-7855 to address that issue.
 * Writing to a Parquet table does not work but more investigation is needed to 
determine whether this is caused by a bug in HCatalog or in the Parquet SerDe.

> HCatalog should use getHiveRecordWriter rather than getRecordWriter
> -------------------------------------------------------------------
>
>                 Key: HIVE-4329
>                 URL: https://issues.apache.org/jira/browse/HIVE-4329
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog, Serializers/Deserializers
>    Affects Versions: 0.14.0
>         Environment: discovered in Pig, but it looks like the root cause 
> impacts all non-Hive users
>            Reporter: Sean Busbey
>            Assignee: David Chen
>         Attachments: HIVE-4329.0.patch, HIVE-4329.1.patch, HIVE-4329.2.patch
>
>
> Attempting to write to a HCatalog defined table backed by the AvroSerde fails 
> with the following stacktrace:
> {code}
> java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be 
> cast to org.apache.hadoop.io.LongWritable
>       at 
> org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat$1.write(AvroContainerOutputFormat.java:84)
>       at 
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:253)
>       at 
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:53)
>       at 
> org.apache.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:242)
>       at org.apache.hcatalog.pig.HCatStorer.putNext(HCatStorer.java:52)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
>       at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:559)
>       at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
> {code}
> The proximal cause of this failure is that the AvroContainerOutputFormat's 
> signature mandates a LongWritable key and HCat's FileRecordWriterContainer 
> forces a NullWritable. I'm not sure of a general fix, other than redefining 
> HiveOutputFormat to mandate a WritableComparable.
> It looks like accepting WritableComparable is what's done in the other Hive 
> OutputFormats, and there's no reason AvroContainerOutputFormat couldn't also 
> be changed, since it's ignoring the key. That way fixing things so 
> FileRecordWriterContainer can always use NullWritable could get spun into a 
> different issue?
> The underlying cause for failure to write to AvroSerde tables is that 
> AvroContainerOutputFormat doesn't meaningfully implement getRecordWriter, so 
> fixing the above will just push the failure into the placeholder RecordWriter.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-4329) HCatalog should use getHiveRecordWriter rather than getRecordWriter

Reply via email to