Re: HCatalog and Crunch named outputs

Josh Wills Thu, 26 Oct 2017 21:43:39 -0700

Hey Stephen,

Looked at this more closely, and I think the need for a unique JobID per
output of a job still exists in order for the various output committers to
work properly (and actually commit the data for each output.) It seems like
you could disable the ID change in the situation that there was only a
single output for the job (or if you would feel more comfortable with me
taking a pass at it, let me know and I'll try to get it done this weekend),
but that would mean that the HCat output stuff would only work when it was
the only output for a given MR job in Crunch. Would that work for your use
case?


The other option would be to roll your own HCat output format, but I don't
know how much desire you have for that; which version of HCat are you
running against? I got this to work against 0.4.0 of course, but that was a
long time ago.

J

On Wed, Oct 25, 2017 at 9:59 PM, Josh Wills <[email protected]> wrote:

> I'll need to look more closely at the code tomorrow to confirm, but IIRC,
> that comment was wrong, and it's safe to proceed with this change.
>
> Josh
>
> On Wed, Oct 25, 2017 at 1:57 PM Stephen Durfey <[email protected]> wrote:
>
>> I've recently taken up the work efforts on CRUNCH-340 [1] to get a
>> functioning source and target for going against HCatalog. One of the
>> issues
>> I've ran into is around named outputs being added to the JobID, which then
>> makes its way into the TaskAttemptID. The stack trace is below.
>>
>> The issue is the named output (e.g. 'out0') becomes part of the
>> TaskAttemptID and the HCat output committer is trying to map between
>> o.a.h.mapreduce.TaskAttemptID and o.a.h.mapred.TaskAttemptID [2] it fails
>> between TaskAttemptID.forName expects the id to only be 6 parts, separated
>> by underscores, and with the named output, it becomes 7. If I remove the
>> named output from being set on the JobID, then everything works fine [3].
>>
>> However, I am hesitant with that change. In the version of code I am
>> working against (0.11.x at the moment) there is a comment stating that
>> certain output formats rely upon this change. However, in the latest
>> version of the code in master, that comment has been removed. I'm curious
>> if the comment was removed because it is no longer true, and thus safe to
>> remove the named output from the job id, or if there is a better/more
>> preferred way to handle the exception below.
>>
>>
>> Error: java.lang.IllegalArgumentException: TaskAttemptId string :
>> > attempt_1508401628996_out0_16350_m_000000_0 is not properly formed at
>> > org.apache.hadoop.mapreduce.TaskAttemptID.forName(
>> TaskAttemptID.java:201)
>> > at org.apache.hadoop.mapred.TaskAttemptID.forName(
>> TaskAttemptID.java:129)
>> > at
>> > org.apache.hive.hcatalog.mapreduce.HCatMapRedUtil.
>> createTaskAttemptContext(HCatMapRedUtil.java:35)
>> > at
>> > org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.
>> setupTask(FileOutputCommitterContainer.java:172)
>> > at
>> > org.apache.crunch.io.CrunchOutputs$CompositeOutputCommitter.
>> setupTask(CrunchOutputs.java:334)
>> > at org.apache.hadoop.mapred.Task.initialize(Task.java:582) at
>> > org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) at
>>
>>
>>
>> [1] https://issues.apache.org/jira/browse/CRUNCH-340
>> [2]
>> https://github.com/cloudera/hive/blob/cdh5.13.0-release/
>> hcatalog/core/src/main/java/org/apache/hive/hcatalog/
>> mapreduce/HCatMapRedUtil.java#L34
>> [3]
>> https://github.com/apache/crunch/blob/master/crunch-
>> core/src/main/java/org/apache/crunch/io/CrunchOutputs.java#L230
>>
>

Re: HCatalog and Crunch named outputs

Reply via email to