[jira] [Updated] (HIVE-8704) HivePassThroughOutputFormat cannot proxy more than one kind of OF (in one process)

Sushanth Sowmyan (JIRA) Sun, 02 Nov 2014 15:27:56 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sushanth Sowmyan updated HIVE-8704:
-----------------------------------
    Status: Patch Available  (was: Open)

> HivePassThroughOutputFormat cannot proxy more than one kind of OF (in one 
> process)
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-8704
>                 URL: https://issues.apache.org/jira/browse/HIVE-8704
>             Project: Hive
>          Issue Type: Bug
>          Components: StorageHandler
>    Affects Versions: 0.14.0
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HIVE-8704.patch
>
>
> HivePassThroughOutputFormat is a wrapper HiveOutputFormat used by hive to 
> allow access to StorageHandlers that use mapred OutputFormats as their 
> primary implementation point, and do not implement HiveOutputFormat.
> However, HivePassThroughOutputFormat(henceforth called PTOF) has one major 
> bug - it tracks the underlying outputformat that it is proxying by means of a 
> static string in HiveFileFormatUtils. There are a few problems with this.
> a) For starters, it means that a given process can only process one 
> PTOF-based output format. So, in the case of a HS2 instance, where one thread 
> is attempting to start a job based on HBase and another on Accumulo will 
> cause a problem, and will overwrite each others' "real" output format. This 
> leads to bugs where a person trying to use a hbase table gets stack traces 
> from Accumulo like the following:
> {noformat}
> ERROR exec.Task: Job Submission failed with exception 
> 'java.lang.NullPointerException(Expected Accumulo table name to be provided 
> in job configuration)'
> java.lang.NullPointerException: Expected Accumulo table name to be provided 
> in job configuration
>       at 
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
>       at 
> org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableOutputFormat.configureAccumuloOutputFormat(HiveAccumuloTableOutputFormat.java:61)
>       at 
> org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableOutputFormat.checkOutputSpecs(HiveAccumuloTableOutputFormat.java:43)
>       at 
> org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat.checkOutputSpecs(HivePassThroughOutputFormat.java:87)
>       at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.checkOutputSpecs(FileSinkOperator.java:1071)
>       at 
> org.apache.hadoop.hive.ql.io.HiveOutputFormatImpl.checkOutputSpecs(HiveOutputFormatImpl.java:67)
>       at 
> org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:465)
>       at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)
>       at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1294)
>       at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1291)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>       at org.apache.hadoop.mapreduce.Job.submit(Job.java:1291)
>       at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
>       at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>       at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
>       at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
>       at 
> org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:420)
>       at 
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
>       at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:161)
>       at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>       at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1603)
>       at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1363)
>       at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1176)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1003)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:998)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:144)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>       at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:508)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> b) There is a bug in HiveFileFormatUtils.getOutputFormatSubstitute, which, 
> after it determines that PTOF should act as a substitute for a process, winds 
> up registering PTOF in the substitute map. This seems innocuous, but in 
> result, because it winds up registering it as a substitute, winds up 
> short-circuiting and avoiding the path where it sets the real OF the next 
> time. This is a problem, because if the same job were to prepare writing to a 
> HBase table, then followed by preparing to write to an Accumulo table, 
> followed by preparing to write to a HBase table, then the second time HBase 
> comes along, the underlying "real" OF is set to accumulo, and the HBase map 
> look up short circuits and avoids the path that would reset the real OF back 
> to HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8704) HivePassThroughOutputFormat cannot proxy more than one kind of OF (in one process)

Reply via email to