[
https://issues.apache.org/jira/browse/PIG-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xuzhou Yin updated PIG-5360:
----------------------------
Description:
{color:#000000}In getSplits() method in PigInputFormat, Pig is trying to set
the working directory of input File System to jobContext.getWorkingDirectory(),
which is always the default working directory of default file system (eg.
hdfs://host:port/user/userId in case of HDFS) unless
“mapreduce.job.working.dir” is explicitly set to non-default value. So if the
input path uses non-default file system (eg. EmrFS), then it will fail since it
is trying to set the working directory of EmrFS to a HDFS path.{color}
{color:#000000}The proposed change is to completely remove this logic of
setting working directory. There are several reasons for doing so. {color}
{color:#000000}Firstly, getSplits() is only supposed to return a list of input
splits. It should not have side effects (especially doing so can potentially
change the output path).{color}
{color:#000000}Secondly, there is inconsistency between the working directories
of input and output file systems. if "mapreduce.job.working.dir" is set to
non-default value, it will affect the output path only (if it is a relative
path) because input path will be made qualified even before this logic.{color}
{color:#000000}Thirdly, there is already a "CD" functionality that allows
customers to change the working directory. However, this logic will overwrite
the "CD" functionality if input and output paths both use default file
system.{color}
was:
{color:#000000}In getSplits() method in PigInputFormat, Pig is trying to set
the working directory of input File System to jobContext.getWorkingDirectory(),
which is always the default working directory of default file system (eg.
hdfs://host:port/user/userId in case of HDFS) unless
“mapreduce.job.working.dir” is explicitly set to non-default value. So if the
input path uses non-default file system (eg. EmrFS), then it will fail since it
is trying to set the working directory of EmrFS to a HDFS path.{color}
{color:#000000}The proposed change it to completely remove this logic of
setting working directory. There are several reasons for doing so. {color}
{color:#000000}Firstly, getSplits() is only supposed to return a list of input
splits. It should not have side effects (especially doing so can potentially
change the output path).{color}
{color:#000000}Secondly, there is inconsistency between the working directories
of input and output file systems. if "mapreduce.job.working.dir" is set to
non-default value, it will affect the output path only (if it is a relative
path) because input path will be made qualified even before this logic.{color}
{color:#000000}Thirdly, there is already a "CD" functionality that allows
customers to change the working directory. However, this logic will overwrite
the "CD" functionality if input and output paths both use default file
system.{color}
> Pig sets working directory of input file systems causes exception thrown
> ------------------------------------------------------------------------
>
> Key: PIG-5360
> URL: https://issues.apache.org/jira/browse/PIG-5360
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.17.0
> Reporter: Xuzhou Yin
> Priority: Minor
> Labels: patch
> Fix For: 0.18.0
>
> Original Estimate: 504h
> Remaining Estimate: 504h
>
> {color:#000000}In getSplits() method in PigInputFormat, Pig is trying to set
> the working directory of input File System to
> jobContext.getWorkingDirectory(), which is always the default working
> directory of default file system (eg. hdfs://host:port/user/userId in case of
> HDFS) unless “mapreduce.job.working.dir” is explicitly set to non-default
> value. So if the input path uses non-default file system (eg. EmrFS), then it
> will fail since it is trying to set the working directory of EmrFS to a HDFS
> path.{color}
> {color:#000000}The proposed change is to completely remove this logic of
> setting working directory. There are several reasons for doing so. {color}
> {color:#000000}Firstly, getSplits() is only supposed to return a list of
> input splits. It should not have side effects (especially doing so can
> potentially change the output path).{color}
> {color:#000000}Secondly, there is inconsistency between the working
> directories of input and output file systems. if "mapreduce.job.working.dir"
> is set to non-default value, it will affect the output path only (if it is a
> relative path) because input path will be made qualified even before this
> logic.{color}
> {color:#000000}Thirdly, there is already a "CD" functionality that allows
> customers to change the working directory. However, this logic will overwrite
> the "CD" functionality if input and output paths both use default file
> system.{color}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)