New load store design does not allow Pig to validate inputs and outputs up front
--------------------------------------------------------------------------------

                 Key: PIG-1216
                 URL: https://issues.apache.org/jira/browse/PIG-1216
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.7.0
            Reporter: Alan Gates


In Pig 0.6 and before, Pig attempts to verify existence of inputs and 
non-existence of outputs during parsing to avoid run time failures when inputs 
don't exist or outputs can't be overwritten.  The downside to this was that Pig 
assumed all inputs and outputs were HDFS files, which made implementation 
harder for non-HDFS based load and store functions.  In the load store redesign 
(PIG-966) this was delegated to InputFormats and OutputFormats to avoid this 
problem and to make use of the checks already being done in those 
implementations.  Unfortunately, for Pig Latin scripts that run more then one 
MR job, this does not work well.  MR does not do input/output verification on 
all the jobs at once.  It does them one at a time.  So if a Pig Latin script 
results in 10 MR jobs and the file to store to at the end already exists, the 
first 9 jobs will be run before the 10th job discovers that the whole thing was 
doomed from the beginning.  

To avoid this a validate call needs to be added to the new LoadFunc and 
StoreFunc interfaces.  Pig needs to pass this method enough information that 
the load function implementer can delegate to InputFormat.getSplits() and the 
store function implementer to OutputFormat.checkOutputSpecs() if s/he decides 
to.  Since 90% of all load and store functions use HDFS and PigStorage will 
also need to, the Pig team should implement a default file existence check on 
HDFS and make it available as a static method to other Load/Store function 
implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to