[ 
https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834608#action_12834608
 ] 

Ashutosh Chauhan commented on PIG-1216:
---------------------------------------

Thinking more about this. We don't do validation on input side because the 
input location (or files) may get created over the course of execution of pig 
script, rendering such validation for input not only useless but incorrect. But 
similar situation may exist for output validation as well. Assume simple case 
of HDFS as storage and  the output location exists in HDFS. Now user may have 
rmf statements within the script, so output location is actually deleted before 
that job is executed, but if we do upfront validation Pig will fail and refuse 
to run script saying outputformat.checkspecs() is asserting output location 
exists at compile time. 
In a more general case, invariants which are true at the compile time of Pig 
script may no longer hold at runtime, resulting in doing such kind of 
validation at compile time dangerous. 

> New load store design does not allow Pig to validate inputs and outputs up 
> front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>         Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and 
> non-existence of outputs during parsing to avoid run time failures when 
> inputs don't exist or outputs can't be overwritten.  The downside to this was 
> that Pig assumed all inputs and outputs were HDFS files, which made 
> implementation harder for non-HDFS based load and store functions.  In the 
> load store redesign (PIG-966) this was delegated to InputFormats and 
> OutputFormats to avoid this problem and to make use of the checks already 
> being done in those implementations.  Unfortunately, for Pig Latin scripts 
> that run more then one MR job, this does not work well.  MR does not do 
> input/output verification on all the jobs at once.  It does them one at a 
> time.  So if a Pig Latin script results in 10 MR jobs and the file to store 
> to at the end already exists, the first 9 jobs will be run before the 10th 
> job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and 
> StoreFunc interfaces.  Pig needs to pass this method enough information that 
> the load function implementer can delegate to InputFormat.getSplits() and the 
> store function implementer to OutputFormat.checkOutputSpecs() if s/he decides 
> to.  Since 90% of all load and store functions use HDFS and PigStorage will 
> also need to, the Pig team should implement a default file existence check on 
> HDFS and make it available as a static method to other Load/Store function 
> implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to