[ https://issues.apache.org/jira/browse/PIG-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921908#action_12921908 ]
Mridul Muralidharan commented on PIG-1684: ------------------------------------------ I am not sure if I understand the comments properly. To elaborate, the usecase is fairly straightforward use of the interfaces. Our store func does task specific initializations (like create "$OUTPUT/_temporary/attempt_id" directory, etc). creates side files based on the data to be stored (task specific) , etc. The output committer flushes the various streams, ensures graceful close of resources, reconciles the final output, moves the appropriate files/directories. etc - once the task is done, or cleans up if there is an abort. This information is, as is obvious, task specific for a store func : and tightly coupled to the instance of store func used (since only the store func has visibility to this data). Please note that, in the bug description above, different instances of the Store func's are getting called on the SAME task - not across tasks or at frontend/backend. We are ok with pig serializing/deserializing the objects, managing its split lifecycle, etc - but at a given task, if it used a single instance of StoreFunc for all invocations consistently : to get Output format, to get record reader, to initialize, to write tuples and to commit : it would be consistent with the way we would expect the invocations (IIRC, hadoop does it this way, but I will need to recheck if there are concerns there). Hopefully this clarifies things, please let me know in case there are issues with the way we are making use of the interfaces/our expectation of the behavior. > Inconsistent usage of store func. > --------------------------------- > > Key: PIG-1684 > URL: https://issues.apache.org/jira/browse/PIG-1684 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.7.0 > Environment: A custom StoreFuncInterface used to store data at the > reducer. > (Output of a group ) > Reporter: Mridul Muralidharan > > Pig seems to be using multiple instances of StoreFuncInterface in the reducer > inconsistently. > Some hadoop api calls are made to one instance and others made to other : > which makes state management very inconsistent and is requiring hacks on our > part to deal with it. > The call snippet below should hopefully indicate the issue. > The format is : > Instance.toString() method_call. > com.yahoo.psox.fish.pig.indexjoinst...@1be4777 getOutputFormat() > com.yahoo.psox.fish.pig.indexjoinst...@1be4777 getOutputCommitter > com.yahoo.psox.fish.pig.indexjoinst...@1be4777 setupTask > com.yahoo.psox.fish.pig.indexjoinst...@1be4777 init > com.yahoo.psox.fish.pig.indexjoinst...@1429cb2 getOutputFormat() > com.yahoo.psox.fish.pig.indexjoinst...@1429cb2 getRecordWriter > com.yahoo.psox.fish.pig.indexjoinst...@1429cb2 init > com.yahoo.psox.fish.pig.indexjoinst...@1429cb2 putNext() > ... > com.yahoo.psox.fish.pig.indexjoinst...@1be4777 needsTaskCommit > com.yahoo.psox.fish.pig.indexjoinst...@1be4777 commitTask > com.yahoo.psox.fish.pig.indexjoinst...@1be4777 finish() > As is obvious, two instances are used for different purposes - one to get the > record writer and do the actual write, and another to call the > OutputCommitter and its methods. > Since they are from different instances (StoreFuncInterface), the output > committer is unable to gracefully commit and cleanup. > I am not attaching the StoreFunc, but any user defined StoreFunc will exhibit > this behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.