[ 
https://issues.apache.org/jira/browse/PIG-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921908#action_12921908
 ] 

Mridul Muralidharan commented on PIG-1684:
------------------------------------------

I am not sure if I understand the comments properly. To elaborate, the usecase 
is fairly straightforward use of the interfaces.


Our store func does task specific initializations (like create 
"$OUTPUT/_temporary/attempt_id" directory, etc). creates side files based on 
the data to be stored (task specific) , etc.
The output committer flushes the various streams, ensures graceful close of 
resources, reconciles the final output, moves the appropriate 
files/directories. etc - once the task is done, or cleans up if there is an 
abort.


This information is, as is obvious, task specific for a store func : and 
tightly coupled to the instance of store func used (since only the store func 
has visibility to this data).


Please note that, in the bug description above, different instances of the 
Store func's are getting called on the SAME task - not across tasks or at 
frontend/backend.


We are ok with pig serializing/deserializing the objects, managing its split 
lifecycle, etc - but at a given task, if it used a single instance of StoreFunc 
for all invocations consistently : to get Output format, to get record reader, 
to initialize, to write tuples and to commit : it would be consistent with the 
way we would expect the invocations (IIRC, hadoop does it this way, but I will 
need to recheck if there are concerns there).

Hopefully this clarifies things, please let me know in case there are issues 
with the way we are making use of the interfaces/our expectation of the 
behavior.

> Inconsistent usage of store func.
> ---------------------------------
>
>                 Key: PIG-1684
>                 URL: https://issues.apache.org/jira/browse/PIG-1684
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>         Environment: A custom StoreFuncInterface used to store data at the 
> reducer.
> (Output of a group )
>            Reporter: Mridul Muralidharan
>
> Pig seems to be using multiple instances of StoreFuncInterface in the reducer 
> inconsistently.
> Some hadoop api calls are made to one instance and others made to other : 
> which makes state management very inconsistent and is requiring hacks on our 
> part to deal with it.
> The call snippet below should hopefully indicate the issue.
> The format is :
> Instance.toString()   method_call.
> com.yahoo.psox.fish.pig.indexjoinst...@1be4777 getOutputFormat()
> com.yahoo.psox.fish.pig.indexjoinst...@1be4777 getOutputCommitter
> com.yahoo.psox.fish.pig.indexjoinst...@1be4777 setupTask
> com.yahoo.psox.fish.pig.indexjoinst...@1be4777 init
> com.yahoo.psox.fish.pig.indexjoinst...@1429cb2 getOutputFormat()
> com.yahoo.psox.fish.pig.indexjoinst...@1429cb2 getRecordWriter
> com.yahoo.psox.fish.pig.indexjoinst...@1429cb2 init
> com.yahoo.psox.fish.pig.indexjoinst...@1429cb2 putNext()
> ... 
> com.yahoo.psox.fish.pig.indexjoinst...@1be4777 needsTaskCommit
> com.yahoo.psox.fish.pig.indexjoinst...@1be4777 commitTask
> com.yahoo.psox.fish.pig.indexjoinst...@1be4777 finish()
> As is obvious, two instances are used for different purposes - one to get the 
> record writer and do the actual write, and another to call the 
> OutputCommitter and its methods.
> Since they are from different instances (StoreFuncInterface), the output 
> committer is unable to gracefully commit and cleanup.
> I am not attaching the StoreFunc, but any user defined StoreFunc will exhibit 
> this behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to