[jira] [Commented] (PIG-3014) CurrentTime() UDF has undesirable characteristics

Jonathan Coveney (JIRA) Tue, 06 Nov 2012 00:00:33 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491288#comment-13491288
 ]


Jonathan Coveney commented on PIG-3014:
---------------------------------------

Zhijie,

I think that the semantics in my patch are sufficient. You are correct that in 
some cases, they might be "closer together" than we might want, but what does 
that even mean? The semantics are not well specified. What if the optimizer in 
fact put C before B? What if the optimizer had them run at the same time? What 
if my cluster happens to be tuned to a certain workload...and so on and so on. 
I think as long as "now" is defined as "after the script runs," and as long as 
it is the same for every value in a given relation that uses it, that's the 
only guarantee that we can make. We can document this limitation (i.e. that 
"now" is a more or less arbitrary value in between the beginning of your script 
and when it is finished being parsed).

I suppose there would be some utility in a CurrentTime() where the time is with 
respect to the beginning of execution, but it could easily suffer from the same 
issue if it was in a foreach with a really time consuming value, where the 
"now" value quickly becomes stale. I think the incremental gain is minimal, and 
the incremental complexity is quite high. If you deeply disagree, though, we 
can discuss how to do it. I think the following would work: per each 
instantiation of the UDF, we create two unique files and put them in HDFS (I do 
not think the distributed cache will work in this specific case, but it may). 
Those files will be the constructor argument. On first execution, each mapper 
tries to delete the file. Since delete is atomic, only one should succeed. This 
is the leader. It will record the current time and serialize it to the second 
file. We would have to coordinate atomicity...perhaps it could write a magic 
value at the end of the serialized date time, so all of the mappers would read 
the file until they read the magic number, and then they'd know it was done.

This would be pretty complicated for what I see as a minimal gain, but it would 
probably be a "more correct" now() implementation. I do not know if Hadoop has 
a more convenient coordination mechanism between mappers (this sort of goes 
against the whole point).

I welcome more thoughts
                
> CurrentTime() UDF has undesirable characteristics
> -------------------------------------------------
>
>                 Key: PIG-3014
>                 URL: https://issues.apache.org/jira/browse/PIG-3014
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.12
>
>         Attachments: PIG-3014-0.patch
>
>
> As part of the explanation of the new DateTime datatype I noticed that we had 
> added a CurrentTime() UDF. The issue with this UDF is that it returns the 
> current time _of every exec invocation_, which can lead to confusing results. 
> In PIG-1431 I proposed a way such that every instance of the same NOW() will 
> return the same time, which I think is better. Would enjoy thoughts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3014) CurrentTime() UDF has undesirable characteristics

Reply via email to