[
https://issues.apache.org/jira/browse/PIG-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491288#comment-13491288
]
Jonathan Coveney commented on PIG-3014:
---------------------------------------
Zhijie,
I think that the semantics in my patch are sufficient. You are correct that in
some cases, they might be "closer together" than we might want, but what does
that even mean? The semantics are not well specified. What if the optimizer in
fact put C before B? What if the optimizer had them run at the same time? What
if my cluster happens to be tuned to a certain workload...and so on and so on.
I think as long as "now" is defined as "after the script runs," and as long as
it is the same for every value in a given relation that uses it, that's the
only guarantee that we can make. We can document this limitation (i.e. that
"now" is a more or less arbitrary value in between the beginning of your script
and when it is finished being parsed).
I suppose there would be some utility in a CurrentTime() where the time is with
respect to the beginning of execution, but it could easily suffer from the same
issue if it was in a foreach with a really time consuming value, where the
"now" value quickly becomes stale. I think the incremental gain is minimal, and
the incremental complexity is quite high. If you deeply disagree, though, we
can discuss how to do it. I think the following would work: per each
instantiation of the UDF, we create two unique files and put them in HDFS (I do
not think the distributed cache will work in this specific case, but it may).
Those files will be the constructor argument. On first execution, each mapper
tries to delete the file. Since delete is atomic, only one should succeed. This
is the leader. It will record the current time and serialize it to the second
file. We would have to coordinate atomicity...perhaps it could write a magic
value at the end of the serialized date time, so all of the mappers would read
the file until they read the magic number, and then they'd know it was done.
This would be pretty complicated for what I see as a minimal gain, but it would
probably be a "more correct" now() implementation. I do not know if Hadoop has
a more convenient coordination mechanism between mappers (this sort of goes
against the whole point).
I welcome more thoughts
> CurrentTime() UDF has undesirable characteristics
> -------------------------------------------------
>
> Key: PIG-3014
> URL: https://issues.apache.org/jira/browse/PIG-3014
> Project: Pig
> Issue Type: Bug
> Reporter: Jonathan Coveney
> Assignee: Jonathan Coveney
> Fix For: 0.12
>
> Attachments: PIG-3014-0.patch
>
>
> As part of the explanation of the new DateTime datatype I noticed that we had
> added a CurrentTime() UDF. The issue with this UDF is that it returns the
> current time _of every exec invocation_, which can lead to confusing results.
> In PIG-1431 I proposed a way such that every instance of the same NOW() will
> return the same time, which I think is better. Would enjoy thoughts.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira