[ 
https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672719#comment-13672719
 ] 

pat chan commented on PIG-3341:
-------------------------------

Before making the fix, I think there needs to be a little more clarity around 
exactly what formats are supported. For example, pig 0.11.1 currently supports 
datetime strings with no date - "T00:00:00" produces a date in 1970. Is this 
intentional? 

The general issue is that the actual implementation supports many more formats 
than is specified in the iso8601 profile http://www.w3.org/TR/NOTE-datetime. 
What is the specified policy regarding these extra formats? I see three choices:

1. specify precisely the entire range of supported formats. In the 
implementation I submitted above, the spec (taken from joda docs) is:

 date-opt-time     = date-element ['T' [time-element] [offset]]
 date-element      = std-date-element | ord-date-element | week-date-element
 std-date-element  = yyyy ['-' MM ['-' dd]]
 ord-date-element  = yyyy ['-' DDD]
 week-date-element = xxxx '-W' ww ['-' e]
 time-element      = HH [minute-element] | [fraction]
 minute-element    = ':' mm [second-element] | [fraction]
 second-element    = ':' ss [fraction]
 fraction          = ('.' | ',') digit+

2. state that the implementation may parse formats beyond the w3c profile but 
such formats may not be supported in future releases.

3. run all dates through a regex that matches exactly the w3c profile and dates 
that don't conform to the format are turned into null.


                
> Improving performance of loading datetime values
> ------------------------------------------------
>
>                 Key: PIG-3341
>                 URL: https://issues.apache.org/jira/browse/PIG-3341
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.11.1
>            Reporter: pat chan
>            Priority: Minor
>             Fix For: 0.12, 0.11.2
>
>
> The performance of loading datetime values can be improved by about 25% by 
> moving a single line in ToDate.java:
>     public static DateTimeZone extractDateTimeZone(String dtStr) {
>       Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");;
> should become:
>     static Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");
>     public static DateTimeZone extractDateTimeZone(String dtStr) {
> There is no need to recompile the regular expression for every value. I'm not 
> sure if this function is ever called concurrently, but Pattern objects are 
> thread-safe anyways.
> As a test, I created a file of 10M timestamps:
>   for i in 0..10000000
>     puts '2000-01-01T00:00:00+23'
>   end
> I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;
> Before the change it took 160s.
> After the change, the script took 120s.
> ----------------
> Another performance improvement can be made for invalid datetime values. If a 
> datetime value is invalid, an exception is created and thrown, which is a 
> costly way to fail a validity check. To test the performance impact, I 
> created 10M invalid datetime values:
>   for i in 0..10000000
>     puts '2000-99-01T00:00:00+23'
>   end
> In this test, the regex pattern was always recompiled. I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump 
> B;
> The script took 190s.
> I understand this could be considered an edge case and might not be worth 
> changing. However, if there are use cases where invalid dates are part of 
> normal processing, then you might consider fixing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to