[ 
https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3341:
------------------------------------

    Attachment: PIG-3341-1.patch

Went with the ISODateTimeFormat only. It is a strict ISO Parser only. As for 
the additional types Pat mentioned

ord-date-element  = yyyy ['-' DDD]
week-date-element = xxxx '-W' ww ['-' e]

Saw that they were mentioned as part of ISO 8601 standard in 
http://en.wikipedia.org/wiki/ISO_8601#Week_dates and 
http://en.wikipedia.org/wiki/ISO_8601#Ordinal_dates. They are also mentioned in 
http://www.cl.cam.ac.uk/~mgk25/iso-time.html which 
http://www.w3.org/TR/NOTE-datetime refers to. 
http://www.w3.org/TR/NOTE-datetime only defines a profile of ISO 8601, 
consisting of a few date/time formats from ISO 8601, likely to satisfy most 
requirements. It is not the full set. 

So as ISODateTimeFormat is totally ISO 8601 compliant went with that instead of 
spending time to cut down the scope to http://www.w3.org/TR/NOTE-datetime.

Also added some missing documentation for DateTime.
                
> Improving performance of loading datetime values
> ------------------------------------------------
>
>                 Key: PIG-3341
>                 URL: https://issues.apache.org/jira/browse/PIG-3341
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.11.1
>            Reporter: pat chan
>            Assignee: Rohini Palaniswamy
>            Priority: Minor
>             Fix For: 0.12, 0.11.2
>
>         Attachments: PIG-3341-1.patch
>
>
> The performance of loading datetime values can be improved by about 25% by 
> moving a single line in ToDate.java:
>     public static DateTimeZone extractDateTimeZone(String dtStr) {
>       Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");;
> should become:
>     static Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");
>     public static DateTimeZone extractDateTimeZone(String dtStr) {
> There is no need to recompile the regular expression for every value. I'm not 
> sure if this function is ever called concurrently, but Pattern objects are 
> thread-safe anyways.
> As a test, I created a file of 10M timestamps:
>   for i in 0..10000000
>     puts '2000-01-01T00:00:00+23'
>   end
> I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;
> Before the change it took 160s.
> After the change, the script took 120s.
> ----------------
> Another performance improvement can be made for invalid datetime values. If a 
> datetime value is invalid, an exception is created and thrown, which is a 
> costly way to fail a validity check. To test the performance impact, I 
> created 10M invalid datetime values:
>   for i in 0..10000000
>     puts '2000-99-01T00:00:00+23'
>   end
> In this test, the regex pattern was always recompiled. I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump 
> B;
> The script took 190s.
> I understand this could be considered an edge case and might not be worth 
> changing. However, if there are use cases where invalid dates are part of 
> normal processing, then you might consider fixing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to