[ https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672719#comment-13672719 ]
pat chan commented on PIG-3341: ------------------------------- Before making the fix, I think there needs to be a little more clarity around exactly what formats are supported. For example, pig 0.11.1 currently supports datetime strings with no date - "T00:00:00" produces a date in 1970. Is this intentional? The general issue is that the actual implementation supports many more formats than is specified in the iso8601 profile http://www.w3.org/TR/NOTE-datetime. What is the specified policy regarding these extra formats? I see three choices: 1. specify precisely the entire range of supported formats. In the implementation I submitted above, the spec (taken from joda docs) is: date-opt-time = date-element ['T' [time-element] [offset]] date-element = std-date-element | ord-date-element | week-date-element std-date-element = yyyy ['-' MM ['-' dd]] ord-date-element = yyyy ['-' DDD] week-date-element = xxxx '-W' ww ['-' e] time-element = HH [minute-element] | [fraction] minute-element = ':' mm [second-element] | [fraction] second-element = ':' ss [fraction] fraction = ('.' | ',') digit+ 2. state that the implementation may parse formats beyond the w3c profile but such formats may not be supported in future releases. 3. run all dates through a regex that matches exactly the w3c profile and dates that don't conform to the format are turned into null. > Improving performance of loading datetime values > ------------------------------------------------ > > Key: PIG-3341 > URL: https://issues.apache.org/jira/browse/PIG-3341 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.11.1 > Reporter: pat chan > Priority: Minor > Fix For: 0.12, 0.11.2 > > > The performance of loading datetime values can be improved by about 25% by > moving a single line in ToDate.java: > public static DateTimeZone extractDateTimeZone(String dtStr) { > Pattern pattern = > Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");; > should become: > static Pattern pattern = > Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$"); > public static DateTimeZone extractDateTimeZone(String dtStr) { > There is no need to recompile the regular expression for every value. I'm not > sure if this function is ever called concurrently, but Pattern objects are > thread-safe anyways. > As a test, I created a file of 10M timestamps: > for i in 0..10000000 > puts '2000-01-01T00:00:00+23' > end > I then ran this script: > grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B; > Before the change it took 160s. > After the change, the script took 120s. > ---------------- > Another performance improvement can be made for invalid datetime values. If a > datetime value is invalid, an exception is created and thrown, which is a > costly way to fail a validity check. To test the performance impact, I > created 10M invalid datetime values: > for i in 0..10000000 > puts '2000-99-01T00:00:00+23' > end > In this test, the regex pattern was always recompiled. I then ran this script: > grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump > B; > The script took 190s. > I understand this could be considered an edge case and might not be worth > changing. However, if there are use cases where invalid dates are part of > normal processing, then you might consider fixing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira