[ https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673742#comment-13673742 ]
pat chan commented on PIG-3341: ------------------------------- Hi, you bring up two good design points. 1. are more formats the better for this use case? Some possible cons: a) the spec becomes more complicated for probably unused formats. The simplest spec would be to conform to the w3c profile. b) you will have to support all these formats forever c) there could be a performance overhead to support the possibly unused formats d) ToDate(s,f) and UDFs already give users the ability to handle any format that's needed. e) asymmetry: seems cleaner if the default parseable format is exactly the default printed format 2. What is the design philosophy for invalid conversions? Quietly turning invalid values into null seems like it could be a possibly dangerous default since it would be really hard to know if your query on terabytes of data is encountering problems which are quietly being ignored. A safer philosophy would have the default be as strict with the data as possible and then if the user finds a legitimate case for null-conversions, provide a way for the user to enable it explicitly in the script. cheers > Improving performance of loading datetime values > ------------------------------------------------ > > Key: PIG-3341 > URL: https://issues.apache.org/jira/browse/PIG-3341 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.11.1 > Reporter: pat chan > Priority: Minor > Fix For: 0.12, 0.11.2 > > > The performance of loading datetime values can be improved by about 25% by > moving a single line in ToDate.java: > public static DateTimeZone extractDateTimeZone(String dtStr) { > Pattern pattern = > Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");; > should become: > static Pattern pattern = > Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$"); > public static DateTimeZone extractDateTimeZone(String dtStr) { > There is no need to recompile the regular expression for every value. I'm not > sure if this function is ever called concurrently, but Pattern objects are > thread-safe anyways. > As a test, I created a file of 10M timestamps: > for i in 0..10000000 > puts '2000-01-01T00:00:00+23' > end > I then ran this script: > grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B; > Before the change it took 160s. > After the change, the script took 120s. > ---------------- > Another performance improvement can be made for invalid datetime values. If a > datetime value is invalid, an exception is created and thrown, which is a > costly way to fail a validity check. To test the performance impact, I > created 10M invalid datetime values: > for i in 0..10000000 > puts '2000-99-01T00:00:00+23' > end > In this test, the regex pattern was always recompiled. I then ran this script: > grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump > B; > The script took 190s. > I understand this could be considered an edge case and might not be worth > changing. However, if there are use cases where invalid dates are part of > normal processing, then you might consider fixing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira