Hey Alex,
On 06/17/2013 10:00 PM, Cameron Goodale wrote:
On June 17, 2013, 6:48 p.m., Alex Goodman wrote:
Kyo and I went over this together and there are a few issues that need to be
considered before committing this:
Thanks very much for bringing all of this to the list!
1) Although my version of calcTemporalCorrelation is faster, it is also longer
and harder to understand. It also has the disadvantage of not relying directly
on scipy for the calculation so we may lose some consistency after scipy
updates.
2) Jinwon mentioned in an e-mail that several of these functions do not work
for daily data, and this is true. It is not a huge concern now because our
current workflow primarily consists of monthly data, but it is a disadvantage
nonetheless. Luckily I have already written a function that does something
similar to reshapeMonthlyData() but for daily data a while back, but I will
still need to take the time to implement and test this.
Thus, until these are fully taken care of, we think that we should either
define separate versions of these functions inside metrics.py (eg
calcClimSeasonOptimized()) or add a new module in rcmes/toolkit called
something like 'metrics2.py' which has all of my changes, while keeping the
older functions in metrics.py until support for daily data is added. I
understand that this would add some clutter to the existing codebase, so we
would like to hear what everyone else thinks.
What are your thoughts?
Alex,
I don't like the idea of fragmenting the codebase around appending the number
2. So i vote against creating metrics2.py.
I strongly agree with Cameron on this. I understand your thinking, but
my worry is that implementing arbitrary version numbering for individual
files will sow unnecessary confusion down the road.
It sounds like this review has grown to be too large and has started to shift
away from it's original intent. Since it seems this task has grown in scope,
it makes sense to just break it down into smaller tasks that are more
manageable. You can open 1 or more JIRA issues to track these additional
changes that are not ready and still need time to engineer and test. That
would leave you the ability to scale this Review and JIRA issue back to a scope
that you can commit on.
This sounds like a very sensible approach. Incrementally tracking
progress on tasks is likely easier for you as the developer, and it also
lets the community provide more focused feedback. One thing to consider
here is the fact that JIRA allows for sub-tasks (under the "more
actions" tab for an issue). If you'd like to tie more focused issues to
CLIMATE-88's broader thrust of "performance improvements for
metrics.py", sub-tasks would be an excellent way to show the
relationship explicitly. If you're not sure how to do it, go ahead and
open up the individual issues anyway, and we can group and classify them
under the umbrella of CLIMATE-88 after the fact.
Thanks again for taking the time to explain things on-list, and for your
excellent work on the metrics!
Best,
Andrew.
For a less abstract example: If you have 3 functions, out of say 20 that can
be committed, then scale the JIRA scope down to just those 3 issues and update
the review and commit the 3 functions. Then open more JIRA issues to track the
changes needed to support the remaining 17 functions you want to change. It
will be up to you if you need to open just 1 more JIRA or 4 or 17, it is really
up to you.
The CLIMATE-88 JIRA issue is to improve metrics performance, which is a massive
undertaking and also ambiguous. So my thought is to scale that issue back and
focus it down to a patch/change you can commit now, and push the rest of the
code out as future work.
My $0.02
- Cameron
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11873/#review22008
-----------------------------------------------------------
On June 15, 2013, 3:38 a.m., Alex Goodman wrote:
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11873/
-----------------------------------------------------------
(Updated June 15, 2013, 3:38 a.m.)
Review request for Apache Open Climate, Cameron Goodale and Kyo Lee.
Description
-------
This is a major update to metrics as per CLIMATE-88. There are many changes,
most of them related to vectorizing many of the functions using various
indexing tricks. If you don't know this terminology, this basically means that
operations are performed on an entire array chunk at once, eliminating the need
for explicit for loops and improving performance and code readability in most
cases. I'll summarize some the changes that resulted here:
-For the most part, absolute performance increases were not that large but the
code became significantly more concise. A few of the functions (particularly
correlation calculation) showed very large gains. You can run the updated
attached benchmark_metrics.py script to see for yourself (See end of this post
for more details on that).
-Thanks to the addition of the reshapeMonthlyData() helper function in my
previous patch to misc.py, the explicit use of datetimes as a parameter for
many of the functions is no longer needed.
-The names of some variables were changed to adhere to our current coding
conventions.
-Functions that were commented out have been removed, complying to our
deprecation policy.
To run the benchmarking script, assuming you have the rcmes directory in your
PYTHONPATH do:
python benchmark_metrics.py
This will benchmark the functions that were changed between revisions for 10
years of randomly generated data on a 100 x 100 grid. To change the number of
years of data generated for the test, do:
python benchmark_metrics.py nYR
Where 'nYR' is the number of years of data you wish to use for the benchmark.
Finally, you can test to see if the revised functions are consistent with their
previous versions by running it in test mode:
python benchmark_metrics.py -t
This does not cover every possible test case, but from current testing
everything seems to work fine. Keep in mind though that they are tested against
revisions in the repository and not against Jinwon's upcoming revisions, so if
a previously used function was wrong, then so is the revised one.
Diffs
-----
http://svn.apache.org/repos/asf/incubator/climate/trunk/rcmet/src/main/python/rcmes/toolkit/metrics.py
1492816
Diff: https://reviews.apache.org/r/11873/diff/
Testing
-------
-Randomly generated masked arrays via the attached benchmark_metrics.py
-Some of the TRMM data
Thanks,
Alex Goodman