Re: Review Request: metrics improvements

Andrew Hart Mon, 17 Jun 2013 22:39:05 -0700

Hey Alex,

On 06/17/2013 10:00 PM, Cameron Goodale wrote:

On June 17, 2013, 6:48 p.m., Alex Goodman wrote:

Kyo and I went over this together and there are a few issues that need to be 
considered before committing this:


Thanks very much for bringing all of this to the list!


1) Although my version of calcTemporalCorrelation is faster, it is also longer 
and harder to understand. It also has the disadvantage of not relying directly 
on scipy for the calculation so we may lose some consistency after scipy 
updates.
2) Jinwon mentioned in an e-mail that several of these functions do not work 
for daily data, and this is true. It is not a huge concern now because our 
current workflow primarily consists of monthly data, but it is a disadvantage 
nonetheless. Luckily I have already written a function that does something 
similar to reshapeMonthlyData() but for daily data a while back, but I will 
still need to take the time to implement and test this.

Thus, until these are fully taken care of, we think that we should either 
define separate versions of these functions inside metrics.py (eg 
calcClimSeasonOptimized()) or add a new module in rcmes/toolkit called 
something like 'metrics2.py' which has all of my changes, while keeping the 
older functions in metrics.py until support for daily data is added. I 
understand that this would add some clutter to the existing codebase, so we 
would like to hear what everyone else thinks.

What are your thoughts?

Alex,

I don't like the idea of fragmenting the codebase around appending the number 
2.  So i vote against creating metrics2.py.

I strongly agree with Cameron on this. I understand your thinking, butmy worry is that implementing arbitrary version numbering for individualfiles will sow unnecessary confusion down the road.


It sounds like this review has grown to be too large and has started to shift 
away from it's original intent.  Since it seems this task has grown in scope, 
it makes sense to just break it down into smaller tasks that are more 
manageable.  You can open 1 or more JIRA issues to track these additional 
changes that are not ready and still need time to engineer and test.  That 
would leave you the ability to scale this Review and JIRA issue back to a scope 
that you can commit on.

This sounds like a very sensible approach. Incrementally trackingprogress on tasks is likely easier for you as the developer, and it alsolets the community provide more focused feedback. One thing to considerhere is the fact that JIRA allows for sub-tasks (under the "moreactions" tab for an issue). If you'd like to tie more focused issues toCLIMATE-88's broader thrust of "performance improvements formetrics.py", sub-tasks would be an excellent way to show therelationship explicitly. If you're not sure how to do it, go ahead andopen up the individual issues anyway, and we can group and classify themunder the umbrella of CLIMATE-88 after the fact.

Thanks again for taking the time to explain things on-list, and for yourexcellent work on the metrics!


Best,

Andrew.


For a less abstract example:  If you have 3 functions, out of say 20 that can 
be committed, then scale the JIRA scope down to just those 3 issues and update 
the review and commit the 3 functions.  Then open more JIRA issues to track the 
changes needed to support the remaining 17 functions you want to change.  It 
will be up to you if you need to open just 1 more JIRA or 4 or 17, it is really 
up to you.

The CLIMATE-88 JIRA issue is to improve metrics performance, which is a massive 
undertaking and also ambiguous.  So my thought is to scale that issue back and 
focus it down to a patch/change you can commit now, and push the rest of the 
code out as future work.

My $0.02


- Cameron


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11873/#review22008
-----------------------------------------------------------


On June 15, 2013, 3:38 a.m., Alex Goodman wrote:

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11873/
-----------------------------------------------------------

(Updated June 15, 2013, 3:38 a.m.)


Review request for Apache Open Climate, Cameron Goodale and Kyo Lee.


Description
-------

This is a major update to metrics as per CLIMATE-88. There are many changes, 
most of them related to vectorizing many of the functions using various 
indexing tricks. If you don't know this terminology, this basically means that 
operations are performed on an entire array chunk at once, eliminating the need 
for explicit for loops and improving performance and code readability in most 
cases. I'll summarize some the changes that resulted here:

-For the most part, absolute performance increases were not that large but the 
code became significantly more concise. A few of the functions (particularly 
correlation calculation) showed very large gains. You can run the updated 
attached benchmark_metrics.py script to see for yourself (See end of this post 
for more details on that).

-Thanks to the addition of the reshapeMonthlyData() helper function in my 
previous patch to misc.py, the explicit use of datetimes as a parameter for 
many of the functions is no longer needed.

-The names of some variables were changed to adhere to our current coding 
conventions.

-Functions that were commented out have been removed, complying to our 
deprecation policy.

To run the benchmarking script, assuming you have the rcmes directory in your 
PYTHONPATH do:

python benchmark_metrics.py

This will benchmark the functions that were changed between revisions for 10 
years of randomly generated data on a 100 x 100 grid. To change the number of 
years of data generated for the test, do:

python benchmark_metrics.py nYR

Where 'nYR' is the number of years of data you wish to use for the benchmark.

Finally, you can test to see if the revised functions are consistent with their 
previous versions by running it in test mode:

python benchmark_metrics.py -t

This does not cover every possible test case, but from current testing 
everything seems to work fine. Keep in mind though that they are tested against 
revisions in the repository and not against Jinwon's upcoming revisions, so if 
a previously used function was wrong, then so is the revised one.


Diffs
-----

   
http://svn.apache.org/repos/asf/incubator/climate/trunk/rcmet/src/main/python/rcmes/toolkit/metrics.py
 1492816

Diff: https://reviews.apache.org/r/11873/diff/


Testing
-------

-Randomly generated masked arrays via the attached benchmark_metrics.py
-Some of the TRMM data


Thanks,

Alex Goodman

Re: Review Request: metrics improvements

Reply via email to