[Wikimedia-l] Multivariate Fundraising Tests (Re: compromise?)

2012-12-28 Thread Matthew Walker
James,

On Fri, Dec 28, 2012 at 2:11 PM, James Salsman jsals...@gmail.com wrote:

 I mean as in the tests done May 16, September 20, and October 9
 reported at
 http://meta.wikimedia.org/wiki/Fundraising_2012/We_Need_A_Breakthrough
 without adjusting the best performing pull-down delivery combined
 banner/landing page from the beginning of this month


I obviously cannot speak for what Zack will end up doing but let's talk
shop for a moment on how this would be implemented.

The tests you indicated play banner, landing page impressions, and donation
amount against each other. It appears that everyone saw a collection of
random banners (ie: the test was not bucketed.) Are these the same
variables you want to test?

Regardless of the answer to the above; how do you propose we normalize our
tests across time of day, day of week, and day of month factors - we've
seen evidence that these all play a role. I don't know how many banner
variations we actually have to test but it's likely we won't be able to
test them all at the same time (In fact with the current weighting setup we
can only test 30 banners at a time). Do we just take each group as it
stands -- find the best performers in the group and then test the winners
against each other?

An additional considering is that we have four buckets to play with;
buckets are independent so we could potentially test 120 banners at a time
to four different groups. Presumably if we did this we would want a couple
of control banners in each to normalize with?

An additional something to consider is how long do we have to run these
tests to gain statistical significance? At least a day I'm guessing. Are we
going to account for banner fatigue at all? IE: show banners during only
the first 10 visits like we just did with this most recent campaign?

-- 
~Matt Walker
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Multivariate Fundraising Tests (Re: compromise?)

2012-12-28 Thread James Salsman
Matt, I have specific answers to most of your questions, but I don't
know whether others on wikimedia-l would be interested in them, and
I'm not sure about the specifics of a couple terms you used relative
to what I remember of the testing harness, so I'll reply in more
detail off-list with some questions about the terms over the weekend.

For now, I think the banner text message has aways been the most
important part of any appeal, and that if you were to take all 300 of
the existing volunteer submissions (and accept more -- e.g. How much
you donate may help determine how much we pay our programmers would
be incredibly effective, and hope you will measure it) and if you were
to include all those without any javascript, pull-down, landing page,
or other changes over a one week period with about 3000 impressions
each at random times of day and days of week for each, you would have
plenty to work with.  That's about a million impressions, or a 0.3%
impressions test, which I believe will give you well over 95%
confidence in the results.

That would not account for banner fatigue, which may be significant
all the way from timezone-to-timezone up to year-to-year, but I have
no ideas about how to account for that other than to do a multivariate
test shortly before beginning fundraising in earnest.

On Fri, Dec 28, 2012 at 3:46 PM, Matthew Walker mwal...@wikimedia.org wrote:
 James,

 On Fri, Dec 28, 2012 at 2:11 PM, James Salsman jsals...@gmail.com wrote:

 I mean as in the tests done May 16, September 20, and October 9
 reported at
 http://meta.wikimedia.org/wiki/Fundraising_2012/We_Need_A_Breakthrough
 without adjusting the best performing pull-down delivery combined
 banner/landing page from the beginning of this month


 I obviously cannot speak for what Zack will end up doing but let's talk shop
 for a moment on how this would be implemented.

 The tests you indicated play banner, landing page impressions, and donation
 amount against each other. It appears that everyone saw a collection of
 random banners (ie: the test was not bucketed.) Are these the same variables
 you want to test?

 Regardless of the answer to the above; how do you propose we normalize our
 tests across time of day, day of week, and day of month factors - we've seen
 evidence that these all play a role. I don't know how many banner variations
 we actually have to test but it's likely we won't be able to test them all
 at the same time (In fact with the current weighting setup we can only test
 30 banners at a time). Do we just take each group as it stands -- find the
 best performers in the group and then test the winners against each other?

 An additional considering is that we have four buckets to play with; buckets
 are independent so we could potentially test 120 banners at a time to four
 different groups. Presumably if we did this we would want a couple of
 control banners in each to normalize with?

 An additional something to consider is how long do we have to run these
 tests to gain statistical significance? At least a day I'm guessing. Are we
 going to account for banner fatigue at all? IE: show banners during only the
 first 10 visits like we just did with this most recent campaign?

 --
 ~Matt Walker

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l