Thanks. That makes sense. Using map/reduce was as much a curiosity as a practical requirement. Another way to monitor accuracy is to watch my progress indicator and see how close it is to the real time.
On Thu, Nov 28, 2013 at 10:50 AM, Nitin Borwankar <ni...@borwankar.com>wrote: > Hi Mark, > > It may not be worth it to have a real time estimate of the coefficients A, > B if the variance is very small. > > In other words if your collection of past videos cover most of the > different kinds of videos you are likely to encounter then estimates of A, > B are likely to be pretty robust and not change much with future new > samples. So using older A,B is not likely to throw your conversion time > predictions off by much. > > So if the next sample that comes along is not likely to add much change to > the values of A, B. and you might as well update much less frequently - > daily, weekly whatever - via cron or batch updates. > > How does one determine this - here's a "back of the envelope", "seat of the > pants" experiment. > > So first, after calculating A, B using all my video conversion times I > would do a second series of calculations. > > Here I would start with say the first 20 videos or some number = ~ 30-50% > of your videos and calculate A,B. Then keep adding the next 5% , and > repeat the calculation of A,B. > > Do this until you use up all the samples. But at the last step just add one > video at a time for the last 10 videos while doing the calcs. > > Now look at A, B for each calculation. Do they settle down to be close to > a "mean" A and "mean" B? What is the variance around the mean A, B? If > this is small or very small, then re-computing every time is "really cool > and all" but not worth it computationally. > > What is meant by "small" here? Well, take two successive estimates of A, > B. > Do a prediction using A1,B1 then A2,B2 how much are you off by if you use > the older sample? If A, B don't vary much then your prediction won't vary > much and you could use a stale sample without noticeable impact on your > prediction. Noticeable = say off by more than 10% accuracy in prediction. > > Then just update A,B every day or week. > > Bottom line before you do a "real time" update of parameters do a "back of > the envelope" experiment to see if it's worth it for the complexity and > point-of-failure it adds. > > Happy to chat offlist and/or offline if you want - am nborwankar on the > google email system. > > Nitin > > > > ------------------------------------------------------------------ > Nitin Borwankar > nborwan...@gmail.com > > > On Wed, Nov 27, 2013 at 1:13 PM, Mark Hahn <m...@reevuit.com> wrote: > > > I'm not an expert on statistics (and I'm lazy) so I thought I'd pose my > > problem here. Consider it a holiday mind exercise while avoiding > > relatives. > > > > I send customer-uploaded videos to Amazon Elastic Transcoder to generate > a > > video for html5 consumption. It takes a few seconds up to tens of > minutes > > to convert. I have no way to track progress so I estimate the time to > > complete and show a fake progress indicator. I have been using the > run-time > > of the video and this is not working well at all. High bit-rate (big > file) > > videos fare the worst. > > > > I'm guessing there are two main parameters to estimate the conversion > time, > > the files size and run-time. The file size is a good estimate of input > > processing and run-rime is a good estimate of output processing. Amazon > > has been pretty consistent in their conversion times in the short-run. > > > > I have tons of data in my couchdb from previous conversions. I want to > do > > regression analysis of these past runs to calculate parameters for > > estimation. I know the file-size, run-time, and conversion time for > each. > > > > I will use runLen * A + fileSize * B as the estimation formula. A and B > > will be calculated by solving runLen@A + fileSize * B = convTime from > the > > samples. It would be nice to use a map-reduce to always have the latest > > estimate of A and B, if possible. > > > > My first thought would be to just find the average for each of the three > > input vars and solve for A and B using these averages. However I'm > pretty > > sure this would yield the wrong result because each set of three samples > > need to be used independently (not sure). > > > > So I would like to have each map take one conversion sample and do the > > regression in the reduce. Can someone give me pointers on how to do > this? > > >