Hello Doug, others, Gordon pointed me to the design document. This is very interesting. A few comments below.
1. Although the design document talks about passing training data to the LoadBalancer object, glancing briefly at the pseudocode for the class definition it's not clear to me what the API is for passing the training data is or what the format would be. 2. I'm assuming that the LoadBalancer has full access to the items in the data table itself during the balance operation, in case it wants to collect data about them for use in the prediction. Not clear if this would be useful b/c of the cost involved in gathering this data, but interesting to explore. 3. I would expect that practical implementations of LoadBalancer would want a way to serialise their nontrivial state (presumably using HT itself), but not sure if there's any special API support required for that. (Maybe a reserved table for LB data?) 4. It may be worth providing a convenience implementation of LoadBalancer that works in the batch setting like the basic algorithm, i.e., a superclass for load balancers that want to operate once a day based on data that has been collected in the last 24 hours. 5. A LoadBalancer might want to use different strategies for the cases of adding a new range server versus high variance among servers. Is there a way for the master to signal which of these situations is the case? 6. Of course an effective challenge problem would also require a test workload that is challenging enough to be representative of real usages of the load balancer. As close to real usage as possible would be best, to try to forestall the danger of designing ML algorithms that are strong enough to learn the features of the synthetic problem generator but not that of real data. 7. It is unclear what the optimal granularity for aggregating the range counts would be (could be less than 30 sec, or more). Might want to have this settable parameter of the master. Note that this is orthogonal to how often the master decides to send data to the load balancer, e.g., the master could send data every thirty seconds that are 6 bins of counts recorded every 5 sec. 8. Wrt the objective functions, different objective performance metrics that are of interest to the user, and the user might want to have knobs to say (e.g.) exactly what SLA they would like satisfied. But it's not clear to me whether this is part of load balancing (i.e., deciding which ranges are served by which server) or auto-scaling (deciding how many servers to have). It may be too early to lock down an API on this without having more experience with practical SML/Optimization load balancers. Best wishes Charles -- Charles Sutton * [email protected] * http://homepages.inf.ed.ac.uk/csutton Lecturer * School of Informatics * University of Edinburgh The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en.
