Hi, Sean. I don't think sampling from a grid is a good idea as the min/max may lie between grid points. Unconstrained random sampling avoids this problem. To this end, I have an implementation at:
https://github.com/apache/spark/compare/master...PhillHenry:master It is unit tested and does not change any already existing code. Totally get what you mean about Hyperopt but this is a pure JVM solution that's fairly straightforward. Is it worth contributing? Thanks, Phillip On Sat, Jan 30, 2021 at 2:00 PM Sean Owen <sro...@gmail.com> wrote: > I was thinking ParamGridBuilder would have to change to accommodate a > continuous range of values, and that's not hard, though other code wouldn't > understand that type of value, like the existing simple grid builder. > It's all possible just wondering if simply randomly sampling the grid is > enough. That would be a simpler change, just a new method or argument. > > Yes part of it is that if you really want to search continuous spaces, > hyperopt is probably even better, so how much do you want to put into > Pyspark - something really simple sure. > Not out of the question to do something more complex if it turns out to > also be pretty simple. > > On Sat, Jan 30, 2021 at 4:42 AM Phillip Henry <londonjava...@gmail.com> > wrote: > >> Hi, Sean. >> >> Perhaps I don't understand. As I see it, ParamGridBuilder builds an >> Array[ParamMap]. What I am proposing is a new class that also builds an >> Array[ParamMap] via its build() method, so there would be no "change in the >> APIs". This new class would, of course, have methods that defined the >> search space (log, linear, etc) over which random values were chosen. >> >> Now, if this is too trivial to warrant the work and people prefer >> Hyperopt, then so be it. It might be useful for people not using Python but >> they can just roll-their-own, I guess. >> >> Anyway, looking forward to hearing what you think. >> >> Regards, >> >> Phillip >> >> >> >> On Fri, Jan 29, 2021 at 4:18 PM Sean Owen <sro...@gmail.com> wrote: >> >>> I think that's a bit orthogonal - right now you can't specify continuous >>> spaces. The straightforward thing is to allow random sampling from a big >>> grid. You can create a geometric series of values to try, of course - >>> 0.001, 0.01, 0.1, etc. >>> Yes I get that if you're randomly choosing, you can randomly choose from >>> a continuous space of many kinds. I don't know if it helps a lot vs the >>> change in APIs (and continuous spaces don't make as much sense for grid >>> search) >>> Of course it helps a lot if you're doing a smarter search over the >>> space, like what hyperopt does. For that, I mean, one can just use >>> hyperopt + Spark ML already if desired. >>> >>> On Fri, Jan 29, 2021 at 9:01 AM Phillip Henry <londonjava...@gmail.com> >>> wrote: >>> >>>> Thanks, Sean! I hope to offer a PR next week. >>>> >>>> Not sure about a dependency on the grid search, though - but happy to >>>> hear your thoughts. I mean, you might want to explore logarithmic space >>>> evenly. For example, something like "please search 1e-7 to 1e-4" leads to >>>> a reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly) >>>> evenly spaced in logarithmic space but not in linear space. So, saying what >>>> fraction of a grid search to sample wouldn't make sense (unless the grid >>>> was warped, of course). >>>> >>>> Does that make sense? It might be better for me to just write the code >>>> as I don't think it would be very complicated. >>>> >>>> Happy to hear your thoughts. >>>> >>>> Phillip >>>> >>>> >>>> >>>> On Fri, Jan 29, 2021 at 1:47 PM Sean Owen <sro...@gmail.com> wrote: >>>> >>>>> I don't know of anyone working on that. Yes I think it could be >>>>> useful. I think it might be easiest to implement by simply having some >>>>> parameter to the grid search process that says what fraction of all >>>>> possible combinations you want to randomly test. >>>>> >>>>> On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry <londonjava...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I have no work at the moment so I was wondering if anybody would be >>>>>> interested in me contributing code that generates an Array[ParamMap] for >>>>>> random hyperparameters? >>>>>> >>>>>> Apparently, this technique can find a hyperparameter in the top 5% of >>>>>> parameter space in fewer than 60 iterations with 95% confidence [1]. >>>>>> >>>>>> I notice that the Spark code base has only the brute force >>>>>> ParamGridBuilder unless I am missing something. >>>>>> >>>>>> Hyperparameter optimization is an area of interest to me but I don't >>>>>> want to re-invent the wheel. So, if this work is already underway or >>>>>> there >>>>>> are libraries out there to do it please let me know and I'll shut up :) >>>>>> >>>>>> Regards, >>>>>> >>>>>> Phillip >>>>>> >>>>>> [1] >>>>>> https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html >>>>>> >>>>>