Hi, Sean.

I don't think sampling from a grid is a good idea as the min/max may lie
between grid points. Unconstrained random sampling avoids this problem. To
this end, I have an implementation at:

https://github.com/apache/spark/compare/master...PhillHenry:master

It is unit tested and does not change any already existing code.

Totally get what you mean about Hyperopt but this is a pure JVM solution
that's fairly straightforward.

Is it worth contributing?

Thanks,

Phillip





On Sat, Jan 30, 2021 at 2:00 PM Sean Owen <sro...@gmail.com> wrote:

> I was thinking ParamGridBuilder would have to change to accommodate a
> continuous range of values, and that's not hard, though other code wouldn't
> understand that type of value, like the existing simple grid builder.
> It's all possible just wondering if simply randomly sampling the grid is
> enough. That would be a simpler change, just a new method or argument.
>
> Yes part of it is that if you really want to search continuous spaces,
> hyperopt is probably even better, so how much do you want to put into
> Pyspark - something really simple sure.
> Not out of the question to do something more complex if it turns out to
> also be pretty simple.
>
> On Sat, Jan 30, 2021 at 4:42 AM Phillip Henry <londonjava...@gmail.com>
> wrote:
>
>> Hi, Sean.
>>
>> Perhaps I don't understand. As I see it, ParamGridBuilder builds an
>> Array[ParamMap]. What I am proposing is a new class that also builds an
>> Array[ParamMap] via its build() method, so there would be no "change in the
>> APIs". This new class would, of course, have methods that defined the
>> search space (log, linear, etc) over which random values were chosen.
>>
>> Now, if this is too trivial to warrant the work and people prefer
>> Hyperopt, then so be it. It might be useful for people not using Python but
>> they can just roll-their-own, I guess.
>>
>> Anyway, looking forward to hearing what you think.
>>
>> Regards,
>>
>> Phillip
>>
>>
>>
>> On Fri, Jan 29, 2021 at 4:18 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> I think that's a bit orthogonal - right now you can't specify continuous
>>> spaces. The straightforward thing is to allow random sampling from a big
>>> grid. You can create a geometric series of values to try, of course -
>>> 0.001, 0.01, 0.1, etc.
>>> Yes I get that if you're randomly choosing, you can randomly choose from
>>> a continuous space of many kinds. I don't know if it helps a lot vs the
>>> change in APIs (and continuous spaces don't make as much sense for grid
>>> search)
>>> Of course it helps a lot if you're doing a smarter search over the
>>> space, like what hyperopt does. For that, I mean, one can just use
>>> hyperopt + Spark ML already if desired.
>>>
>>> On Fri, Jan 29, 2021 at 9:01 AM Phillip Henry <londonjava...@gmail.com>
>>> wrote:
>>>
>>>> Thanks, Sean! I hope to offer a PR next week.
>>>>
>>>> Not sure about a dependency on the grid search, though - but happy to
>>>> hear your thoughts. I mean, you might want to explore logarithmic space
>>>> evenly. For example,  something like "please search 1e-7 to 1e-4" leads to
>>>> a reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly)
>>>> evenly spaced in logarithmic space but not in linear space. So, saying what
>>>> fraction of a grid search to sample wouldn't make sense (unless the grid
>>>> was warped, of course).
>>>>
>>>> Does that make sense? It might be better for me to just write the code
>>>> as I don't think it would be very complicated.
>>>>
>>>> Happy to hear your thoughts.
>>>>
>>>> Phillip
>>>>
>>>>
>>>>
>>>> On Fri, Jan 29, 2021 at 1:47 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> I don't know of anyone working on that. Yes I think it could be
>>>>> useful. I think it might be easiest to implement by simply having some
>>>>> parameter to the grid search process that says what fraction of all
>>>>> possible combinations you want to randomly test.
>>>>>
>>>>> On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry <londonjava...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have no work at the moment so I was wondering if anybody would be
>>>>>> interested in me contributing code that generates an Array[ParamMap] for
>>>>>> random hyperparameters?
>>>>>>
>>>>>> Apparently, this technique can find a hyperparameter in the top 5% of
>>>>>> parameter space in fewer than 60 iterations with 95% confidence [1].
>>>>>>
>>>>>> I notice that the Spark code base has only the brute force
>>>>>> ParamGridBuilder unless I am missing something.
>>>>>>
>>>>>> Hyperparameter optimization is an area of interest to me but I don't
>>>>>> want to re-invent the wheel. So, if this work is already underway or 
>>>>>> there
>>>>>> are libraries out there to do it please let me know and I'll shut up :)
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Phillip
>>>>>>
>>>>>> [1]
>>>>>> https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
>>>>>>
>>>>>

Reply via email to