Re: Hyperparameter Optimization via Randomization

2021-01-29 Thread Sean Owen
I think that's a bit orthogonal - right now you can't specify continuous
spaces. The straightforward thing is to allow random sampling from a big
grid. You can create a geometric series of values to try, of course -
0.001, 0.01, 0.1, etc.
Yes I get that if you're randomly choosing, you can randomly choose from a
continuous space of many kinds. I don't know if it helps a lot vs the
change in APIs (and continuous spaces don't make as much sense for grid
search)
Of course it helps a lot if you're doing a smarter search over the space,
like what hyperopt does. For that, I mean, one can just use hyperopt +
Spark ML already if desired.

On Fri, Jan 29, 2021 at 9:01 AM Phillip Henry 
wrote:

> Thanks, Sean! I hope to offer a PR next week.
>
> Not sure about a dependency on the grid search, though - but happy to hear
> your thoughts. I mean, you might want to explore logarithmic space evenly.
> For example,  something like "please search 1e-7 to 1e-4" leads to a
> reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly)
> evenly spaced in logarithmic space but not in linear space. So, saying what
> fraction of a grid search to sample wouldn't make sense (unless the grid
> was warped, of course).
>
> Does that make sense? It might be better for me to just write the code as
> I don't think it would be very complicated.
>
> Happy to hear your thoughts.
>
> Phillip
>
>
>
> On Fri, Jan 29, 2021 at 1:47 PM Sean Owen  wrote:
>
>> I don't know of anyone working on that. Yes I think it could be useful. I
>> think it might be easiest to implement by simply having some parameter to
>> the grid search process that says what fraction of all possible
>> combinations you want to randomly test.
>>
>> On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry 
>> wrote:
>>
>>> Hi,
>>>
>>> I have no work at the moment so I was wondering if anybody would be
>>> interested in me contributing code that generates an Array[ParamMap] for
>>> random hyperparameters?
>>>
>>> Apparently, this technique can find a hyperparameter in the top 5% of
>>> parameter space in fewer than 60 iterations with 95% confidence [1].
>>>
>>> I notice that the Spark code base has only the brute force
>>> ParamGridBuilder unless I am missing something.
>>>
>>> Hyperparameter optimization is an area of interest to me but I don't
>>> want to re-invent the wheel. So, if this work is already underway or there
>>> are libraries out there to do it please let me know and I'll shut up :)
>>>
>>> Regards,
>>>
>>> Phillip
>>>
>>> [1]
>>> https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
>>>
>>


Re: Public API access to UDTs

2021-01-29 Thread Fitch, Simeon
On Fri, Jan 29, 2021 at 9:46 AM Sean Owen  wrote:

> Are there implications for storing UDTs in particular engines or formats?
>

I've found UDTs I/O to Parquet without problem.

They work fine with PySpark with implementation of mirror classes. Without
properly constructed mirror classe they show up as structs, which isn't a
bad fallback.

However, they do *not* work with Spark's use of Arrow, as they get rejected
here:
https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala#L75-L76




> Just making it public for developers, even with a 'use at your own risk'
> warning, seems pretty small as a change?
>
> On Thu, Jan 28, 2021 at 5:10 PM Fitch, Simeon  wrote:
>
>> Hi,
>>
>> First time posting here, so apologies if I need to be directing this
>> topic elsewhere.
>>
>> I'm the author of RasterFrames, and a contributor to GeoMesa's Spark SQL
>> module. Both make use of decently low level Catalyst constructs, include
>> custom UDTs; RasterFrames introduces a geospatial raster type, and GeoMesa
>> a geometry type.
>>
>> In order to make this work we've circumvented the [`package private`](
>> https://bit.ly/3pr0fVv)  restriction on `UDTRegistration` by inserting
>> sibling classes into the package namespace. It's a hack, and works fine
>> with JVM 8, but violates the [much more restrictive](
>> https://bit.ly/3aadO5g) module constructs in JVM 9+.
>>
>> We've been monitoring [SPARK-7768](
>> https://issues.apache.org/jira/browse/SPARK-7768) (filed in 2015)  and
>> it's [associated PR](https://github.com/apache/spark/pull/16478) for
>> years now, but it keeps getting kicked down the road(map).
>>
>> As authors of open source systems we completely understand how and why
>> this happens, but we are at a critical juncture in our projects' lifecycle,
>> anchored to JVM 8 while other systems have moved on to later versions. We'd
>> also like to enjoy the benefits of later JVMs.
>>
>> So... I'm here to find out how I and others critically needing public
>> access to `UDTRegistration` might better advocate for it?
>>
>> I think (but not 100% sure) the PR linked above is more extensive than
>> what we need, also addressing usability around Encoders, for which we have
>> our own type class solution. My assumption to date has been all we need is
>> line 32 of `UDTRegistration` deleted (if there's folly therein, please say
>> so!). While I understand a reluctance to promote `UDTRegistration` to
>> `public`, I note that it has not been changed since 2016, perhaps a good
>> indicator that the API is stable enough. Marking it as `@Experimental`
>> could be a compromise option.
>>
>> Thanks for reading this far and giving this consideration. Any and all
>> advice is appreciated.
>>
>> Simeon (@metasim)
>>
>>
>> --
>> Simeon Fitch
>> Co-founder & VP of R
>> Astraea, Inc.
>>
>>

-- 
Simeon Fitch
Co-founder & VP of R
Astraea, Inc.


Re: Hyperparameter Optimization via Randomization

2021-01-29 Thread Phillip Henry
Thanks, Sean! I hope to offer a PR next week.

Not sure about a dependency on the grid search, though - but happy to hear
your thoughts. I mean, you might want to explore logarithmic space evenly.
For example,  something like "please search 1e-7 to 1e-4" leads to a
reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly)
evenly spaced in logarithmic space but not in linear space. So, saying what
fraction of a grid search to sample wouldn't make sense (unless the grid
was warped, of course).

Does that make sense? It might be better for me to just write the code as I
don't think it would be very complicated.

Happy to hear your thoughts.

Phillip



On Fri, Jan 29, 2021 at 1:47 PM Sean Owen  wrote:

> I don't know of anyone working on that. Yes I think it could be useful. I
> think it might be easiest to implement by simply having some parameter to
> the grid search process that says what fraction of all possible
> combinations you want to randomly test.
>
> On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry 
> wrote:
>
>> Hi,
>>
>> I have no work at the moment so I was wondering if anybody would be
>> interested in me contributing code that generates an Array[ParamMap] for
>> random hyperparameters?
>>
>> Apparently, this technique can find a hyperparameter in the top 5% of
>> parameter space in fewer than 60 iterations with 95% confidence [1].
>>
>> I notice that the Spark code base has only the brute force
>> ParamGridBuilder unless I am missing something.
>>
>> Hyperparameter optimization is an area of interest to me but I don't want
>> to re-invent the wheel. So, if this work is already underway or there are
>> libraries out there to do it please let me know and I'll shut up :)
>>
>> Regards,
>>
>> Phillip
>>
>> [1]
>> https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
>>
>


Re: Public API access to UDTs

2021-01-29 Thread Sean Owen
I'm also interested: are there problems with opening up this API beyond
needing to freeze it and keep it stable? it's pretty stable.
As @DeveloperApi at least?
Are there implications for storing UDTs in particular engines or formats?
Just making it public for developers, even with a 'use at your own risk'
warning, seems pretty small as a change?

On Thu, Jan 28, 2021 at 5:10 PM Fitch, Simeon  wrote:

> Hi,
>
> First time posting here, so apologies if I need to be directing this topic
> elsewhere.
>
> I'm the author of RasterFrames, and a contributor to GeoMesa's Spark SQL
> module. Both make use of decently low level Catalyst constructs, include
> custom UDTs; RasterFrames introduces a geospatial raster type, and GeoMesa
> a geometry type.
>
> In order to make this work we've circumvented the [`package private`](
> https://bit.ly/3pr0fVv)  restriction on `UDTRegistration` by inserting
> sibling classes into the package namespace. It's a hack, and works fine
> with JVM 8, but violates the [much more restrictive](
> https://bit.ly/3aadO5g) module constructs in JVM 9+.
>
> We've been monitoring [SPARK-7768](
> https://issues.apache.org/jira/browse/SPARK-7768) (filed in 2015)  and
> it's [associated PR](https://github.com/apache/spark/pull/16478) for
> years now, but it keeps getting kicked down the road(map).
>
> As authors of open source systems we completely understand how and why
> this happens, but we are at a critical juncture in our projects' lifecycle,
> anchored to JVM 8 while other systems have moved on to later versions. We'd
> also like to enjoy the benefits of later JVMs.
>
> So... I'm here to find out how I and others critically needing public
> access to `UDTRegistration` might better advocate for it?
>
> I think (but not 100% sure) the PR linked above is more extensive than
> what we need, also addressing usability around Encoders, for which we have
> our own type class solution. My assumption to date has been all we need is
> line 32 of `UDTRegistration` deleted (if there's folly therein, please say
> so!). While I understand a reluctance to promote `UDTRegistration` to
> `public`, I note that it has not been changed since 2016, perhaps a good
> indicator that the API is stable enough. Marking it as `@Experimental`
> could be a compromise option.
>
> Thanks for reading this far and giving this consideration. Any and all
> advice is appreciated.
>
> Simeon (@metasim)
>
>
> --
> Simeon Fitch
> Co-founder & VP of R
> Astraea, Inc.
>
>


Re: Hyperparameter Optimization via Randomization

2021-01-29 Thread Sean Owen
I don't know of anyone working on that. Yes I think it could be useful. I
think it might be easiest to implement by simply having some parameter to
the grid search process that says what fraction of all possible
combinations you want to randomly test.

On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry 
wrote:

> Hi,
>
> I have no work at the moment so I was wondering if anybody would be
> interested in me contributing code that generates an Array[ParamMap] for
> random hyperparameters?
>
> Apparently, this technique can find a hyperparameter in the top 5% of
> parameter space in fewer than 60 iterations with 95% confidence [1].
>
> I notice that the Spark code base has only the brute force
> ParamGridBuilder unless I am missing something.
>
> Hyperparameter optimization is an area of interest to me but I don't want
> to re-invent the wheel. So, if this work is already underway or there are
> libraries out there to do it please let me know and I'll shut up :)
>
> Regards,
>
> Phillip
>
> [1]
> https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
>


Hyperparameter Optimization via Randomization

2021-01-29 Thread Phillip Henry
Hi,

I have no work at the moment so I was wondering if anybody would be
interested in me contributing code that generates an Array[ParamMap] for
random hyperparameters?

Apparently, this technique can find a hyperparameter in the top 5% of
parameter space in fewer than 60 iterations with 95% confidence [1].

I notice that the Spark code base has only the brute force ParamGridBuilder
unless I am missing something.

Hyperparameter optimization is an area of interest to me but I don't want
to re-invent the wheel. So, if this work is already underway or there are
libraries out there to do it please let me know and I'll shut up :)

Regards,

Phillip

[1]
https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html