Re: Hyperparameter Optimization via Randomization

2021-01-30 Thread Phillip Henry
Hi, Sean.

Perhaps I don't understand. As I see it, ParamGridBuilder builds an
Array[ParamMap]. What I am proposing is a new class that also builds an
Array[ParamMap] via its build() method, so there would be no "change in the
APIs". This new class would, of course, have methods that defined the
search space (log, linear, etc) over which random values were chosen.

Now, if this is too trivial to warrant the work and people prefer Hyperopt,
then so be it. It might be useful for people not using Python but they can
just roll-their-own, I guess.

Anyway, looking forward to hearing what you think.

Regards,

Phillip



On Fri, Jan 29, 2021 at 4:18 PM Sean Owen  wrote:

> I think that's a bit orthogonal - right now you can't specify continuous
> spaces. The straightforward thing is to allow random sampling from a big
> grid. You can create a geometric series of values to try, of course -
> 0.001, 0.01, 0.1, etc.
> Yes I get that if you're randomly choosing, you can randomly choose from a
> continuous space of many kinds. I don't know if it helps a lot vs the
> change in APIs (and continuous spaces don't make as much sense for grid
> search)
> Of course it helps a lot if you're doing a smarter search over the space,
> like what hyperopt does. For that, I mean, one can just use hyperopt +
> Spark ML already if desired.
>
> On Fri, Jan 29, 2021 at 9:01 AM Phillip Henry 
> wrote:
>
>> Thanks, Sean! I hope to offer a PR next week.
>>
>> Not sure about a dependency on the grid search, though - but happy to
>> hear your thoughts. I mean, you might want to explore logarithmic space
>> evenly. For example,  something like "please search 1e-7 to 1e-4" leads to
>> a reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly)
>> evenly spaced in logarithmic space but not in linear space. So, saying what
>> fraction of a grid search to sample wouldn't make sense (unless the grid
>> was warped, of course).
>>
>> Does that make sense? It might be better for me to just write the code as
>> I don't think it would be very complicated.
>>
>> Happy to hear your thoughts.
>>
>> Phillip
>>
>>
>>
>> On Fri, Jan 29, 2021 at 1:47 PM Sean Owen  wrote:
>>
>>> I don't know of anyone working on that. Yes I think it could be useful.
>>> I think it might be easiest to implement by simply having some parameter to
>>> the grid search process that says what fraction of all possible
>>> combinations you want to randomly test.
>>>
>>> On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry 
>>> wrote:
>>>
 Hi,

 I have no work at the moment so I was wondering if anybody would be
 interested in me contributing code that generates an Array[ParamMap] for
 random hyperparameters?

 Apparently, this technique can find a hyperparameter in the top 5% of
 parameter space in fewer than 60 iterations with 95% confidence [1].

 I notice that the Spark code base has only the brute force
 ParamGridBuilder unless I am missing something.

 Hyperparameter optimization is an area of interest to me but I don't
 want to re-invent the wheel. So, if this work is already underway or there
 are libraries out there to do it please let me know and I'll shut up :)

 Regards,

 Phillip

 [1]
 https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html

>>>


Re: Hyperparameter Optimization via Randomization

2021-01-30 Thread Sean Owen
I was thinking ParamGridBuilder would have to change to accommodate a
continuous range of values, and that's not hard, though other code wouldn't
understand that type of value, like the existing simple grid builder.
It's all possible just wondering if simply randomly sampling the grid is
enough. That would be a simpler change, just a new method or argument.

Yes part of it is that if you really want to search continuous spaces,
hyperopt is probably even better, so how much do you want to put into
Pyspark - something really simple sure.
Not out of the question to do something more complex if it turns out to
also be pretty simple.

On Sat, Jan 30, 2021 at 4:42 AM Phillip Henry 
wrote:

> Hi, Sean.
>
> Perhaps I don't understand. As I see it, ParamGridBuilder builds an
> Array[ParamMap]. What I am proposing is a new class that also builds an
> Array[ParamMap] via its build() method, so there would be no "change in the
> APIs". This new class would, of course, have methods that defined the
> search space (log, linear, etc) over which random values were chosen.
>
> Now, if this is too trivial to warrant the work and people prefer
> Hyperopt, then so be it. It might be useful for people not using Python but
> they can just roll-their-own, I guess.
>
> Anyway, looking forward to hearing what you think.
>
> Regards,
>
> Phillip
>
>
>
> On Fri, Jan 29, 2021 at 4:18 PM Sean Owen  wrote:
>
>> I think that's a bit orthogonal - right now you can't specify continuous
>> spaces. The straightforward thing is to allow random sampling from a big
>> grid. You can create a geometric series of values to try, of course -
>> 0.001, 0.01, 0.1, etc.
>> Yes I get that if you're randomly choosing, you can randomly choose from
>> a continuous space of many kinds. I don't know if it helps a lot vs the
>> change in APIs (and continuous spaces don't make as much sense for grid
>> search)
>> Of course it helps a lot if you're doing a smarter search over the space,
>> like what hyperopt does. For that, I mean, one can just use hyperopt +
>> Spark ML already if desired.
>>
>> On Fri, Jan 29, 2021 at 9:01 AM Phillip Henry 
>> wrote:
>>
>>> Thanks, Sean! I hope to offer a PR next week.
>>>
>>> Not sure about a dependency on the grid search, though - but happy to
>>> hear your thoughts. I mean, you might want to explore logarithmic space
>>> evenly. For example,  something like "please search 1e-7 to 1e-4" leads to
>>> a reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly)
>>> evenly spaced in logarithmic space but not in linear space. So, saying what
>>> fraction of a grid search to sample wouldn't make sense (unless the grid
>>> was warped, of course).
>>>
>>> Does that make sense? It might be better for me to just write the code
>>> as I don't think it would be very complicated.
>>>
>>> Happy to hear your thoughts.
>>>
>>> Phillip
>>>
>>>
>>>
>>> On Fri, Jan 29, 2021 at 1:47 PM Sean Owen  wrote:
>>>
 I don't know of anyone working on that. Yes I think it could be useful.
 I think it might be easiest to implement by simply having some parameter to
 the grid search process that says what fraction of all possible
 combinations you want to randomly test.

 On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry 
 wrote:

> Hi,
>
> I have no work at the moment so I was wondering if anybody would be
> interested in me contributing code that generates an Array[ParamMap] for
> random hyperparameters?
>
> Apparently, this technique can find a hyperparameter in the top 5% of
> parameter space in fewer than 60 iterations with 95% confidence [1].
>
> I notice that the Spark code base has only the brute force
> ParamGridBuilder unless I am missing something.
>
> Hyperparameter optimization is an area of interest to me but I don't
> want to re-invent the wheel. So, if this work is already underway or there
> are libraries out there to do it please let me know and I'll shut up :)
>
> Regards,
>
> Phillip
>
> [1]
> https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
>



Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-30 Thread Maciej
Just thinking out loud ‒ if there is community need for providing
language bindings for less popular SQL functions, could these live
outside main project or even outside the ASF?  As long as expressions
are already implemented, bindings are trivial after all.

If could also allow usage of more scalable hierarchy (let's say with
modules / packages per function family).

On 1/29/21 5:01 AM, Hyukjin Kwon wrote:
> FYI exposing methods with Column signature only is already documented
> on the top of functions.scala, and I believe that has been the current
> dev direction if I am not mistaken.
>
> Another point is that we should rather expose commonly used
> expressions. Its best if it considers language specific context. Many
> of expressions are for SQL compliance. Many data silence python
> libraries don't support such features as an example.
>
>
>
> On Fri, 29 Jan 2021, 12:04 Matthew Powers,
> mailto:matthewkevinpow...@gmail.com>>
> wrote:
>
> Thanks for the thoughtful responses.  I now understand why adding
> all the functions across all the APIs isn't the default.
>
> To Nick's point, relying on heuristics to gauge user interest, in
> addition to personal experience, is a good idea.  The
> regexp_extract_all SO thread has 16,000 views
> 
> ,
> so I say we set the threshold to 10k, haha, just kidding!  Like
> Sean mentioned, we don't want to add niche functions.  Now we just
> need a way to figure out what's niche!
>
> To Reynolds point on overloading Scala functions, I think we
> should start trying to limit the number of overloaded functions. 
> Some functions have the columnName and column object function
> signatures.  e.g. approx_count_distinct(columnName: String, rsd:
> Double) and approx_count_distinct(e: Column, rsd: Double).  We can
> just expose the approx_count_distinct(e: Column, rsd: Double)
> variety going forward (not suggesting any backwards incompatible
> changes, just saying we don't need the columnName-type functions
> for new stuff).
>
> Other functions have one signature with the second object as a
> Scala object and another signature with the second object as a
> column object, e.g. date_add(start: Column, days: Column) and
> date_add(start: Column, days: Int).  We can just expose the
> date_add(start: Column, days: Column) variety cause it's general
> purpose.  Let me know if you think that avoiding Scala function
> overloading will help Reynold.
>
> Let's brainstorm Nick's idea of creating a framework that'd test
> Scala / Python / SQL / R implementations in one-fell-swoop.  Seems
> like that'd be a great way to reduce the maintenance burden. 
> Reynold's regexp_extract code from 5 years ago is largely still
> intact - getting the job done right the first time is another
> great way to avoid maintenance!
>
> On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin  > wrote:
>
> There's another thing that's not mentioned … it's primarily a
> problem for Scala. Due to static typing, we need a very large
> number of function overloads for the Scala version of each
> function, whereas in SQL/Python they are just one. There's a
> limit on how many functions we can add, and it also makes it
> difficult to browse through the docs when there are a lot of
> functions.
>
>
>
> On Thu, Jan 28, 2021 at 1:09 PM, Maciej
> mailto:mszymkiew...@gmail.com>> wrote:
>
> Just my two cents on R side.
>
> On 1/28/21 10:00 PM, Nicholas Chammas wrote:
>> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen
>> mailto:sro...@gmail.com>> wrote:
>>
>> It isn't that regexp_extract_all (for example) is
>> useless outside SQL, just, where do you draw the
>> line? Supporting 10s of random SQL functions across 3
>> other languages has a cost, which has to be weighed
>> against benefit, which we can never measure well
>> except anecdotally: one or two people say "I want
>> this" in a sea of hundreds of thousands of users.
>>
>>
>> +1 to this, but I will add that Jira and Stack Overflow
>> activity can sometimes give good signals about API gaps
>> that are frustrating users. If there is an SO question
>> with 30K views about how to do something that should have
>> been easier, then that's an important signal about the API.
>>
>> For this specific case, I think there is a fine
>> argument that regexp_extract_all should be added
>> simply for consistency with regexp_extract. I can
>> also see the argument that regexp_extract was a 

How to contribute the code

2021-01-30 Thread hammerCS
I have learned Spark for one year .And now, I want to get the hang of spark
by contributing code ..
But the code is so massive , I don't know how to do.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-30 Thread Matthew Powers
Maciej - I like the idea of a separate library to provide easy access to
functions that the maintainers don't want to merge into Spark core.

I've seen this model work well in other open source communities.  The Rails
Active Support library provides the Ruby community with core functionality
like beginning_of_month.  The Ruby community has a good, well-supported
function, but it's not in the Ruby codebase so it's not a maintenance
burden - best of both worlds.

I'll start a proof-of-concept repo.  If the repo gets popular, I'll be
happy to donate it to a GitHub organization like Awesome Spark
 or the ASF.

On Sat, Jan 30, 2021 at 9:35 AM Maciej  wrote:

> Just thinking out loud ‒ if there is community need for providing language
> bindings for less popular SQL functions, could these live outside main
> project or even outside the ASF?  As long as expressions are already
> implemented, bindings are trivial after all.
>
> If could also allow usage of more scalable hierarchy (let's say with
> modules / packages per function family).
>
> On 1/29/21 5:01 AM, Hyukjin Kwon wrote:
>
> FYI exposing methods with Column signature only is already documented on
> the top of functions.scala, and I believe that has been the current dev
> direction if I am not mistaken.
>
> Another point is that we should rather expose commonly used expressions.
> Its best if it considers language specific context. Many of expressions are
> for SQL compliance. Many data silence python libraries don't support such
> features as an example.
>
>
>
> On Fri, 29 Jan 2021, 12:04 Matthew Powers, 
> wrote:
>
>> Thanks for the thoughtful responses.  I now understand why adding all the
>> functions across all the APIs isn't the default.
>>
>> To Nick's point, relying on heuristics to gauge user interest, in
>> addition to personal experience, is a good idea.  The regexp_extract_all
>> SO thread has 16,000 views
>> ,
>> so I say we set the threshold to 10k, haha, just kidding!  Like Sean
>> mentioned, we don't want to add niche functions.  Now we just need a way to
>> figure out what's niche!
>>
>> To Reynolds point on overloading Scala functions, I think we should start
>> trying to limit the number of overloaded functions.  Some functions have
>> the columnName and column object function signatures.  e.g.
>> approx_count_distinct(columnName: String, rsd: Double) and
>> approx_count_distinct(e: Column, rsd: Double).  We can just expose the
>> approx_count_distinct(e: Column, rsd: Double) variety going forward (not
>> suggesting any backwards incompatible changes, just saying we don't need
>> the columnName-type functions for new stuff).
>>
>> Other functions have one signature with the second object as a Scala
>> object and another signature with the second object as a column object,
>> e.g. date_add(start: Column, days: Column) and date_add(start: Column,
>> days: Int).  We can just expose the date_add(start: Column, days: Column)
>> variety cause it's general purpose.  Let me know if you think that avoiding
>> Scala function overloading will help Reynold.
>>
>> Let's brainstorm Nick's idea of creating a framework that'd test Scala /
>> Python / SQL / R implementations in one-fell-swoop.  Seems like that'd be a
>> great way to reduce the maintenance burden.  Reynold's regexp_extract code
>> from 5 years ago is largely still intact - getting the job done right the
>> first time is another great way to avoid maintenance!
>>
>> On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin  wrote:
>>
>>> There's another thing that's not mentioned … it's primarily a problem
>>> for Scala. Due to static typing, we need a very large number of function
>>> overloads for the Scala version of each function, whereas in SQL/Python
>>> they are just one. There's a limit on how many functions we can add, and it
>>> also makes it difficult to browse through the docs when there are a lot of
>>> functions.
>>>
>>>
>>>
>>> On Thu, Jan 28, 2021 at 1:09 PM, Maciej  wrote:
>>>
 Just my two cents on R side.

 On 1/28/21 10:00 PM, Nicholas Chammas wrote:

 On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  wrote:

> It isn't that regexp_extract_all (for example) is useless outside SQL,
> just, where do you draw the line? Supporting 10s of random SQL functions
> across 3 other languages has a cost, which has to be weighed against
> benefit, which we can never measure well except anecdotally: one or two
> people say "I want this" in a sea of hundreds of thousands of users.
>

 +1 to this, but I will add that Jira and Stack Overflow activity can
 sometimes give good signals about API gaps that are frustrating users. If
 there is an SO question with 30K views about how to do something that
 should have been easier, then that's an important signal about the API.

 For this specific cas

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-30 Thread Kent Yao





Hi, MrPowersI'm also interested in this idea.I started https://github.com/yaooqinn/spark-func-extras a few month agoOn 2021/01/30 15:45:30, Matthew Powers  wrote:  Maciej - I like the idea of a separate library to provide easy access to>  functions that the maintainers don't want to merge into Spark core.>   I've seen this model work well in other open source communities.  The Rails>  Active Support library provides the Ruby community with core functionality>  like beginning_of_month.  The Ruby community has a good, well-supported>  function, but it's not in the Ruby codebase so it's not a maintenance>  burden - best of both worlds.>   I'll start a proof-of-concept repo.  If the repo gets popular, I'll be>  happy to donate it to a GitHub organization like Awesome Spark>   or the ASF.>   On Sat, Jan 30, 2021 at 9:35 AM Maciej  wrote:>   Just thinking out loud ‒ if there is community need for providing language>  bindings for less popular SQL functions, could these live outside main>  project or even outside the ASF?  As long as expressions are already>  implemented, bindings are trivial after all.>   If could also allow usage of more scalable hierarchy (let's say with>  modules / packages per function family).>   On 1/29/21 5:01 AM, Hyukjin Kwon wrote:>   FYI exposing methods with Column signature only is already documented on>  the top of functions.scala, and I believe that has been the current dev>  direction if I am not mistaken.>   Another point is that we should rather expose commonly used expressions.>  Its best if it considers language specific context. Many of expressions are>  for SQL compliance. Many data silence python libraries don't support such>  features as an example.> On Fri, 29 Jan 2021, 12:04 Matthew Powers, >  wrote:>   Thanks for the thoughtful responses.  I now understand why adding all the>  functions across all the APIs isn't the default.>   To Nick's point, relying on heuristics to gauge user interest, in>  addition to personal experience, is a good idea.  The regexp_extract_all>  SO thread has 16,000 views>  ,>  so I say we set the threshold to 10k, haha, just kidding!  Like Sean>  mentioned, we don't want to add niche functions.  Now we just need a way to>  figure out what's niche!>   To Reynolds point on overloading Scala functions, I think we should start>  trying to limit the number of overloaded functions.  Some functions have>  the columnName and column object function signatures.  e.g.>  approx_count_distinct(columnName: String, rsd: Double) and>  approx_count_distinct(e: Column, rsd: Double).  We can just expose the>  approx_count_distinct(e: Column, rsd: Double) variety going forward (not>  suggesting any backwards incompatible changes, just saying we don't need>  the columnName-type functions for new stuff).>   Other functions have one signature with the second object as a Scala>  object and another signature with the second object as a column object,>  e.g. date_add(start: Column, days: Column) and date_add(start: Column,>  days: Int).  We can just expose the date_add(start: Column, days: Column)>  variety cause it's general purpose.  Let me know if you think that avoiding>  Scala function overloading will help Reynold.>   Let's brainstorm Nick's idea of creating a framework that'd test Scala />  Python / SQL / R implementations in one-fell-swoop.  Seems like that'd be a>  great way to reduce the maintenance burden.  Reynold's regexp_extract code>  from 5 years ago is largely still intact - getting the job done right the>  first time is another great way to avoid maintenance!>   On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin  wrote:>   There's another thing that's not mentioned … it's primarily a problem>  for Scala. Due to static typing, we need a very large number of function>  overloads for the Scala version of each function, whereas in SQL/Python>  they are just one. There's a limit on how many functions we can add, and it>  also makes it difficult to browse through the docs when there are a lot of>  functions.> On Thu, Jan 28, 2021 at 1:09 PM, Maciej  wrote:>   Just my two cents on R side.>   On 1/28/21 10:00 PM, Nicholas Chammas wrote:>   On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  wrote:>   It isn't that regexp_extract_all (for example) is useless outside SQL,>  just, where do you draw the line? Supporting 10s of random SQL functions>  across 3 other languages has a cost, which has to be weighed against>  benefit, which we can never measure well except anecdotally: one or two>  people say "I want this" in a sea of hundreds of thousands of users.>+1 to this, but I will add that Jira and Stack Overflow activity can>  sometimes give good signals about API gaps that are frustrating users. If>  there is an SO question with 30K views about how to do something that>  should have been easier, then that's an important signal about the API.>   For this specific case, I think there is a fine argument>  that regexp_extract_all should be added simply for