Re: [discuss] DataFrame function namespacing

2015-05-04 Thread Reynold Xin
After talking with people on this thread and offline, I've decided to go
with option 1, i.e. putting everything in a single functions object.


On Thu, Apr 30, 2015 at 10:04 AM, Ted Yu yuzhih...@gmail.com wrote:

 IMHO I would go with choice #1

 Cheers

 On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin r...@databricks.com wrote:

 We definitely still have the name collision problem in SQL.

 On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal 
 punya.bis...@gmail.com
  wrote:

  Do we still have to keep the names of the functions distinct to avoid
  collisions in SQL? Or is there a plan to allow importing a namespace
 into
  SQL somehow?
 
  I ask because if we have to keep worrying about name collisions then I'm
  not sure what the added complexity of #2 and #3 buys us.
 
  Punya
 
  On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com
 wrote:
 
  Scaladoc isn't much of a problem because scaladocs are grouped.
  Java/Python
  is the main problem ...
 
  See
 
 
 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
 
  On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman 
  shiva...@eecs.berkeley.edu wrote:
 
   My feeling is that we should have a handful of namespaces (say 4 or
 5).
  It
   becomes too cumbersome to import / remember more package names and
  having
   everything in one package makes it hard to read scaladoc etc.
  
   Thanks
   Shivaram
  
   On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com
  wrote:
  
   To add a little bit more context, some pros/cons I can think of are:
  
   Option 1: Very easy for users to find the function, since they are
 all
  in
   org.apache.spark.sql.functions. However, there will be quite a large
   number
   of them.
  
   Option 2: I can't tell why we would want this one over Option 3,
 since
  it
   has all the problems of Option 3, and not as nice of a hierarchy.
  
   Option 3: Opposite of Option 1. Each package or static class has a
  small
   number of functions that are relevant to each other, but for some
   functions
   it is unclear where they should go (e.g. should min go into basic
 or
   math?)
  
  
  
  
   On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com
  wrote:
  
Before we make DataFrame non-alpha, it would be great to decide
 how
  we
want to namespace all the functions. There are 3 alternatives:
   
1. Put all in org.apache.spark.sql.functions. This is how SQL does
  it,
since SQL doesn't have namespaces. I estimate eventually we will
  have ~
   200
functions.
   
2. Have explicit namespaces, which is what master branch currently
  looks
like:
   
- org.apache.spark.sql.functions
- org.apache.spark.sql.mathfunctions
- ...
   
3. Have explicit namespaces, but restructure them slightly so
  everything
is under functions.
   
package object functions {
   
  // all the old functions here -- but deprecated so we keep
 source
compatibility
  def ...
}
   
package org.apache.spark.sql.functions
   
object mathFunc {
  ...
}
   
object basicFuncs {
  ...
}
   
   
   
  
  
  
 
 





Re: [discuss] DataFrame function namespacing

2015-04-30 Thread Ted Yu
IMHO I would go with choice #1

Cheers

On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin r...@databricks.com wrote:

 We definitely still have the name collision problem in SQL.

 On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal 
 punya.bis...@gmail.com
  wrote:

  Do we still have to keep the names of the functions distinct to avoid
  collisions in SQL? Or is there a plan to allow importing a namespace
 into
  SQL somehow?
 
  I ask because if we have to keep worrying about name collisions then I'm
  not sure what the added complexity of #2 and #3 buys us.
 
  Punya
 
  On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com wrote:
 
  Scaladoc isn't much of a problem because scaladocs are grouped.
  Java/Python
  is the main problem ...
 
  See
 
 
 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
 
  On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman 
  shiva...@eecs.berkeley.edu wrote:
 
   My feeling is that we should have a handful of namespaces (say 4 or
 5).
  It
   becomes too cumbersome to import / remember more package names and
  having
   everything in one package makes it hard to read scaladoc etc.
  
   Thanks
   Shivaram
  
   On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com
  wrote:
  
   To add a little bit more context, some pros/cons I can think of are:
  
   Option 1: Very easy for users to find the function, since they are
 all
  in
   org.apache.spark.sql.functions. However, there will be quite a large
   number
   of them.
  
   Option 2: I can't tell why we would want this one over Option 3,
 since
  it
   has all the problems of Option 3, and not as nice of a hierarchy.
  
   Option 3: Opposite of Option 1. Each package or static class has a
  small
   number of functions that are relevant to each other, but for some
   functions
   it is unclear where they should go (e.g. should min go into basic
 or
   math?)
  
  
  
  
   On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com
  wrote:
  
Before we make DataFrame non-alpha, it would be great to decide how
  we
want to namespace all the functions. There are 3 alternatives:
   
1. Put all in org.apache.spark.sql.functions. This is how SQL does
  it,
since SQL doesn't have namespaces. I estimate eventually we will
  have ~
   200
functions.
   
2. Have explicit namespaces, which is what master branch currently
  looks
like:
   
- org.apache.spark.sql.functions
- org.apache.spark.sql.mathfunctions
- ...
   
3. Have explicit namespaces, but restructure them slightly so
  everything
is under functions.
   
package object functions {
   
  // all the old functions here -- but deprecated so we keep source
compatibility
  def ...
}
   
package org.apache.spark.sql.functions
   
object mathFunc {
  ...
}
   
object basicFuncs {
  ...
}
   
   
   
  
  
  
 
 



Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python
is the main problem ...

See
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 My feeling is that we should have a handful of namespaces (say 4 or 5). It
 becomes too cumbersome to import / remember more package names and having
 everything in one package makes it hard to read scaladoc etc.

 Thanks
 Shivaram

 On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com wrote:

 To add a little bit more context, some pros/cons I can think of are:

 Option 1: Very easy for users to find the function, since they are all in
 org.apache.spark.sql.functions. However, there will be quite a large
 number
 of them.

 Option 2: I can't tell why we would want this one over Option 3, since it
 has all the problems of Option 3, and not as nice of a hierarchy.

 Option 3: Opposite of Option 1. Each package or static class has a small
 number of functions that are relevant to each other, but for some
 functions
 it is unclear where they should go (e.g. should min go into basic or
 math?)




 On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote:

  Before we make DataFrame non-alpha, it would be great to decide how we
  want to namespace all the functions. There are 3 alternatives:
 
  1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
  since SQL doesn't have namespaces. I estimate eventually we will have ~
 200
  functions.
 
  2. Have explicit namespaces, which is what master branch currently looks
  like:
 
  - org.apache.spark.sql.functions
  - org.apache.spark.sql.mathfunctions
  - ...
 
  3. Have explicit namespaces, but restructure them slightly so everything
  is under functions.
 
  package object functions {
 
// all the old functions here -- but deprecated so we keep source
  compatibility
def ...
  }
 
  package org.apache.spark.sql.functions
 
  object mathFunc {
...
  }
 
  object basicFuncs {
...
  }
 
 
 





Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
To add a little bit more context, some pros/cons I can think of are:

Option 1: Very easy for users to find the function, since they are all in
org.apache.spark.sql.functions. However, there will be quite a large number
of them.

Option 2: I can't tell why we would want this one over Option 3, since it
has all the problems of Option 3, and not as nice of a hierarchy.

Option 3: Opposite of Option 1. Each package or static class has a small
number of functions that are relevant to each other, but for some functions
it is unclear where they should go (e.g. should min go into basic or
math?)




On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote:

 Before we make DataFrame non-alpha, it would be great to decide how we
 want to namespace all the functions. There are 3 alternatives:

 1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
 since SQL doesn't have namespaces. I estimate eventually we will have ~ 200
 functions.

 2. Have explicit namespaces, which is what master branch currently looks
 like:

 - org.apache.spark.sql.functions
 - org.apache.spark.sql.mathfunctions
 - ...

 3. Have explicit namespaces, but restructure them slightly so everything
 is under functions.

 package object functions {

   // all the old functions here -- but deprecated so we keep source
 compatibility
   def ...
 }

 package org.apache.spark.sql.functions

 object mathFunc {
   ...
 }

 object basicFuncs {
   ...
 }





Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Shivaram Venkataraman
My feeling is that we should have a handful of namespaces (say 4 or 5). It
becomes too cumbersome to import / remember more package names and having
everything in one package makes it hard to read scaladoc etc.

Thanks
Shivaram

On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com wrote:

 To add a little bit more context, some pros/cons I can think of are:

 Option 1: Very easy for users to find the function, since they are all in
 org.apache.spark.sql.functions. However, there will be quite a large number
 of them.

 Option 2: I can't tell why we would want this one over Option 3, since it
 has all the problems of Option 3, and not as nice of a hierarchy.

 Option 3: Opposite of Option 1. Each package or static class has a small
 number of functions that are relevant to each other, but for some functions
 it is unclear where they should go (e.g. should min go into basic or
 math?)




 On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote:

  Before we make DataFrame non-alpha, it would be great to decide how we
  want to namespace all the functions. There are 3 alternatives:
 
  1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
  since SQL doesn't have namespaces. I estimate eventually we will have ~
 200
  functions.
 
  2. Have explicit namespaces, which is what master branch currently looks
  like:
 
  - org.apache.spark.sql.functions
  - org.apache.spark.sql.mathfunctions
  - ...
 
  3. Have explicit namespaces, but restructure them slightly so everything
  is under functions.
 
  package object functions {
 
// all the old functions here -- but deprecated so we keep source
  compatibility
def ...
  }
 
  package org.apache.spark.sql.functions
 
  object mathFunc {
...
  }
 
  object basicFuncs {
...
  }
 
 
 



Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
We definitely still have the name collision problem in SQL.

On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal punya.bis...@gmail.com
 wrote:

 Do we still have to keep the names of the functions distinct to avoid
 collisions in SQL? Or is there a plan to allow importing a namespace into
 SQL somehow?

 I ask because if we have to keep worrying about name collisions then I'm
 not sure what the added complexity of #2 and #3 buys us.

 Punya

 On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com wrote:

 Scaladoc isn't much of a problem because scaladocs are grouped.
 Java/Python
 is the main problem ...

 See

 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

 On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

  My feeling is that we should have a handful of namespaces (say 4 or 5).
 It
  becomes too cumbersome to import / remember more package names and
 having
  everything in one package makes it hard to read scaladoc etc.
 
  Thanks
  Shivaram
 
  On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com
 wrote:
 
  To add a little bit more context, some pros/cons I can think of are:
 
  Option 1: Very easy for users to find the function, since they are all
 in
  org.apache.spark.sql.functions. However, there will be quite a large
  number
  of them.
 
  Option 2: I can't tell why we would want this one over Option 3, since
 it
  has all the problems of Option 3, and not as nice of a hierarchy.
 
  Option 3: Opposite of Option 1. Each package or static class has a
 small
  number of functions that are relevant to each other, but for some
  functions
  it is unclear where they should go (e.g. should min go into basic or
  math?)
 
 
 
 
  On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com
 wrote:
 
   Before we make DataFrame non-alpha, it would be great to decide how
 we
   want to namespace all the functions. There are 3 alternatives:
  
   1. Put all in org.apache.spark.sql.functions. This is how SQL does
 it,
   since SQL doesn't have namespaces. I estimate eventually we will
 have ~
  200
   functions.
  
   2. Have explicit namespaces, which is what master branch currently
 looks
   like:
  
   - org.apache.spark.sql.functions
   - org.apache.spark.sql.mathfunctions
   - ...
  
   3. Have explicit namespaces, but restructure them slightly so
 everything
   is under functions.
  
   package object functions {
  
 // all the old functions here -- but deprecated so we keep source
   compatibility
 def ...
   }
  
   package org.apache.spark.sql.functions
  
   object mathFunc {
 ...
   }
  
   object basicFuncs {
 ...
   }