Re: [discuss] DataFrame function namespacing
After talking with people on this thread and offline, I've decided to go with option 1, i.e. putting everything in a single functions object. On Thu, Apr 30, 2015 at 10:04 AM, Ted Yu yuzhih...@gmail.com wrote: IMHO I would go with choice #1 Cheers On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin r...@databricks.com wrote: We definitely still have the name collision problem in SQL. On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Do we still have to keep the names of the functions distinct to avoid collisions in SQL? Or is there a plan to allow importing a namespace into SQL somehow? I ask because if we have to keep worrying about name collisions then I'm not sure what the added complexity of #2 and #3 buys us. Punya On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com wrote: Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python is the main problem ... See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My feeling is that we should have a handful of namespaces (say 4 or 5). It becomes too cumbersome to import / remember more package names and having everything in one package makes it hard to read scaladoc etc. Thanks Shivaram On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com wrote: To add a little bit more context, some pros/cons I can think of are: Option 1: Very easy for users to find the function, since they are all in org.apache.spark.sql.functions. However, there will be quite a large number of them. Option 2: I can't tell why we would want this one over Option 3, since it has all the problems of Option 3, and not as nice of a hierarchy. Option 3: Opposite of Option 1. Each package or static class has a small number of functions that are relevant to each other, but for some functions it is unclear where they should go (e.g. should min go into basic or math?) On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote: Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 functions. 2. Have explicit namespaces, which is what master branch currently looks like: - org.apache.spark.sql.functions - org.apache.spark.sql.mathfunctions - ... 3. Have explicit namespaces, but restructure them slightly so everything is under functions. package object functions { // all the old functions here -- but deprecated so we keep source compatibility def ... } package org.apache.spark.sql.functions object mathFunc { ... } object basicFuncs { ... }
Re: [discuss] DataFrame function namespacing
IMHO I would go with choice #1 Cheers On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin r...@databricks.com wrote: We definitely still have the name collision problem in SQL. On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Do we still have to keep the names of the functions distinct to avoid collisions in SQL? Or is there a plan to allow importing a namespace into SQL somehow? I ask because if we have to keep worrying about name collisions then I'm not sure what the added complexity of #2 and #3 buys us. Punya On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com wrote: Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python is the main problem ... See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My feeling is that we should have a handful of namespaces (say 4 or 5). It becomes too cumbersome to import / remember more package names and having everything in one package makes it hard to read scaladoc etc. Thanks Shivaram On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com wrote: To add a little bit more context, some pros/cons I can think of are: Option 1: Very easy for users to find the function, since they are all in org.apache.spark.sql.functions. However, there will be quite a large number of them. Option 2: I can't tell why we would want this one over Option 3, since it has all the problems of Option 3, and not as nice of a hierarchy. Option 3: Opposite of Option 1. Each package or static class has a small number of functions that are relevant to each other, but for some functions it is unclear where they should go (e.g. should min go into basic or math?) On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote: Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 functions. 2. Have explicit namespaces, which is what master branch currently looks like: - org.apache.spark.sql.functions - org.apache.spark.sql.mathfunctions - ... 3. Have explicit namespaces, but restructure them slightly so everything is under functions. package object functions { // all the old functions here -- but deprecated so we keep source compatibility def ... } package org.apache.spark.sql.functions object mathFunc { ... } object basicFuncs { ... }
Re: [discuss] DataFrame function namespacing
Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python is the main problem ... See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My feeling is that we should have a handful of namespaces (say 4 or 5). It becomes too cumbersome to import / remember more package names and having everything in one package makes it hard to read scaladoc etc. Thanks Shivaram On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com wrote: To add a little bit more context, some pros/cons I can think of are: Option 1: Very easy for users to find the function, since they are all in org.apache.spark.sql.functions. However, there will be quite a large number of them. Option 2: I can't tell why we would want this one over Option 3, since it has all the problems of Option 3, and not as nice of a hierarchy. Option 3: Opposite of Option 1. Each package or static class has a small number of functions that are relevant to each other, but for some functions it is unclear where they should go (e.g. should min go into basic or math?) On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote: Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 functions. 2. Have explicit namespaces, which is what master branch currently looks like: - org.apache.spark.sql.functions - org.apache.spark.sql.mathfunctions - ... 3. Have explicit namespaces, but restructure them slightly so everything is under functions. package object functions { // all the old functions here -- but deprecated so we keep source compatibility def ... } package org.apache.spark.sql.functions object mathFunc { ... } object basicFuncs { ... }
Re: [discuss] DataFrame function namespacing
To add a little bit more context, some pros/cons I can think of are: Option 1: Very easy for users to find the function, since they are all in org.apache.spark.sql.functions. However, there will be quite a large number of them. Option 2: I can't tell why we would want this one over Option 3, since it has all the problems of Option 3, and not as nice of a hierarchy. Option 3: Opposite of Option 1. Each package or static class has a small number of functions that are relevant to each other, but for some functions it is unclear where they should go (e.g. should min go into basic or math?) On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote: Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 functions. 2. Have explicit namespaces, which is what master branch currently looks like: - org.apache.spark.sql.functions - org.apache.spark.sql.mathfunctions - ... 3. Have explicit namespaces, but restructure them slightly so everything is under functions. package object functions { // all the old functions here -- but deprecated so we keep source compatibility def ... } package org.apache.spark.sql.functions object mathFunc { ... } object basicFuncs { ... }
Re: [discuss] DataFrame function namespacing
My feeling is that we should have a handful of namespaces (say 4 or 5). It becomes too cumbersome to import / remember more package names and having everything in one package makes it hard to read scaladoc etc. Thanks Shivaram On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com wrote: To add a little bit more context, some pros/cons I can think of are: Option 1: Very easy for users to find the function, since they are all in org.apache.spark.sql.functions. However, there will be quite a large number of them. Option 2: I can't tell why we would want this one over Option 3, since it has all the problems of Option 3, and not as nice of a hierarchy. Option 3: Opposite of Option 1. Each package or static class has a small number of functions that are relevant to each other, but for some functions it is unclear where they should go (e.g. should min go into basic or math?) On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote: Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 functions. 2. Have explicit namespaces, which is what master branch currently looks like: - org.apache.spark.sql.functions - org.apache.spark.sql.mathfunctions - ... 3. Have explicit namespaces, but restructure them slightly so everything is under functions. package object functions { // all the old functions here -- but deprecated so we keep source compatibility def ... } package org.apache.spark.sql.functions object mathFunc { ... } object basicFuncs { ... }
Re: [discuss] DataFrame function namespacing
We definitely still have the name collision problem in SQL. On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Do we still have to keep the names of the functions distinct to avoid collisions in SQL? Or is there a plan to allow importing a namespace into SQL somehow? I ask because if we have to keep worrying about name collisions then I'm not sure what the added complexity of #2 and #3 buys us. Punya On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com wrote: Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python is the main problem ... See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My feeling is that we should have a handful of namespaces (say 4 or 5). It becomes too cumbersome to import / remember more package names and having everything in one package makes it hard to read scaladoc etc. Thanks Shivaram On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com wrote: To add a little bit more context, some pros/cons I can think of are: Option 1: Very easy for users to find the function, since they are all in org.apache.spark.sql.functions. However, there will be quite a large number of them. Option 2: I can't tell why we would want this one over Option 3, since it has all the problems of Option 3, and not as nice of a hierarchy. Option 3: Opposite of Option 1. Each package or static class has a small number of functions that are relevant to each other, but for some functions it is unclear where they should go (e.g. should min go into basic or math?) On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote: Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 functions. 2. Have explicit namespaces, which is what master branch currently looks like: - org.apache.spark.sql.functions - org.apache.spark.sql.mathfunctions - ... 3. Have explicit namespaces, but restructure them slightly so everything is under functions. package object functions { // all the old functions here -- but deprecated so we keep source compatibility def ... } package org.apache.spark.sql.functions object mathFunc { ... } object basicFuncs { ... }