[jira] [Updated] (SPARK-17845) Improve window function frame boundary API in DataFrame

Reynold Xin (JIRA) Sun, 09 Oct 2016 21:46:08 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Reynold Xin updated SPARK-17845:
--------------------------------
    Description: 
ANSI SQL uses the following to specify the frame boundaries for window 
functions:

{code}
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
{code}

In Spark's DataFrame API, we use integer values to indicate relative position:
- 0 means "CURRENT ROW"
- -1 means "1 PRECEDING"
- Long.MinValue means "UNBOUNDED PRECEDING"
- Long.MaxValue to indicate "UNBOUNDED FOLLOWING"

{code}
// ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
Window.rowsBetween(-3, +3)

// ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
Window.rowsBetween(Long.MinValue, -3)

// ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Window.rowsBetween(Long.MinValue, 0)

// ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
Window.rowsBetween(0, Long.MaxValue)

// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Window.rowsBetween(Long.MinValue, Long.MaxValue)
{code}

I think using numeric values to indicate relative positions is actually a good 
idea, but the reliance on Long.MinValue and Long.MaxValue to indicate unbounded 
ends is pretty confusing:

1. The API is not self-evident. There is no way for a new user to figure out 
how to indicate an unbounded frame by looking at just the API. The user has to 
read the doc to figure this out.
2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
3. Different languages have different min/max values, e.g. in Python we use 
-sys.maxsize and +sys.maxsize.


To make this API less confusing, we have a few options:


Option 1. Add the following (additional) methods:

{code}
// ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
Window.rowsBetween(-3, +3)  // this one exists already

// ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
Window.rowsBetweenUnboundedPrecedingAnd(-3)

// ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Window.rowsBetweenUnboundedPrecedingAndCurrentRow()

// ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
Window.rowsBetweenCurrentRowAndUnboundedFollowing()

// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
{code}

This is obviously very verbose, but is very similar to how these functions are 
done in SQL, and is perhaps the most obvious to end users, especially if they 
come from SQL background.


Option 2. Decouple the specification for frame begin and frame end into two 
functions. Assume the boundary is unlimited unless specified.

{code}
// ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
Window.rowsFrom(-3).rowsTo(3)

// ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
Window.rowsTo(-3)

// ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Window.rowsToCurrent() or Window.rowsTo(0)

// ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
Window.rowsFromCurrent() or Window.rowsFrom(0)

// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
// no need to specify
{code}

If we go with option 2, we should throw exceptions if users specify multiple 
from's or to's. A variant of option 2 is to require explicitly specification of 
begin/end even in the case of unbounded boundary, e.g.:

{code}
Window.rowsFromBeginning().rowsTo(-3)
or
Window.rowsFromUnboundedPreceding().rowsTo(-3)
{code}



  was:
ANSI SQL uses the following to specify the frame boundaries for window 
functions:

{code}
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
{code}

In Spark's DataFrame API, we use integer values to indicate relative position:
- 0 means "CURRENT ROW"
- -1 means "1 PRECEDING"
- Long.MinValue means "UNBOUNDED PRECEDING"
- Long.MaxValue to indicate "UNBOUNDED FOLLOWING"

{code}
// ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Window.rowsBetween(Long.MinValue, 0)

// ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
Window.rowsBetween(-3, 3)

// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Window.rowsBetween(Long.MinValue, Long.MaxValue)
{code}

I think using numeric values to indicate relative positions is actually a good 
idea, but the reliance on Long.MinValue and Long.MaxValue to indicate unbounded 
ends is pretty confusing:

1. The API is not self-evident. There is no way for a new user to figure out 
how to indicate an unbounded frame by looking at just the API. The user has to 
read the doc to figure this out.
2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
3. Different languages have different min/max values, e.g. in Python we use 
-sys.maxsize and +sys.maxsize.




> Improve window function frame boundary API in DataFrame
> -------------------------------------------------------
>
>                 Key: SPARK-17845
>                 URL: https://issues.apache.org/jira/browse/SPARK-17845
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17845) Improve window function frame boundary API in DataFrame

Reply via email to