[jira] [Commented] (SPARK-17845) Improve window function frame boundary API in DataFrame

Cheng Lian (JIRA) Sun, 09 Oct 2016 22:54:49 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15561376#comment-15561376
 ]


Cheng Lian commented on SPARK-17845:
------------------------------------

One thing is that ANSI SQL also allows using arbitrary integral expressions to 
specify frame boundaries. For example, it's legal to do something like this:

{code:sql}
SELECT max(id) OVER (ROWS BETWEEN id +1 PRECEDING AND id % 3 FOLLOWING) FROM t
{code}

where id is a column of {{INT}}.

Although many other databases, including PostgreSQL, don't support this syntax, 
Presto does support it:

{noformat}
presto> select sum(id) over (rows between id + 1 preceding and id % 3 
following) from (values 1, 2, 3) as t(id);
 _col0
-------
     3
     6
     6
(3 rows)
{noformat}

If we want to have another API for specifying window frame boundaries, it would 
be nice to take this scenario into consideration, just in case we want to 
support it in the future.

With this in mind, I'd propose a more verbose but more flexible and readable 
API as following:

{code}
trait FrameBoundary extends Expression

case object CurrentRow extends FrameBoundary
case object UnboundedPreceding extends FrameBoundary
case object UnboundedFollowing extends FrameBoundary

case class Preceding(offset: Expression) extends FrameBoundary
case class Following(offset: Expression) extends FrameBoundary

Window.rowsBetween(Preceding($"id" + 1), Following($"id" % 3))
Window.rangeBetween(CurrentRow, UnboundedFollowing)
{code}

We can have shortcut constructors like `Following(1)`, which is equivalent to 
`Following(lit(1))`, for convenience. At the current stage, we can restrict the 
offset expressions to be constant integral expressions only.

The above version is just a skeleton and hasn't taken Java interoperability 
into account yet, though.


> Improve window function frame boundary API in DataFrame
> -------------------------------------------------------
>
>                 Key: SPARK-17845
>                 URL: https://issues.apache.org/jira/browse/SPARK-17845
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17845) Improve window function frame boundary API in DataFrame

Reply via email to