subject:"spark git commit\: \[SPARK\-17895\] Improve doc for rangeBetween and rowsBetween"

spark git commit: [SPARK-17895] Improve doc for rangeBetween and rowsBetween

2016-11-02 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 4af0ce2d9 -> 742e0fea5


[SPARK-17895] Improve doc for rangeBetween and rowsBetween

## What changes were proposed in this pull request?

Copied description for row and range based frame boundary from 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L56

Added examples to show different behavior of rangeBetween and rowsBetween when 
involving duplicate values.

Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a pull request.

Author: buzhihuojie 

Closes #15727 from david-weiluo-ren/improveDocForRangeAndRowsBetween.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/742e0fea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/742e0fea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/742e0fea

Branch: refs/heads/master
Commit: 742e0fea5391857964e90d396641ecf95cac4248
Parents: 4af0ce2
Author: buzhihuojie 
Authored: Wed Nov 2 11:36:20 2016 -0700
Committer: Reynold Xin 
Committed: Wed Nov 2 11:36:20 2016 -0700

--
 .../apache/spark/sql/expressions/Window.scala   | 55 
 .../spark/sql/expressions/WindowSpec.scala  | 55 
 2 files changed, 110 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/742e0fea/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
index 0b26d86..327bc37 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
@@ -121,6 +121,32 @@ object Window {
* and [[Window.currentRow]] to specify special boundary values, rather than 
using integral
* values directly.
*
+   * A row based boundary is based on the position of the row within the 
partition.
+   * An offset indicates the number of rows above or below the current row, 
the frame for the
+   * current row starts or ends. For instance, given a row based sliding frame 
with a lower bound
+   * offset of -1 and a upper bound offset of +2. The frame for row with index 
5 would range from
+   * index 4 to index 6.
+   *
+   * {{{
+   *   import org.apache.spark.sql.expressions.Window
+   *   val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
+   * .toDF("id", "category")
+   *   df.withColumn("sum",
+   *   sum('id) over 
Window.partitionBy('category).orderBy('id).rowsBetween(0,1))
+   * .show()
+   *
+   *   +---++---+
+   *   | id|category|sum|
+   *   +---++---+
+   *   |  1|   b|  3|
+   *   |  2|   b|  5|
+   *   |  3|   b|  3|
+   *   |  1|   a|  2|
+   *   |  1|   a|  3|
+   *   |  2|   a|  2|
+   *   +---++---+
+   * }}}
+   *
* @param start boundary start, inclusive. The frame is unbounded if this is
*  the minimum long value ([[Window.unboundedPreceding]]).
* @param end boundary end, inclusive. The frame is unbounded if this is the
@@ -144,6 +170,35 @@ object Window {
* and [[Window.currentRow]] to specify special boundary values, rather than 
using integral
* values directly.
*
+   * A range based boundary is based on the actual value of the ORDER BY
+   * expression(s). An offset is used to alter the value of the ORDER BY 
expression, for
+   * instance if the current order by expression has a value of 10 and the 
lower bound offset
+   * is -3, the resulting lower bound for the current row will be 10 - 3 = 7. 
This however puts a
+   * number of constraints on the ORDER BY expressions: there can be only one 
expression and this
+   * expression must have a numerical data type. An exception can be made when 
the offset is 0,
+   * because no value modification is needed, in this case multiple and 
non-numeric ORDER BY
+   * expression are allowed.
+   *
+   * {{{
+   *   import org.apache.spark.sql.expressions.Window
+   *   val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
+   * .toDF("id", "category")
+   *   df.withColumn("sum",
+   *   sum('id) over 
Window.partitionBy('category).orderBy('id).rangeBetween(0,1))
+   * .show()
+   *
+   *   +---++---+
+   *   | id|category|sum|
+   *   +---++---+
+   *   |  1|   b|  3|
+   *   |  2|   b|  5|
+   *   |  3|   b|  3|
+   *   |  1|   a|  4|
+   *   |  1|   a|  4|
+   *   |  2|   a|  2|
+   *   +---++---+
+   * }}}
+   *

spark git commit: [SPARK-17895] Improve doc for rangeBetween and rowsBetween

2016-11-02 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 9be069125 -> a885d5bbc


[SPARK-17895] Improve doc for rangeBetween and rowsBetween

## What changes were proposed in this pull request?

Copied description for row and range based frame boundary from 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L56

Added examples to show different behavior of rangeBetween and rowsBetween when 
involving duplicate values.

Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a pull request.

Author: buzhihuojie 

Closes #15727 from david-weiluo-ren/improveDocForRangeAndRowsBetween.

(cherry picked from commit 742e0fea5391857964e90d396641ecf95cac4248)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a885d5bb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a885d5bb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a885d5bb

Branch: refs/heads/branch-2.1
Commit: a885d5bbce9dba66b394850b3aac51ae97cb18dd
Parents: 9be0691
Author: buzhihuojie 
Authored: Wed Nov 2 11:36:20 2016 -0700
Committer: Reynold Xin 
Committed: Wed Nov 2 11:36:26 2016 -0700

--
 .../apache/spark/sql/expressions/Window.scala   | 55 
 .../spark/sql/expressions/WindowSpec.scala  | 55 
 2 files changed, 110 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a885d5bb/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
index 0b26d86..327bc37 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
@@ -121,6 +121,32 @@ object Window {
* and [[Window.currentRow]] to specify special boundary values, rather than 
using integral
* values directly.
*
+   * A row based boundary is based on the position of the row within the 
partition.
+   * An offset indicates the number of rows above or below the current row, 
the frame for the
+   * current row starts or ends. For instance, given a row based sliding frame 
with a lower bound
+   * offset of -1 and a upper bound offset of +2. The frame for row with index 
5 would range from
+   * index 4 to index 6.
+   *
+   * {{{
+   *   import org.apache.spark.sql.expressions.Window
+   *   val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
+   * .toDF("id", "category")
+   *   df.withColumn("sum",
+   *   sum('id) over 
Window.partitionBy('category).orderBy('id).rowsBetween(0,1))
+   * .show()
+   *
+   *   +---++---+
+   *   | id|category|sum|
+   *   +---++---+
+   *   |  1|   b|  3|
+   *   |  2|   b|  5|
+   *   |  3|   b|  3|
+   *   |  1|   a|  2|
+   *   |  1|   a|  3|
+   *   |  2|   a|  2|
+   *   +---++---+
+   * }}}
+   *
* @param start boundary start, inclusive. The frame is unbounded if this is
*  the minimum long value ([[Window.unboundedPreceding]]).
* @param end boundary end, inclusive. The frame is unbounded if this is the
@@ -144,6 +170,35 @@ object Window {
* and [[Window.currentRow]] to specify special boundary values, rather than 
using integral
* values directly.
*
+   * A range based boundary is based on the actual value of the ORDER BY
+   * expression(s). An offset is used to alter the value of the ORDER BY 
expression, for
+   * instance if the current order by expression has a value of 10 and the 
lower bound offset
+   * is -3, the resulting lower bound for the current row will be 10 - 3 = 7. 
This however puts a
+   * number of constraints on the ORDER BY expressions: there can be only one 
expression and this
+   * expression must have a numerical data type. An exception can be made when 
the offset is 0,
+   * because no value modification is needed, in this case multiple and 
non-numeric ORDER BY
+   * expression are allowed.
+   *
+   * {{{
+   *   import org.apache.spark.sql.expressions.Window
+   *   val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
+   * .toDF("id", "category")
+   *   df.withColumn("sum",
+   *   sum('id) over 
Window.partitionBy('category).orderBy('id).rangeBetween(0,1))
+   * .show()
+   *
+   *   +---++---+
+   *   | id|category|sum|
+   *   +---++---+
+   *   |  1|   b|  3|
+   *   |  2|   b|  5|
+   *   |  3|   b|

spark git commit: [SPARK-17895] Improve doc for rangeBetween and rowsBetween

spark git commit: [SPARK-17895] Improve doc for rangeBetween and rowsBetween

2 matches

Site Navigation

Mail list logo

Footer information