[GitHub] [spark] HyukjinKwon commented on a change in pull request #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation
HyukjinKwon commented on a change in pull request #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation URL: https://github.com/apache/spark/pull/23946#discussion_r263692570 ## File path: python/pyspark/sql/window.py ## @@ -97,6 +97,33 @@ def rowsBetween(start, end): and ``Window.currentRow`` to specify special boundary values, rather than using integral values directly. +A row based boundary is based on the position of the row within the partition. +An offset indicates the number of rows above or below the current row, the frame for the +current row starts or ends. For instance, given a row based sliding frame with a lower bound +offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from +index 4 to index 6. + +>>> from pyspark.sql import Window +>>> from pyspark.sql import functions as func +>>> from pyspark.sql import SQLContext +>>> sc = SparkContext.getOrCreate() +>>> sqlContext = SQLContext(sc) +>>> tup = [(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")] +>>> df = sqlContext.createDataFrame(tup, ["id", "category"]) +>>> window = Window.partitionBy("category").orderBy("id").rowsBetween(Window.currentRow, 1) +>>> df.withColumn("sum", func.sum("id").over(window)).show() ++---++---+ +| id|category|sum| ++---++---+ +| 1| b| 3| +| 2| b| 5| +| 3| b| 3| +| 1| a| 2| +| 1| a| 3| +| 2| a| 2| ++---++---+ + Review comment: Nope, it's not necessary. `optionflags=doctest.NORMALIZE_WHITESPACE` is needed just to make the doc prettier (by getting rid of ``). ```python (failure_count, test_count) = doctest.testmod( pyspark.sql.window, ``` is just to make the module path pretty. In console, it possible to show the module path like `__main__.bla.bla`. In this way, it shows up like `pyspark.sql.window.bla.bla`. Not a big deal a t all. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation
HyukjinKwon commented on a change in pull request #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation URL: https://github.com/apache/spark/pull/23946#discussion_r263240174 ## File path: python/pyspark/sql/window.py ## @@ -97,6 +97,33 @@ def rowsBetween(start, end): and ``Window.currentRow`` to specify special boundary values, rather than using integral values directly. +A row based boundary is based on the position of the row within the partition. +An offset indicates the number of rows above or below the current row, the frame for the +current row starts or ends. For instance, given a row based sliding frame with a lower bound +offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from +index 4 to index 6. + +>>> from pyspark.sql import Window +>>> from pyspark.sql import functions as func +>>> from pyspark.sql import SQLContext +>>> sc = SparkContext.getOrCreate() +>>> sqlContext = SQLContext(sc) +>>> tup = [(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")] +>>> df = sqlContext.createDataFrame(tup, ["id", "category"]) +>>> window = Window.partitionBy("category").orderBy("id").rowsBetween(Window.currentRow, 1) +>>> df.withColumn("sum", func.sum("id").over(window)).show() ++---++---+ +| id|category|sum| ++---++---+ +| 1| b| 3| +| 2| b| 5| +| 3| b| 3| +| 1| a| 2| +| 1| a| 3| +| 2| a| 2| ++---++---+ + Review comment: You can change the doctest running codes from: ```python import doctest SparkContext('local[4]', 'PythonTest') (failure_count, test_count) = doctest.testmod() ``` to: ```python import doctest import pyspark.sql.window SparkContext('local[4]', 'PythonTest') globs = pyspark.sql.window.__dict__.copy() (failure_count, test_count) = doctest.testmod( pyspark.sql.window, globs=globs, optionflags=doctest.NORMALIZE_WHITESPACE)) ``` so that: 1. it doesn't need to add `` 2. when the tests are skipped, it shows the correct fully qualified module names like `pyspark.sql.window...`, rather then `__main__. ...`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation
HyukjinKwon commented on a change in pull request #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation URL: https://github.com/apache/spark/pull/23946#discussion_r262740577 ## File path: python/pyspark/sql/window.py ## @@ -97,6 +97,33 @@ def rowsBetween(start, end): and ``Window.currentRow`` to specify special boundary values, rather than using integral values directly. +A row based boundary is based on the position of the row within the partition. +An offset indicates the number of rows above or below the current row, the frame for the +current row starts or ends. For instance, given a row based sliding frame with a lower bound +offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from +index 4 to index 6. +""" +# from pyspark.sql import Window Review comment: @jagadesh-kiran why is it commented? It should be started with `>>>` to make it a proper [doctest](https://docs.python.org/2/library/doctest.html). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org