[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

2018-10-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21858


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

2018-08-16 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/21858#discussion_r210681673
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/MonotonicallyIncreasingID.scala
 ---
@@ -80,7 +80,5 @@ case class MonotonicallyIncreasingID() extends 
LeafExpression with Stateful {
 
   override def prettyName: String = "monotonically_increasing_id"
 
-  override def sql: String = s"$prettyName()"
--- End diff --

It's the default and no need for the override, isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

2018-07-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/21858#discussion_r206642272
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/MonotonicallyIncreasingID.scala
 ---
@@ -80,7 +80,5 @@ case class MonotonicallyIncreasingID() extends 
LeafExpression with Stateful {
 
   override def prettyName: String = "monotonically_increasing_id"
 
-  override def sql: String = s"$prettyName()"
--- End diff --

Why this change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

2018-07-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/21858#discussion_r206643036
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -1150,16 +1150,48 @@ object functions {
   /**
* A column expression that generates monotonically increasing 64-bit 
integers.
*
-   * The generated ID is guaranteed to be monotonically increasing and 
unique, but not consecutive.
+   * The generated IDs are guaranteed to be monotonically increasing and 
unique, but not
+   * consecutive (unless all rows are in the same single partition which 
you rarely want due to
+   * the volume of the data).
* The current implementation puts the partition ID in the upper 31 
bits, and the record number
* within each partition in the lower 33 bits. The assumption is that 
the data frame has
* less than 1 billion partitions, and each partition has less than 8 
billion records.
*
-   * As an example, consider a `DataFrame` with two partitions, each with 
3 records.
-   * This expression would return the following IDs:
-   *
* {{{
-   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
+   * // Create a dataset with four partitions, each with two rows.
+   * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)
+   *
+   * // Make sure that every partition has the same number of rows
+   * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => 
assert(rows.next == 2))
+   * q.select(monotonically_increasing_id).show
--- End diff --

I personally would simplify the example to not focus on the particular 
shift; yeah that behavior ought not change but it's not really something a 
caller would ever rely on. And I think you don't need to make a new variable to 
subtract 1 from row number, etc. Something simply showing the two properties -- 
increasing within partition, not between partitions -- is enough.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

2018-07-25 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/21858#discussion_r205243526
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -1150,16 +1150,48 @@ object functions {
   /**
* A column expression that generates monotonically increasing 64-bit 
integers.
*
-   * The generated ID is guaranteed to be monotonically increasing and 
unique, but not consecutive.
+   * The generated IDs are guaranteed to be monotonically increasing and 
unique, but not
+   * consecutive (unless all rows are in the same single partition which 
you rarely want due to
+   * the volume of the data).
* The current implementation puts the partition ID in the upper 31 
bits, and the record number
* within each partition in the lower 33 bits. The assumption is that 
the data frame has
* less than 1 billion partitions, and each partition has less than 8 
billion records.
*
-   * As an example, consider a `DataFrame` with two partitions, each with 
3 records.
-   * This expression would return the following IDs:
-   *
* {{{
-   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
+   * // Create a dataset with four partitions, each with two rows.
+   * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)
+   *
+   * // Make sure that every partition has the same number of rows
+   * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => 
assert(rows.next == 2))
+   * q.select(monotonically_increasing_id).show
--- End diff --

IMHO It' enough to add that rows are consecutive in each partition, but not 
between partitions and that values are shifted left by 33 - written in words, 
not code, will be much shorter and concise


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

2018-07-25 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21858#discussion_r205062686
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -1150,16 +1150,48 @@ object functions {
   /**
* A column expression that generates monotonically increasing 64-bit 
integers.
*
-   * The generated ID is guaranteed to be monotonically increasing and 
unique, but not consecutive.
+   * The generated IDs are guaranteed to be monotonically increasing and 
unique, but not
+   * consecutive (unless all rows are in the same single partition which 
you rarely want due to
+   * the volume of the data).
* The current implementation puts the partition ID in the upper 31 
bits, and the record number
* within each partition in the lower 33 bits. The assumption is that 
the data frame has
* less than 1 billion partitions, and each partition has less than 8 
billion records.
*
-   * As an example, consider a `DataFrame` with two partitions, each with 
3 records.
-   * This expression would return the following IDs:
-   *
* {{{
-   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
+   * // Create a dataset with four partitions, each with two rows.
+   * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)
+   *
+   * // Make sure that every partition has the same number of rows
+   * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => 
assert(rows.next == 2))
+   * q.select(monotonically_increasing_id).show
--- End diff --

I know you're exploring the internals but .. to be honest I was wondering 
if users are usually interested in such in-deep explanation since I guess most 
of them wouldn't care about the details.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

2018-07-25 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/21858#discussion_r205058875
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -1150,16 +1150,48 @@ object functions {
   /**
* A column expression that generates monotonically increasing 64-bit 
integers.
*
-   * The generated ID is guaranteed to be monotonically increasing and 
unique, but not consecutive.
+   * The generated IDs are guaranteed to be monotonically increasing and 
unique, but not
+   * consecutive (unless all rows are in the same single partition which 
you rarely want due to
+   * the volume of the data).
* The current implementation puts the partition ID in the upper 31 
bits, and the record number
* within each partition in the lower 33 bits. The assumption is that 
the data frame has
* less than 1 billion partitions, and each partition has less than 8 
billion records.
*
-   * As an example, consider a `DataFrame` with two partitions, each with 
3 records.
-   * This expression would return the following IDs:
-   *
* {{{
-   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
+   * // Create a dataset with four partitions, each with two rows.
+   * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)
+   *
+   * // Make sure that every partition has the same number of rows
+   * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => 
assert(rows.next == 2))
+   * q.select(monotonically_increasing_id).show
--- End diff --

I thought about explaining the "internals" of the operator through a more 
involved example and actually thought about removing the line 1166 (but 
forgot). I think the following lines make for a very in-depth explanation and 
use other operators in use.

In other words, I'm in favour of removing the line 1166 and leaving the 
others with no changes. Possible?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

2018-07-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21858#discussion_r204959753
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -1150,16 +1150,48 @@ object functions {
   /**
* A column expression that generates monotonically increasing 64-bit 
integers.
*
-   * The generated ID is guaranteed to be monotonically increasing and 
unique, but not consecutive.
+   * The generated IDs are guaranteed to be monotonically increasing and 
unique, but not
+   * consecutive (unless all rows are in the same single partition which 
you rarely want due to
+   * the volume of the data).
* The current implementation puts the partition ID in the upper 31 
bits, and the record number
* within each partition in the lower 33 bits. The assumption is that 
the data frame has
* less than 1 billion partitions, and each partition has less than 8 
billion records.
*
-   * As an example, consider a `DataFrame` with two partitions, each with 
3 records.
-   * This expression would return the following IDs:
-   *
* {{{
-   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
+   * // Create a dataset with four partitions, each with two rows.
+   * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)
+   *
+   * // Make sure that every partition has the same number of rows
+   * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => 
assert(rows.next == 2))
+   * q.select(monotonically_increasing_id).show
--- End diff --

eh @jaceklaskowski, wouldn't this one be enough as an example?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

2018-07-24 Thread jaceklaskowski
GitHub user jaceklaskowski opened a pull request:

https://github.com/apache/spark/pull/21858

[SPARK-24899][SQL][DOC] Add example of monotonically_increasing_id standard 
function to scaladoc

## What changes were proposed in this pull request?

Example of `monotonically_increasing_id` standard function (with how it 
works internally) in scaladoc

## How was this patch tested?

Local build. Waiting for Jenkins


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jaceklaskowski/spark 
SPARK-24899-monotonically_increasing_id

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21858.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21858


commit 29def0069d96ca449204ad27e8c66ca2a218ce84
Author: Jacek Laskowski 
Date:   2018-07-24T09:34:49Z

[SPARK-24899][SQL][DOC] Add example of monotonically_increasing_id standard 
function to scaladoc




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org