[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21858 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/21858#discussion_r210681673 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/MonotonicallyIncreasingID.scala --- @@ -80,7 +80,5 @@ case class MonotonicallyIncreasingID() extends LeafExpression with Stateful { override def prettyName: String = "monotonically_increasing_id" - override def sql: String = s"$prettyName()" --- End diff -- It's the default and no need for the override, isn't it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/21858#discussion_r206642272 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/MonotonicallyIncreasingID.scala --- @@ -80,7 +80,5 @@ case class MonotonicallyIncreasingID() extends LeafExpression with Stateful { override def prettyName: String = "monotonically_increasing_id" - override def sql: String = s"$prettyName()" --- End diff -- Why this change? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/21858#discussion_r206643036 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -1150,16 +1150,48 @@ object functions { /** * A column expression that generates monotonically increasing 64-bit integers. * - * The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. + * The generated IDs are guaranteed to be monotonically increasing and unique, but not + * consecutive (unless all rows are in the same single partition which you rarely want due to + * the volume of the data). * The current implementation puts the partition ID in the upper 31 bits, and the record number * within each partition in the lower 33 bits. The assumption is that the data frame has * less than 1 billion partitions, and each partition has less than 8 billion records. * - * As an example, consider a `DataFrame` with two partitions, each with 3 records. - * This expression would return the following IDs: - * * {{{ - * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. + * // Create a dataset with four partitions, each with two rows. + * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4) + * + * // Make sure that every partition has the same number of rows + * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => assert(rows.next == 2)) + * q.select(monotonically_increasing_id).show --- End diff -- I personally would simplify the example to not focus on the particular shift; yeah that behavior ought not change but it's not really something a caller would ever rely on. And I think you don't need to make a new variable to subtract 1 from row number, etc. Something simply showing the two properties -- increasing within partition, not between partitions -- is enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/21858#discussion_r205243526 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -1150,16 +1150,48 @@ object functions { /** * A column expression that generates monotonically increasing 64-bit integers. * - * The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. + * The generated IDs are guaranteed to be monotonically increasing and unique, but not + * consecutive (unless all rows are in the same single partition which you rarely want due to + * the volume of the data). * The current implementation puts the partition ID in the upper 31 bits, and the record number * within each partition in the lower 33 bits. The assumption is that the data frame has * less than 1 billion partitions, and each partition has less than 8 billion records. * - * As an example, consider a `DataFrame` with two partitions, each with 3 records. - * This expression would return the following IDs: - * * {{{ - * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. + * // Create a dataset with four partitions, each with two rows. + * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4) + * + * // Make sure that every partition has the same number of rows + * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => assert(rows.next == 2)) + * q.select(monotonically_increasing_id).show --- End diff -- IMHO It' enough to add that rows are consecutive in each partition, but not between partitions and that values are shifted left by 33 - written in words, not code, will be much shorter and concise --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21858#discussion_r205062686 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -1150,16 +1150,48 @@ object functions { /** * A column expression that generates monotonically increasing 64-bit integers. * - * The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. + * The generated IDs are guaranteed to be monotonically increasing and unique, but not + * consecutive (unless all rows are in the same single partition which you rarely want due to + * the volume of the data). * The current implementation puts the partition ID in the upper 31 bits, and the record number * within each partition in the lower 33 bits. The assumption is that the data frame has * less than 1 billion partitions, and each partition has less than 8 billion records. * - * As an example, consider a `DataFrame` with two partitions, each with 3 records. - * This expression would return the following IDs: - * * {{{ - * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. + * // Create a dataset with four partitions, each with two rows. + * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4) + * + * // Make sure that every partition has the same number of rows + * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => assert(rows.next == 2)) + * q.select(monotonically_increasing_id).show --- End diff -- I know you're exploring the internals but .. to be honest I was wondering if users are usually interested in such in-deep explanation since I guess most of them wouldn't care about the details. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/21858#discussion_r205058875 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -1150,16 +1150,48 @@ object functions { /** * A column expression that generates monotonically increasing 64-bit integers. * - * The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. + * The generated IDs are guaranteed to be monotonically increasing and unique, but not + * consecutive (unless all rows are in the same single partition which you rarely want due to + * the volume of the data). * The current implementation puts the partition ID in the upper 31 bits, and the record number * within each partition in the lower 33 bits. The assumption is that the data frame has * less than 1 billion partitions, and each partition has less than 8 billion records. * - * As an example, consider a `DataFrame` with two partitions, each with 3 records. - * This expression would return the following IDs: - * * {{{ - * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. + * // Create a dataset with four partitions, each with two rows. + * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4) + * + * // Make sure that every partition has the same number of rows + * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => assert(rows.next == 2)) + * q.select(monotonically_increasing_id).show --- End diff -- I thought about explaining the "internals" of the operator through a more involved example and actually thought about removing the line 1166 (but forgot). I think the following lines make for a very in-depth explanation and use other operators in use. In other words, I'm in favour of removing the line 1166 and leaving the others with no changes. Possible? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21858#discussion_r204959753 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -1150,16 +1150,48 @@ object functions { /** * A column expression that generates monotonically increasing 64-bit integers. * - * The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. + * The generated IDs are guaranteed to be monotonically increasing and unique, but not + * consecutive (unless all rows are in the same single partition which you rarely want due to + * the volume of the data). * The current implementation puts the partition ID in the upper 31 bits, and the record number * within each partition in the lower 33 bits. The assumption is that the data frame has * less than 1 billion partitions, and each partition has less than 8 billion records. * - * As an example, consider a `DataFrame` with two partitions, each with 3 records. - * This expression would return the following IDs: - * * {{{ - * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. + * // Create a dataset with four partitions, each with two rows. + * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4) + * + * // Make sure that every partition has the same number of rows + * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => assert(rows.next == 2)) + * q.select(monotonically_increasing_id).show --- End diff -- eh @jaceklaskowski, wouldn't this one be enough as an example? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/21858 [SPARK-24899][SQL][DOC] Add example of monotonically_increasing_id standard function to scaladoc ## What changes were proposed in this pull request? Example of `monotonically_increasing_id` standard function (with how it works internally) in scaladoc ## How was this patch tested? Local build. Waiting for Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark SPARK-24899-monotonically_increasing_id Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21858.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21858 commit 29def0069d96ca449204ad27e8c66ca2a218ce84 Author: Jacek Laskowski Date: 2018-07-24T09:34:49Z [SPARK-24899][SQL][DOC] Add example of monotonically_increasing_id standard function to scaladoc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org