Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21858#discussion_r206643036
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -1150,16 +1150,48 @@ object functions {
       /**
        * A column expression that generates monotonically increasing 64-bit 
integers.
        *
    -   * The generated ID is guaranteed to be monotonically increasing and 
unique, but not consecutive.
    +   * The generated IDs are guaranteed to be monotonically increasing and 
unique, but not
    +   * consecutive (unless all rows are in the same single partition which 
you rarely want due to
    +   * the volume of the data).
        * The current implementation puts the partition ID in the upper 31 
bits, and the record number
        * within each partition in the lower 33 bits. The assumption is that 
the data frame has
        * less than 1 billion partitions, and each partition has less than 8 
billion records.
        *
    -   * As an example, consider a `DataFrame` with two partitions, each with 
3 records.
    -   * This expression would return the following IDs:
    -   *
        * {{{
    -   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
    +   * // Create a dataset with four partitions, each with two rows.
    +   * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)
    +   *
    +   * // Make sure that every partition has the same number of rows
    +   * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => 
assert(rows.next == 2))
    +   * q.select(monotonically_increasing_id).show
    --- End diff --
    
    I personally would simplify the example to not focus on the particular 
shift; yeah that behavior ought not change but it's not really something a 
caller would ever rely on. And I think you don't need to make a new variable to 
subtract 1 from row number, etc. Something simply showing the two properties -- 
increasing within partition, not between partitions -- is enough.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to