[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

jaceklaskowski Wed, 25 Jul 2018 03:28:59 -0700

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21858#discussion_r205058875
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -1150,16 +1150,48 @@ object functions {
       /**
        * A column expression that generates monotonically increasing 64-bit 
integers.
        *
    -   * The generated ID is guaranteed to be monotonically increasing and 
unique, but not consecutive.
    +   * The generated IDs are guaranteed to be monotonically increasing and 
unique, but not
    +   * consecutive (unless all rows are in the same single partition which 
you rarely want due to
    +   * the volume of the data).
        * The current implementation puts the partition ID in the upper 31 
bits, and the record number
        * within each partition in the lower 33 bits. The assumption is that 
the data frame has
        * less than 1 billion partitions, and each partition has less than 8 
billion records.
        *
    -   * As an example, consider a `DataFrame` with two partitions, each with 
3 records.
    -   * This expression would return the following IDs:
    -   *
        * {{{
    -   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
    +   * // Create a dataset with four partitions, each with two rows.
    +   * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)
    +   *
    +   * // Make sure that every partition has the same number of rows
    +   * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => 
assert(rows.next == 2))
    +   * q.select(monotonically_increasing_id).show
    --- End diff --
    
    I thought about explaining the "internals" of the operator through a more 
involved example and actually thought about removing the line 1166 (but 
forgot). I think the following lines make for a very in-depth explanation and 
use other operators in use.
    
    In other words, I'm in favour of removing the line 1166 and leaving the 
others with no changes. Possible?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

Reply via email to