Hi, there I am trying to generate unique ID for each record in a dataframe, so that I can save the dataframe to a relational database table. My question is that when the dataframe is regenerated due to executor failure or being evicted out of cache, does the ID keeps the same as before?
According to the document: *The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. * I assume the partition id stays the same after the regeneration. But what about the record number within each partition? My code is like below: import org.apache.spark.sql.functions._ val df1=(1 to 1000).toDF.repartition(4) val df2 = df1.withColumn("id", monotonically_increasing_id).cache df2.show df2.show I executed it several times and it seems to generate the same ID for each specific record, but I am not sure that proves that it will generate the same ID for every scenario. BTW, I am aware of the shortcoming of monotonically_increasing_id in Spark 1.6, explained in https://issues.apache.org/jira/browse/SPARK-14241, which is fixed in 2.0. Lan