Hi, there

I am trying to generate unique ID for each record in a dataframe, so that I
can save the dataframe to a relational database table. My question is that
when the dataframe is regenerated due to executor failure or being evicted
out of cache, does the ID keeps the same as before?

According to the document:

*The generated ID is guaranteed to be monotonically increasing and unique,
but not consecutive. The current implementation puts the partition ID in
the upper 31 bits, and the lower 33 bits represent the record number within
each partition. *

I assume the partition id stays the same after the regeneration. But what
about the record number within each partition?

My code is like below:

import org.apache.spark.sql.functions._
val df1=(1 to 1000).toDF.repartition(4)
val df2 = df1.withColumn("id", monotonically_increasing_id).cache
df2.show
df2.show

I executed it several times and it seems to generate the same ID for each
specific record, but I am not sure that proves that it will generate the
same ID for every scenario.

BTW, I am aware of the shortcoming of monotonically_increasing_id in Spark
1.6, explained in https://issues.apache.org/jira/browse/SPARK-14241, which
is fixed in 2.0.

Lan

Reply via email to