want to maintain the order of the rows in the data frame in Pyspark. Is
there any way to achieve this for this function here we have the row ID
which will give numbering to each row. Currently, the below function
results in the rearrangement of the row in the data frame.

def createRowIdColumn( new_column, position, start_value):
    row_count = df.count()
    row_ids = spark.range(int(start_value), int(start_value) +
row_count, 1).toDF(new_column)
    window = Window.orderBy(lit(1))
    df_row_ids = row_ids.withColumn("row_num", row_number().over(window) - 1)
    df_with_row_num = df.withColumn("row_num", row_number().over(window) - 1)

    if position == "Last Column":
        result = df_with_row_num.join(df_row_ids, on="row_num").drop("row_num")
    else:
        result = df_row_ids.join(df_with_row_num, on="row_num").drop("row_num")

    return result.orderBy(new_column)

Please let me know the solution if we can achieve this requirement.

Reply via email to