Hi @all,

I am using monotonically_increasing_id(), in the pyspark function, for
removing one field from json field in one column from the delta table,
please refer the below code

df = spark.sql(f"SELECT * from {database}.{table}")
df1 = spark.read.json(df.rdd.map(lambda x: x.data), multiLine = True)
df1 = df1.drop('fw_createdts')
df1 = df1.selectExpr('to_json(struct(*)) as data')
df = df.withColumn('row_index',
row_number().over(Window.orderBy(monotonically_increasing_id()))).withColumnRenamed(
"data","data1")
df1 = df1.withColumn('row_index',
row_number().over(Window.orderBy(monotonically_increasing_id())))
df = df.join(df1, on=["row_index"]).drop("row_index","data1")
df.createOrReplaceTempView('tempdf')

Business Requirement:
Need to remove one key value from json field from the json field in the
delta table

done steps:
1. reading the data
2. read the JSON data only in separate df and split using spark.read.json
3. I am removing the unwanted column from the df
4. again i am converting it into json field
5. (since we don't have any unique column between json column and other
primary columns are not able to map with unique id) so, I am joining two
dataframe using  monotonically_increasing_id()
(and also some of the fields present in the JSON field is present  in the
primitive level, so not able to split the json field in same dataframe)

*Issue Faced:*
*1. For small database and data it is working as expected*
*2. for big database and data it is not working as expected and get mapping
with different records in same table*

*When I referred document, I could see the id are not consecutive, is there
any limit?*
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html#pyspark-sql-functions-monotonically-increasing-id

*Could you explain to us if there any constraints on it?*
*How we can achieve this requirement by using any alternate methods?*


Thanks in advance🙂

Reply via email to