[ 
https://issues.apache.org/jira/browse/SPARK-43951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728736#comment-17728736
 ] 

Adam Binford commented on SPARK-43951:
--------------------------------------

Of course as soon as I finish figuring all this out I found 
https://github.com/apache/spark/pull/41089

> RocksDB state store can become corrupt on task retries
> ------------------------------------------------------
>
>                 Key: SPARK-43951
>                 URL: https://issues.apache.org/jira/browse/SPARK-43951
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Adam Binford
>            Priority: Major
>
> A couple of our streaming jobs have failed since upgrading to Spark 3.4 with 
> an error such as:
> org.rocksdb.RocksDBException: Mismatch in unique ID on table file ###. 
> Expected: [###,###} Actual\{###,###} in file ..../MANIFEST-####
> This is due to the change from 
> [https://github.com/facebook/rocksdb/commit/6de7081cf37169989e289a4801187097f0c50fae]
>  that enabled unique ID checks by default, and I finally tracked down the 
> exact sequence of steps that leads to this failure in the way RocksDB state 
> store is used.
>  # A task fails after uploading the checkpoint to HDFS. Lets say it uploaded 
> 11.zip to version 11 of the table, but the task failed before it could finish 
> after successfully uploading the checkpoint.
>  # The same task is retried and goes back to load version 10 of the table as 
> expected.
>  # Cleanup/maintenance is called for this partition, which looks in HDFS for 
> persisted versions and sees up through version 11 since that zip file was 
> successfully uploaded on the previous task.
>  # As part of resolving what SST files are part of each table version, 
> versionToRocksDBFiles.put(version, newResolvedFiles) is called for version 11 
> with its SST files that were uploaded in the first failed task.
>  # The second attempt at the task commits and goes to sync its checkpoint to 
> HDFS.
>  # versionToRocksDBFiles contains the SST files to upload from step 4, and 
> these files are considered "the same" as what's in the local working dir 
> because the name and file size match.
>  # No SST files are uploaded because they matched above, but in reality the 
> unique ID inside the SST files is different (presumably this is just randomly 
> generated and inserted into each SST file?), it just doesn't affect the size.
>  # A new METADATA file is uploaded which has the new unique IDs listed inside.
>  # When version 11 of the table is read during the next batch, the unique IDs 
> in the METADATA file don't match the unique IDS in the SST files, which 
> causes the exception.
>  
> This is basically a ticking time bomb for anyone using RocksDB. Thoughts on 
> possible fixes would be:
>  * Disable unique ID verification. I don't currently see a binding for this 
> in the RocksDB java wrapper, so that would probably have to be added first.
>  * Disable checking if files are already uploaded with the same size, and 
> just always upload SST files no matter what.
>  * Update the "same file" check to also be able to do some kind of CRC 
> comparison or something like that.
>  * Update the mainteance/cleanup to not update the versionToRocksDBFiles map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to