[I] [Bug] [Spark-MultiTableManager] Non-deterministic schema generation in multi-table Spark pipelines causes silent data corruption [seatunnel]

via GitHub Thu, 18 Sep 2025 05:27:09 -0700


dexty007 opened a new issue, #9877:
URL: https://github.com/apache/seatunnel/issues/9877


   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22)
 and found no similar issues.
   
   
   ### What happened
   
   ### Critical Bug: 
   Non-deterministic schema generation in multi-table Spark pipelines causes 
silent data corruption
   
   ### Root Cause:
   MultiTableManager.mergeSchema() uses table iteration order to assign global 
column indices, making schema generation non-deterministic when the same tables 
are processed in different orders.
   
   ### Algorithm Issue:
   java
   **// MultiTableManager.java:148-164 mergeSchema() method**
   ```
   for (int i = 0; i < catalogTables.length; i++) {
       // Processing order determines global field positions
       if (!indexQueue.hasNext()) {
           indexSize++;  // Sequential assignment based on iteration order
           fieldNames.add(editColumnName(indexSize));
           fieldTypes.add(seaTunnelDataTypes[j]);
       }
   }
   ```
   
   ### Failure Scenario:
   1. Execution 1: [TableA(INT,STRING), TableB(STRING,LONG)] → schema: [INT, 
STRING, LONG]
   2. Execution 2: [TableB(STRING,LONG), TableA(INT,STRING)] → schema: [STRING, 
LONG, INT]
   3. Result: Same data encoded with different schemas → silent data corruption
   
   ### Impact:
   • **Silent data corruption**: Wrong data types in wrong positions
   • **Non-deterministic behavior**: Same pipeline produces different results
   • **Production risk**: Financial/business data corruption without error 
indication
   • **Debugging difficulty**: Appears as mysterious data inconsistencies
   
   ### Evidence:
   Current tests only cover identical schemas where order changes are invisible 
(MultiTableManagerTest.java:105-106), masking this critical bug for 
heterogeneous schemas.
   
   ### Expected Behavior:
   Same set of tables should always produce identical merged schema regardless 
of processing order.
   
   ### Suggested Fix:
   Make schema generation order-independent:
   java
   ```
   // Sort tables by deterministic identifier before processing
   Arrays.sort(catalogTables, Comparator.comparing(t -> 
t.getTablePath().toString()));
   ```
   
   Priority: P0 - Silent data corruption in core multi-table functionality
   
   
   ### SeaTunnel Version
   
   2.3.x (affects all versions with multi-table support)
   
   ### SeaTunnel Config
   
   ```conf
   hocon
   env {
     execution.parallelism = 1
   }
   source {
     FakeSource {
       tables_configs = [
         {
           schema = {
             table = "users"
             fields {
               id = int
               name = string
             }
           }
         },
         {
           schema = {
             table = "orders" 
             fields {
               description = string
               amount = bigint
             }
           }
         }
       ]
     }
   }
   transform {
     FieldMapper {
       field_mapper = {
         id = user_id
       }
     }
   }
   sink {
     Console {}
   }
   ```
   
   ### Running Command
   
   ```shell
   ./bin/seatunnel.sh --config config/multi-table-heterogeneous.conf --engine 
spark
   ```
   
   ### Error Exception
   
   ```log
   No explicit exception - silent data corruption occurs.
   Data appears in wrong columns due to schema mismatch between encoding and 
decoding stages.
   ```
   
   ### Zeta or Flink or Spark Version
   
   Spark 3.x (affects all Spark versions)
   
   ### Java or Scala Version
   
   _No response_
   
   ### Screenshots
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug] [Spark-MultiTableManager] Non-deterministic schema generation in multi-table Spark pipelines causes silent data corruption [seatunnel]

Reply via email to