[I] Nexmark Source Behaves Incorrectly Under Multi-Parallelism [gluten]

via GitHub Mon, 15 Jun 2026 01:54:23 -0700


ggjh-159 opened a new issue, #12298:
URL: https://github.com/apache/gluten/issues/12298


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   Under the original code, when `parallelism.default > 1`, the Nexmark Source 
exhibited two categories of abnormal behavior:
   
   ### Symptom 1: Duplicate Event Generation
   
   `GlutenSourceFunction` used the single `ConnectorSplit` passed in directly, 
with no per-subtask splitting. Every subtask used the same split config, 
causing:
   1. Duplicate event generation (each subtask generated all events; total = 
parallelism × totalEvents).
   2. `maxEvents` was not split by parallelism — every subtask generated the 
full set of events.
   3. Behavior diverged from native Flink nexmark's `GeneratorConfig.split()`.
   
   ### Symptom 2: All Timestamps Identical
   
   The `dateTime` field of every bid/person/auction event shared the same 
value. The TPS config did not take effect and events were not throttled by TPS.
   
   ### Reproduction
   
   Set `parallelism.default: 2` and run any nexmark query (e.g. q0) with 
`events.num: 10000, tps: 2000`. Pre-fix: input data shows ~18400 bid rows 
(duplicated) all stamped with the same `dateTime`. Post-fix: 9200 unique bids, 
timestamps span ~5 seconds.
   
   ### Expected vs Actual Output (q0, parallelism = 2)
   
   #### Bid row count (input data fed into the query)
   
   | | Value |
   |---|---|
   | **Expected** | 9200 (= 10000 × bid ratio 46/50, no duplicates) |
   | **Actual (pre-fix)** | ~18400 (each of 2 subtasks independently produces 
9200 bids) |
   
   #### Timestamp span on `dateTime` field
   
   | | Value |
   |---|---|
   | **Expected** | ~5 seconds (10000 events / TPS 2000), ~4600 distinct ms 
values |
   | **Actual (pre-fix)** | collapses to a single timestamp (all subtasks share 
the same starting event number + same random state) |
   
   #### Sample expected output row (post-fix `query_output.txt`)
   
   ```
   +I[1000, 2003, 1304, 2026-06-12T10:56:51.404, zcpqyjL]XO^MIHWKWWZaI...]
   +I[1010, 2001, 6464, 2026-06-12T10:56:51.405, \OKKVSWVa_RUdbbnje`...]
   +I[1052, 2001, 9509, 2026-06-12T10:56:51.406, ...]
   ```
   
   Note the `dateTime` field advances by 1 ms per row (TPS = 2000 → ~2 rows per 
ms), reflecting the proper TPS throttling.
   
   #### Sample actual output (pre-fix, all timestamps identical)
   
   ```
   +I[1000, 2003, 1304, 2026-06-12T10:22:42.370, ...]
   +I[1010, 2001, 6464, 2026-06-12T10:22:42.370, ...]
   +I[1052, 2001, 9509, 2026-06-12T10:22:42.370, ...]
   ...all 18400 rows share the same dateTime value...
   +I[9948, 2003, 8742, 2026-06-12T10:22:42.370, ...]
   ```
   
   Every row is stamped with the **identical** `dateTime` value — TPS 
throttling is completely bypassed. This is the most visible signal of the bug: 
regardless of how many events are generated, the entire dataset collapses to a 
single instant.
   
   ### Gluten version
   
   main branch
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Nexmark Source Behaves Incorrectly Under Multi-Parallelism [gluten]

Reply via email to