ggjh-159 opened a new issue, #12298: URL: https://github.com/apache/gluten/issues/12298
### Backend VL (Velox) ### Bug description Under the original code, when `parallelism.default > 1`, the Nexmark Source exhibited two categories of abnormal behavior: ### Symptom 1: Duplicate Event Generation `GlutenSourceFunction` used the single `ConnectorSplit` passed in directly, with no per-subtask splitting. Every subtask used the same split config, causing: 1. Duplicate event generation (each subtask generated all events; total = parallelism × totalEvents). 2. `maxEvents` was not split by parallelism — every subtask generated the full set of events. 3. Behavior diverged from native Flink nexmark's `GeneratorConfig.split()`. ### Symptom 2: All Timestamps Identical The `dateTime` field of every bid/person/auction event shared the same value. The TPS config did not take effect and events were not throttled by TPS. ### Reproduction Set `parallelism.default: 2` and run any nexmark query (e.g. q0) with `events.num: 10000, tps: 2000`. Pre-fix: input data shows ~18400 bid rows (duplicated) all stamped with the same `dateTime`. Post-fix: 9200 unique bids, timestamps span ~5 seconds. ### Expected vs Actual Output (q0, parallelism = 2) #### Bid row count (input data fed into the query) | | Value | |---|---| | **Expected** | 9200 (= 10000 × bid ratio 46/50, no duplicates) | | **Actual (pre-fix)** | ~18400 (each of 2 subtasks independently produces 9200 bids) | #### Timestamp span on `dateTime` field | | Value | |---|---| | **Expected** | ~5 seconds (10000 events / TPS 2000), ~4600 distinct ms values | | **Actual (pre-fix)** | collapses to a single timestamp (all subtasks share the same starting event number + same random state) | #### Sample expected output row (post-fix `query_output.txt`) ``` +I[1000, 2003, 1304, 2026-06-12T10:56:51.404, zcpqyjL]XO^MIHWKWWZaI...] +I[1010, 2001, 6464, 2026-06-12T10:56:51.405, \OKKVSWVa_RUdbbnje`...] +I[1052, 2001, 9509, 2026-06-12T10:56:51.406, ...] ``` Note the `dateTime` field advances by 1 ms per row (TPS = 2000 → ~2 rows per ms), reflecting the proper TPS throttling. #### Sample actual output (pre-fix, all timestamps identical) ``` +I[1000, 2003, 1304, 2026-06-12T10:22:42.370, ...] +I[1010, 2001, 6464, 2026-06-12T10:22:42.370, ...] +I[1052, 2001, 9509, 2026-06-12T10:22:42.370, ...] ...all 18400 rows share the same dateTime value... +I[9948, 2003, 8742, 2026-06-12T10:22:42.370, ...] ``` Every row is stamped with the **identical** `dateTime` value — TPS throttling is completely bypassed. This is the most visible signal of the bug: regardless of how many events are generated, the entire dataset collapses to a single instant. ### Gluten version main branch ### Spark version None ### Spark configurations _No response_ ### System information _No response_ ### Relevant logs ```bash ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
