> > 1. The default behavior changes if no ON CONFLICT is defined. I am a > little concerned that this may cause errors in a large number of existing > cases.
I can be convinced to leave the default behaviour as it is now. I am worried though, very rarely the current behaviour of SUM is what people actually want. As mentioned in the FLIP I wholeheartedly believe there are very little if any real world scenarios where you need the deduplicate behaviour. I try to elaborate a bit more in 2) 2. Regarding On Conflict Errors, in the context of CDC streams, it is > expected that the vast majority of cases cannot generate only one record > with one primary key. The only solutions I can think of are append-only > top1, deduplication, or aggregating the first row. I disagree with that statement. I don't think CDC streams change anything in that regard. Maybe there is some misunderstanding about what a one record means in this context. I agree almost certainly there will be a sequence of UA, UB for a single sink's primary key. My claim is that users almost never want a situation where they have more than one "active" upsert key/record for one sink's primary key. I tried to explain that in the FLIP, but let me try to give one more example here. Imagine two tables: CREATE TABLE source ( id bigint PRIMARY KEY, name string, value string ) CREATE TABLE sink ( name string PRIMARY KEY, value string ) INSERT INTO sink SELECT name, value; === Input (1, "Apple", "ABC") (2, "Apple", "DEF") In the scenario above a SUM is inserted which will deduplicate the rows and override the value for "Apple" with "DEF". In my opinion it's entirely wrong, instead an exception should be thrown that there is actually a constraint validation. I am absolutely more than happy to be proved wrong. If you do have a real world scenario where the deduplication logic is actually correct and expected please, please do share. So far I have not seen one, nor was I able to come up with one. And yet I am not suggesting to remove the deduplication logic entirely, users can still use it with ON CONFLICT DEDUPLICATE. 3. The special watermark generation interval affects the visibility of > results. How can users configure this generation interval? That's a fair question I'll try to elaborate on in the FLIP. I can see two options: 1. We piggyback on existing watermarks in the query, if there are no watermarks (tables don't have a watermark definition) we fail during planning 2. We add a new parameter option for a specialized generalized watermark Let me think for some more on that and I'll come back with a more concrete proposal. > 4. I believe that resolving out-of-order issues and addressing internal > consistency are two separate problems. As I understand the current > solution, it does not really resolve the internal consistency issue. We > could first resolve the out-of-order problem. For most scenarios that > require real-time response, we can directly output intermediate results > promptly. Why doesn't it solve it? It does. Given a pair of UB/UA we won't emit the temporary state after processing the UB. 5. How can we compact data with the same custom watermark? If detailed > comparisons are necessary, I think we still need to preserve all key data; > we would just be compressing this data further at time t. Yes, we need to preserve all key data, but only between two watermarks. Assuming frequent watermarks, that's for a very short time. 6. If neither this proposed solution nor the reject solution can resolve > internal consistency, we need to reconsider the differences between the two > approaches. I'll copy the explanation why the rejected alternative should be rejected from the FLIP: The solution can help us to solve the changelog disorder problem, but it does not help with the *internal consistency *issue. If we want to fix that as well, we still need the compaction on watermarks. At the same time it increases the size of all flowing records. Therefore it was rejected in favour of simply compacting all records once on the progression of watermarks. Best, Dawid
