HeartSaVioR edited a comment on issue #23634: [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows URL: https://github.com/apache/spark/pull/23634#issuecomment-464907821 @tdas Now I'm also thinking about how to add matched flag to KeyWithIndexToValueStateStore, but fail to think how it is compatible with existing state, specifically, regarding optional matched flag. Does it mean you're suggesting not to version state format for streaming join? Situation would be pretty different which we choose. 1) Keep state format as 1 I'm not sure this is safe to guarantee accessibility of `matched` flag, given that state will be mixed up with existing rows and new rows which some don't have `matched` field (numFields also doesn't count `matched` field) and some have `matched` field. Even if this can be possible via specific trick (knowledge of internal), I'd like to address this without breaking any interface/expectation cause it could be a blocker when we want to change InternalRow. (Like allowing different schema of actual rows - this could occur crash on runtime.) Please shed a light (sketched idea) on how to do it if your would want to guide me here. 2) Have two state formats I think refactoring is still necessary if we version state format (we could flatten some interface and implementations into one though), cause we will end up having two StreamingJoinStateManagers which most of things are duplicated but implementation of KeyWithIndexToValueStore will be a bit different. If we don't add interface but only add boolean flag to classify state formats we would end up having code regarding state format being mixed up with logic, complicated and hard to debug. In overall I totally understand the importance about reducing the code diff, but I'd hope we'd not too concerned about large code diff if we find its worth on introducing more codes. (I meant I'd hope the amount of code change would not be the first thing to consider.) To evaluate the worth we might be ideal to put aside of length of code diff when deciding general direction.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org