More followup, and +dev as Guozhang replied to me directly previously. I am currently porting the code over to trunk. One of the major changes since 1.0 is the usage of GraphNodes. I have a question about this:
For a foreignKey joiner, should it have its own dedicated node type? Or would it be advisable to construct it from existing GraphNode components? For instance, I believe I could construct it from several OptimizableRepartitionNode, some SinkNode, some SourceNode, and several StatefulProcessorNode. That being said, there is some underlying complexity to each approach. I will be switching the KIP-213 to use the RecordHeaders in Kafka Streams instead of the PropagationWrapper, but conceptually it should be the same. Again, any feedback is welcomed... On Mon, Jul 30, 2018 at 9:38 AM, Adam Bellemare <adam.bellem...@gmail.com> wrote: > Hi Guozhang et al > > I was just reading the 2.0 release notes and noticed a section on Record > Headers. > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > 244%3A+Add+Record+Header+support+to+Kafka+Streams+Processor+API > > I am not yet sure if the contents of a RecordHeader is propagated all the > way through the Sinks and Sources, but if it is, and if it remains attached > to the record (including null records) I may be able to ditch the > propagationWrapper for an implementation using RecordHeader. I am not yet > sure if this is doable, so if anyone understands RecordHeader impl better > than I, I would be happy to hear from you. > > In the meantime, let me know of any questions. I believe this PR has a lot > of potential to solve problems for other people, as I have encountered a > number of other companies in the wild all home-brewing their own solutions > to come up with a method of handling relational data in streams. > > Adam > > > On Fri, Jul 27, 2018 at 1:45 AM, Guozhang Wang <wangg...@gmail.com> wrote: > >> Hello Adam, >> >> Thanks for rebooting the discussion of this KIP ! Let me finish my pass >> on the wiki and get back to you soon. Sorry for the delays.. >> >> Guozhang >> >> On Tue, Jul 24, 2018 at 6:08 AM, Adam Bellemare <adam.bellem...@gmail.com >> > wrote: >> >>> Let me kick this off with a few starting points that I would like to >>> generate some discussion on. >>> >>> 1) It seems to me that I will need to repartition the data twice - once >>> on >>> the foreign key, and once back to the primary key. Is there anything I am >>> missing here? >>> >>> 2) I believe I will also need to materialize 3 state stores: the >>> prefixScan >>> SS, the highwater mark SS (for out-of-order resolution) and the final >>> state >>> store, due to the workflow I have laid out. I have not thought of a >>> better >>> way yet, but would appreciate any input on this matter. I have gone back >>> through the mailing list for the previous discussions on this KIP, and I >>> did not see anything relating to resolving out-of-order compute. I cannot >>> see a way around the current three-SS structure that I have. >>> >>> 3) Caching is disabled on the prefixScan SS, as I do not know how to >>> resolve the iterator obtained from rocksDB with that of the cache. In >>> addition, I must ensure everything is flushed before scanning. Since the >>> materialized prefixScan SS is under "control" of the function, I do not >>> anticipate this to be a problem. Performance throughput will need to be >>> tested, but as Jan observed in his initial overview of this issue, it is >>> generally a surge of output events which affect performance moreso than >>> the >>> flush or prefixScan itself. >>> >>> Thoughts on any of these are greatly appreciated, since these elements >>> are >>> really the cornerstone of the whole design. I can put up the code I have >>> written against 1.0.2 if we so desire, but first I was hoping to just >>> tackle some of the fundamental design proposals. >>> >>> Thanks, >>> Adam >>> >>> >>> >>> On Mon, Jul 23, 2018 at 10:05 AM, Adam Bellemare < >>> adam.bellem...@gmail.com> >>> wrote: >>> >>> > Here is the new discussion thread for KIP-213. I picked back up on the >>> KIP >>> > as this is something that we too at Flipp are now running in >>> production. >>> > Jan started this last year, and I know that Trivago is also using >>> something >>> > similar in production, at least in terms of APIs and functionality. >>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP- >>> > 213+Support+non-key+joining+in+KTable >>> > >>> > I do have an implementation of the code for Kafka 1.0.2 (our local >>> > production version) but I won't post it yet as I would like to focus >>> on the >>> > workflow and design first. That being said, I also need to add some >>> clearer >>> > integration tests (I did a lot of testing using a non-Kafka Streams >>> > framework) and clean up the code a bit more before putting it in a PR >>> > against trunk (I can do so later this week likely). >>> > >>> > Please take a look, >>> > >>> > Thanks >>> > >>> > Adam Bellemare >>> > >>> > >>> >> >> >> >> -- >> -- Guozhang >> > >