All, Joey brought this up over the weekend and I think a discussion is overdue on the topic.
Streams components were meant to be compatible with other runtime frameworks all along, and for the most part are implemented in a manner compatible with distributed execution where coordination, message passing, and lifecycle and handled outside of streams libraries. By community standards any component or component configuration object that doesn't cleanly serializable for relocation in a distributed framework is a bug. When the streams project got started in 2012 storm was the only TLP real-time data processing framework at apache, but now there are plenty of good choices all of which are faster and better tested than our streams-runtime-local module. So, what should be the role of streams-runtime-local? Should we keep it at all? The tests take forever to run and my organization has stopped using it entirely. The best argument for keeping it is that it is useful when integration testing small pipelines, but perhaps we could just agree to use something else for that purpose? Do we want to keep the other runtime modules around and continue adding more? I’ve found that when embedding streams components in other frameworks (spark and flink most recently) I end up creating a handful of classes to help bind streams interfaces and instances within the pdfs / functions / transforms / whatever are that framework atomic unit of computation and reusing them in all my pipelines. How about the StreamBuilder interface? Does anyone still believe we should support (and still want to work on) classes implementing StreamBuilder to build and running a pipeline comprised solely of streams components on other frameworks? Personally I prefer to write code using the framework APIs at the pipeline level, and embed individual streams components at the step level. Any other thoughts on the topic? Steve - What should the focus be? If you look at the code, the project really provides 3 things: (1) a stream processing engine and integration with data persistence mechanisms, (2) a reference implementation of ActivityStreams, AS schemas, and tools for interlinking activity objects and events, and (3) a uniform API for integrating with social network APIs. I don't think that first thing is needed anymore. Just looking at Apache projects, NiFi, Apex + Apex Malhar, and to some extent Flume are further along here. Stream Sets covers some of this too, and arguably Logstash also gets used for this sort of work. I.e., I think the project would be much stronger if it focused on (2) and (3) and marrying those up to other Apache projects that fit (1). Minimally, it needs to be de-entangled a bit.