All,

Joey brought this up over the weekend and I think a discussion is overdue on 
the topic.  

Streams components were meant to be compatible with other runtime frameworks 
all along, and for the most part are implemented in a manner compatible with 
distributed execution where coordination, message passing, and lifecycle and 
handled outside of streams libraries.  By community standards any component or 
component configuration object that doesn't cleanly serializable for relocation 
in a distributed framework is a bug.

When the streams project got started in 2012 storm was the only TLP real-time 
data processing framework at apache, but now there are plenty of good choices 
all of which are faster and better tested than our streams-runtime-local module.

So, what should be the role of streams-runtime-local?  Should we keep it at 
all?  The tests take forever to run and my organization has stopped using it 
entirely.  The best argument for keeping it is that it is useful when 
integration testing small pipelines, but perhaps we could just agree to use 
something else for that purpose?

Do we want to keep the other runtime modules around and continue adding more?  
I’ve found that when embedding streams components in other frameworks (spark 
and flink most recently) I end up creating a handful of classes to help bind 
streams interfaces and instances within the pdfs / functions / transforms / 
whatever are that framework atomic unit of computation and reusing them in all 
my pipelines.

How about the StreamBuilder interface?  Does anyone still believe we should 
support (and still want to work on) classes implementing StreamBuilder to build 
and running a pipeline comprised solely of streams components on other 
frameworks?  Personally I prefer to write code using the framework APIs at the 
pipeline level, and embed individual streams components at the step level.

Any other thoughts on the topic?

Steve

- What should the focus be? If you look at the code, the project really 
provides 3 things: (1) a stream processing engine and integration with data 
persistence mechanisms, (2) a reference implementation of ActivityStreams, AS 
schemas, and tools for interlinking activity objects and events, and (3) a 
uniform API for integrating with social network APIs. I don't think that first 
thing is needed anymore. Just looking at Apache projects, NiFi, Apex + Apex 
Malhar, and to some extent Flume are further along here. Stream Sets covers 
some of this too, and arguably Logstash also gets used for this sort of work. 
I.e., I think the project would be much stronger if it focused on (2) and (3) 
and marrying those up to other Apache projects that fit (1). Minimally, it 
needs to be de-entangled a bit.

Reply via email to