Shu: I want to echo Arvind's appreciate for your feedback. It's super helpful to get this kind of input especially early on. I'm going to answer these points and also answer some of your original questions as well in a separate email.
On Tue, Dec 6, 2011 at 3:43 PM, Shu Zhang <[email protected]> wrote: > Thanks Arvind, very helpful explanation. > > I think I should clarify what I was saying about the channel. I'm > completely with you in that, conceptually, every channel needs both put() > and take(), what I'm suggesting is that channel doesn't necessarily need to > map to a single interface. That is, if channel interface was split into a > sender side and a receiver side interface, one sender and one receiver > would make up a 'channel'. > We definitely could separate them. I'm pretty sure this is what Spring Integration does; there's 3 entities - a MessageSender, MessageReceiver and a Channel that implements the union (forgive me - that's from memory). This would enable yet another level of indirection and allow for RPC-like channels as you've mentioned. It would also act as a method of simplifying the send versus receive code into two separately testable chunks. All of that said, I think that all channels would always implement both (see elsewhere in my response for rationale). > I think it's fine for most channel implementations to implement both the > the sender and receiver interface and I also think it's fine for the SAME > instance to be passed into corresponding source/sink. But I think there are > also cases where it makes sense to separate the implementations into > different classes. Consider a channel representing an rpc connection. > Conceptually, it's a channel right? It can connect a source and a sink, > things can be 'put' on it and things can be 'taken' on the other side. But > you'd probably want to implement the sender and receiver part of the > channel in different classes, maybe even different sub-projects due to > dependency needs. > You definitely could build a channel that does RPC or uses a traditional message queue. The issue is the transactional semantics of the channel may be extremely difficult to preserve in those kinds of cases. The intention is that a source -> channel -> sink always lives within a single JVM. If you wish to make a network hop, the pattern is to do what Flume OG does: source -> channel (equivalent) -> sink ---// network //---> source -> channel -> sink. This means that the channel's only two functions are queuing to permit variable production / consumption rates and durability. Any interfacing with other systems is still (as before) only ever accomplished via the source / sink endpoints. By adhering to this convention we can make certain assumptions about deployment and execution that drastically simplify other elements of the system. > My second point was that, I think generally, it doesn't make sense for a > component to invoke both put() and take() on the same channel and it's > always a little safer not to put methods that should not be invoked > together on the same interface. > I thought about this a bit and one thing that has traditionally been hard in Flume OG is to build (reliable) non-1:1 decorators. By enforcing a channel must have both put() and take(), we can now write inline processing like this: public OneToManyExploder implements PollableSink { public Status process() { e = channel.take(); tags = e.getHeaders().get("tags").split(", "); for (String tag : tags) { eTagged = new SimpleEvent(); eTagged.setBody(e.getBody()); eTagged.setHeaders(e.getHeaders()); eTagged.addHeader("type", tag); channel.put(eTagged); } } } In other words, the channel is an event store that can be manipulated in any way on an event. Things like fancier prioritization and QOS systems come to mind. These things are possible if the interfaces are decoupled but involves extra legwork. Let me know what you think. Thanks again! > Anyways not a big deal in my mind, just some clarification for you guys' > consideration. > > Cheers, > Shu > > ________________________________________ > From: [email protected] [[email protected]] > Sent: Tuesday, December 06, 2011 11:02 AM > To: [email protected] > Cc: Basier Aziz; Robert Mahfoud; Robert Ragno > Subject: Re: early flume-ng feedback > > Hi Shu, > > Thanks for taking the time to review NG and for your excellent feedback. I > wish to respond to the high level questions you raised regarding > Source/Channel/Sink, and would leave the other questions for other folks to > jump in on. > > > In NG, it seems like Sources map to upstream components of an > > OG Sink and Sinks map to downstream components of an OG > > Source. A channel maps to the combination of OG Sources and > > Sinks. That is, I see the high level modeling as: > > Sources - A component which puts events on a channel. > > (Implementation defines where those events come from). > > Sinks - A component which polls events from a channel and > > applies some processing on them. (Implementation defines > > processing) > > Channel - Transport between sources and sinks (Implementation > > defines durability, transport mechanism, etc.) > > > > Please let me know if I have the high level picture correctly. > > This is very near the real picture, but a few subtle details need be > pointed out: The OG implementation used a channel driver which pushed > events from source to sink within a node. On the other hand, in NG > implementation: > * There is no Node concept > * There is no Channel Driver > * Every source and sink have their own thread of execution > * A channel decouples the source(s) from sink(s). > * Together, a set of sources, channels and sinks that operate within the > same JVM is called an Agent. > > Think of a Flume Agent as borker in a traditional messaging infrastructure > which receives messages (events) and then forwards them on their way to the > intended destination. The characteristics of sender of these messages may > be very different from the intended receivers and hence the Channel acts as > a buffer that decouples the two threads of execution that interact with > these two entities. > > > For the most part, I'm a big fan of the high level modeling in > > NG. One thing I want to bring up is the fact that channel has > > both put() and take() on it. I'm not seeing the case where the > > same component would want to both take() an event for > > processing and also put() an event on to the same channel, > > since that component has a good chance being the one that > > ends up take()ing the event back. Because of that, I think it > > could be a good idea to separate channel into 2 interfaces. > > I can see channel-like implementations, for which it's more > > difficult to implement both put() and take() and I don't see > > the need for both to be in every implementation (though > > most will be, and that's ok). I guess what I'm thinking is > > along the lines of > > Hopefully my earlier explanation answers this question on why the channel > must have both put() and take() methods on it. If you imagine a channel > that has put() but no take(), it's functionality overlaps with a terminal > sink that only consumes messages from the channel it polls. > > Thanks, > Arvind > > > > > On Mon, Dec 5, 2011 at 5:15 PM, Shu Zhang <[email protected]> wrote: > > > Ooops, nvm about the lack of definition, I missed the main flume ng wiki > > page :), still... some javadocs on the Source/Sink interfaces would be > > nice. Hopefully the rest of of my comments are still valid. > > > > Cheers > > Shu > > ________________________________________ > > From: Shu Zhang [[email protected]] > > Sent: Monday, December 05, 2011 4:30 PM > > To: [email protected] > > Cc: Basier Aziz; Robert Mahfoud; Robert Ragno > > Subject: early flume-ng feedback > > > > Hi all, I just took a first look at the alpha2 of flume-ng and thought > I'd > > provide a little early high level feedback. First of all, thanks for > doing > > this re-architecture, we (medio systems) have had problems with the OG > and > > at first glance NG looks very promising and we're very excited to try it > > out. > > > > Ok feedback. Personally I would find it very helpful if > > Source/Sink/Channel were strongly defined. On the wiki, I see the line: > > > > "You still have sources and sinks and they still do the same thing. They > > are now connected by channels." > > > > I'm not finding that to be true. In OG, events are appended to Sinks via > > append(Event) and polled from Sources via next(). That is, some upstream > > component is responsible from getting events somehow and appending them > to > > a Sink; some downstream component is responsible for getting events from > a > > Source and processing it. For example, Driver is a special case, acting > as > > the downstream component of a Source and an upstream component of a Sink; > > its processing is simply appending events to the downstream Sink. > > > > In NG, it seems like Sources map to upstream components of an OG Sink and > > Sinks map to downstream components of an OG Source. A channel maps to the > > combination of OG Sources and Sinks. That is, I see the high level > modeling > > as: > > Sources - A component which puts events on a channel. (Implementation > > defines where those events come from). > > Sinks - A component which polls events from a channel and applies some > > processing on them. (Implementation defines processing) > > Channel - Transport between sources and sinks (Implementation defines > > durability, transport mechanism, etc.) > > > > Please let me know if I have the high level picture correctly. > > > > It seems to me, most Source/Sink/Channel do follow the above definition. > > But the avro stuff seems to be a major divergence. I'm a little confused > > about AvroSource; it doesn't seem to do much. It appears to be just a > > vanilla Source that let's you manually pass in events, which it'll then > > pass straight through to a channel. I'm guessing that's a work in > progress? > > Avro transport, seems to me, best modeled as an AvroChannel, which can > > link a source and sink which in turn defines where the events come from > and > > what to do with them on the other side; that is modeling avro transport > as > > a channel seems like it might provide more flexibility for less > > configuration. Also, modeling avro transport as a channel seems like it > > would make it easier to configure for different reliability levels > through > > composition with other channels. The way AvroSink is written can work, > but > > I see advantages in modeling avro transport as a channel instead, any > > thoughts? > > > > For the most part, I'm a big fan of the high level modeling in NG. One > > thing I want to bring up is the fact that channel has both put() and > take() > > on it. I'm not seeing the case where the same component would want to > both > > take() an event for processing and also put() an event on to the same > > channel, since that component has a good chance being the one that ends > up > > take()ing the event back. Because of that, I think it could be a good > idea > > to separate channel into 2 interfaces. I can see channel-like > > implementations, for which it's more difficult to implement both put() > and > > take() and I don't see the need for both to be in every implementation > > (though most will be, and that's ok). I guess what I'm thinking is along > > the lines of > > interface ChannelPoller { take() } > > interface ChannelSender { put() } > > public class FileChannel implements ChannelReceivingSide, > > ChannelSendingSide {...} > > public class SomeSource { ChannelSender _channel; ... } > > public class SomeSink { ChannelPoller _channel; ... } > > > > One final thing is, if I have the right idea on the high level modeling > > then it seems like a method like process() should be defined on the sinks > > interface and a method like List<Event> getNext() should be defined on > the > > sources interface, thoughts? What mind a sink do if it doesn't have a > > process() defined? > > > > Anyways thanks again for doing this work, I think it's very positive. > I'll > > be talking to people internally about helping out, I think it could be > good > > for all involved. I apologize if I've misunderstood anything or made any > > wrong assumptions. When we get around to testing it out, I'll get back to > > you guys on lower level issues. > > > > Cheers, > > Shu > > > -- Eric Sammer twitter: esammer data: www.cloudera.com
