[ https://issues.apache.org/jira/browse/SAMZA-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981736#comment-13981736 ]
Chris Riccomini commented on SAMZA-138: --------------------------------------- The SystemAdmin class's goal is really to give the SamzaContainer a way to get what it needs from the underlying system. Right now the two things that the SamzaContainer needs are: # The ability to get the next offset, given the last read offset for a certain SSP. This is provided by getOffsetsAfter. # The the offset of the first message, last message, and upcoming message for a certain SSP. For (1), it is safe to make the method a no-op, and just return what was passed in. This will result in the last message processed before a failure being re-processed when the job restarts, but Samza only provides "at-least-once" messaging right now, so this is fine. For (2), this API is a bit problematic. The "oldest" offset will always be 0. The "newest" offset will always be size(file). But what about the "newest" offset? For files, this is dependent on the file format. This implementation uses newlines as the delimiter for messages, which is fine. This means that the "newest" offset will be the offset of the second-to-last newline (\n) in the file. {noformat} msg1\n <-- oldest offset msg2\n <-- newest offset msg3\n <-- upcoming offset not-yet-complete-msg-that-has-no-newline-after-it-yet {noformat} bq. It is 100 by default. I plan not to change it. Will SAMZA-220 affect this? Yea, it should probably be at least 10,000, since that's what the maxMessagesPerPartition setting is right now. Anything lower will be sub-optimal. bq. I wrap all the exceptions to SamzaException. Whenever there is a SamzaException, will the Samza job be killed? Yes. > System that places specified file contents onto stream > ------------------------------------------------------ > > Key: SAMZA-138 > URL: https://issues.apache.org/jira/browse/SAMZA-138 > Project: Samza > Issue Type: New Feature > Affects Versions: 0.7.0 > Environment: RHELinux 2.6.18-371.4.1.el5 > Reporter: Jonathan Poltak Samosir > Priority: Minor > Labels: feature, newbie, patch > Attachments: FileReaderConsumer.java, FileReaderSystemFactory.java > > > A fairly straightforward Samza System that reads from a specified file, and > places that file's contents onto a SystemStreamPartition for use as input for > a StreamTask. > Roughly based off how the hello-samza example project's WikipediaSystem works > (more the SystemConsumerFactory rather than SystemConsumer class). > Probably needs a bit of work, but basic functionality works as intended. > Hopefully useful to some, either as a functioning system or as a base for a > more robust and functionally-promising system that you wish to implement. > Some suggested improvements (not yet implemented): > * handle reading from multiple files ([suggested alternative input > specification|https://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201401.mbox/%3C1B43C7411DB20E47AB0FB62E7262B80179BA7465%40ESV4-MBX01.linkedin.biz%3E]- > point 2) > * use of filepos for IncomingMessageEnvelope offset ([more info > here|https://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201401.mbox/%3C1B43C7411DB20E47AB0FB62E7262B80179BA749D%40ESV4-MBX01.linkedin.biz%3E] > * come up with a reasonable bounded queue threshold (the value of 100 was > arbitrary, as I was unsure of a reasonable value here) > * better handling for the exceptions encountered (I wasn't 100% sure about > some of them) -- This message was sent by Atlassian JIRA (v6.2#6252)