[
https://issues.apache.org/jira/browse/SAMZA-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986817#comment-13986817
]
Chris Riccomini commented on SAMZA-138:
---------------------------------------
bq. java 7: use WatchService to monitor the file changes
Hmm, the WatchService is pretty appealing. This would break Java 6 support,
though. That's a pretty big commitment that probably should be brought to the
mailing list to discuss. Personally, I don't think the WatchService is
compelling enough to warrant ditching Java 6 support, so I'd prefer the "old
fashioned" approach. If you feel differently, we should involve more folks to
get more input.
bq. Does Scala has other approaches?
I don't know of any Scala class that will tail out of the box. There are
probably idiomatic ways to do this, but I'm not too worried about that.
bq. This reminds me of SAMZA-200 . MySQL log is just a simple log file. We may
be able to solve that ticket by modifying this reader a little.
I'd rather keep the two separate. Parsing MySQL bin-logs is going to be very
labor-intensive. The format is complex, and properly dealing with bin-logs can
be tricky. I'm thinking of this as more useful for tailing plain-text logs (not
binary logs). I agree we could expand the scope of this reader to handle more
than just new-line files, but I'd rather keep that as a separate JIRA.
bq. Still not sure how I can get newest offset and upcoming offset from a file.
Yea, sorry I'm not able to be more clear about this. JIRAs seem somewhat lossy
as a communication mechanism. Let me try again.
First, definitions:
*oldest*: Registering the oldest offset with a SystemConsumer means that the
next poll() call will return the oldest message in the stream as its first
message.
*newest*: Registering the newest offset with a SystemConsumer means that the
next poll() call will return the LAST message in the strea as its first message.
*upcoming*: Registering the upcoming offset with a SystemConsumer means that
the next poll() will either return nothing (if no new message has been appended
to the log since registration), or it will return the first message that was
written after the upcoming offset was registered.
Visually, for the file reader, things look like this:
msg1\n <-- oldest offset is 0
msg2\n <-- newest offset after this newline
msg3\n <-- upcoming offset after this newline
not-yet-complete-msg-that-has-no-newline-after-it-yet
Edge cases to consider:
# When a file is empty, oldest and newest should be set to null, and upcoming
should be set to 0.
# A message that's partially written will have no newline after it. This
message should not be returned until the trailing new-line has been written for
it.
In the file reader's case, for a non-empty file, the oldest offset will always
be 0, the newest offset will always be the offset immediately after the second
to last newline, and the upcoming offset will always be the offset immediately
after the last newline.
> System that places specified file contents onto stream
> ------------------------------------------------------
>
> Key: SAMZA-138
> URL: https://issues.apache.org/jira/browse/SAMZA-138
> Project: Samza
> Issue Type: New Feature
> Affects Versions: 0.7.0
> Environment: RHELinux 2.6.18-371.4.1.el5
> Reporter: Jonathan Poltak Samosir
> Assignee: Yan Fang
> Priority: Minor
> Labels: feature, newbie, patch
> Attachments: FileReaderConsumer.java, FileReaderSystemFactory.java,
> SAMZA-138.patch
>
>
> A fairly straightforward Samza System that reads from a specified file, and
> places that file's contents onto a SystemStreamPartition for use as input for
> a StreamTask.
> Roughly based off how the hello-samza example project's WikipediaSystem works
> (more the SystemConsumerFactory rather than SystemConsumer class).
> Probably needs a bit of work, but basic functionality works as intended.
> Hopefully useful to some, either as a functioning system or as a base for a
> more robust and functionally-promising system that you wish to implement.
> Some suggested improvements (not yet implemented):
> * handle reading from multiple files ([suggested alternative input
> specification|https://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201401.mbox/%3C1B43C7411DB20E47AB0FB62E7262B80179BA7465%40ESV4-MBX01.linkedin.biz%3E]-
> point 2)
> * use of filepos for IncomingMessageEnvelope offset ([more info
> here|https://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201401.mbox/%3C1B43C7411DB20E47AB0FB62E7262B80179BA749D%40ESV4-MBX01.linkedin.biz%3E]
> * come up with a reasonable bounded queue threshold (the value of 100 was
> arbitrary, as I was unsure of a reasonable value here)
> * better handling for the exceptions encountered (I wasn't 100% sure about
> some of them)
--
This message was sent by Atlassian JIRA
(v6.2#6252)