[
https://issues.apache.org/jira/browse/APEXMALHAR-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182940#comment-15182940
]
Yogi Devendra commented on APEXMALHAR-2009:
-------------------------------------------
Ram,
Does "from some source" specifically exclude files ? If so, we should
explicitly state this.
No. Even file based sources will be allowed. But, when we are processing
tuple-by-tuple we believe that each record from the file is separate entity and
thus can be processed independently.
For file copy use-case, this is not true. We need to maintain original sequence
from the source at the destination. Hence, it would need additional metadata
with each tuple like which file, what offset etc.
Thus proposal is to have following 4 components:
HDFS input per tuple basis
HDFS input for file copy
HDFS output per tuple basis
HDFS output for file copy
Each of these components will have separate email thread for the proposal.
#2, #4 can be connected together (with other operators in-between which works
on blocks) to solve file copy usecase. Idea of keeping them separate is because
signatures for ports are different for tuple based and file copy usecases.
>From end-use perspective:
Tuple based input/output would process one record/line from file. Each record
is processed independently. Here tuple represents raw data.
Whereas, in case of file copy ports would emit filemeta, blockmeta in addition
to block data. This is to make provisions for retaining original sequence at
the destination.
Consider the expected typical scenario, an upstream operator X sends tuples
to this proposed operator Y.
1. How does Y know what the file name is, given a tuple (i.e.
implementation of *getFileName()*) ?
Proposed operator Y writes all records into same file. Basically, tuple doesn't
care about which file it came from. Operator mentions about where to write that
tuple. All tuples go to the same output file. (Simplification, because we do
not want getFileName() to be abstract. Also, this is valid in many use-cases.)
2. How does Y know when to call *requestFinalize()* for a file (multiple
files could be in progress) ?
As discussed in 1, only one file will be in progress.
*requestFinalize()* call will happens based on time or size of the output file
as discussed in earlier emails on this thread.
3. Is it partitionable ? The base class is not for some reason though the
file input operator is.
Since base class is not partitionable, This operator Y will not be
partitionable.
4. The directory where files are written is a fixed property in the base
class annotated with *@NotNull*; what
if this path is not known upfront but is dynamically constructed on a
per-file basis.
How does X send this info to Y ?
Since, there is only single file, there is no concept of dynamically
constructing the file name.
When looking at files, the simplest example a user will think of is file
copy, so I think we should make
that work, and work well. To do that, the file input operator may also need
to be carefully examined
and changed suitably if necessary.
I guess addressing it in a module is certainly an option but having file
input and output operators
with elaborate features, class hierarchies, and tutorials but where the
simplest possible use case
is not easy is doing users a disservice.
Yes. File copy is the simplest example for file source, destinations. The aim
is to make this file copy easy for the end user.
Answer to that lies in the proposal to have dedicated components for file copy
(Component #2, #4 ) as mentioned above.
This email thread is for discussions about component #3. i.e. HDFS output tuple
basis.
> concrete operator for writing to HDFS file
> ------------------------------------------
>
> Key: APEXMALHAR-2009
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2009
> Project: Apache Apex Malhar
> Issue Type: Task
> Reporter: Yogi Devendra
> Assignee: Yogi Devendra
>
> Currently, for writing to HDFS file we have AbstractFileOutputOperator in the
> malhar library.
> It has following abstract methods :
> 1. protected abstract String getFileName(INPUT tuple)
> 2. protected abstract byte[] getBytesForTuple(INPUT tuple)
> These methods are kept generic to give flexibility to the app developers.
> But, someone who is new to apex; would look for ready-made implementation
> instead of extending Abstract implementation.
> Thus, I am proposing to add concrete operator HDFSOutputOperator to malhar.
> Aim of this operator would be to serve the purpose of ready to use operator
> for most frequent use-cases.
> Here are my key observations on most frequent use-cases:
> ------------------------------------------------------------------------------
> 1. Writing tuples of type byte[] or String.
> 2. All tuples on a particular stream land up in the same output file.
> 3. App developer may want to add some custom tuple separator (e.g. newline
> character) between tuples.
> Discussion thread on mailing list here:
> http://mail-archives.apache.org/mod_mbox/apex-dev/201603.mbox/%3CCAHekGF_6KovS4cjYXzCLdU9En0iPaKO%2BBv%3DEJXbrCuhe9%2BtdrA%40mail.gmail.com%3E
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)