[
https://issues.apache.org/jira/browse/SAMZA-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475223#comment-15475223
]
Hai commented on SAMZA-967:
---------------------------
[~navina] Thanks so much for your valuable feedback. Please take a look at the
updated RB when you are free. In regards to your comments on the design doc, I
have updated the design doc as well, here is my answers to your questions:
Q: Is the “End of Stream” feature a pre-requisite for HDFS consumer? If yes,
link the corresponding JIRA and design document. Providing a high-level
description of how that feature will be leveraged for solving this problem will
layout more ground-work for readers who are not familiar about this
A: Yes. Updated the doc and the jira to reflect that Samza-974 is a
pre-requisite
Q: One of the goals and non-goals are slightly overlapping. "(Goal) The system
consumer should support a variety of folder structures and filename
conventions" and "(Non-Goal) Support ALL kinds of HDFS folder structures and
filename formats" . Can you specifically call out which structure and
conventions you are supporting or call out which ones you are not supporting?
Just to more clarity to the document.
A: Updated the doc to be more specific.
Q: Along with the 3rd point under Assumptions, you can call out "write-once,
read-many" as the underlying usage pattern.
A: Done
Q: What does the whitelist and blacklist here consists of ? Why do we need both
? Can you provide example of how this config will look like?
A: As pointed out in the design doc, this is to simplify the regex by having
two instead of one regex. Many systems including kafka is doing this. You can
always craft one regex to combine whitelist and blacklist, but that's gonna
look complicated. Updated doc to give examples.
Q: In case of repartitioner, multiple samza tasks cannot write to the same
file. Hence, each task can write in a separate file within the partition
directory -> what defines the ordering among these files when the downstream
job is consuming ? is it based on timestamp?
A: In this case there is no ordering among these files. Let's imaging, instead
of writing to HDFS, we write to Kafka, then you also have no ordering within
the samza topic partition when the events are coming from different upstream
producers.
Q: when does the HDFSSystemAdmin write the PartitionDescriptor to HDFS?? Is it
done by the job coordinator or by each container?
A: This is more of an implementation details so I didn't provide specifics on
the doc. You are right, it's done by job coordinator. It happens when
getSystemStreamMetadata is called given the current implementation.
Q: Is the PartitionDescriptor file expected to follow any convention? Or is it
simply going to contain a map?
A: It's simply a map in the json format.
> Add HDFS system consumer to Samza
> ---------------------------------
>
> Key: SAMZA-967
> URL: https://issues.apache.org/jira/browse/SAMZA-967
> Project: Samza
> Issue Type: Sub-task
> Reporter: Hai
> Assignee: Hai
> Fix For: 0.12.0
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)