[jira] [Commented] (SAMZA-967) Add HDFS system consumer to Samza

Hai (JIRA) Thu, 08 Sep 2016 15:39:03 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475223#comment-15475223
 ]


Hai commented on SAMZA-967:
---------------------------

[~navina] Thanks so much for your valuable feedback. Please take a look at the 
updated RB when you are free. In regards to your comments on the design doc, I 
have updated the design doc as well, here is my answers to your questions:

Q: Is the “End of Stream” feature a pre-requisite for HDFS consumer? If yes, 
link the corresponding JIRA and design document. Providing a high-level 
description of how that feature will be leveraged for solving this problem will 
layout more ground-work for readers who are not familiar about this
A: Yes. Updated the doc and the jira to reflect that Samza-974 is a 
pre-requisite

Q: One of the goals and non-goals are slightly overlapping. "(Goal) The system 
consumer should support a variety of folder structures and filename 
conventions" and "(Non-Goal) Support ALL kinds of HDFS folder structures and 
filename formats" . Can you specifically call out which structure and 
conventions you are supporting or call out which ones you are not supporting? 
Just to more clarity to the document.
A: Updated the doc to be more specific.

Q: Along with the 3rd point under Assumptions, you can call out "write-once, 
read-many" as the underlying usage pattern.
A: Done

Q: What does the whitelist and blacklist here consists of ? Why do we need both 
? Can you provide example of how this config will look like?
A: As pointed out in the design doc, this is to simplify the regex by having 
two instead of one regex. Many systems including kafka is doing this. You can 
always craft one regex to combine whitelist and blacklist, but that's gonna 
look complicated. Updated doc to give examples.

Q: In case of repartitioner, multiple samza tasks cannot write to the same 
file. Hence, each task can write in a separate file within the partition 
directory -> what defines the ordering among these files when the downstream 
job is consuming ? is it based on timestamp?
A: In this case there is no ordering among these files. Let's imaging, instead 
of writing to HDFS, we write to Kafka, then you also have no ordering within 
the samza topic partition when the events are coming from different upstream 
producers.

Q: when does the HDFSSystemAdmin write the PartitionDescriptor to HDFS?? Is it 
done by the job coordinator or by each container?
A: This is more of an implementation details so I didn't provide specifics on 
the doc. You are right, it's done by job coordinator. It happens when 
getSystemStreamMetadata is called given the current implementation.

Q: Is the PartitionDescriptor file expected to follow any convention? Or is it 
simply going to contain a map?
A: It's simply a map in the json format.

> Add HDFS system consumer to Samza
> ---------------------------------
>
>                 Key: SAMZA-967
>                 URL: https://issues.apache.org/jira/browse/SAMZA-967
>             Project: Samza
>          Issue Type: Sub-task
>            Reporter: Hai
>            Assignee: Hai
>             Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-967) Add HDFS system consumer to Samza

Reply via email to