[ 
https://issues.apache.org/jira/browse/HDFS-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166823#comment-15166823
 ] 

Andrew Wang commented on HDFS-9782:
-----------------------------------

Based on numbers I've seen, the NN can do a few hundred files per second, so 
throwing a couple hundred or thousand at the NN all at once will result in a 
multi-second blip. A little fuzz goes a long way here, so if you're cool with 
1min or even 30s, I think that's sufficient. Speaking from experience, even big 
cluster operators aren't necessarily more savvy about Hadoop config keys.

There is also a meta point about timeliness. There's always going to be 
inaccuracy in data collection (NTP fail, GC pause, dog chewed an Ethernet 
cable), and this needs to be accounted for when processing. This is like the 
famous "lambda architecture" from the streaming world; handle late data in a 
rollup.

> RollingFileSystemSink should have configurable roll interval
> ------------------------------------------------------------
>
>                 Key: HDFS-9782
>                 URL: https://issues.apache.org/jira/browse/HDFS-9782
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>         Attachments: HDFS-9782.001.patch, HDFS-9782.002.patch, 
> HDFS-9782.003.patch, HDFS-9782.004.patch
>
>
> Right now it defaults to rolling at the top of every hour.  Instead that 
> interval should be configurable.  The interval should also allow for some 
> play so that all hosts don't try to flush their files simultaneously.
> I'm filing this in HDFS because I suspect it will involve touching the HDFS 
> tests.  If it turns out not to, I'll move it into common instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to