[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006843#comment-13006843 ]
Suresh Srinivas commented on HDFS-1742: --------------------------------------- +1 for using some kind of a tool on editlog to do this, as many have suggested. Please see HDFS-1448, which added a tool for viewing editlog. A tool could be built around that. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > ----------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node > Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output/YYYY/MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > It might be possible to creates a class that implements this method and load > it somehow (for example, using an extra jar in classpath) in NameNode's JVM. > NameNode includes a configuration option that specifies names of such > class(es) - then NameNode instantiates them and calls methods from them (in a > separate thread) on every valid event happening. > There would be a couple of ready-made pluggable implementations of such a > class that would be most likely distributed as contrib. Default NameNode's > process would stay the same without any visible differences. > Hadoop's JobTracker already extensively uses the same paradigm with pluggable > Scheduler interfaces, such as [Fair > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], > [Capacity > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], > [Dynamic > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], > etc. It also uses a class(es) that loads and runs inside JobTracker's > context, few relatively trustued varieties exist, they're distributed as > contrib and purely optional to be enabled by cluster admin. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira