[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005134#comment-13005134 ]
Mikhail Yakshin commented on HDFS-1742: --------------------------------------- I disagree about complete isolation of callback system process. Callback system implementation is *not* an end-user code, such as map-reduce jobs are, and thus can be fairly reliable. Update of this code requires administrative privileges and restarting of NameNode. JobTracker already includes pluggable Scheduler interface ([HADOOP-3412]) that introduces external classes into main JobTracker JVM (albeit, choice of classes is fairly limited). There is pluggable [http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/JobTracker.html#addJobInProgressListener(org.apache.hadoop.mapred.JobInProgressListener)|JobInProgressListener] that implements exactly the same idea: a listener that receives events. Thus, I see no harm in no listeners by default and a sample listener implementation that does basic logging of events in a file or some sort of queue. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > ----------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node > Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output/YYYY/MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira