[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005134#comment-13005134
 ] 

Mikhail Yakshin commented on HDFS-1742:
---------------------------------------

I disagree about complete isolation of callback system process. Callback system 
implementation is *not* an end-user code, such as map-reduce jobs are, and thus 
can be fairly reliable. Update of this code requires administrative privileges 
and restarting of NameNode.

JobTracker already includes pluggable Scheduler interface ([HADOOP-3412]) that 
introduces external classes into main JobTracker JVM (albeit, choice of classes 
is fairly limited). There is pluggable 
[http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/JobTracker.html#addJobInProgressListener(org.apache.hadoop.mapred.JobInProgressListener)|JobInProgressListener]
 that implements exactly the same idea: a listener that receives events.

Thus, I see no harm in no listeners by default and a sample listener 
implementation that does basic logging of events in a file or some sort of 
queue.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1742
>                 URL: https://issues.apache.org/jira/browse/HDFS-1742
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Mikhail Yakshin
>              Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output/YYYY/MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
>     public void onFileCreate(SomeFileInformation f);
>     public void onFileClose(SomeFileInformation f);
>     public void onFileDelete(SomeFileInformation f);
>     ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to