[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004950#comment-13004950
 ] 

Alejandro Abdelnur commented on HDFS-1742:
------------------------------------------

Agree, this would be a very nice feature to have.

Oozie Coordinator (Mikhail, Oozie coordinator does what you describe you are 
building) currently polls HDFS to find new files to process.

This polling can be heavy in case of several/large Oozie coordinator jobs 
(large meaning a large number of input dependencies).

This listener should also be available in the secondary namenode. This would 
allow to offload the notifications from the primary namenode, thus not putting 
extra load to the primary namenode.

A default implementation of this listener could be an HTTP RSS-feed like 
endpoint that remembers the # last minutes and supports 'if-modified-since' 
HTTP header, if the header is present it returns only notifications newer than 
the timestamp. And, it could also support a path prefix filter (Note that this 
implementation does not ensure notification if  the # time window is missed by 
the caller, thus the caller may have to do still some lazy polling).





> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1742
>                 URL: https://issues.apache.org/jira/browse/HDFS-1742
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Mikhail Yakshin
>              Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output/YYYY/MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
>     public void onFileCreate(SomeFileInformation f);
>     public void onFileClose(SomeFileInformation f);
>     public void onFileDelete(SomeFileInformation f);
>     ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to