[jira] [Commented] (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164530#comment-14164530 ] Colin Patrick McCabe commented on HDFS-1742: HDFS-6634 implemented a way to listen for filesystem events. Check it out! > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: namenode >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > It might be possible to creates a class that implements this method and load > it somehow (for example, using an extra jar in classpath) in NameNode's JVM. > NameNode includes a configuration option that specifies names of such > class(es) - then NameNode instantiates them and calls methods from them (in a > separate thread) on every valid event happening. > There would be a couple of ready-made pluggable implementations of such a > class that would be most likely distributed as contrib. Default NameNode's > process would stay the same without any visible differences. > Hadoop's JobTracker already extensively uses the same paradigm with pluggable > Scheduler interfaces, such as [Fair > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], > [Capacity > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], > [Dynamic > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], > etc. It also uses a class(es) that loads and runs inside JobTracker's > context, few relatively trustued varieties exist, they're distributed as > contrib and purely optional to be enabled by cluster admin. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006843#comment-13006843 ] Suresh Srinivas commented on HDFS-1742: --- +1 for using some kind of a tool on editlog to do this, as many have suggested. Please see HDFS-1448, which added a tool for viewing editlog. A tool could be built around that. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > It might be possible to creates a class that implements this method and load > it somehow (for example, using an extra jar in classpath) in NameNode's JVM. > NameNode includes a configuration option that specifies names of such > class(es) - then NameNode instantiates them and calls methods from them (in a > separate thread) on every valid event happening. > There would be a couple of ready-made pluggable implementations of such a > class that would be most likely distributed as contrib. Default NameNode's > process would stay the same without any visible differences. > Hadoop's JobTracker already extensively uses the same paradigm with pluggable > Scheduler interfaces, such as [Fair > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], > [Capacity > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], > [Dynamic > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], > etc. It also uses a class(es) that loads and runs inside JobTracker's > context, few relatively trustued varieties exist, they're distributed as > contrib and purely optional to be enabled by cluster admin. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006148#comment-13006148 ] dhruba borthakur commented on HDFS-1742: I agree that this is a useful feature, we have many processes that watch the filesystem namespace and does various things when files/directories appears in the HDFS namespace. However, making the fsedit logging invoke user-specified callbacks seems problematic. What happens when the callback does not return within a specific period of time? What locks can the namenode keep across these callsbacks? who will retry-callbacks if the callback returned "failure"? I would rather vote that the HDFS namenode log all these changes into a file in a well-defined-format (aka HDFS-1179). This is the core building block that is needed by an external application to build notifications mechanism, or publish-subscribe software, etc.etc. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > It might be possible to creates a class that implements this method and load > it somehow (for example, using an extra jar in classpath) in NameNode's JVM. > NameNode includes a configuration option that specifies names of such > class(es) - then NameNode instantiates them and calls methods from them (in a > separate thread) on every valid event happening. > There would be a couple of ready-made pluggable implementations of such a > class that would be most likely distributed as contrib. Default NameNode's > process would stay the same without any visible differences. > Hadoop's JobTracker already extensively uses the same paradigm with pluggable > Scheduler interfaces, such as [Fair > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], > [Capacity > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], > [Dynamic > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], > etc. It also uses a class(es) that loads and runs inside JobTracker's > context, few relatively trustued varieties exist, they're distributed as > contrib and purely optional to be enabled by cluster admin. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005365#comment-13005365 ] Alejandro Abdelnur commented on HDFS-1742: -- I agree 300% that user code MUST NOT run in the Hadoop services. Just to make it clear, my suggestion was to have an service interface, like JT has the Scheduler interface, that can be use to augment server behavior. Only the cluster administrations could set this up. Out of the box Hadoop could bundle 1 or 2 implementations. Still people could implement their own in case they have special requirements. Or, just use NIL, which it would be today's behavior. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > It might be possible to creates a class that implements this method and load > it somehow (for example, using an extra jar in classpath) in NameNode's JVM. > NameNode includes a configuration option that specifies names of such > class(es) - then NameNode instantiates them and calls methods from them (in a > separate thread) on every valid event happening. > There would be a couple of ready-made pluggable implementations of such a > class that would be most likely distributed as contrib. Default NameNode's > process would stay the same without any visible differences. > Hadoop's JobTracker already extensively uses the same paradigm with pluggable > Scheduler interfaces, such as [Fair > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], > [Capacity > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], > [Dynamic > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], > etc. It also uses a class(es) that loads and runs inside JobTracker's > context, few relatively trustued varieties exist, they're distributed as > contrib and purely optional to be enabled by cluster admin. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005327#comment-13005327 ] Mikhail Yakshin commented on HDFS-1742: --- I seriously doubt that making pubsub-like event transmission as the *only* available option is the way to go. Pubsub model is a cool thing, but proper implementation of it requires full-blown messaging subsystem akin to ones that implement [JMS|http://en.wikipedia.org/wiki/Java_Message_Service], such as [ActiveMQ|http://activemq.apache.org/]. In turn, it means a whole other system, matching Hadoop by complexity (it includes demons, at least a JMS broker, and it requires non-trivial configuration and deployment), being installed and made mandatory by Hadoop. The only thing I try to argue about is making this thing *modular* - i.e. making JMS pubsub producer *an option*, but not the *only* option. Other options might be simple local file logging, sending them across the network, plugging some local workflow management system, etc, etc. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > It might be possible to creates a class that implements this method and load > it somehow (for example, using an extra jar in classpath) in NameNode's JVM. > NameNode includes a configuration option that specifies names of such > class(es) - then NameNode instantiates them and calls methods from them (in a > separate thread) on every valid event happening. > There would be a couple of ready-made pluggable implementations of such a > class that would be most likely distributed as contrib. Default NameNode's > process would stay the same without any visible differences. > Hadoop's JobTracker already extensively uses the same paradigm with pluggable > Scheduler interfaces, such as [Fair > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], > [Capacity > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], > [Dynamic > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], > etc. It also uses a class(es) that loads and runs inside JobTracker's > context, few relatively trustued varieties exist, they're distributed as > contrib and purely optional to be enabled by cluster admin. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005326#comment-13005326 ] Mikhail Yakshin commented on HDFS-1742: --- I seriously doubt that making pubsub-like event transmission as the *only* available option is the way to go. Pubsub model is a cool thing, but proper implementation of it requires full-blown messaging subsystem akin to ones that implement [JMS|http://en.wikipedia.org/wiki/Java_Message_Service], such as [ActiveMQ|http://activemq.apache.org/]. In turn, it means a whole other system, matching Hadoop by complexity (it includes demons, at least a JMS broker, and it requires non-trivial configuration and deployment), being installed and made mandatory by Hadoop. The only thing I try to argue about is making this thing *modular* - i.e. making JMS pubsub producer *an option*, but not the *only* option. Other options might be simple local file logging, sending them across the network, plugging some local workflow management system, etc, etc. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > It might be possible to creates a class that implements this method and load > it somehow (for example, using an extra jar in classpath) in NameNode's JVM. > NameNode includes a configuration option that specifies names of such > class(es) - then NameNode instantiates them and calls methods from them (in a > separate thread) on every valid event happening. > There would be a couple of ready-made pluggable implementations of such a > class that would be most likely distributed as contrib. Default NameNode's > process would stay the same without any visible differences. > Hadoop's JobTracker already extensively uses the same paradigm with pluggable > Scheduler interfaces, such as [Fair > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], > [Capacity > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], > [Dynamic > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], > etc. It also uses a class(es) that loads and runs inside JobTracker's > context, few relatively trustued varieties exist, they're distributed as > contrib and purely optional to be enabled by cluster admin. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005280#comment-13005280 ] Allen Wittenauer commented on HDFS-1742: Heck, you could build a trivial/poc version based upon the hdfs audit log in no time flat. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > It might be possible to creates a class that implements this method and load > it somehow (for example, using an extra jar in classpath) in NameNode's JVM. > NameNode includes a configuration option that specifies names of such > class(es) - then NameNode instantiates them and calls methods from them (in a > separate thread) on every valid event happening. > There would be a couple of ready-made pluggable implementations of such a > class that would be most likely distributed as contrib. Default NameNode's > process would stay the same without any visible differences. > Hadoop's JobTracker already extensively uses the same paradigm with pluggable > Scheduler interfaces, such as [Fair > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], > [Capacity > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], > [Dynamic > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], > etc. It also uses a class(es) that loads and runs inside JobTracker's > context, few relatively trustued varieties exist, they're distributed as > contrib and purely optional to be enabled by cluster admin. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005217#comment-13005217 ] Todd Lipcon commented on HDFS-1742: --- Hey folks. I think people generally accept that it would be nice to be able to have an inotify-like interface on top of HDFS. However I don't think the proposed implementation of doing this inside the NN is a good idea for the following reasons: - it adds "less trusted" code running in the same JVM as the NN, which could crash it, use up memory, etc. - it adds load to the NN, which is already a scalability limit on large clusters - it will require a NN restart (or fragile classloader tricks) to reload the set of hooks I think the right way forward here is to have some kind of service subscribe to the NN edit logs and then publish events to subscribers. This would allow the "pubsub" service to run on a separate machine and not impact the NN in any way. Monitoring/alerting capability based on lifecycle events in the NN does make sense to me, though - eg a trigger when the NN enters or exits safemode. These tend to be lower load infrequent events and pluggable listeners would be plenty useful. See HADOOP-5640 for an interface like this. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > It might be possible to creates a class that implements this method and load > it somehow (for example, using an extra jar in classpath) in NameNode's JVM. > NameNode includes a configuration option that specifies names of such > class(es) - then NameNode instantiates them and calls methods from them (in a > separate thread) on every valid event happening. > There would be a couple of ready-made pluggable implementations of such a > class that would be most likely distributed as contrib. Default NameNode's > process would stay the same without any visible differences. > Hadoop's JobTracker already extensively uses the same paradigm with pluggable > Scheduler interfaces, such as [Fair > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], > [Capacity > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], > [Dynamic > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], > etc. It also uses a class(es) that loads and runs inside JobTracker's > context, few relatively trustued varieties exist, they're distributed as > contrib and purely optional to be enabled by cluster admin. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005158#comment-13005158 ] Tsz Wo (Nicholas), SZE commented on HDFS-1742: -- > Cluster admins or, more likely, Hadoop developers. ... Sounds good. I suggest you to update the description for avoiding confusion. > It depends on how do you define "internally". ... I actually mean the same as your suggested, i.e. not for end users to provide a {{JobInProgressListener}} class. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005150#comment-13005150 ] Mikhail Yakshin commented on HDFS-1742: --- >> A user creates a class that implements this method and loads it somehow ... > By user, do you mean end users or cluster admins? Cluster admins or, more likely, Hadoop developers. I'd like it to act just as a pluggable Scheduler interface: a few well-known and maintained varieties exist, 99.9% of Hadoop users/admins just plug in whatever scheduler they see fit. >> JobTracker already includes pluggable Scheduler interface ... > {{JobInProgressListener}} is used in {{JobTracker}} internally but not for > running end user codes. It depends on how do you define "internally". In fact, pluggable Scheduler interface extensively uses {{JobInProgressListener}} infrastructure, for example, [FairScheduler|https://github.com/apache/hadoop/blob/trunk/src/contrib/fairscheduler/src/java/org/apache/hadoop/mapred/FairScheduler.java#L154] defines its own custom JobInProgressListener. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005138#comment-13005138 ] Tsz Wo (Nicholas), SZE commented on HDFS-1742: -- > A user creates a class that implements this method and loads it somehow ... By user, do you mean end users or cluster admins? > JobTracker already includes pluggable Scheduler interface ... {{JobInProgressListener}} is used in {{JobTracker}} internally but not for running end user codes. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005137#comment-13005137 ] Doug Cutting commented on HDFS-1742: Ha! I now see that this is what Alejandro already suggested! > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005135#comment-13005135 ] Doug Cutting commented on HDFS-1742: I wonder if, rather than callbacks, this might look something like an RSS feed for changes. An application could request for the N edits immediately after a given timestamp. Each edit returned would include a timestamp. Edits could be filtered by the server to particular directory paths. The server would only return edits to files and directories that the client is permitted to see. The server would implement this by retaining edit logs for, e.g., 24 hours. Requests for timestamps before this would be result in an error. This service might only be provided by the secondary namenode, to reduce the load on the namenode. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005134#comment-13005134 ] Mikhail Yakshin commented on HDFS-1742: --- I disagree about complete isolation of callback system process. Callback system implementation is *not* an end-user code, such as map-reduce jobs are, and thus can be fairly reliable. Update of this code requires administrative privileges and restarting of NameNode. JobTracker already includes pluggable Scheduler interface ([HADOOP-3412]) that introduces external classes into main JobTracker JVM (albeit, choice of classes is fairly limited). There is pluggable [http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/JobTracker.html#addJobInProgressListener(org.apache.hadoop.mapred.JobInProgressListener)|JobInProgressListener] that implements exactly the same idea: a listener that receives events. Thus, I see no harm in no listeners by default and a sample listener implementation that does basic logging of events in a file or some sort of queue. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005113#comment-13005113 ] Allen Wittenauer commented on HDFS-1742: I mean specifically the current HDFS master processes should know absolutely nothing about callbacks even existing in the system. User's won't talk to it about them, it won't execute them, etc, etc. This whole callback system must be a completely separate daemon so that user's can't compromise HDFS in any way/shape/form. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005105#comment-13005105 ] Uma Maheswara Rao G commented on HDFS-1742: --- I also Agree with you Allen, you mean user's event listener's code will be executed in seperated process. Please correct me if i am wrong. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005092#comment-13005092 ] Allen Wittenauer commented on HDFS-1742: The namenode, secondary nn, etc should never be running user code directly. It won't scale and it will introduce an incredible amount of instability. It would be much better if this was designed in such a way that it was a completely separate process (or gang of processes). This process could be fed by receiving the edits stream similar to how Checkpoint and Backup nodes work today. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004979#comment-13004979 ] Uma Maheswara Rao G commented on HDFS-1742: --- This is very good feature. This events/callbacks can be given when space filled in NameNode, Datanode unregistration with NameNode ,Datanode registration with NameNode ..etc. Based on this events application can raise some alarms to adminstartor. For HDFS-1594 also we can implement the event/callback feature . ( when Name Node going to safemode because of disk space, it can raise event). > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004950#comment-13004950 ] Alejandro Abdelnur commented on HDFS-1742: -- Agree, this would be a very nice feature to have. Oozie Coordinator (Mikhail, Oozie coordinator does what you describe you are building) currently polls HDFS to find new files to process. This polling can be heavy in case of several/large Oozie coordinator jobs (large meaning a large number of input dependencies). This listener should also be available in the secondary namenode. This would allow to offload the notifications from the primary namenode, thus not putting extra load to the primary namenode. A default implementation of this listener could be an HTTP RSS-feed like endpoint that remembers the # last minutes and supports 'if-modified-since' HTTP header, if the header is present it returns only notifications newer than the timestamp. And, it could also support a path prefix filter (Note that this implementation does not ensure notification if the # time window is missed by the caller, thus the caller may have to do still some lazy polling). > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > --- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output//MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input//MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output//MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > A user creates a class that implements this method and loads it somehow (for > example, using an extra jar in classpath) in NameNode's JVM. NameNode > includes a configuration option that specifies names of such class(es) - then > NameNode instantiates them and calls methods from them (in a separate thread) > on every valid event happening. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira