[jira] [Commented] (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2014-10-08 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164530#comment-14164530
 ] 

Colin Patrick McCabe commented on HDFS-1742:


HDFS-6634 implemented a way to listen for filesystem events.  Check it out!

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: namenode
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> It might be possible to creates a class that implements this method and load 
> it somehow (for example, using an extra jar in classpath) in NameNode's JVM. 
> NameNode includes a configuration option that specifies names of such 
> class(es) - then NameNode instantiates them and calls methods from them (in a 
> separate thread) on every valid event happening.
> There would be a couple of ready-made pluggable implementations of such a 
> class that would be most likely distributed as contrib. Default NameNode's 
> process would stay the same without any visible differences.
> Hadoop's JobTracker already extensively uses the same paradigm with pluggable 
> Scheduler interfaces, such as [Fair 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
>  [Capacity 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
>  [Dynamic 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
>  etc. It also uses a class(es) that loads and runs inside JobTracker's 
> context, few relatively trustued varieties exist, they're distributed as 
> contrib and purely optional to be enabled by cluster admin.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-15 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006843#comment-13006843
 ] 

Suresh Srinivas commented on HDFS-1742:
---

+1 for using some kind of a tool on editlog to do this, as many have suggested. 
Please see HDFS-1448, which added a tool for viewing editlog. A tool could be 
built around that.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> It might be possible to creates a class that implements this method and load 
> it somehow (for example, using an extra jar in classpath) in NameNode's JVM. 
> NameNode includes a configuration option that specifies names of such 
> class(es) - then NameNode instantiates them and calls methods from them (in a 
> separate thread) on every valid event happening.
> There would be a couple of ready-made pluggable implementations of such a 
> class that would be most likely distributed as contrib. Default NameNode's 
> process would stay the same without any visible differences.
> Hadoop's JobTracker already extensively uses the same paradigm with pluggable 
> Scheduler interfaces, such as [Fair 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
>  [Capacity 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
>  [Dynamic 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
>  etc. It also uses a class(es) that loads and runs inside JobTracker's 
> context, few relatively trustued varieties exist, they're distributed as 
> contrib and purely optional to be enabled by cluster admin.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-12 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006148#comment-13006148
 ] 

dhruba borthakur commented on HDFS-1742:


I agree that this is a useful feature, we have many processes that watch the 
filesystem namespace and does various things when files/directories appears in 
the HDFS namespace. However, making the fsedit logging invoke user-specified 
callbacks seems problematic. What happens when the callback does not return 
within a specific period of time? What locks can the namenode keep across these 
callsbacks? who will retry-callbacks if the callback returned "failure"? 

I would rather vote that the HDFS namenode log all these changes into a file in 
a well-defined-format (aka HDFS-1179). This is the core building block that is 
needed by an external application to build notifications mechanism, or 
publish-subscribe software, etc.etc.


> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> It might be possible to creates a class that implements this method and load 
> it somehow (for example, using an extra jar in classpath) in NameNode's JVM. 
> NameNode includes a configuration option that specifies names of such 
> class(es) - then NameNode instantiates them and calls methods from them (in a 
> separate thread) on every valid event happening.
> There would be a couple of ready-made pluggable implementations of such a 
> class that would be most likely distributed as contrib. Default NameNode's 
> process would stay the same without any visible differences.
> Hadoop's JobTracker already extensively uses the same paradigm with pluggable 
> Scheduler interfaces, such as [Fair 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
>  [Capacity 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
>  [Dynamic 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
>  etc. It also uses a class(es) that loads and runs inside JobTracker's 
> context, few relatively trustued varieties exist, they're distributed as 
> contrib and purely optional to be enabled by cluster admin.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005365#comment-13005365
 ] 

Alejandro Abdelnur commented on HDFS-1742:
--

I agree 300% that user code MUST NOT run in the Hadoop services.

Just to make it clear, my suggestion was to have an service interface, like JT 
has the Scheduler interface, that can be use to augment server behavior. Only 
the cluster administrations could set this up. Out of the box Hadoop could 
bundle 1 or 2 implementations. Still people could implement their own in case 
they have special requirements. Or, just use NIL, which it would be today's 
behavior.





> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> It might be possible to creates a class that implements this method and load 
> it somehow (for example, using an extra jar in classpath) in NameNode's JVM. 
> NameNode includes a configuration option that specifies names of such 
> class(es) - then NameNode instantiates them and calls methods from them (in a 
> separate thread) on every valid event happening.
> There would be a couple of ready-made pluggable implementations of such a 
> class that would be most likely distributed as contrib. Default NameNode's 
> process would stay the same without any visible differences.
> Hadoop's JobTracker already extensively uses the same paradigm with pluggable 
> Scheduler interfaces, such as [Fair 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
>  [Capacity 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
>  [Dynamic 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
>  etc. It also uses a class(es) that loads and runs inside JobTracker's 
> context, few relatively trustued varieties exist, they're distributed as 
> contrib and purely optional to be enabled by cluster admin.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Mikhail Yakshin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005327#comment-13005327
 ] 

Mikhail Yakshin commented on HDFS-1742:
---

I seriously doubt that making pubsub-like event transmission as the *only* 
available option is the way to go. Pubsub model is a cool thing, but proper 
implementation of it requires full-blown messaging subsystem akin to ones that 
implement [JMS|http://en.wikipedia.org/wiki/Java_Message_Service], such as 
[ActiveMQ|http://activemq.apache.org/]. In turn, it means a whole other system, 
matching Hadoop by complexity (it includes demons, at least a JMS broker, and 
it requires non-trivial configuration and deployment), being installed and made 
mandatory by Hadoop.

The only thing I try to argue about is making this thing *modular* - i.e. 
making JMS pubsub producer *an option*, but not the *only* option. Other 
options might be simple local file logging, sending them across the network, 
plugging some local workflow management system, etc, etc.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> It might be possible to creates a class that implements this method and load 
> it somehow (for example, using an extra jar in classpath) in NameNode's JVM. 
> NameNode includes a configuration option that specifies names of such 
> class(es) - then NameNode instantiates them and calls methods from them (in a 
> separate thread) on every valid event happening.
> There would be a couple of ready-made pluggable implementations of such a 
> class that would be most likely distributed as contrib. Default NameNode's 
> process would stay the same without any visible differences.
> Hadoop's JobTracker already extensively uses the same paradigm with pluggable 
> Scheduler interfaces, such as [Fair 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
>  [Capacity 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
>  [Dynamic 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
>  etc. It also uses a class(es) that loads and runs inside JobTracker's 
> context, few relatively trustued varieties exist, they're distributed as 
> contrib and purely optional to be enabled by cluster admin.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Mikhail Yakshin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005326#comment-13005326
 ] 

Mikhail Yakshin commented on HDFS-1742:
---

I seriously doubt that making pubsub-like event transmission as the *only* 
available option is the way to go. Pubsub model is a cool thing, but proper 
implementation of it requires full-blown messaging subsystem akin to ones that 
implement [JMS|http://en.wikipedia.org/wiki/Java_Message_Service], such as 
[ActiveMQ|http://activemq.apache.org/]. In turn, it means a whole other system, 
matching Hadoop by complexity (it includes demons, at least a JMS broker, and 
it requires non-trivial configuration and deployment), being installed and made 
mandatory by Hadoop.

The only thing I try to argue about is making this thing *modular* - i.e. 
making JMS pubsub producer *an option*, but not the *only* option. Other 
options might be simple local file logging, sending them across the network, 
plugging some local workflow management system, etc, etc.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> It might be possible to creates a class that implements this method and load 
> it somehow (for example, using an extra jar in classpath) in NameNode's JVM. 
> NameNode includes a configuration option that specifies names of such 
> class(es) - then NameNode instantiates them and calls methods from them (in a 
> separate thread) on every valid event happening.
> There would be a couple of ready-made pluggable implementations of such a 
> class that would be most likely distributed as contrib. Default NameNode's 
> process would stay the same without any visible differences.
> Hadoop's JobTracker already extensively uses the same paradigm with pluggable 
> Scheduler interfaces, such as [Fair 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
>  [Capacity 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
>  [Dynamic 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
>  etc. It also uses a class(es) that loads and runs inside JobTracker's 
> context, few relatively trustued varieties exist, they're distributed as 
> contrib and purely optional to be enabled by cluster admin.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005280#comment-13005280
 ] 

Allen Wittenauer commented on HDFS-1742:


Heck, you could build a trivial/poc version based upon the hdfs audit log in no 
time flat.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> It might be possible to creates a class that implements this method and load 
> it somehow (for example, using an extra jar in classpath) in NameNode's JVM. 
> NameNode includes a configuration option that specifies names of such 
> class(es) - then NameNode instantiates them and calls methods from them (in a 
> separate thread) on every valid event happening.
> There would be a couple of ready-made pluggable implementations of such a 
> class that would be most likely distributed as contrib. Default NameNode's 
> process would stay the same without any visible differences.
> Hadoop's JobTracker already extensively uses the same paradigm with pluggable 
> Scheduler interfaces, such as [Fair 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
>  [Capacity 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
>  [Dynamic 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
>  etc. It also uses a class(es) that loads and runs inside JobTracker's 
> context, few relatively trustued varieties exist, they're distributed as 
> contrib and purely optional to be enabled by cluster admin.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005217#comment-13005217
 ] 

Todd Lipcon commented on HDFS-1742:
---

Hey folks. I think people generally accept that it would be nice to be able to 
have an inotify-like interface on top of HDFS. However I don't think the 
proposed implementation of doing this inside the NN is a good idea for the 
following reasons:
- it adds "less trusted" code running in the same JVM as the NN, which could 
crash it, use up memory, etc.
- it adds load to the NN, which is already a scalability limit on large clusters
- it will require a NN restart (or fragile classloader tricks) to reload the 
set of hooks

I think the right way forward here is to have some kind of service subscribe to 
the NN edit logs and then publish events to subscribers. This would allow the 
"pubsub" service to run on a separate machine and not impact the NN in any way.

Monitoring/alerting capability based on lifecycle events in the NN does make 
sense to me, though - eg a trigger when the NN enters or exits safemode. These 
tend to be lower load infrequent events and pluggable listeners would be plenty 
useful. See HADOOP-5640 for an interface like this.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> It might be possible to creates a class that implements this method and load 
> it somehow (for example, using an extra jar in classpath) in NameNode's JVM. 
> NameNode includes a configuration option that specifies names of such 
> class(es) - then NameNode instantiates them and calls methods from them (in a 
> separate thread) on every valid event happening.
> There would be a couple of ready-made pluggable implementations of such a 
> class that would be most likely distributed as contrib. Default NameNode's 
> process would stay the same without any visible differences.
> Hadoop's JobTracker already extensively uses the same paradigm with pluggable 
> Scheduler interfaces, such as [Fair 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
>  [Capacity 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
>  [Dynamic 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
>  etc. It also uses a class(es) that loads and runs inside JobTracker's 
> context, few relatively trustued varieties exist, they're distributed as 
> contrib and purely optional to be enabled by cluster admin.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005158#comment-13005158
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1742:
--

> Cluster admins or, more likely, Hadoop developers. ...

Sounds good.  I suggest you to update the description for avoiding confusion.

> It depends on how do you define "internally". ...

I actually mean the same as your suggested, i.e. not for end users to provide a 
{{JobInProgressListener}} class.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Mikhail Yakshin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005150#comment-13005150
 ] 

Mikhail Yakshin commented on HDFS-1742:
---

>> A user creates a class that implements this method and loads it somehow ...
> By user, do you mean end users or cluster admins?

Cluster admins or, more likely, Hadoop developers. I'd like it to act just as a 
pluggable Scheduler interface: a few well-known and maintained varieties exist, 
99.9% of Hadoop users/admins just plug in whatever scheduler they see fit.

>> JobTracker already includes pluggable Scheduler interface ...
> {{JobInProgressListener}} is used in {{JobTracker}} internally but not for 
> running end user codes.

It depends on how do you define "internally". In fact, pluggable Scheduler 
interface extensively uses {{JobInProgressListener}} infrastructure, for 
example, 
[FairScheduler|https://github.com/apache/hadoop/blob/trunk/src/contrib/fairscheduler/src/java/org/apache/hadoop/mapred/FairScheduler.java#L154]
 defines its own custom JobInProgressListener.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005138#comment-13005138
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1742:
--

> A user creates a class that implements this method and loads it somehow ...

By user, do you mean end users or cluster admins?


> JobTracker already includes pluggable Scheduler interface ...

{{JobInProgressListener}} is used in {{JobTracker}} internally but not for 
running end user codes.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005137#comment-13005137
 ] 

Doug Cutting commented on HDFS-1742:


Ha!  I now see that this is what Alejandro already suggested!

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005135#comment-13005135
 ] 

Doug Cutting commented on HDFS-1742:


I wonder if, rather than callbacks, this might look something like an RSS feed 
for changes.  An application could request for the N edits immediately after a 
given timestamp.  Each edit returned would include a timestamp.  Edits could be 
filtered by the server to particular directory paths.  The server would only 
return edits to files and directories that the client is permitted to see.

The server would implement this by retaining edit logs for, e.g., 24 hours.  
Requests for timestamps before this would be result in an error.  This service 
might only be provided by the secondary namenode, to reduce the load on the 
namenode.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Mikhail Yakshin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005134#comment-13005134
 ] 

Mikhail Yakshin commented on HDFS-1742:
---

I disagree about complete isolation of callback system process. Callback system 
implementation is *not* an end-user code, such as map-reduce jobs are, and thus 
can be fairly reliable. Update of this code requires administrative privileges 
and restarting of NameNode.

JobTracker already includes pluggable Scheduler interface ([HADOOP-3412]) that 
introduces external classes into main JobTracker JVM (albeit, choice of classes 
is fairly limited). There is pluggable 
[http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/JobTracker.html#addJobInProgressListener(org.apache.hadoop.mapred.JobInProgressListener)|JobInProgressListener]
 that implements exactly the same idea: a listener that receives events.

Thus, I see no harm in no listeners by default and a sample listener 
implementation that does basic logging of events in a file or some sort of 
queue.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005113#comment-13005113
 ] 

Allen Wittenauer commented on HDFS-1742:


I mean specifically the current HDFS master processes should know absolutely 
nothing about callbacks even existing in the system.  User's won't talk to it 
about them, it won't execute them, etc, etc.  This whole callback system must 
be a completely separate daemon so that user's can't compromise HDFS in any 
way/shape/form.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005105#comment-13005105
 ] 

Uma Maheswara Rao G commented on HDFS-1742:
---

I also Agree with you Allen,
 you mean user's event listener's code will be executed in seperated process. 
Please correct me if i am wrong.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-10 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005092#comment-13005092
 ] 

Allen Wittenauer commented on HDFS-1742:


The namenode, secondary nn, etc should never be running user code directly.  It 
won't scale and it will introduce an incredible amount of instability.  

It would be much better if this was designed in such a way that it was a 
completely separate process (or gang of processes).  This process could be fed 
by receiving the edits stream similar to how Checkpoint and Backup nodes work 
today.

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-09 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004979#comment-13004979
 ] 

Uma Maheswara Rao G commented on HDFS-1742:
---

This is very good feature. 
This events/callbacks can be given when space filled in NameNode, Datanode 
unregistration with NameNode ,Datanode registration with NameNode   ..etc.
Based on this events application can raise some alarms to adminstartor.

For HDFS-1594 also we can implement the event/callback feature . ( when Name 
Node going to safemode because of disk space, it can raise event).


  

> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

2011-03-09 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004950#comment-13004950
 ] 

Alejandro Abdelnur commented on HDFS-1742:
--

Agree, this would be a very nice feature to have.

Oozie Coordinator (Mikhail, Oozie coordinator does what you describe you are 
building) currently polls HDFS to find new files to process.

This polling can be heavy in case of several/large Oozie coordinator jobs 
(large meaning a large number of input dependencies).

This listener should also be available in the secondary namenode. This would 
allow to offload the notifications from the primary namenode, thus not putting 
extra load to the primary namenode.

A default implementation of this listener could be an HTTP RSS-feed like 
endpoint that remembers the # last minutes and supports 'if-modified-since' 
HTTP header, if the header is present it returns only notifications newer than 
the timestamp. And, it could also support a path prefix filter (Note that this 
implementation does not ensure notification if  the # time window is missed by 
the caller, thus the caller may have to do still some lazy polling).





> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> ---
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Mikhail Yakshin
>  Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output//MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input//MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output//MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for 
> example, using an extra jar in classpath) in NameNode's JVM. NameNode 
> includes a configuration option that specifies names of such class(es) - then 
> NameNode instantiates them and calls methods from them (in a separate thread) 
> on every valid event happening.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira