[ 
https://issues.apache.org/jira/browse/STORM-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862722#comment-15862722
 ] 

Tibor Kiss edited comment on STORM-2355 at 2/12/17 10:13 AM:
-------------------------------------------------------------

Initial implementation for 1.0.x-branch could be found here: 
https://github.com/tibkiss/storm/commit/d916d6f904ea085ebdaf5ada2a9c0607794d3c50

Note that I needed to lower guava version to be hdfs compatible (14.0.1).
I have also bumped Hadoop version to 2.7.3.

The implementation was tested using UTs and in a three node dockerized cluster 
using Flux and simple passthrough topology via Storm-Spout & Storm Bolt. 
Using inotify the load on HDFS was reduced by 15%. Nonetheless more precise 
performance measurement would have been needed in a non-dockerized environment.


was (Author: tibor.k...@gmail.com):
Initial implementation for 1.0.x-branch could be found here: 
https://github.com/tibkiss/storm/commit/d916d6f904ea085ebdaf5ada2a9c0607794d3c50

Note that I needed to lower guava version to be hdfs compatible (14.0.1).
I have also bumped Hadoop version to 2.7.3.


> Storm-HDFS: inotify support
> ---------------------------
>
>                 Key: STORM-2355
>                 URL: https://issues.apache.org/jira/browse/STORM-2355
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-hdfs
>            Reporter: Tibor Kiss
>            Assignee: Tibor Kiss
>             Fix For: 2.0.0, 1.1.0
>
>
> This is a proposal to implement inotify based watch dir monitoring in 
> Storm-HDFS Spout.
> *Motivation*
> Storm-HDFS currently polls the input directory using Hadoop's 
> {{FileSystem.listFiles}}. This operation is expensive since it returns the 
> block locations and all stat information of the files inside the watch 
> directory. Storm-HDFS currently uses only one element's Path of the returned 
> list which is inefficient.
> *Proposed improvement*
> Provide a way to monitor the input directory through HDFS's inotify API.
> In order to have backward compatibility with the poll based solution I 
> propose a new class ({{HdfsDirectoryMonitor}}) which implements both the 
> inotify and poll based solution through a iterator. The user can enable 
> inotify based polling through a configuration parameter.
> *Caveat*
> HDFS inotify is currently only available for root user, but there is ongoing 
> discussion in Hadoop community to extend its support to users. See: HDFS-8940 
> *Testing related changes*
> The {{TestHdfsSpout}} testcase should be parametrized to check for both the 
> poll & inotify based solution.
> *Further work*
> If the design is accepted the poll based solution could easily improved 
> through {{HdfsDirectoryMonitor}} to properly use all the returned items from 
> the work directory (similar to inotify based solution). Such improvement will 
> reduce the number of calls made to {{FileSystem.listFiles}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to