[ https://issues.apache.org/jira/browse/STORM-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862722#comment-15862722 ]
Tibor Kiss edited comment on STORM-2355 at 2/12/17 10:13 AM: ------------------------------------------------------------- Initial implementation for 1.0.x-branch could be found here: https://github.com/tibkiss/storm/commit/d916d6f904ea085ebdaf5ada2a9c0607794d3c50 Note that I needed to lower guava version to be hdfs compatible (14.0.1). I have also bumped Hadoop version to 2.7.3. The implementation was tested using UTs and in a three node dockerized cluster using Flux and simple passthrough topology via Storm-Spout & Storm Bolt. Using inotify the load on HDFS was reduced by 15%. Nonetheless more precise performance measurement would have been needed in a non-dockerized environment. was (Author: tibor.k...@gmail.com): Initial implementation for 1.0.x-branch could be found here: https://github.com/tibkiss/storm/commit/d916d6f904ea085ebdaf5ada2a9c0607794d3c50 Note that I needed to lower guava version to be hdfs compatible (14.0.1). I have also bumped Hadoop version to 2.7.3. > Storm-HDFS: inotify support > --------------------------- > > Key: STORM-2355 > URL: https://issues.apache.org/jira/browse/STORM-2355 > Project: Apache Storm > Issue Type: New Feature > Components: storm-hdfs > Reporter: Tibor Kiss > Assignee: Tibor Kiss > Fix For: 2.0.0, 1.1.0 > > > This is a proposal to implement inotify based watch dir monitoring in > Storm-HDFS Spout. > *Motivation* > Storm-HDFS currently polls the input directory using Hadoop's > {{FileSystem.listFiles}}. This operation is expensive since it returns the > block locations and all stat information of the files inside the watch > directory. Storm-HDFS currently uses only one element's Path of the returned > list which is inefficient. > *Proposed improvement* > Provide a way to monitor the input directory through HDFS's inotify API. > In order to have backward compatibility with the poll based solution I > propose a new class ({{HdfsDirectoryMonitor}}) which implements both the > inotify and poll based solution through a iterator. The user can enable > inotify based polling through a configuration parameter. > *Caveat* > HDFS inotify is currently only available for root user, but there is ongoing > discussion in Hadoop community to extend its support to users. See: HDFS-8940 > *Testing related changes* > The {{TestHdfsSpout}} testcase should be parametrized to check for both the > poll & inotify based solution. > *Further work* > If the design is accepted the poll based solution could easily improved > through {{HdfsDirectoryMonitor}} to properly use all the returned items from > the work directory (similar to inotify based solution). Such improvement will > reduce the number of calls made to {{FileSystem.listFiles}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346)