[ https://issues.apache.org/jira/browse/STORM-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088143#comment-15088143 ]
ASF GitHub Bot commented on STORM-1199: --------------------------------------- Github user roshannaik commented on a diff in the pull request: https://github.com/apache/storm/pull/936#discussion_r49128132 --- Diff: external/storm-hdfs/README.md --- @@ -405,7 +410,123 @@ On worker hosts the bolt/trident-state code will use the keytab file with princi Namenode. This method is little dangerous as you need to ensure all workers have the keytab file at the same location and you need to remember this as you bring up new hosts in the cluster. -## License +--- + +# HDFS Spout + +Hdfs spout is intended to allow feeding data into Storm from a HDFS directory. +It will actively monitor the directory to consume any new files that appear in the directory. +HDFS spout does not support Trident currently. + +**Impt**: Hdfs spout assumes that the files being made visible to it in the monitored directory +are NOT actively being written to. Only after a file is completely written should it be made +visible to the spout. This can be achieved by either writing the files out to another directory +and once completely written, move it to the monitored directory. Alternatively the file +can be created with a '.ignore' suffix in the monitored directory and after data is completely +written, rename it without the suffix. File names with a '.ignore' suffix are ignored +by the spout. + +When the spout is actively consuming a file, it renames the file with a '.inprogress' suffix. +After consuming all the contents in the file, the file will be moved to a configurable *done* +directory and the '.inprogress' suffix will be dropped. + +**Concurrency** If multiple spout instances are used in the topology, each instance will consume +a different file. Synchronization among spout instances is done using lock files created in a +(by default) '.lock' subdirectory under the monitored directory. A file with the same name +as the file being consumed (without the in progress suffix) is created in the lock directory. +Once the file is completely consumed, the corresponding lock file is deleted. + +**Recovery from failure** +Periodically, the spout also records progress information wrt to how much of the file has been +consumed in the lock file. In case of an crash of the spout instance (or force kill of topology) +another spout can take over the file and resume from the location recorded in the lock file. + +Certain error conditions (such spout crashing) can leave behind lock files without deleting them. +Such a stale lock file also indicates that the corresponding input file has also not been completely +processed. When detected, ownership of such stale lock files will be transferred to another spout. +The configuration 'hdfsspout.lock.timeout.sec' is used to specify the duration of inactivity after +which lock files should be considered stale. For lock file ownership transfer to succeed, the HDFS +lease on the file (from prev lock owner) should have expired. Spouts scan for stale lock files +before selecting the next file for consumption. + +**Lock on *.lock* Directory** +Hdfs spout instances create a *DIRLOCK* file in the .lock directory to co-ordinate certain accesses to +the .lock dir itself. A spout will try to create it when it needs access to the .lock directory and +then delete it when done. In case of a topology crash or force kill, this file may not get deleted. --- End diff -- you are right.. it should. I should reword it. Thanks! > Create HDFS Spout > ----------------- > > Key: STORM-1199 > URL: https://issues.apache.org/jira/browse/STORM-1199 > Project: Apache Storm > Issue Type: New Feature > Reporter: Roshan Naik > Assignee: Roshan Naik > Attachments: HDFSSpoutforStorm v2.pdf, HDFSSpoutforStorm.pdf, > hdfs-spout.1.patch > > > Create an HDFS spout so that Storm can suck in data from files in a HDFS > directory -- This message was sent by Atlassian JIRA (v6.3.4#6332)