[jira] [Commented] (STORM-1199) Create HDFS Spout

ASF GitHub Bot (JIRA) Thu, 07 Jan 2016 13:28:37 -0800

    [ 
https://issues.apache.org/jira/browse/STORM-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088143#comment-15088143
 ]


ASF GitHub Bot commented on STORM-1199:
---------------------------------------

Github user roshannaik commented on a diff in the pull request:

    https://github.com/apache/storm/pull/936#discussion_r49128132
  
    --- Diff: external/storm-hdfs/README.md ---
    @@ -405,7 +410,123 @@ On worker hosts the bolt/trident-state code will use 
the keytab file with princi
     Namenode. This method is little dangerous as you need to ensure all 
workers have the keytab file at the same location and you need
     to remember this as you bring up new hosts in the cluster.
     
    -## License
    +---
    +
    +# HDFS Spout
    +
    +Hdfs spout is intended to allow feeding data into Storm from a HDFS 
directory. 
    +It will actively monitor the directory to consume any new files that 
appear in the directory.
    +HDFS spout does not support Trident currently.
    +
    +**Impt**: Hdfs spout assumes that the files being made visible to it in 
the monitored directory 
    +are NOT actively being written to. Only after a file is completely written 
should it be made
    +visible to the spout. This can be achieved by either writing the files out 
to another directory 
    +and once completely written, move it to the monitored directory. 
Alternatively the file
    +can be created with a '.ignore' suffix in the monitored directory and 
after data is completely 
    +written, rename it without the suffix. File names with a '.ignore' suffix 
are ignored
    +by the spout.
    +
    +When the spout is actively consuming a file, it renames the file with a 
'.inprogress' suffix.
    +After consuming all the contents in the file, the file will be moved to a 
configurable *done* 
    +directory and the '.inprogress' suffix will be dropped.
    +
    +**Concurrency** If multiple spout instances are used in the topology, each 
instance will consume
    +a different file. Synchronization among spout instances is done using lock 
files created in a 
    +(by default) '.lock' subdirectory under the monitored directory. A file 
with the same name
    +as the file being consumed (without the in progress suffix) is created in 
the lock directory.
    +Once the file is completely consumed, the corresponding lock file is 
deleted.
    +
    +**Recovery from failure**
    +Periodically, the spout also records progress information wrt to how much 
of the file has been
    +consumed in the lock file. In case of an crash of the spout instance (or 
force kill of topology) 
    +another spout can take over the file and resume from the location recorded 
in the lock file.
    +
    +Certain error conditions (such spout crashing) can leave behind lock files 
without deleting them. 
    +Such a stale lock file also indicates that the corresponding input file 
has also not been completely 
    +processed. When detected, ownership of such stale lock files will be 
transferred to another spout.   
    +The configuration 'hdfsspout.lock.timeout.sec' is used to specify the 
duration of inactivity after 
    +which lock files should be considered stale. For lock file ownership 
transfer to succeed, the HDFS
    +lease on the file (from prev lock owner) should have expired. Spouts scan 
for stale lock files
    +before selecting the next file for consumption.
    +
    +**Lock on *.lock* Directory**
    +Hdfs spout instances create a *DIRLOCK* file in the .lock directory to 
co-ordinate certain accesses to 
    +the .lock dir itself. A spout will try to create it when it needs access 
to the .lock directory and
    +then delete it when done.  In case of a topology crash or force kill, this 
file may not get deleted.
    --- End diff --
    
    you are right.. it should.  I should reword it. Thanks!


> Create HDFS Spout
> -----------------
>
>                 Key: STORM-1199
>                 URL: https://issues.apache.org/jira/browse/STORM-1199
>             Project: Apache Storm
>          Issue Type: New Feature
>            Reporter: Roshan Naik
>            Assignee: Roshan Naik
>         Attachments: HDFSSpoutforStorm v2.pdf, HDFSSpoutforStorm.pdf, 
> hdfs-spout.1.patch
>
>
> Create an HDFS spout so that Storm can suck in data from files in a HDFS 
> directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1199) Create HDFS Spout

Reply via email to