GitHub user roshannaik opened a pull request:

    https://github.com/apache/storm/pull/936

     Functionally complete. Not well tested. Have some UTs

    HDFS Spout  for initial review (no Trident support).
    
    **Features:**
     - Monitors directory. Picks up new files as they show up.
     - Built-in  support for consuming Text and Sequence File formats 
     - Extensible to other file formats. 
     - Works with or without ACKers.
     - Concurrent consumption: Each spout instance consumes a different file in 
parallel. 
     - Zookeeper free synchronization: HDFS based synchronization mechanism 
enables each spout to concurrently decide which of the available files to 
consume next. Spout acquires a lock on a file before consuming it. The lock 
files are stored in a .lock subdirectory. Lock file is deleted once file is 
consumed.
    - Archiving: Completed files are moved to a (configurable) 'done' folder 
    - Easy progress tracking: Files being consumed are renamed with a 
'.inprogress' suffix allowing users to easily check which files are being 
processed by simply listing the dir. More more detailed info users can rely on 
dashboard metrics.
    - Read model : Buffered reads & lazy tuple parsing. Reads are buffered, but 
parsing out a tuple is delayed until tuple is needed. Avoids periodic latency 
spikes involved with 'eager' parsing (e.g Kafka spout) of the entire  batch of 
records. 
    - Rolling window model used to track outstanding unACKed tuples (instead of 
Tumbling Window model used in Kafka spout). Tumbling window model prevents new 
tuple generation even if one tuple in the batch is not acked.
    - Support for clusters where NTP is not used to synchronize machine clocks 
(for identifying timed out locks - see *spout failure detection* section 
below). 
    - Kerberos support
    
    **Fault Tolerance :**
    - Un-parseable files - are moved to a  separate (configurable) directory
    - I/O errors - Retry later if currently experiencing I/O errors during read
    - Duplicate Delivery mitigation
         + Spout tracks its progress on HDFS. So if it dies, another spout can 
resume from where it left off, rather than reprocessing the whole file.
         + Configurable number of outstanding unACKed tuples - helps contain 
the amount of duplicate emits when a spout takes over the job of a failed spout.
    - Spout Failure and Failover :
      + Spout failure detection - Each spout acquires a lock on the file it is 
reading and periodically heartbeats (appends) its progress into the lock file 
with a timestamp. This helps identify stale/abandoned locks easily (based on 
configurable lock timeout).  
      + Failover to another spout - Spouts periodically check for stale locks. 
A spout can take ownership of a stale lock and resume reading the file 
corresponding to the lock. The latest progress noted in the lock file is used 
to decide the location from which to resume.
     


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/roshannaik/storm STORM-1199

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/936.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #936
    
----
commit 03260cc66a076fc118197f2d7ec8c64bd704d936
Author: Roshan Naik <[email protected]>
Date:   2015-12-09T21:10:32Z

    Functionally complete. Not well tested. Have some UTs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to