[ 
https://issues.apache.org/jira/browse/STORM-411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Kellogg updated STORM-411:
-------------------------------
    Component/s: storm-core

> Extend file uploads to support more distributed cache like semantics
> --------------------------------------------------------------------
>
>                 Key: STORM-411
>                 URL: https://issues.apache.org/jira/browse/STORM-411
>             Project: Apache Storm
>          Issue Type: Umbrella
>          Components: storm-core
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>
> One of the big features that we are asked about for a hosted storm instance 
> is how to distribute and update large shared data sets with topologies.  
> These could be things like ip to geolocation tables, machine learned models 
> or just about anything else.
> Currently with storm you either have to package it as part of your topology 
> jar, install in on the machine already, or access an external service to pull 
> the data down.  Packaging it in the jar does not allow users to update the 
> dataset without restarting their topologies, installing it on the machine 
> will not work for a hosted storm solution, and pulling it form an external 
> service without the supervisors being aware of it would mean it would be 
> downloaded multiple times, and may not be cleaned up properly afterwards.
> I propose that instead we setup something similar to the distributed cache on 
> Hadoop, but with a pluggable backend.  The APIs would be for a simple 
> blobstore so they could be backed by local disk on nimbus, HDFS, swift, or 
> even bittorrent.
> Adding new "files" to the blob store or downloading them would by default go 
> through nimbus, but if an external store is properly configured direct access 
> into the store could be used.
> The worker process would access the files through symlinks in the current 
> working directory of the worker.  For posix systems when a new version of the 
> file is made available the symlink would atomically be replaced by a new one 
> pointing to the new version.  Windows does not support atomic replacement of 
> a symlink so we should provide a simple library that will return resolved 
> paths to be used, and can detect when the links have changed, but have some 
> retry logic built in, if the symlink disappears in the middle.
> We are in the early stages of implementing this functionality and would like 
> some feedback on the concepts before getting too far along. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to