[
https://issues.apache.org/jira/browse/STORM-411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Kellogg updated STORM-411:
-------------------------------
Component/s: storm-core
> Extend file uploads to support more distributed cache like semantics
> --------------------------------------------------------------------
>
> Key: STORM-411
> URL: https://issues.apache.org/jira/browse/STORM-411
> Project: Apache Storm
> Issue Type: Umbrella
> Components: storm-core
> Reporter: Robert Joseph Evans
> Assignee: Robert Joseph Evans
>
> One of the big features that we are asked about for a hosted storm instance
> is how to distribute and update large shared data sets with topologies.
> These could be things like ip to geolocation tables, machine learned models
> or just about anything else.
> Currently with storm you either have to package it as part of your topology
> jar, install in on the machine already, or access an external service to pull
> the data down. Packaging it in the jar does not allow users to update the
> dataset without restarting their topologies, installing it on the machine
> will not work for a hosted storm solution, and pulling it form an external
> service without the supervisors being aware of it would mean it would be
> downloaded multiple times, and may not be cleaned up properly afterwards.
> I propose that instead we setup something similar to the distributed cache on
> Hadoop, but with a pluggable backend. The APIs would be for a simple
> blobstore so they could be backed by local disk on nimbus, HDFS, swift, or
> even bittorrent.
> Adding new "files" to the blob store or downloading them would by default go
> through nimbus, but if an external store is properly configured direct access
> into the store could be used.
> The worker process would access the files through symlinks in the current
> working directory of the worker. For posix systems when a new version of the
> file is made available the symlink would atomically be replaced by a new one
> pointing to the new version. Windows does not support atomic replacement of
> a symlink so we should provide a simple library that will return resolved
> paths to be used, and can detect when the links have changed, but have some
> retry logic built in, if the symlink disappears in the middle.
> We are in the early stages of implementing this functionality and would like
> some feedback on the concepts before getting too far along.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)