> On May 8, 2017, 10:35 p.m., Joseph Wu wrote:
> > src/slave/containerizer/fetcher.cpp
> > Lines 261 (patched)
> > <https://reviews.apache.org/r/58900/diff/2/?file=1710782#file1710782line295>
> >
> >     This patch is still missing cleanup for this directory.
> >     
> >     Before this patch, we would "leak" fetcher cache directories whenever 
> > the SlaveID changed.  This change will make the fetcher leak the cache 
> > every time the agent is started.
> >     
> >     I'm considering the following options:
> >     1) Change the `--fetcher_cache_dir` to be the cache directory, rather 
> > than the _parent_ of the cache directory.  This has some implications when 
> > running multiple agents one a single machine (as the default value of the 
> > flag will no longer be acceptable).  But with a single cache directory, 
> > managing leaks is as easy as clearing the directory on startup.
> >     2) Delete the fetcher cache directory in the fetcher's destructor.  
> > This will only really do anything when the agent is gracefully shut down.  
> > But graceful shutdowns can be considered somewhat rare.
> >     3) Checkpoint the PID of the current process inside the fetcher cache 
> > directory.  Whenever the agent starts, it could scan for these PIDs and 
> > check which processes are still running.  We can then delete the caches of 
> > dead processes.

First of all, i think making `--fetcher_cache_dir` be the cache directory makes 
sense to me. I won't be too worried about the case of running multiple agents. 
If we want to run multiple agents, there are many other flags that we need to 
make unique as well (e.g., runtime_dir, cgroups_root, etc.)

For 3), can we avoid checkpointing. For instance, can we introduce a `trash` 
directory under `fetcher_cache_dir`, and rely on `rename` to atomically move 
the cached artifacts to the trash directory for removal.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58900/#review174232
-----------------------------------------------------------


On May 8, 2017, 10:20 p.m., Joseph Wu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58900/
> -----------------------------------------------------------
> 
> (Updated May 8, 2017, 10:20 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-7304
>     https://issues.apache.org/jira/browse/MESOS-7304
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The fetcher cache directory was historically located (by default)
> in `/tmp/mesos/fetch`.  The agent flag `--fetcher_cache_dir` could
> be used to change this value.
> 
> The fetcher would create a subdirectory underneath `/tmp/mesos/fetch`
> for each `SlaveID`.  This was done because multiple agents can run on
> the same node.  If all the agents use the same default fetcher cache
> directory, they will collide and cause unpredictable results.
> As a result, the `SlaveID` needed to be passed into the fetcher
> after the agent recovers and/or registers with the master, because
> that is when the `SlaveID` is determined.
> 
> This changes the default fetcher cache directory to
> `/tmp/mesos/fetch/XXXXXX`, where the 6 X's are replaced by `::mkdtemp`.
> The `SlaveID` subdirectory has also been removed.
> 
> This change, while techically a breaking change, is safe because of
> how the fetcher uses this directory.  Upon starting up, the fetcher
> "recovers" by clearing this directory.  By using a temporary directory,
> we similarly get an empty fetcher cache upon startup.
> 
> This change will only cause breakages if multiple agents are run
> with the same `--fetcher_cache_dir`.  In this case, each agent
> will delete the fetcher caches of all the other agents.
> 
> ---
> 
> With the removal of the `SlaveID` field in the fetcher's methods,
> it is no longer necessary to pass in the `SlaveID` or agent Flags
> at agent recovery time.  Instead, the flags can be passed in during
> the fetcher's construction.
> 
> Similarly, the fetcher's "recovery" (clearing the fetcher cache)
> can be done immediately upon construction, which simplifies the
> code slightly.
> 
> 
> Diffs
> -----
> 
>   src/local/local.cpp e47980929db2da1f31cf899a0e1fc452070e11f3 
>   src/slave/containerizer/fetcher.hpp 
> 9e3018dc087ed55c61b2824d0105bc5339b83043 
>   src/slave/containerizer/fetcher.cpp 
> a910fea5a5556afb376524c5bb2ff98d7d84e611 
>   src/slave/flags.hpp e5784ef81ad0720c7ec061ee0b28b8fadae77afd 
>   src/slave/flags.cpp ed99fadbf1aa91f7f0a57c9fb351c0247a40c6f4 
>   src/slave/main.cpp 507d59996a90f51c7cd1e0f836a34c2e9c484791 
>   src/slave/slave.cpp a37c6c888e7573209aadb07576cfb727fa1ec4ff 
> 
> 
> Diff: https://reviews.apache.org/r/58900/diff/2/
> 
> 
> Testing
> -------
> 
> See last patch in chain.
> 
> 
> Thanks,
> 
> Joseph Wu
> 
>

Reply via email to