Daryn, The problem with FileSystem.closeAllForUGI(ugi) for me is that a server can be multi-threaded, and a user could be doing multiple request at the same time, so if i used closeAllForUGI isn't there a risk of shutting down the other requests for the same user?
On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <[email protected]> wrote: > Yes, the implementation of fs.close() leaves something to be desired. > There's actually been debate in the past about close being a no-op for a > cached fs, but the idea was rejected by the majority of people. > > In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end > of a request to flush all the fs cache entries for the ugi. You'll get the > benefit of the cache during execution of the request, and be able to close > the cached fs instances to prevent memory leaks. I hope this helps! > > Daryn > > > On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote: > > ---------- Forwarded message ---------- > From: "Koert Kuipers" <[email protected]> > Date: Aug 4, 2012 1:54 PM > Subject: fs cache giving me headaches > To: <[email protected]> > > nothing has confused me as much in hadoop as FileSystem.close(). > any decent java programmer that sees that an object implements Closable > writes code like this: > Final FileSystem fs = FileSystem.get(conf); > try { > // do something with fs > } finally { > fs.close(); > } > > so i started out using hadoop FileSystem like this, and i ran into all > sorts of weird errors where FileSystems in unrelated code (sometimes not > even my code) started misbehaving and streams where unexpectedly shut. Then > i realized that FileSystem uses a cache and close() closes it for everyone! > Not pretty in my opinion, but i can live with it. So i checked other code > and found that basically nobody closes FileSystems. Apparently the expected > way of using FileSystems is to simple never close them. So i adopted this > approach (which i think is really contrary to java conventions for a > Closeable). > > Lately i started working on some code for a daemon/server where many > FileSystems objects are created for different users (UGIs) that use the > service. As it turns out other projects have run into trouble with the > FileSystem cache in situations like this (for example, Scribe and Hoop). I > imagine the cache can get very large and cause problems (i have not tested > this myself). > > Looking at the code for Hoop i noticed they simply turned off the > FileSystem cache and made sure to close every FileSystem. So here the > suggested approach to deal with FileSystems seems to be: > Final FileSystem fs = FileSystem.newInstance(conf); // or > FileSystem.get(conf) but with caching turned off in the conf > try { > // do something with fs > } finally { > fs.close(); > } > > This code bypasses the cache if i understand it correctly, avoiding any > cache size limitations. However if i adopt this approach i basically can > not re-use any existing code or libraries that do not close FileSystems, > splitting the codebase into two which is pretty ugly. And this code is not > efficient in situations where there are very few used FileSystem objects > and a cache would improve performance, so the split works both ways. > > In short, there is so single way to code with FileSystem that works in > both situations! Ideally i would have liked fs.close() to do the right > thing depending in the settings: if cache is turned off it closes the > FileSystem, and if it is turned on its a NOOP. That way i could always use > FileSystem.get(conf) and always close my filesystems, and the code would be > usable irrespective of whether the cache is turned on or off. > > Any insights or suggestions? Thanks! > > >
