I used the new DataStorage API that showed up as part of PIG-32 a
fair bit this weekend while updating PIG-55, and I'm a little
perplexed by its design. A few questions about why things are the
way they are follow. I'd be happy to make some patches to address
these issues, but I wanted to make sure I'm not missing something first.
Why are the navigation functions on DataStorage and not
ContainerDescriptor? It seems natural to add a couple methods to
ContainerDescriptor to get a subelement or subcontainer given a
String. The current setup seems to require calling getDataStorage on
the Container then calling asContainer or asElement on it with the
same Container as an argument. If the navigation moves to
ContainerDescriptor, DataStorage could just have a single method to
get an ElementDescriptor given a String rather than its current
proliferation of as* methods.
Why does ContainerDescriptor extend ElementDescriptor?
ElementDescriptor exposes several methods that make no sense for a
directory, so this forces every ContainerDescriptor implementation to
disallow those methods and return a dummy InputStream for create. A
common superinterface with the shared operations would make things
much easier for DataStorage implementors.
Why does ContainerDescriptor only expose listing its subelements
through being an Iterable? Having it be Iterable is definitely nice,
but there are always times when you need to look at all the files at
once, so this forces any client code to build an array by hand out of
the Iterable. Since both of the existing implementations are already
turning a returned array into an iterable, why not expose that and
save some work for clients?
What's the distinction between getConfiguration and getStatistics?
Is it that the things in Configuration are settable and Statistics
aren't? If that's the case, why not just have a getProperties method
and note if a given key is settable in its javadoc. A user is
already going to have to lookup the key to figure out how a given
DataStorage implementation's configuration maps into the common
DataStorage operations.
Could keys common to all DataStorage implementations be moved to
methods on ElementDescriptor? The existing keys all seem like they'd
be available from any DataStorage, so making them regular methods on
the ElementDescriptor would make them much more pleasant to use not
to mention that it'd remove the need to create a Map for every access
to these rather commonly used attributes.
Is toString the correct method to produce a String representation of
a ElementDescriptor? I didn't see anything else on there to produce
an absolute String representation of a path, and that's hugely useful
for serialization. It seems like a bad idea to expose that through
toString since toString is generally used for debugging, and there's
nothing to guide DataStorage implementors to use toString as such.
The last thing that's bothering me about the API is the names of the
interfaces: ElementDescriptor and ContainerDescriptor. Those names
tell me almost nothing about what the interfaces do. Container gives
me a little hint that that interface will probably have other things
inside of it, but the other two words are generic enough in
programming to be meaningless. I realize that something other than a
filesystem may be exposed through these interfaces, but the
operations exposed through the interfaces are inherently file-like,
so calling them something like PigFile and PigDirectory would convey
loads more information about how they're to be used to a programmer
encountering them for the first time.
Charlie