Questions about the DataStorage API

Charlie Groves Sun, 02 Mar 2008 17:04:45 -0800

I used the new DataStorage API that showed up as part of PIG-32 afair bit this weekend while updating PIG-55, and I'm a littleperplexed by its design. A few questions about why things are theway they are follow. I'd be happy to make some patches to addressthese issues, but I wanted to make sure I'm not missing something first.

Why are the navigation functions on DataStorage and notContainerDescriptor? It seems natural to add a couple methods toContainerDescriptor to get a subelement or subcontainer given aString. The current setup seems to require calling getDataStorage onthe Container then calling asContainer or asElement on it with thesame Container as an argument. If the navigation moves toContainerDescriptor, DataStorage could just have a single method toget an ElementDescriptor given a String rather than its currentproliferation of as* methods.

Why does ContainerDescriptor extend ElementDescriptor?ElementDescriptor exposes several methods that make no sense for adirectory, so this forces every ContainerDescriptor implementation todisallow those methods and return a dummy InputStream for create. Acommon superinterface with the shared operations would make thingsmuch easier for DataStorage implementors.

Why does ContainerDescriptor only expose listing its subelementsthrough being an Iterable? Having it be Iterable is definitely nice,but there are always times when you need to look at all the files atonce, so this forces any client code to build an array by hand out ofthe Iterable. Since both of the existing implementations are alreadyturning a returned array into an iterable, why not expose that andsave some work for clients?

What's the distinction between getConfiguration and getStatistics?Is it that the things in Configuration are settable and Statisticsaren't? If that's the case, why not just have a getProperties methodand note if a given key is settable in its javadoc. A user isalready going to have to lookup the key to figure out how a givenDataStorage implementation's configuration maps into the commonDataStorage operations.

Could keys common to all DataStorage implementations be moved tomethods on ElementDescriptor? The existing keys all seem like they'dbe available from any DataStorage, so making them regular methods onthe ElementDescriptor would make them much more pleasant to use notto mention that it'd remove the need to create a Map for every accessto these rather commonly used attributes.

Is toString the correct method to produce a String representation ofa ElementDescriptor? I didn't see anything else on there to producean absolute String representation of a path, and that's hugely usefulfor serialization. It seems like a bad idea to expose that throughtoString since toString is generally used for debugging, and there'snothing to guide DataStorage implementors to use toString as such.

The last thing that's bothering me about the API is the names of theinterfaces: ElementDescriptor and ContainerDescriptor. Those namestell me almost nothing about what the interfaces do. Container givesme a little hint that that interface will probably have other thingsinside of it, but the other two words are generic enough inprogramming to be meaningless. I realize that something other than afilesystem may be exposed through these interfaces, but theoperations exposed through the interfaces are inherently file-like,so calling them something like PigFile and PigDirectory would conveyloads more information about how they're to be used to a programmerencountering them for the first time.


Charlie

Questions about the DataStorage API

Reply via email to