Re: Approaching OODT as a new user

Mattmann, Chris A (388J) Wed, 12 Jan 2011 07:42:57 -0800

Hi Scott,

Thanks for your detailed and informative email, giving us the user perspective!

My comments inline below:

On Jan 11, 2011, at 4:43 PM, Scott Konzem wrote:

> First of all, I'd like to congratulate OODT on becoming a top level project 
> and NASA for making this project available. Thank you!

No problemo! We're very happy to be working on OODT in open source, with the 
rest of the community!

> 
> From all the nasa.gov email addresses around here, I get the impression that 
> in the early days of this project, most of the developers and users have been 
> in direct contact or even within the same organization, so I'd like to share 
> my experience as a complete outsider.  I am familiar with the challenges of 
> managing research data at a large organization with many research groups, so 
> I've been trying to figure out what OODT does and what it could do for me.  
> So far most of what I've found has been written either at a very abstract 
> level for managers (the TLP press release and the OODT main page) or a very 
> detailed level for developers (the javadocs). I haven't seen much so far for 
> the "data people" in the middle -- the people who need enough technical 
> detail to put the system into practice because they're tired of coding their 
> own.  This is my experience trying to get that information.

Sorry that you've had that experience so far. The guide for the file manager 
that you stumbled upon below is an effort to start to obviate some of those 
concerns. I agree that much of the documentation as it stands is Javadoc type 
documentation, or high level architecture, but I'd also point you to more 
guides like the below (there are more). In fact, many of the OODT components 
have a few such guides that can help out at least in getting started. I'll 
reply more on these on the below paragraph because they are more applicable 
there.

> 
> The website has a lot of stub pages for the individual components, so I 
> thought that I might be able to get some more information by downloading and 
> running the software.  This started as a NASA project, so there have to be 
> stacks of documentation somewhere, right?  I downloaded the trunk and built 
> it using the instructions I eventually found on the File Manager page 
> (http://oodt.apache.org/components/maven/filemgr/user/basic.html), but now I 
> have a directory with a bunch of folders in it, and I have no idea what to do 
> with them.  The only tutorial I can find is for the File Manager -- which I 
> very much appreciate, even though it doesn't completely work for me -- and 
> there are only two files named README.txt in the entire project.

Thanks. Can you elaborate on what part of the guide doesn't completely work? 

The filemgr, workflow, and resource components are 3 sort of canonical services 
that help you implement data processing and management. File Manager tracks 
file locations, their metadata, handles data transfers, and provides the 
ability to transform that captured metadata in a variety of ways (e.g., output 
it as RSS or RDF via the cas-product webapp), and to deliver those files and 
metadata to folks who ask for them. The workflow manager is a light-weight 
wrapper where you can cook up control flow and data flow (sets of Tasks chained 
together) in XML files, you can execute those Tasks locally on a single 
machine, or you can plug the workflow manager into a resource manager, and have 
those tasks be distributed out onto a cluster, a cloud, a grid or whatever type 
of hardware you have to execute processes and jobs on. These components, by 
themselves, are useful independently of one another. In fact, they don't have 
any direct dependencies on one another unless you tell them to. What that means 
is that you can use the filemgr as an independent component simply to 
programmatically capture information about files and metadata; but never do 
anything with them that involves a workflow manager or resource manager. You 
can simply use the workflow system if you want, independent of the filemgr or 
resource manager; you can use resource manager similarly. 

However, when you put these 3 services together, you start to have a really 
powerful substrate to perform data management system functions on. For example, 
the crawler framework combines the power of automatic file identification, and 
ingestion, with the file manager, to rapidly build up your file manager based 
archive and catalog; it also provides the ability to notify the workflow 
manager when files are ingested to kick off tasks and processes (algorithms) 
associated with the ingestion of those files. The pushpull framework is a 
remote content acquisition system, that can go get you ancillary files and 
metadata, pull them down locally, and feed them to the crawler for ingestion 
and management in your data management system. Finally the PGE component is a 
specialized workflow task jar library, that when dropped into the context of 
the workflow manager's lib directory, gives you a high powered workflow task 
that can easily communicate with the filemgr, workflow manager or resource 
manager, and feed information to your algorithm that otherwise you'd have to 
write lots of specialized data management code for.

The above is a description of what *one set* of OODT components (the CAS 
family) do; there's a whole other set of those components that handle 
information integration. The use case here is that you have a bunch of existing 
databases or data systems that you'd like to link together, but you don't 
control their population, schema, or business processes associated with them. 
In this case, we have the profile (metadata) and product (data) server 
components, which expose the underlying metadata and data from these systems 
and make it easily available for query, representation and dissemination. 
Profile and product servers run on top of the web-grid WAR file, a Tomcat 
webapp that turns them into REST-ful services. The best place to get started 
here is to look at:

http://oodt.apache.org/components/maven/grid/slides.pdf

NOTE: those slides were made pre-Apache OODT, so some of them will contain old 
properties and paths for Web Grid, but should still give you an idea of what's 
going on. The Apache OODT web-grid is basically the same component that you see 
in those slides.

Once you are familiar with web-grid there are a few custom, extensible profile 
and product server handlers that we have been working on. xmlps (available as a 
top-level OODT module) is a XML-configurable profile/profile server that can 
easily connect to JDBC-accesible databases and dump out the bits and metadata 
from them. OPeNDAPPs is a XML configurable profile server that can connect to 
OPeNDAP accessible data servers and extract metadata and data from them.

> 
> As a result, I still have a lot of very basic questions:  What do I do with 
> all of these components? What do they all do?  Which ones do I need, and 
> which are optional? Are they standalone executables?  Web services that 
> require some sort of container?  Do I interact with them using the command 
> line, or do they have web or web services interfaces?  What are the 
> configuration options?  What kinds of data and metadata can I manage? What 
> kinds of roles do I need to have within my organization (administrator, 
> content owner, metadata maintainer), and how does the software handle these? 
> What do I want to do that this project can't? (In this type of software, 
> there's always something that's just a little too specific to the original 
> purpose or organization.)

Hopefully what I mentioned above will give you a basic idea of what's going on. 
Apache OODT is a framework that by itself doesn't build your data system for 
you; it needs some TLC from a person like you that knows your data system 
requirements, etc., and can help map those to the specific components and 
resultant architecture provided by OODT to use it for your application.

Check out what I mentioned above, and then if you need more help just jump on 
list and let us know. At that point it would be nice if you could give us some 
more detail about what you are actually trying to do in terms of data 
management/etc., as that would give us a better idea of how to suggest help in 
configuring and using OODT for your specific case.

> 
> OODT claims to have a large user community apart from the original 
> developers.  How did it come to be that these organizations and individuals 
> knew how to use the software?  What sort of documentation and support did the 
> developers need to provide in order to get them up and running?  How can I 
> get some of that? :)

Like Dave Kale mentioned in his email, a lot of the work to date has come from 
collaborative research grants and shared effort on projects with folks working 
in the organizations that have used OODT. Now that it's here at Apache many of 
those folks are lurking on these lists, and available to help out and discuss 
issues with the software, etc., also in the hopes that it will help out their 
specific deployments.

> 
> Again, I'm very grateful that this product exists and am excited to find out 
> more about it.  Thanks for making it available for me to puzzle over!

Thanks for your email and welcome!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Approaching OODT as a new user

Reply via email to