[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696513#comment-14696513
 ] 

Chris A. Mattmann commented on TIKA-1699:
-----------------------------------------

I got this working! :-) 

h2. Starting Tika Server
{noformat}
java -Dorg.apache.tika.service.error.warn=true -classpath 
$HOME/git/grobidparser-resources/:$HOME/src/tika-server/target/tika-server-1.11-SNAPSHOT.jar:$HOME/grobid/lib/\*
 org.apache.tika.server.TikaServerCli --config tika-config.xml
{noformat}

h2. cURL command to test
{noformat}
curl -T $HOME/git/grobid/papers/ICSE06.pdf -H "Content-Disposition: 
attachment;filename=ICSE06.pdf" http://localhost:9998/rmeta | python -mjson.tool
{noformat}

h2. Output

{noformat}
[
    {
        "Author": "End User Computing Services",
        "Company": "ACM",
        "Content-Type": "application/pdf",
        "Creation-Date": "2006-02-15T21:13:58Z",
        "Last-Modified": "2006-02-15T21:16:01Z",
        "Last-Save-Date": "2006-02-15T21:16:01Z",
        "SourceModified": "D:20060215211344",
        "X-Parsed-By": [
            "org.apache.tika.parser.CompositeParser",
            "org.apache.tika.parser.journal.JournalParser"
        ],
        "X-TIKA:content": 
"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProceedings
 Template - WORD\n\n\nA Software Architecture-Based Framework for Highly 
\nDistributed and Data Intensive Scientific Applications \n\n \nChris A. 
Mattmann1, 2        Daniel J. Crichton1        Nenad Medvidovic2        Steve 
Hughes1 \n\n \n1Jet Propulsion Laboratory \n\nCalifornia Institute of 
Technology \nPasadena, CA 91109, USA 
\n\n{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov \n\n2Computer Science 
Department \nUniversity of Southern California  \n\nLos Angeles, CA 90089, USA 
\n{mattmann,neno}@usc.edu \n\n \nABSTRACT \nModern scientific research is 
increasingly conducted by virtual \ncommunities of scientists distributed 
around the world. The data \nvolumes created by these communities are extremely 
large, and \ngrowing rapidly. The management of the resulting highly 
\ndistributed, virtual data systems is a complex task, characterized \nby a 
number of formidable technical challenges, many of which \nare of a software 
engineering nature.  In this paper we describe \nour experience over the past 
seven years in constructing and \ndeploying OODT, a software framework that 
supports large, \ndistributed, virtual scientific communities. We outline the 
key \nsoftware engineering challenges that we faced, and addressed, \nalong the 
way. We argue that a major contributor to the success of \nOODT was its 
explicit focus on software architecture. We \ndescribe several large-scale, 
real-world deployments of OODT, \nand the manner in which OODT helped us to 
address the domain-\nspecific challenges induced by each deployment.  
\n\nCategories and Subject Descriptors \nD.2 Software Engineering, D.2.11 
Domain Specific Architectures \n\nKeywords \nOODT, Data Management, Software 
Architecture. \n\n1. INTRODUCTION \nSoftware systems of today are very large, 
highly complex, \n\noften widely distributed, increasingly decentralized, 
dynamic, and \nmobile.  There are many causes behind this, spanning virtually 
all \nfacets of human endeavor: desired advances in education, \nentertainment, 
medicine, military technology, \ntelecommunications, transportation, and so on. 
  \n\nOne major driver of software\u2019s growing complexity is \nscientific 
research and exploration.  Today\u2019s scientists are solving \nproblems of 
until recently unimaginable complexity with the help \nof software.  They also 
actively and regularly collaborate with \n\ncolleagues around the world, 
something that has become possible \nonly relatively recently, again ultimately 
thanks to software. They \nare collecting, producing, sharing, and 
disseminating large \namounts of data, which are growing by orders of magnitude 
in \nvolume in remarkably short time periods. \n\nIt is this latter problem 
that NASA\u2019s Jet Propulsion \nLaboratory (JPL) began facing several years 
ago.  Until recently, \nJPL would disseminate data collected by various 
instruments \n(Earth-based, orbiting, and in outer space) to the interested 
\nscientists around the United States by \u201cburning\u201d CD-ROMs and 
\nmailing them via the U.S. Postal Service.  In addition to being \nslow, 
sequential, unidirectional, and lacking interactivity, this \nmethod was 
expensive, costing hundreds of thousands of dollars. \nFurthermore, the method 
was prone to security breaches, and the \nexact data distribution (determining 
which data goes to which \ndestinations) had to be calculated for each 
individual shipment. It \nhad become increasingly difficult to manage this 
process as the \nnumber of projects and missions, as well as involved 
scientists, \ngrew.  An even more critical limiting factor became the sheer 
\nvolume of data that the current (e.g., Planetary Data System, or \nPDS), 
pending (e.g., Mars Reconnaissance Orbiter, or MRO), and \nplanned (e.g., Lunar 
Reconnaissance Orbiter, or LRO) missions \nwould produce: from terabytes (PDS), 
to hundreds of terabytes \n(MRO), to petabytes or more (LRO).  Clearly, 
spending millions \nof dollars just to distribute the data to scientists is 
impractical. \n\nThis prompted NASA\u2019s Office of Space Science to explore 
\nconstruction of an end-to-end software framework that would \nlower the cost 
of distributing and managing scientific data, from \nthe inception of data at a 
science processing center to its ultimate \narrival on the desks of interested 
users. Because of increasing data \nvolumes, the framework had to be scalable 
and have native \nsupport for evolution to hundreds of sites and thousands of 
data \ntypes. Additionally, the framework had to enable the \nvirtualization of 
heterogeneous data (and processing) sources, and \nto address wide-scale 
(national and international) distribution of \ndata. The framework needed to be 
flexible: it needed to support \nfully automated processing of data throughout 
its lifecycle, while \nstill allowing interactivity and intervention from an 
operator when \nneeded. Furthermore because data is itself distributed across 
\nNASA agencies, any software framework that distributes NASA\u2019s \ndata 
would require the capability for tailorable levels of security \nand for 
varying types of users belonging to multiple \norganizations. \n\nThere were 
also miscellaneous issues of data ownership that \nneeded to be overcome. 
Ultimately, because NASA\u2019s science data \nis so distributed, the owners of 
data systems (e.g., a Planetary \n\n \n\nPermission to make digital or hard 
copies of all or part of this work for \npersonal or classroom use is granted 
without fee provided that copies are \nnot made or distributed for profit or 
commercial advantage and that \ncopies bear this notice and the full citation 
on the first page. To copy \notherwise, or republish, to post on servers or to 
redistribute to lists, \nrequires prior specific permission and/or a fee. 
\nICSE06\u2019, May 20\u201328, 2006, Shanghai, China. \nCopyright 2006 ACM 
1-58113-000-0/00/0004\u2026$5.00. \n \n\n\n\nScience Principal Investigator) 
feel hard pressed to control their \ndata, as the successful operation and 
maintenance of their data \nsystems are essential services that they provide. 
As such, any \nframework that virtualizes science data sources across NASA 
\nshould be transparent and unobtrusive: it should enable \ndissemination and 
retrieval of data across data systems, each of \nwhich may have their own 
external interfaces and services; at the \nsame time, it should enable 
scientists to maintain and operate their \ndata systems independently. Finally, 
to lower costs, once the \nframework was built and installed, it needed to be 
reusable, free, \nand distributable to other NASA sites and centers for use. 
\n\nOver the past seven years we have designed, implemented \nand deployed a 
framework called OODT (Object Oriented Data \nTechnology) that has met these 
rigorous demands. In this paper \nwe discuss the significant software 
engineering challenges we \nfaced in developing OODT.  The primary objective of 
the paper is \nto demonstrate how OODT\u2019s explicit software architectural 
basis \nenabled us to effectively address these challenges.  In particular, 
\nwe will detail the architectural decisions we found most difficult \nand/or 
critical to OODT\u2019s ultimate success. We highlight several \nrepresentative 
examples of OODT\u2019s use to date both at NASA \nand externally. We contrast 
our solution with related approaches, \nand argue that a major differentiator 
of this work, in addition to its \nexplicit architectural foundation, is its 
native support for \narchitecture-based development of distributed scientific 
\napplications. \n\n2. SOFTWARE ENGINEERING \nCHALLENGES \n\nTo develop OODT, 
we needed to address several significant \nsoftware engineering challenges, the 
bulk of which surfaced in \nlight of the complex data management and 
distribution issues \nregularly faced within a distributed, large-scale 
government \norganization such as NASA. In this paper we will focus on nine 
\nkey challenges: Complexity, Heterogeneity, Location \nTransparency, Autonomy, 
Dynamism, Scalability, Distribution, \nDecentralization, and Performance. 
\n\nComplexity \u2013 We envisioned OODT to be a large, multi-site, 
\nmulti-user, complex system. At the software level, complexity \nranged from 
understanding how to install, integrate, and manage \nthe software remotely 
deployed at participating organizations, to \nunderstanding how to manage 
information such as access \nprivileges and security credentials across both 
NASA and non-\nNASA sites. There were also complexities at the software 
\nnetworking layer, including varying firewall capabilities at each 
\ninstitution, and data repositories that would periodically go offline \nand 
needed to be remotely restarted. Just understanding the \nvarying types of data 
held at sites linked together via OODT was \na significant task. Even sites 
within the same science domain \n(e.g., planetary science) describe similar 
data sets in decidedly \ndifferent ways. Discerning in what ways these 
different data \nmodels were common and what attributes of data could be 
shared, \ndone away with, or amended, was a huge challenge. Finally, the 
\ndifferent interfaces to data, ranging from third-party, well-\nengineered 
database management systems, to in-house data \nsystems, ultimately to flat 
text file-based data was a particularly \ndifficult challenge that we had to 
hurdle. \n\nHeterogeneity \u2013 In order to drive down the data management 
\ncosts for science missions, the same OODT framework needed to \n\nspan 
multiple science domains. The domains initially targeted \nwere earth and 
planetary; this has subsequently been expanded to \nspace, biomedical sciences, 
and the modeling and simulation \ncommunities. As such, the same core set of 
OODT software \ncomponents, system designs, and implementation-level facilities 
\nhad to work across widely varying science domains.  \n\nThe data management 
processes within the organizations that \nuse OODT also added to its 
heterogeneity. For instance, OODT \ncomponents needed to have interfaces with 
end users and support \ninteractive sessions, but also with scientific 
instruments, which \nmost likely were automatic and non-interactive. Scientific 
\ninstruments could push data to certain components in OODT, \nwhile other OODT 
components would need to distribute data to \nusers outside of OODT. End-users 
in some cases wanted to \nperform transformations on the data sent to them by 
OODT, and \nthen to return the data back into OODT. The framework needed to 
\nsupport scenarios such as these seamlessly. \n\nMany other constraints also 
imposed the heterogeneity \nrequirement on OODT. We can group these constraints 
into two \nmajor categories: \n\u2022 Organizational \u2013 As we briefly 
alluded above, discipline \n\nexperts who wanted to disseminate their data via 
OODT \nreally wanted the data to reside at their respective \ninstitutions. 
This constraint non-negotiable, and significantly \nimpacted the space of 
technical solutions that we could \ninvestigate for OODT.  \n\n\u2022 Technical 
\u2013 Since OODT had to federate many different data \nholdings and catalogs, 
we faced the constraints of linking \nthem together and federating very 
different schemas and \nvarying levels of sophistication in the data system 
interfaces \n(e.g., flat files, DBMS, web pages). Even those systems \nmanaging 
data through \u201chigher level APIs\u201d and middleware \n(e.g., RMI, CORBA, 
SOAP) proved non-trivial to integrate. \nThe constraints enjoined by 
heterogeneity alone led us to \n\nrealize that the OODT framework would need to 
draw heavily \nfrom multiple areas. Database systems, although used 
\nsuccessfully for many years to manage large amounts of data at \nmany sites, 
lacked the flexibility and interface capability to \nintegrate data from other 
more crude APIs and storage systems \n(such as a PI-led web site). Databases 
also did not address the \ndistribution of data and \u201cownership\u201d 
issues. The advent of the \nweb, although a promising means for providing 
openness and \nflexible interfaces to data, would not alone address the issues 
such \nas multi-institutional security and access. Furthermore, its 
\nrequest/reply nature would not easily handle other distribution \nscenarios, 
e.g., subscribe/notify. Research in the area of grid \ncomputing [1] has 
defined \u201cout of the box\u201d services for managing \ndata systems (e.g., 
GridFTP), but which utilized alone would not \naddress our other challenges 
(e.g., complexity). \n\nLocation Transparency \u2013 Even though data could 
potentially \nbe input into and output from the system from many 
\ngeographically disparate and distributed sites, it should appear to \nthe 
end-users as if the data flow occurred from a single location. \nThis 
requirement was reinforced by the need to dynamically add \ndata producers and 
consumers to a system supported by OODT, \nas will be further discussed below. 
\n\nAutonomy \u2013 When designing the OODT framework, we could \nnot dictate 
how data providers should store, process, find, evolve, \nor retire their data. 
Instead, the framework needed to be \n\n\n\ntransparent, allowing data 
providers to continue with their regular \nbusiness processes, while managing 
and disseminating their \ninformation unobtrusively.  \n\nDynamism \u2013 It is 
expected that data providers for the most part \nwill be stable organizations. 
However, there are cases in which \nnew data producing (occasionally) and 
consuming (frequently) \nnodes will need to be brought on-line. Back-end data 
sources need \nto be pluggable, with little or no direct impact on the end-user 
of \nthe OODT system, or on the organization that owns the data \nsource. New 
end-users (or client hosts) should also be able to \n\u201ccome and go\u201d 
without any disruption to the rest of the system. In \nthe end, we realized 
this meant the whole infrastructure must be \ncapable of some level of dynamism 
in order to meet these \nconstraints. \n\nScalability \u2013 OODT needed to 
manage large volumes of data, \nfrom at least hundreds of gigabytes at its 
inception to the current \nmissions which will produce hundreds of terabytes. 
The \nframework needed to support at least dozens of institutional data 
\nproviders (which themselves may have subordinate data system \nproviders), 
dozens of user types (e.g., scientists, teachers, \nstudents, policy makers), 
thousands of users, hundreds of \ngeographic sites, and thousands of different 
data types to manage \nand disseminate. \n\nDistribution \u2013 The framework 
should be able to handle the \nphysical distribution of data across sites 
nationally and \ninternationally, and ultimately the physical distribution of 
the \nsystem interfaces which provide the data. \n\nDecentralization \u2013 
Each site may have its own data \nmanagement processes, interfaces and data 
types, which were \noperating independently for some time. We needed to devise 
a \nway of coordinating and managing data between these data sites \nand 
providers without centralizing control of their systems, or \ninformation. In 
other words, the requirement was that the different \nsites retain their full 
autonomy, and that OODT adapts instead. \n\nPerformance \u2013 Despite its 
scale and interaction with many \norganizations, data systems, and providers, 
OODT still needed to \nperform under stringent demands. Queries for information 
needed \nto be serviced quickly: in many cases response time under five 
\nseconds was used as a baseline. Additionally, OODT needed to be \noperational 
whenever any of the participating scientists wanted to \nlocate, access, or 
process their data. \n\n3. BACKGROUND AND RELATED WORK \nSeveral large-scale 
software technologies that distribute, \n\nmanage, and process information have 
been constructed over the \npast decade. Each of these technologies falls into 
one or more of \nfour distinct areas: grid-computing, information integration, 
\ndatabases, and middleware. In this section, we briefly survey \nrelated 
projects in each of these areas and compare their foci and \naccomplishments to 
those of OODT. Additionally, since a major \nfocal point of OODT is software 
architecture, we start out by \nproviding some brief software architecture 
background and \nterminology to set the context. \n\nTraditionally, software 
architecture has referred to the \nabstraction of a software system into its 
fundamental building \nblocks: software components, their methods of 
interaction (or \nsoftware connectors), and the governing rules that guide the 
\n\ncomposition of software components and software connectors 
\n(configurations) [2, 3]. Software architecture has been recognized \nin many 
ways to be the linchpin of the software development \nprocess. Ideally, the 
software requirements are reflected within \nthe software system\u2019s 
components and interactions; the \ncomponents and interactions are captured 
within the system\u2019s \narchitecture; and the architecture is used to guide 
the design, \nimplementation, and evolution of the system. Design guidelines 
\nthat have been proven effective are often codified into \narchitectural 
styles, while specific architectural solutions (e.g., \nconcrete system 
structures, component types and interfaces, and \ninteraction facilities) 
within specific domains are captured as \nreusable reference architectures. 
\n\nGrid computing deals with highly complex and distributed \ncomputational 
problems and large volume data management \ntasks. Massive parallel 
computation, distributed workflow, and \npetabyte scale data distribution are 
only a small cross-section of \nthe grid\u2019s capabilities. Grid projects are 
usually broken down into \ntwo areas. Computational grid systems are concerned 
with \nsolving complex scientific problems involving supercomputing \nscale 
resources dispersed across various organizational \nboundaries. The 
representative computational grid system is the \nGlobus Toolkit [4]. Globus is 
built on top of a web-services [5] \nsubstrate and provides resource management 
components, \ndistributed workflow and security infrastructure. Other 
\ncomputational grid systems provide similar capabilities. For \nexample, 
Alchemi [6] is a .NET-based grid technology that \nsupports distributed job 
scheduling and an object-oriented grid \ndevelopment environment. JCGrid [7] is 
a light weight, Java-\nbased open source computational grid project whose goal 
is to \nsupport distributed job scheduling and the splitting of CPU-\nintensive 
tasks across multiple machines.  \n\nThe other class of grid systems, Data 
grids, is involved in the \nmanagement, processing, and distribution of large 
data volumes to \ndisbursed and heterogeneous users, user types, and geographic 
\nlocations. There are several major data grid projects. The LHC \nComputing 
Grid [8] is a system whose main goal is to provide a \ndata management and 
processing infrastructure for the high \nenergy physics community. The Earth 
System Grid [9] is geared \ntowards supporting climate modeling research and 
distribution of \nclimate data sets and metadata to the climate and weather 
\nscientific community.  \n\nTwo independently conducted studies [10, 11] have 
\nidentified three key areas that the current grid implementations \nmust 
address more effectively in order to promote data and \nsoftware 
interoperability: (1) formality in grid requirements \nspecification, (2) 
rigorous architectural description, and (3) \ninteroperability between grid 
solutions. As we will discuss in this \npaper, our work to date on OODT has the 
potential to be a \nstepping stone in each of these areas: its explicit focus 
on \narchitectures for data-intensive, \u201cgrid-like\u201d systems naturally 
\naddresses the three concerns.  \n\nThere have been several well-known efforts 
within the AI \nand database communities that have delved into the topic of 
\ninformation integration, or the shared access, search, and retrieval \nof 
distributed, heterogeneous information resources. Within the \npast decade, 
there has been significant interest in building \ninformation mediators that 
can integrate information from \nmultiple data sources. Mediators federate 
information by querying \nmultiple data sources, and fusing back the gathered 
results. The \nrepresentative systems using this approach include TSIMMS [12], 
\n\n\n\nInformation Manifold [13], The Internet Softbot [14], InfoSleuth 
\n[15], Infomaster [16], DISCO [17], SIMS [18] and Ariadne  [19]. \nEach of 
these approaches focuses on fundamental algorithmic \ncomponents of information 
integration: (1) formulating \nexpressive, efficient query languages (such as 
Theseus [20]) that \nquery many heterogeneous data stores; (2) accurately and 
reliably \ndescribing both global, and source data models (e.g. the 
Global-\nas-view [12] and Local-as-view [21] approaches); (3) providing a 
\nmeans for global-to-source data model integration; and (4) \nimproving 
queries and deciding which data sources to query (e.g. \nquery reformulation 
[22] and query rewriting [22, 23]).  \n\nHowever, these algorithmic techniques 
fail to address the \nsoftware engineering side of information integration. For 
instance, \nexisting literature fails to answer questions such as, which of the 
\ncomponents in the different systems\u2019 architectures are common; \nhow can 
they be reused; which portions of their implementations \nare tied to (which) 
software components; which software \nconnectors are the components using to 
interact; are the \ninteraction mechanisms replaceable (e.g., can a 
client-server \ninteraction in Ariadne become a peer-to-peer interaction); and 
so \non. Additionally, none of the above related mediator systems have 
\nformalized a process for designing, implementing, deploying, and 
\nmaintaining the software components belonging to each system.  \n\nSeveral 
middleware technologies such as CORBA, \nEnterprise Java Beans [24], Java RMI 
[25], and more recently \nSOAP and Web services [5] have been suggested as 
\u201csilver \nbullets\u201d that address the problem of integrating and 
utilizing \nheterogeneous software computing and data resources. Each of 
\nthese technologies provides three basic services: (1) an \n\nimplementation 
and composition framework for software \ncomponents, possibly written in 
different languages but \nconforming to a specific middleware interface; (2) a 
naming \nregistry used to locate components; and (3) a set of basic services 
\nsuch as (un-)marshalling of data, concurrency, distribution and \nsecurity.  
\n\nAlthough middleware is very useful \u201cglue\u201d that can connect 
\nsoftware components written in different languages or deployed \nin 
heterogeneous environments, middleware technologies do not \nprovide any 
\u201cout of the box\u201d services that deal with computing \nand data 
resource management across organizational boundaries \nand across computing 
environments at a national scale. These \nkinds of services usually have to be 
engineered into the \nmiddleware itself. We should note that in grid computing 
such \nservices are explicitly called out and provided at a higher layer of 
\nabstraction. In fact, the combination of these higher-level grid \nservices 
and an underlying middleware platform is typically \nreferred to as a 
\u201cgrid technology\u201d [11].  \n\n4. OODT ARCHITECTURE \nOODT\u2019s 
architecture is a reference architecture that is \n\nintended to be 
instantiated and tailored for use across science \ndomains and projects. The 
reference architecture comprises \nseveral components and connectors.  A 
particular instance of this \nreference architecture, that of NASA\u2019s 
planetary data system \n(PDS) project, is shown in Figure 1. OODT is installed 
on a given \nhost inside a \u201csandbox\u201d, and is aware of and interacts 
only with \nthe designated external data sources outside its sandbox. 
OODT\u2019s \n\nm\nessaging layer (H\n\nTTP)\n\n\u2026\n.. \u2026..\n\n 
\nFigure 1. The Planetary Data System (PDS) OODT Architecture Instantiation 
\n\n\n\ncomponents are responsible for delivering data from \nheterogeneous 
data stores, identifying and locating data within the \nsystem, and ingesting 
and processing data into underlying data \nstores. The connectors are 
responsible for integrating OODT with \nheterogeneous data sources; providing 
reliable messaging to the \nsoftware components; marshalling resource 
descriptions and \ntransferring data between components; transactional 
\ncommunication between components; and security related issues \nsuch as 
identification, authorization, and authentication. In this \nsection, we 
describe the guiding principles behind the reference \narchitecture. We then 
describe each of the OODT reference \ncomponents and connectors in detail. In 
Section 5, we describe \nspecific instantiations of the reference architecture 
in the context \nof several projects that are using OODT. \n\n4.1 Guiding 
Principles \nThe software engineering challenges discussed in Section 2 
\n\nmotivated and framed the development of OODT. Conquering \nthese challenges 
led us to a set of four guiding principles behind \nthe OODT reference 
architecture.  \n\nThe first guiding principle is division of labor. Each 
\ncapability provided by OODT (e.g., processing, ingestion, search, \nand 
retrieval of data, access to heterogeneous data, and so on) is \ncarefully 
divided among separate, independent architectural \ncomponents and connectors. 
As will be further detailed below, the \nprinciple is upheld through 
OODT\u2019s rigorous separation of \nconcerns, and modularity enforced by 
explicit interfaces. This \nprinciple addresses the complexity, heterogeneity, 
dynamism, and \ndecentralization challenges. \n\nClosely related to the 
preceding principle is technology \nindependence. This principle involves 
keeping up-to-date with the \nevolution of software technology (both in-house 
and third-party), \nwhile avoiding tying the OODT architecture to any specific 
\nimplementation. By allowing us to select the technology most \nappropriate to 
a given task or specific need, this principle helps us \nto address the 
challenges of complexity, scalability, security, \ndistribution, location 
transparency, performance, and dynamism.  \nFor instance, OODT\u2019s initial 
reference implementation used \nCORBA as the substrate for its messaging layer 
connector. When \nthe CORBA vendor decided to begin charging JPL significant 
\nlicense fees (thus violating NASA\u2019s objective of producing a \nsolution 
that would be free to its users), the principle of \ntechnology independence 
came into play. Because the OODT \nmessaging layer connector supports a wrapper 
interface around \nthe lower-level distribution technology, we were able to 
replace \nour initial CORBA-based connector with one using Java\u2019s open 
\nsource RMI middleware, and redeploy the new connector to the \nOODT user 
sites, within three person days.  \n\nAnother guiding principle of OODT is the 
distinguishing of \nmetadata as a first-class citizen in the reference 
architecture, and \nseparating metadata from data. The job of metadata (i.e., 
\u201cdata \nabout data\u201d) is to describe the data universe in which the 
system \nis operating. Since OODT is meant to be a technology that \nintegrates 
diverse data sources, this data universe is highly \nheterogeneous and possibly 
dynamic. Metadata in OODT is \nmeant to catalog information, allowing a user to 
locate and \ndescribe the actual data in which she is interested. On the other 
\nhand, the job of data in OODT is to describe physical or scientific 
\nphenomena; it is the ultimate end user product that an OODT \nsystem should 
deliver. This principle helps to address the \n\nchallenges of heterogeneity, 
autonomy of data providers, and \ndecentralization. \n\nSeparating the data 
model from the software is another key \nprinciple behind the reference 
architecture. Akin to ontology/data-\ndriven systems, OODT components should 
not be tied to the data \nand metadata that they manipulate. Instead, the 
components \nshould be flexible enough to understand many (meta-)data models 
\nused across different scientific domains, without reengineering or 
\ntailoring of the component implementations. This principle helps \nto address 
the challenges of complexity and heterogeneity. \n\nThese four guiding 
principles are reified in a reference \narchitecture comprising four pairs of 
component types and two \nclasses of connectors organized in a canonical 
structure. One \ninstantiation of the reference architecture reflecting the 
canonical \nstructure is depicted in Figure 1.  Each OODT architectural 
\nelement (component and connector) serves a specific purpose, \nwith its 
functionality exported through a well-defined interface.  \nThis supports 
OODT\u2019s constant evolution, allowing us to add, \nremove, and substitute, 
if necessary dynamically (i.e., at runtime), \nelements of a given type. It 
also allows us to introduce flexibility \nin the individual instances of the 
reference architecture while, at \nthe same time, controlling the legal system 
configurations.  \nFinally, the explicit connectors and well-defined component 
\ninterfaces allow OODT in principle to integrate with a wide \nvariety of 
third-party systems (e.g., [26]).  The outcome of the \nguiding principles 
(described above) and design decisions \n(detailed below) is an architecture 
that is \u201ceasy to build, hard to \nbreak\u201d. \n\n4.2 OODT Components 
\n4.2.1 Product Server and Product Client \n\nThe Product Server is used to 
retrieve data from \nheterogeneous data stores. The product server accepts a 
query \nstructure that identifies a set of zero or more products which \nshould 
be returned the issuer of the query. A product is a unit of \ndata in OODT and 
represents anything that a user of the system is \ninterested in retrieving: a 
JPEG image of Mars, an MS Word \ndocument, a zip file containing text file 
results of a cancer study, \nand so on. Product servers can be located at 
remote data sites, \ngeographically and/or institutionally disparate from other 
OODT \ncomponents. Alternatively, product servers can be centralized, \nlocated 
at a single site. The objective of the product server is to \ndeliver data from 
otherwise heterogeneous data stores and \nsystems. As long as a data store (or 
system) provides some kind \nof access interface to get its data, a product 
server can \u201cwrap\u201d \nthose interfaces with the help of Handler 
connectors described in \nSection 4.3 below. \n\nThe Product Client component 
communicates with a product \nserver via the Messaging Layer connectors 
described in Section \n4.3. A product client resides at the end-user\u2019s 
(e.g., scientist\u2019s) \nsite.  It must know the location of at least one 
product server, and \nthe query structure that identifies the set of products 
that the user \nwants to retrieve. At the same time, it is completely insulated 
\nfrom any changes in the physical location or actual representation \nof the 
data; its only interface is to the product server(s).  Many \nproduct clients 
may communicate with the same product server, \nand many product servers can 
return data to the same product \nclient. This adds flexibility to the 
architecture without introducing \nunwanted long-term dependencies: a product 
client can be added, \n\n\n\nremoved, or replaced with another one that depends 
on different \nproduct servers, without any effect on the rest of the 
architecture. \n\n4.2.2 Profile Server and Profile Client \nThe Profile Server 
manages resource description \n\ninformation, i.e., metadata, in a system built 
with OODT. \nResource description information is divided into three main 
\ncategories: \n\u2022 Housekeeping Information \u2013 Metadata such as ID, 
Last \n\nModified Date, Last Revised By. This information is kept \nabout the 
resource descriptions themselves and is used by the \nprofile server to 
inventory and catalog resource descriptions. \nThis is a fixed set of metadata. 
\n\n\u2022 Resource Information \u2013 This includes metadata such as Title, 
\nAuthor, Creator, Publisher, Resource Type, and Resource \nLocation. This 
information is kept for all the data in the \nsystem, and is an extended 
version of the Dublin Core \nMetadata for describing electronic resources [27]. 
This is \nalso a fixed set of metadata. \n\n\u2022 Domain-Specific Information 
\u2013 This includes metadata \nspecific to a particular data domain. For 
instance, in a cancer \nresearch system this may include metadata such as Blood 
\nSpecimen Type, Site ID, and Protocol/Study Description. \nThis set of 
metadata is flexible and is expected to change. \n\nAs with product servers, 
profile servers can be decentralized at \nmultiple sites or centralized at a 
single site. The objective of the \nprofile server is to deliver metadata that 
gives a user enough \ninformation to locate the actual data within OODT 
regardless of \nthe underlying system\u2019s exact configuration, and degrees 
of \ncomplexity and heterogeneity; the user then retrieves the data via \none 
or more product servers. Because profile servers do not serve \nthe actual 
data, they need not have a direct interface to the data \nthat they describe. 
In addition to the complete separation of duties \nbetween profile and product 
servers, this ensures their location \nindependence, allows their separate 
evolution, and minimizes the \neffects of component and/or network failures in 
an OODT system. \n\nProfile Client components communicate with profile servers 
\nover the messaging layer connectors. The client must know the \nlocation of 
the profile server, and must provide a query that \nidentifies the metadata 
that a user is interested in retrieving. There \ncan be many profile clients 
speaking with a single profile server, \nand many profile servers speaking with 
a single profile client.  \nThe architectural effects are analogous to those in 
the case of \nproduct clients and servers. \n\n4.2.3 Query Server and Query 
Client \nThe Query Server component provides an integrated search \n\nand 
retrieval capability for the OODT reference architecture. \nQuery servers 
interact with profile and product servers to retrieve \nmetadata and data 
requested by system users. A query server is \nseeded with an initial set of 
references to profile servers. Upon \nreceiving a query from a user, the query 
server passes it along to \neach profile server from its list, and collects the 
metadata \nreturned. Part of this metadata is a resource location (recall 
\nSection 4.2.2) in the form of a URI [28]. A URI can be a link to a \nproduct 
server, to a web site with the actual data, or to some \nexternal data 
providing system. This directly supports \nheterogeneity, location 
transparency, and autonomy of data \nproviders in OODT.  \n\nAnother novel 
aspect of OODT\u2019s architecture is that if a \nprofile server is unable to 
service the query, or if it believes that \n\nother profile servers it is aware 
of may contain relevant metadata, \nit will return the URIs of those profile 
servers; the query server \nmay then forward the query to them. As a result, 
query servers are \ncompletely decoupled from product servers (and from any 
\n\u201cexposed\u201d external data sources), and are also decoupled from 
\nmost of the profile servers. In turn, this lessens the complexity of 
\nimplementing, integrating, and evolving query servers. Once the \nresource 
metadata is returned, the query server will either allow \nthe user herself to 
use the supplied URIs to find the data in which \nshe was interested 
(interactive mode), or it will retrieve, package, \nand deliver the data to the 
user (non-interactive mode). As with \nthe product and profile servers, query 
servers can be centrally \nlocated at a single site, or they can be 
decentralized across \nmultiple sites.   \n\nQuery Client components 
communicate with the query \nservers. The query client must provide a query 
server with a query \nthat identifies the data in which the user is interested, 
and it must \nset a mode for the query server (interactive or non-interactive 
\nmode). The query client may know the location of the query \nserver that it 
wants to contact, or it may rely on the messaging \nlayer connector to route 
its queries to one or more query servers.   \n\n4.2.4 Catalog and Archive 
Server and Client \nThe Catalog and Archive Server (CAS) component in OODT 
\n\nis responsible for providing a common mechanism for ingestion \nof data 
into a data store, including any processing required as a \nresult of 
ingestion. For instance, prior to the ingestion of a poor-\nresolution image of 
Mars, the image may need to be refined and \nthe resolution improved. CAS would 
handle this type of \nprocessing. Any data ingested into CAS must include 
associated \nmetadata information so that the data can be cataloged for search 
\nand retrieval purposes. Upon ingestion, the data is sent to a data \nstore 
for preservation, and the corresponding metadata is sent to \nthe associated 
catalog. The data store and catalog need not be \nlocated on the same host; 
they may be located on remote sites \nprovided there is an access mechanism to 
store and retrieve data \nfrom each. The goal of CAS is to streamline and 
standardize the \nprocess of adding data to an OODT-aware system.  Note that a 
\nsystem whose data stores were populated prior to its integration \ninto OODT 
can still use CAS for its new data.  Since the CAS \ncomponent populates data 
stores and catalogs with both data and \nmetadata, specialized product and 
profile server components have \nbeen developed to serve data and metadata from 
the CAS backend \ndata stores and catalogs more efficiently. Any older data can 
still \nbe served with existing product and profile servers. \n\nThe Archive 
Client component communicates with CAS. The \narchive client must know the 
location of the CAS component, and \nmust provide it with data to ingest. Many 
archive clients can \ncommunicate with a single CAS component, and vice versa.  
Both \nthe archive client and CAS components are completely \nindependent of 
the preceding three pairs of component types in \nthe OODT reference 
architecture. \n\n4.3 OODT Connectors \n4.3.1 Handler Connectors \n\nHandler 
connectors are responsible for enabling the \ninteraction between OODT\u2019s 
components and third-party data \nstores.  A handler connector performs the 
transformation between \nan underlying (meta-)data store\u2019s internal API 
for retrieving data \nand its (meta-)data format on the one hand, and the OODT 
system \n\n\n\non the other. Each handler connector is typically developed for 
a \nclass of data stores and metadata systems. For example, for a \ngiven DBMS 
such as Oracle, and a given internal representation \nschema for metadata, a 
generic Oracle handler connector is \ntypically developed and then reused. 
Similarly, for a given \nfilesystem scheme for storing data, a generic 
filesystem handler \nconnector is developed and reused across like filesystem 
data \nstores.  \n\nEach profile server and product server relies on one or 
more \nhandler connectors. Profile servers use profile handlers, and \nproduct 
servers use query handlers. Handler connectors thereby \ncompletely insulate 
product and profile servers from the third-\nparty data stores.  Handlers also 
allow for different types of \ntransformations on (meta-)data to be introduced 
dynamically \nwithout any effect on the rest of OODT components. For \nexample, 
a product server that distributes Mars image data might \nbe serviced by a 
query handler connector that returns high-\nresolution (e.g., 10 GB) JPEG image 
files of the latest summit \nclimbed by a Mars rover; if the system ends up 
experiencing \nperformance problems, another handler may be (temporarily) 
\nadded to return lower-resolution (e.g., 1 MB) JPEG image files of \nthe same 
scenario. Likewise, a profile server may have two \nprofile handler connectors, 
one that returns image-quality \nmetadata (e.g., resolution and bits/pixel) and 
another that returns \ninstrument metadata about Mars rover images (e.g., 
instrument \nname or image creation date). \n\n4.3.2 Messaging Layer Connector 
\nThe Messaging Layer connector is responsible for \n\nmarshalling data and 
metadata between components in an OODT \nsystem. The messaging layer must keep 
track of the locations of \nthe components, what types of components reside in 
which \nlocations, and if components are still running or not. Additionally, 
\nthe messaging layer is responsible for taking care of any needed \nsecurity 
mechanisms such as authentication against an LDAP \ndirectory service, or 
authorization of a user to perform certain \nrole-based actions. \n\nThe 
messaging layer in OODT provides synchronous \ninteraction among the 
components, and some delivery guarantees \non messages transferred between the 
software components. \nTypically in any large-scale data system, the 
asynchronous mode \nof interaction is not encouraged because partial data 
transfers are \nof no use to users such as scientists who need to make analysis 
on \nentire data sets. \n\nThe messaging layer supports communication between 
any \nnumber of connected OODT software components. In addition, \nthe 
messaging layer natively supports connections to other \nmessaging layer 
connectors as well.  This provides us with the \nability to extend and adapt an 
OODT system\u2019s architecture, as \nwell as easily tailor the architecture 
for any specific interaction \nneeds (e.g., by adding data encryption and/or 
compression \ncapabilities to the connector). \n\n5. EXPERIENCE AND CASE 
STUDIES \nThe OODT framework has been used both within and \n\noutside NASA. 
JPL, NASA\u2019s Ames Research Center, the \nNational Institutes of Health 
(NIH), the National Cancer Institute \n(NCI), several research universities, 
and U.S. Federally Funded \nResearch and Development Centers (FFRDCs) are all 
using \nOODT in some form or fashion. OODT is also available for \ndownload 
through a large open-source software distributor [29]. \n\nOODT components are 
found in planetary science, earth science, \nbiomedical, and clinical research 
projects. In this section, we \ndiscuss our experience with OODT in several 
representative \nprojects within these scientific areas. We compare and 
contrast \nhow the projects were handled before and after OODT. We sketch 
\nsome of the domain-specific technical challenges we encountered \nand 
identify how OODT helped to solve them. \n\nTo begin using OODT, a user designs 
a deployment \narchitecture from one or more of the reference OODT \ncomponents 
(e.g., product and profile servers), and the reference \nOODT connectors. The 
user must determine if any existing \nhandler connectors can be reused, or if 
specialized handler \nconnectors need to be developed. Once all the components 
are \nready, the user has two options for deploying her architecture to \nthe 
target hosts: (1) the user may translate her design into a \nspecialized OODT 
deployment descriptor XML file, which can \nthen be used to start each program 
on the target host(s); or (2) the \nuser can deploy her OODT architecture using 
a remote server \ncontrol component, adding components, and connectors via a 
\ngraphical user interface. The GUI allows the user to send \ncomponent and 
connector code to the target hosts, to start, shut-\ndown, and restart the 
components and connectors, and to monitor \ntheir health during execution. 
\n\n5.1 Planetary Data System \nOne of the flagship deployments of OODT has 
been for \n\nNASA\u2019s Planetary Data System (PDS) [30]. PDS consists of 
\nseven \u201cdiscipline nodes\u201d and an engineering and management \nnode. 
Each node resides at a different U.S. university or \ngovernment agency, and is 
managed autonomously.  \n\nFor many years PDS distributed its data and metadata 
on \nphysical media, primarily CD-ROM. Each CD-ROM was \nformatted a according 
to a \u201chome-grown\u201d directory layout \nstructure called an archive 
volume, which later was turned into a \nPDS standard. PDS metadata was 
constructed using a common, \nwell-structured set of 1200 metadata elements, 
such as Target \nName and Instrument Type, that were identified from the onset 
of \nthe PDS project by planetary scientists. Beginning in the late \n1990s the 
advent of the WWW and the increasing data volumes of \nmissions led NASA 
managers to impose a new paradigm for \ndistributing data to the users of the 
PDS: data and metadata were \nnow to be distributed electronically, via a 
single, unified web \nportal. The web portal and accompanying infrastructure to 
\ndistribute PDS data and metadata was built in 2001 using OODT \nin the manner 
depicted in Figure 1. \n\nWe faced several technical challenges deploying OODT 
to \nPDS. PDS data and metadata were highly distributed, spanning all \nseven 
of the scientific discipline nodes across the country. \nAlthough the entire 
data volume across PDS at the time was \naround 7 terabytes, it was estimated 
that the volume would grow \nto 10 terabytes by 2004. Consequently, the system 
needed to be \nscalable and respond to large growth spurts caused by new data 
\nproducing missions. The flexibility and modularity of the OODT \nproduct and 
profile server components were particularly useful in \nthis regard. Using a 
product and/or profile server, each new data \nproducing system in the PDS 
could be dynamically \u201cplugged in\u201d \nto the existing PDS 
infrastructure that we constructed, without \ndisturbing existing components 
and processes.  \n\nWe also faced the problem of heterogeneity. Almost every 
\nnode within PDS had a different operating system, ranging from \nLinux, to 
Windows, to Solaris, to Mac OS X.  Each node \n\n\n\nEDRN \nQuery 
\nServer\n\nm\nessaging layer (R\n\nM\nI)\n\nProduct \nServer\n\nDBMS 
\n(Specimen \nMetadata)\n\nmoffitt.usf.edu (win2k server)\n\nMS SQL DBMS 
\n(Specimen \nProducts)\n\nSpecimen \nQuery \n\nHandler\n\nSpecimen Profile 
\nHandler (MS SQL)\n\nOODT \u201cSandbox\u201d\n\nOODT 
\u201cSandbox\u201d\n\nProduct \nServer\n\nProfile 
\nServer\n\nanother.erne.server (AnotherOS)\n\nCAS Profile \nHandler\n\nCAS 
Query \nHandler\n\nOODT \u201cSandbox\u201d\nCatalog and \n\nArchive 
Server\n\nLung Images \n(Filesystem)\n\nOther 
\nApplications\n\nginger.fhcrc.org (win2k)\n\nOther Applications\n\nERNE Web 
\nPortal\n\n(Query Client)\n\nuser host\n\nProfile \nClient\n\nProduct 
\nClient\n\nProfile ServerOther \nApplications\n\nOther \nApplications\n\nOther 
Applications\n\nOther Applications\n\nSpecimen Inventory\n(MS SQL)\n\nOther 
Applications\n\nOther Applications\n\npds.jpl.nasa.gov (Linux)\nLegend:\n\nOODT 
\nComponent\n\nData/metadata \nstore\n\nOODT Connector Hardware \nhost\n\nOODT 
\ncontrolled \nportion of \nmachine\n\ndata/control flow\nBlack Box\n\n \n 
\n\nFigure 2. The Early Detection Research Network (EDRN) OODT Architecture 
Instantiation \n\nmaintained its own local catalog system. Although each node 
in \nPDS had different file system implementations dictated by their \nOS, each 
node stored their data and metadata according to the \narchive volume 
structure. Because of this, we were able to write a \nsingle, reusable PDS 
Query Handler which could serve back \nproducts from a PDS archive volume 
structure located on a file \nsystem. Plugging into each node\u2019s catalog 
system proved to be a \nsignificant challenge. For nearly all of the nodes, 
specialized \nprofile handler connectors were constructed to interface with the 
\nunderlying catalog systems, which ranged from static text files \ncalled PDS 
label files to dynamic web site inventory systems \nconstructed using Java 
Server Pages. Because each of the catalogs \ntagged PDS data using the common 
set of 1200 elements, we \nwere able to share much of the code base among the 
profile \nhandler connectors, ultimately only changing the portion of the 
\ncode that made the particular JSP page call, or read the selected \nset of 
metadata from the label file. The entire code base of the \nPDS including all 
the domain specific handler connectors is only \nslightly over 15 KSLOC, 
illustrating the high degree of \nreusability provided by the OODT framework. 
\n\n5.2 Early Detection Research Network \nOODT is also supporting the National 
Cancer Institute\u2019s \n\n(NCI) Early Detection Research Network (EDRN). EDRN 
is a \ndistributed research program that unites researchers from over \nthirty 
institutions across the United States. Tens of thousands of \nscientists 
participate in the EDRN. Each institution is focused on \nthe discovery of 
cancer biomarkers as indicators for disease [31]. \n\nA critical need for the 
EDRN is an electronic infrastructure to \nsupport discovery and validation of 
these markers.  \n\nIn 2001 we worked with the EDRN program to develop the 
\nfirst component of their electronic biomarker infrastructure called \nthe 
EDRN Resource Network Exchange (ERNE). The (partial) \ncorresponding 
architecture is depicted in Figure 2. One of the \nmajor goals of ERNE was to 
provide real-time access to bio-\nspecimen information across the institutions 
of the EDRN. Bio-\nspecimen information typically consisted of gigabytes of 
\nspecimen images, and location and contact metadata for obtaining \nthe 
specimen from its origin study institution. The previous \nmethod of obtaining 
bio-specimen information was very human-\nintensive: it involved phone calls 
and some forms of electronic \ncommunication such as email. Specimen 
information was not \nsearchable across institutions participating in the EDRN. 
The bio-\nspecimen catalogs were largely out-of-date, and out-of-synch with 
\ncurrent holdings at each participating institution.  \n\nOne of the initial 
technical challenges we faced with EDRN \nwas scale. The EDRN was over three 
times as large as the PDS. \nBecause of this we chose to target ten 
institutions initially, rather \nthan the entire set of thirty one. Again, 
OODT\u2019s modularity and \nscalability came into play as we could phase 
deployment at each \ndeployment institution. As we instantiated new product, 
profile, \nquery, and archive servers at each institution, we could do so 
\nwithout interrupting any existing OODT infrastructure already \ndeployed.  
\n\nAnother challenge that we encountered was dealing with \neach participating 
site\u2019s Institutional Review Board (IRB). An \nIRB is required to review 
and ensure compliance of projects with \n\n\n\nfederal laws related to working 
with data from research projects \ninvolving human subjects. To satisfy the 
IRB, any OODT \ncomponents deployed at an EDRN site had to provide an adequate 
\nsecurity capability in order to get approval to share the data \nexternally 
from an institution. OODT\u2019s separation of data and \nmetadata explicitly 
allowed us to satisfy this requirement. We \ndesigned ERNE so that each 
institution could remain in control of \ntheir specimen holding data by 
instantiating product server \ncomponents at each site, rather than 
distributing the information \nacross ERNE which would have violated the IRB 
agreements.  \n\nAnother significant challenge we faced in developing ERNE 
\nwas lack of a consistent metadata model for each ERNE site. We \nwere forced 
to develop a common specimen metadata model and \nthen to create specific 
mappings to link each local site to the \ncommon model. OODT aided us once 
again in this endeavor as \nthe common mappings we developed were easily 
codified into a \nquery handler connector, and reused across each ERNE site.  
\n\nThe entire code base of ERNE, including all its specialized \nhandler 
connectors is only slightly over 5.3 KSLOC, highlighting \nthe high degree of 
reusability of the shared framework code base \nand the handler code base. \n\n 
\n\n5.3 Science Processing Systems \nOODT has also been deployed in several 
science processing \n\nsystem missions both, operational and under development. 
Due to \nspace limitations, we can only briefly summarize each of the \nOODT 
deployments in these systems.  \n\nSeaWinds, a NASA-funded earth science 
instrument flying \non the Japanese ADEOS-II spacecraft, used the OODT CAS 
\ncomponent as a workflow and processing component for its \nProcessing and 
Analysis Center (SeaPAC). SeaWinds produced \nseveral gigabytes of data during 
its six year mission. CAS was \nused to control the execution and data flow of 
mission-specific \ndata processor components, which calibrated and created 
derived \ndata products from raw instrument data, and archived those \nproducts 
for distribution into the data store managed by CAS. A \nmajor challenge we 
faced during the development of SeaPAC was \nthat  the processor components 
were developed by a group \noutside of the SeaWinds project. We had to provide 
a mechanism \nfor integrating their source code into the OODT SeaPAC 
\nframework. OODT\u2019s separation of concerns allowed us to address \nthis 
issue with relative ease: once the data processors were \nfinished, we were 
able wrap and tailor them internally within \nCAS, without disturbing the 
existing SeaPaC infrastructure. \n\nThe success of the CAS within SeaWinds led 
to its reuse on \nseveral different missions. Another earth science mission 
called \nQuikSCAT retrofitted and replaced some of their existing \nprocessing 
components with CAS, using the SeaWinds experience \nas an example. The 
Orbiting Carbon Observatory (OCO) mission \nthat will fly in 2009, and that is 
currently under development, is \nalso utilizing CAS to ingest and process 
existing FTS CO2 \nspectrometer data from earth-based instruments. The James 
Web \nTelescope (JWT) is using the CAS for to implement its workflow \nand 
processing capabilities for astrophysics data and metadata. \nEach of these 
science processing systems will face similar \ntechnical challenges, including 
separation of concerns between \nthe actual processing framework and the 
developers writing the \nprocessor code, the volume of data that must be 
handled by the \nprocessing system (OCO is projected to produce over 150 
\nterabytes), and the flexibility and tailorability of the workflow \n\nneeded 
to process the data. We believe that OODT is uniquely \npositioned to address 
these difficult challenges. \n\n5.4 Computer Modeling Simulation and 
\nVisualization \n\nOODT has also been deployed to aid the Computer \nModeling 
Simulation and Visualization (CMSV) community at \nJPL, by linking together 
several institutional model repositories \nacross the organizations within the 
lab, and creating a web portal \ninterface to query the integrated model 
repositories. We \ndeveloped specialized profile server components that locate 
and \nlink to different model resources across JPL, such as power \nsubsystem 
models of the Mars Exploration Rovers (MER), CAD-\ndrawing models of different 
spacecraft assembly parts, and \nsystems architecture models for engineering 
and design of \nspacecraft. Each of these different model types lived in 
separate \nindependent repositories across JPL. For instance, the CAD \nmodels 
were stored in a commercial product called TeamCenter \nEnterprise [32], while 
the power and systems architecture models \nwere stored in a commercial product 
called Xerox Docushare \n[33].  \n\nTo integrate these model repositories for 
CMSV, we had to \nderive a common set of metadata across the wide spectrum of 
\ndifferent model types that existed at JPL. OODT\u2019s separation of \ndata 
from metadata allowed us to rapidly instantiate our common \nmetadata model 
once we developed it, by constructing specialized \nprofile handler connectors 
that mapped each repository\u2019s local \nmodel to the common model. 
Reusability levels were high across \nthe connectors, resulting in an extremely 
small code base of 2.57 \nKSLOC.  \n\nAnother challenge in light of this 
mapping activity was \ninterfacing with the APIs of the underlying model 
repositories. In \nthe above two cases, the APIs were commercial products, and 
\npoorly documented. In some cases, such as the Docushare \nrepository, the 
APIs did not fully conform to their stated \nspecifications. The division of 
labor amongst OODT components \ncame into play on this task. It allowed us to 
focus on deploying \nthe rest of the OODT supporting infrastructure, such as 
the web \nportal, and the profile handler connectors, and not getting stalled 
\nwaiting for the support teams from each of the commercial \nvendors to debug 
our API problems. Once the OODT CMSV \ninfrastructure was deployed, the 
modeling and simulation \ncommunity at JPL immediately began adopting it and 
sharing \ntheir models across the lab. During the past year, the system has 
\nreceived around 40,000 hits on the web portal, and over 9,000 \nqueries for 
models. \n\n6. CONCLUSIONS \nWhen the need arose at NASA seven years ago for a 
data \n\ndistribution and management solution that satisfied the formidable 
\nrequirements outlined in this paper, it was not clear to us initially \nhow 
to approach the problem.  On the surface, several applicable \nsolutions 
already existed (middleware, information integration \nsystems, and the 
emerging grid technologies).  Adopting one of \nthem seemed to be a preferable 
path because it would have saved \nus precious time.  However, upon closer 
inspection we realized \nthat each of these options could be instructive, but 
that none of \nthem solved the problem we were facing (and that even some of 
\nthese technologies themselves were facing). \n\nThe observation that directly 
inspired OODT was that we \nwere dealing with software engineering challenges, 
and that those \n\n\n\nchallenges naturally required a software engineering 
solution.  \nOODT is a large, complex, dynamic system, distributed across 
\nmany sites, servicing many different users, and classes of users, \nwith 
large amounts of heterogeneous data, possibly spanning \nmultiple domains. 
Software engineering research and practice \nboth suggest that success in 
developing such a system will be \ndetermined to a large extent by the 
system\u2019s software \narchitecture.  It therefore became imperative that we 
rely on our \nexperience within the domain of data-intensive systems (e.g., 
\nJPL\u2019s PDS project), as well as our study of related research and 
\npractice, in order to develop an architecture for OODT that will \naddress 
the challenges we discussed in Section 2.  Once the \narchitecture was designed 
and evaluated, OODT\u2019s initial \nimplementation and its subsequent 
adaptations followed naturally. \n\nAs OODT\u2019s developers we are heartened, 
but as software \nengineering researchers and practitioners disappointed, that 
\nOODT still appears to be the only system of its kind. The \nintersection of 
middleware, information management, and grid \ncomputing is rapidly growing, 
yet it is still characterized by one-\noff solutions targeted at very specific 
problems in specific \ndomains. Unfortunately, these solutions are sometimes 
clever by \naccident and more frequently little more than \u201chacks\u201d.  
We \nbelieve that OODT\u2019s approach is more appropriate, more \neffective, 
more broadly applicable, and certainly more helpful to \ndevelopers of future 
systems in this area.  We consider OODT\u2019s \ndemonstrated ability to evolve 
and its applicability in a growing \nnumber of science domains to be a 
testament to its explicit, \ncarefully crafted software architecture. \n\n7. 
ACKNOWLEDGEMENTS \nThis material is based upon work supported by the Jet 
\n\nPropulsion Laboratory, managed by the California Institute of \nTechnology. 
Effort also supported by the National Science \nFoundation under Grant Numbers 
CCR-9985441 and ITR-\n0312780.  \n\n8. REFERENCES \n[1] A. Chervenak, I. 
Foster, et al., \"The Data Grid: Towards an \n\nArchitecture for the 
Distributed Management and Analysis of \nLarge Scientific Data Sets,\" J. of 
Network and Computer \nApplications, vol. 23, pp. 187-200, 2000. \n\n[2] N. 
Medvidovic and R. N. Taylor, \"A Classification and \nComparison Framework for 
Software Architecture Description \nLanguages,\" IEEE TSE, vol. 26, pp. 70-93, 
2000. \n\n[3] D. E. Perry and A. L. Wolf, \"Foundations for the Study of 
\nSoftware Architecture,\" Software Engineering Notes (SEN), \nvol. 17, pp. 
40-52, 1992. \n\n[4] \"The Globus Alliance (http://www.globus.org),\" 2005. 
\n[5] \"Webservices.org (http://www.webservices.org),\" 2005. \n[6] A. Luther, 
R. Buyya, et al., \"Alchemi: A .NET-based \n\nEnterprise Grid Computing 
System,\" in Proc. of 6th \nInternational Conference on Internet Computing, Las 
Vegas, \nNV, USA, 2005. \n\n[7] \"JCGrid Web Site 
(http://jcgrid.sourceforge.net),\" 2005. \n[8] \"LHC Computing Grid 
(http://lcg.web.cern.ch/LCG/),\" 2005. \n[9] D. Bernholdt, S. Bharathi, et al., 
\"The Earth System Grid: \n\nSupporting the Next Generation of Climate Modeling 
\nResearch,\" Proceedings of the IEEE, vol. 93, pp. 485-495, \n2005. \n\n[10] 
A. Finkelstein, C. Gryce, et al., \"Relating Requirements and \nArchitectures: 
A Study of Data Grids,\" J. of Grid Computing, \nvol. 2, pp. 207-222, 2004. 
\n\n[11] C. A. Mattmann, N. Medvidovic, et al., \"Unlocking the Grid,\" \nin 
Proc. of CBSE, St. Louis, MO, pp. 322-336, 2005. \n\n[12] J. Hammer, H. 
Garcia-Molina, et al., \"Information translation, \nmediation, and mosaic-based 
browsing in the tsimmis system,\" \nin Proc. of ACM SIGMOD International 
Conference on \nManagement of Data, San Jose, CA, pp. 483-487, 1995. \n\n[13] 
T. Kirk, A. Y. Levy, et al., \"The information manifold,\" \nWorking Notes of 
the AAAI Spring Symposium on Information \nGathering in Heterogeneous, 
Distributed Environment, Menlo \nPark, CA, Technical Report SS-95-08, 1995. 
\n\n[14] O. Etzioni and D. S. Weld, \"A softbot-based interface to the 
\nInternet,\" CACM, vol. 37, pp. 72-76, 1994. \n\n[15] A. Go\u00f1i, A. 
Illarramendi, et al., \"An optimal cache for a \nfederated database system,\" 
Journal of Intelligent Information \nSystems, vol. 9, pp. 125-155, 1997. 
\n\n[16] M. R. Genesereth, A. Keller, et al., \"Infomaster: An \ninformation 
integration system,\" in Proc. of ACM SIGMOD \nInternational Conference on 
Management of Data, Tucson, \nAZ, pp. 539-542, 1997. \n\n[17] A. Tomasic, L. 
Raschid, et al., \"A data model and query \nprocessing techniques for scaling 
access to distributed \nheterogeneous databases in disco,\" IEEE Transactions 
on \nComputers, 1997. \n\n[18] Y. Arens, C. A. Knoblock, et al., \"Query 
Reformulation for \nDynamic Information Integration,\" Journal of Intelligent 
\nInformation Systems, vol. 6, pp. 99-130, 1996. \n\n[19] J. Ambite, N. Ashish, 
et al., \"Ariadne: A system for \nconstructing mediators for internet 
sources,\" in Proc. of ACM \nSIGMOD International Conference on Management of 
Data, \nSeattle, WA, pp. 561-563, 1998. \n\n[20] G. Barish and C. A. Knoblock, 
\"An Expressive and Efficient \nLanguage for Information Gathering on the 
Web,\" in Proc. of \n6th International Conference on AI Planning and Scheduling 
\n(AIPS-2002) Workshop, Toulouse, France, 2002. \n\n[21] A. Y. Halevy, 
\"Answering queries using views: A survey,\" \nVLDB Journal, vol. 10, pp. 
270-294, 2001. \n\n[22] J. L. Ambite, C. A. Knoblock, et al., \"Compiling 
Source \nDescriptions for Efficient and Flexible Information \nIntegration,\" 
Information Systems Journal, vol. 16, pp. 149-\n187, 2001. \n\n[23] E. 
Lambrecht and S. Kambhampati, \"Planning for Information \nGathering:  A 
Tutorial Survey,\" ASU CSE Technical Report \n96-017, May 1997. \n\n[24] 
\"Enterprise Java Beans (http://java.sun.com/ejb),\" pp. 2005. \n[25] \"Java 
RMI (http://java.sun.com/rmi/),\" 2005. \n[26] C. A. Mattmann, S. Malek, et 
al., \"GLIDE:  A Grid-based \n\nLightweight Infrastructure for Data-intensive 
Environments,\" \nin Proc. of European Grid Conference, Amsterdam, the 
\nNetherlands, pp. 68-77, 2005. \n\n[27] DCMI, \"Dublin Core Metadata Element 
Set,\" 1999. \n[28] T. Berners-Lee, R. Fielding, et al., \"Uniform Resource 
\n\nIdentifiers (URI): Generic Syntax,\" 1998. \n[29] \"Open Channel 
Foundation: Request Object Oriented Data \n\nTechnology (OODT) - 
\n(http://openchannelsoftware.com/orders/index.php?group_id=3\n32),\" 2005. 
\n\n[30] J. S. Hughes and S. K. McMahon, \"The Planetary Data System. \nA Case 
Study in the Development and Management of Meta-\nData for a Scientific Digital 
Library.,\" in Proc. of ECDL, pp. \n335-350, 1998. \n\n[31] S. Srivastava, 
Informatics in proteomics. Boca Raton, FL: \nTaylor & Francis/CRC Press, 2005. 
\n\n[32] \"UGS Products: TeamCenter 
\n(http://www.ugs.com/products/teamcenter/),\" 2005. \n\n[33] \"Document 
Management | Xerox Docushre \n(http://docushare.xerox.com/ds/),\" 2005. \n\n 
\n\n\n\n\n\n\n\n\n\n\n\n\n\tINTRODUCTION\n\tSOFTWARE ENGINEERING 
CHALLENGES\n\tBACKGROUND AND RELATED WORK\n\tOODT ARCHITECTURE\n\tGuiding 
Principles\n\tOODT Components\n\tProduct Server and Product Client\n\tProfile 
Server and Profile Client\n\tQuery Server and Query Client\n\tCatalog and 
Archive Server and Client\n\n\tOODT Connectors\n\tHandler 
Connectors\n\tMessaging Layer Connector\n\n\n\tEXPERIENCE AND CASE 
STUDIES\n\tPlanetary Data System\n\tEarly Detection Research Network\n\tScience 
Processing Systems\n\tComputer Modeling Simulation and 
Visualization\n\n\tCONCLUSIONS\n\tACKNOWLEDGEMENTS\n\tREFERENCES\n\n",
        "X-TIKA:parse_time_millis": "11123",
        "access_permission:assemble_document": "true",
        "access_permission:can_modify": "true",
        "access_permission:can_print": "true",
        "access_permission:can_print_degraded": "true",
        "access_permission:extract_content": "true",
        "access_permission:extract_for_accessibility": "true",
        "access_permission:fill_in_form": "true",
        "access_permission:modify_annotations": "true",
        "created": "Wed Feb 15 13:13:58 PST 2006",
        "creator": "End User Computing Services",
        "date": "2006-02-15T21:16:01Z",
        "dc:creator": "End User Computing Services",
        "dc:format": "application/pdf; version=1.4",
        "dc:title": "Proceedings Template - WORD",
        "dcterms:created": "2006-02-15T21:13:58Z",
        "dcterms:modified": "2006-02-15T21:16:01Z",
        "grobid:header_Abstract": "Modern scientific research is increasingly 
conducted by virtual communities of scientists distributed around the world. 
The data volumes created by these communities are extremely large, and growing 
rapidly. The management of the resulting highly distributed, virtual data 
systems is a complex task, characterized by a number of formidable technical 
challenges, many of which are of a software engineering nature. In this paper 
we describe our experience over the past seven years in constructing and 
deploying OODT, a software framework that supports large, distributed, virtual 
scientific communities. We outline the key software engineering challenges that 
we faced, and addressed, along the way. We argue that a major contributor to 
the success of OODT was its explicit focus on software architecture. We 
describe several large-scale, real-world deployments of OODT, and the manner in 
which OODT helped us to address the domain-specific challenges induced by each 
deployment.",
        "grobid:header_AbstractHeader": "ABSTRACT",
        "grobid:header_Address": "Pasadena, CA 91109, USA Los Angeles, CA 
90089, USA",
        "grobid:header_Affiliation": "1 Jet Propulsion Laboratory California 
Institute of Technology ; 2 Computer Science Department University of Southern 
California",
        "grobid:header_Authors": "Chris A. Mattmann 1, 2 Daniel J. Crichton 1 
Nenad Medvidovic 2 Steve Hughes 1",
        "grobid:header_BeginPage": "-1",
        "grobid:header_Class": "class org.grobid.core.data.BiblioItem",
        "grobid:header_Email": 
"{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov ; {mattmann,neno}@usc.edu",
        "grobid:header_EndPage": "-1",
        "grobid:header_Error": "true",
        "grobid:header_FirstAuthorSurname": "Mattmann",
        "grobid:header_FullAffiliations": "[Affiliation{name='null', 
url='null', institutions=[California Institute of Technology], 
departments=null, laboratories=[Jet Propulsion Laboratory], country='USA', 
postCode='91109', postBox='null', region='CA', settlement='Pasadena', 
addrLine='null', marker='1', addressString='null', affiliationString='null', 
failAffiliation=false}, Affiliation{name='null', url='null', 
institutions=[University of Southern California], departments=[Computer Science 
Department], laboratories=null, country='USA', postCode='90089', 
postBox='null', region='CA', settlement='Los Angeles', addrLine='null', 
marker='2', addressString='null', affiliationString='null', 
failAffiliation=false}]",
        "grobid:header_FullAuthors": "[Chris A Mattmann, Daniel J Crichton, 
Nenad Medvidovic, Steve Hughes]",
        "grobid:header_Item": "-1",
        "grobid:header_Keyword": "Categories and Subject Descriptors D2 
Software Engineering, D211 Domain Specific Architectures Keywords OODT, Data 
Management, Software Architecture",
        "grobid:header_Keywords": "[D2 Software Engineering, D211 Domain 
Specific Architectures  (type:subject-headers), Keywords  
(type:subject-headers), OODT, Data Management, Software Architecture  
(type:subject-headers)]",
        "grobid:header_Language": "en",
        "grobid:header_NbPages": "-1",
        "grobid:header_OriginalAuthors": "Chris A. Mattmann 1, 2 Daniel J. 
Crichton 1 Nenad Medvidovic 2 Steve Hughes 1",
        "grobid:header_Title": "A Software Architecture-Based Framework for 
Highly Distributed and Data Intensive Scientific Applications",
        "meta:author": "End User Computing Services",
        "meta:creation-date": "2006-02-15T21:13:58Z",
        "meta:save-date": "2006-02-15T21:16:01Z",
        "modified": "2006-02-15T21:16:01Z",
        "pdf:PDFVersion": "1.4",
        "pdf:encrypted": "false",
        "producer": "Acrobat Distiller 6.0 (Windows)",
        "resourceName": "ICSE06.pdf",
        "title": "Proceedings Template - WORD",
        "xmp:CreatorTool": "Acrobat PDFMaker 6.0 for Word",
        "xmpTPg:NPages": "10"
    }
]
{noformat}

Great work, [~sujenshah]. I'm going to commit this now and start work on the 
Wiki page!


> Integrate the GROBID PDF extractor in Tika
> ------------------------------------------
>
>                 Key: TIKA-1699
>                 URL: https://issues.apache.org/jira/browse/TIKA-1699
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to