[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

Chris A. Mattmann (JIRA) Thu, 13 Aug 2015 22:41:58 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696513#comment-14696513
 ]

Chris A. Mattmann commented on TIKA-1699:
-----------------------------------------

I got this working! :-)

h2. Starting Tika Server
{noformat}
java -Dorg.apache.tika.service.error.warn=true -classpath
$HOME/git/grobidparser-resources/:$HOME/src/tika-server/target/tika-server-1.11-SNAPSHOT.jar:$HOME/grobid/lib/\*
org.apache.tika.server.TikaServerCli --config tika-config.xml
{noformat}

h2. cURL command to test
{noformat}
curl -T $HOME/git/grobid/papers/ICSE06.pdf -H "Content-Disposition:
attachment;filename=ICSE06.pdf" http://localhost:9998/rmeta | python -mjson.tool
{noformat}

h2. Output

{noformat}
[
{
"Author": "End User Computing Services",
"Company": "ACM",
"Content-Type": "application/pdf",
"Creation-Date": "2006-02-15T21:13:58Z",
"Last-Modified": "2006-02-15T21:16:01Z",
"Last-Save-Date": "2006-02-15T21:16:01Z",
"SourceModified": "D:20060215211344",
"X-Parsed-By": [
"org.apache.tika.parser.CompositeParser",
"org.apache.tika.parser.journal.JournalParser"
],
"X-TIKA:content":
"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProceedings
Template - WORD\n\n\nA Software Architecture-Based Framework for Highly
\nDistributed and Data Intensive Scientific Applications \n\n \nChris A.
Mattmann1, 2 Daniel J. Crichton1 Nenad Medvidovic2 Steve
Hughes1 \n\n \n1Jet Propulsion Laboratory \n\nCalifornia Institute of
Technology \nPasadena, CA 91109, USA
\n\n{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov \n\n2Computer Science
Department \nUniversity of Southern California \n\nLos Angeles, CA 90089, USA
\n{mattmann,neno}@usc.edu \n\n \nABSTRACT \nModern scientific research is
increasingly conducted by virtual \ncommunities of scientists distributed
around the world. The data \nvolumes created by these communities are extremely
large, and \ngrowing rapidly. The management of the resulting highly
\ndistributed, virtual data systems is a complex task, characterized \nby a
number of formidable technical challenges, many of which \nare of a software
engineering nature. In this paper we describe \nour experience over the past
seven years in constructing and \ndeploying OODT, a software framework that
supports large, \ndistributed, virtual scientific communities. We outline the
key \nsoftware engineering challenges that we faced, and addressed, \nalong the
way. We argue that a major contributor to the success of \nOODT was its
explicit focus on software architecture. We \ndescribe several large-scale,
real-world deployments of OODT, \nand the manner in which OODT helped us to
address the domain-\nspecific challenges induced by each deployment.
\n\nCategories and Subject Descriptors \nD.2 Software Engineering, D.2.11
Domain Specific Architectures \n\nKeywords \nOODT, Data Management, Software
Architecture. \n\n1. INTRODUCTION \nSoftware systems of today are very large,
highly complex, \n\noften widely distributed, increasingly decentralized,
dynamic, and \nmobile. There are many causes behind this, spanning virtually
all \nfacets of human endeavor: desired advances in education, \nentertainment,
medicine, military technology, \ntelecommunications, transportation, and so on.
\n\nOne major driver of software\u2019s growing complexity is \nscientific
research and exploration. Today\u2019s scientists are solving \nproblems of
until recently unimaginable complexity with the help \nof software. They also
actively and regularly collaborate with \n\ncolleagues around the world,
something that has become possible \nonly relatively recently, again ultimately
thanks to software. They \nare collecting, producing, sharing, and
disseminating large \namounts of data, which are growing by orders of magnitude
in \nvolume in remarkably short time periods. \n\nIt is this latter problem
that NASA\u2019s Jet Propulsion \nLaboratory (JPL) began facing several years
ago. Until recently, \nJPL would disseminate data collected by various
instruments \n(Earth-based, orbiting, and in outer space) to the interested
\nscientists around the United States by \u201cburning\u201d CD-ROMs and
\nmailing them via the U.S. Postal Service. In addition to being \nslow,
sequential, unidirectional, and lacking interactivity, this \nmethod was
expensive, costing hundreds of thousands of dollars. \nFurthermore, the method
was prone to security breaches, and the \nexact data distribution (determining
which data goes to which \ndestinations) had to be calculated for each
individual shipment. It \nhad become increasingly difficult to manage this
process as the \nnumber of projects and missions, as well as involved
scientists, \ngrew. An even more critical limiting factor became the sheer
\nvolume of data that the current (e.g., Planetary Data System, or \nPDS),
pending (e.g., Mars Reconnaissance Orbiter, or MRO), and \nplanned (e.g., Lunar
Reconnaissance Orbiter, or LRO) missions \nwould produce: from terabytes (PDS),
to hundreds of terabytes \n(MRO), to petabytes or more (LRO). Clearly,
spending millions \nof dollars just to distribute the data to scientists is
impractical. \n\nThis prompted NASA\u2019s Office of Space Science to explore
\nconstruction of an end-to-end software framework that would \nlower the cost
of distributing and managing scientific data, from \nthe inception of data at a
science processing center to its ultimate \narrival on the desks of interested
users. Because of increasing data \nvolumes, the framework had to be scalable
and have native \nsupport for evolution to hundreds of sites and thousands of
data \ntypes. Additionally, the framework had to enable the \nvirtualization of
heterogeneous data (and processing) sources, and \nto address wide-scale
(national and international) distribution of \ndata. The framework needed to be
flexible: it needed to support \nfully automated processing of data throughout
its lifecycle, while \nstill allowing interactivity and intervention from an
operator when \nneeded. Furthermore because data is itself distributed across
\nNASA agencies, any software framework that distributes NASA\u2019s \ndata
would require the capability for tailorable levels of security \nand for
varying types of users belonging to multiple \norganizations. \n\nThere were
also miscellaneous issues of data ownership that \nneeded to be overcome.
Ultimately, because NASA\u2019s science data \nis so distributed, the owners of
data systems (e.g., a Planetary \n\n \n\nPermission to make digital or hard
copies of all or part of this work for \npersonal or classroom use is granted
without fee provided that copies are \nnot made or distributed for profit or
commercial advantage and that \ncopies bear this notice and the full citation
on the first page. To copy \notherwise, or republish, to post on servers or to
redistribute to lists, \nrequires prior specific permission and/or a fee.
\nICSE06\u2019, May 20\u201328, 2006, Shanghai, China. \nCopyright 2006 ACM
1-58113-000-0/00/0004\u2026$5.00. \n \n\n\n\nScience Principal Investigator)
feel hard pressed to control their \ndata, as the successful operation and
maintenance of their data \nsystems are essential services that they provide.
As such, any \nframework that virtualizes science data sources across NASA
\nshould be transparent and unobtrusive: it should enable \ndissemination and
retrieval of data across data systems, each of \nwhich may have their own
external interfaces and services; at the \nsame time, it should enable
scientists to maintain and operate their \ndata systems independently. Finally,
to lower costs, once the \nframework was built and installed, it needed to be
reusable, free, \nand distributable to other NASA sites and centers for use.
\n\nOver the past seven years we have designed, implemented \nand deployed a
framework called OODT (Object Oriented Data \nTechnology) that has met these
rigorous demands. In this paper \nwe discuss the significant software
engineering challenges we \nfaced in developing OODT. The primary objective of
the paper is \nto demonstrate how OODT\u2019s explicit software architectural
basis \nenabled us to effectively address these challenges. In particular,
\nwe will detail the architectural decisions we found most difficult \nand/or
critical to OODT\u2019s ultimate success. We highlight several \nrepresentative
examples of OODT\u2019s use to date both at NASA \nand externally. We contrast
our solution with related approaches, \nand argue that a major differentiator
of this work, in addition to its \nexplicit architectural foundation, is its
native support for \narchitecture-based development of distributed scientific
\napplications. \n\n2. SOFTWARE ENGINEERING \nCHALLENGES \n\nTo develop OODT,
we needed to address several significant \nsoftware engineering challenges, the
bulk of which surfaced in \nlight of the complex data management and
distribution issues \nregularly faced within a distributed, large-scale
government \norganization such as NASA. In this paper we will focus on nine
\nkey challenges: Complexity, Heterogeneity, Location \nTransparency, Autonomy,
Dynamism, Scalability, Distribution, \nDecentralization, and Performance.
\n\nComplexity \u2013 We envisioned OODT to be a large, multi-site,
\nmulti-user, complex system. At the software level, complexity \nranged from
understanding how to install, integrate, and manage \nthe software remotely
deployed at participating organizations, to \nunderstanding how to manage
information such as access \nprivileges and security credentials across both
NASA and non-\nNASA sites. There were also complexities at the software
\nnetworking layer, including varying firewall capabilities at each
\ninstitution, and data repositories that would periodically go offline \nand
needed to be remotely restarted. Just understanding the \nvarying types of data
held at sites linked together via OODT was \na significant task. Even sites
within the same science domain \n(e.g., planetary science) describe similar
data sets in decidedly \ndifferent ways. Discerning in what ways these
different data \nmodels were common and what attributes of data could be
shared, \ndone away with, or amended, was a huge challenge. Finally, the
\ndifferent interfaces to data, ranging from third-party, well-\nengineered
database management systems, to in-house data \nsystems, ultimately to flat
text file-based data was a particularly \ndifficult challenge that we had to
hurdle. \n\nHeterogeneity \u2013 In order to drive down the data management
\ncosts for science missions, the same OODT framework needed to \n\nspan
multiple science domains. The domains initially targeted \nwere earth and
planetary; this has subsequently been expanded to \nspace, biomedical sciences,
and the modeling and simulation \ncommunities. As such, the same core set of
OODT software \ncomponents, system designs, and implementation-level facilities
\nhad to work across widely varying science domains. \n\nThe data management
processes within the organizations that \nuse OODT also added to its
heterogeneity. For instance, OODT \ncomponents needed to have interfaces with
end users and support \ninteractive sessions, but also with scientific
instruments, which \nmost likely were automatic and non-interactive. Scientific
\ninstruments could push data to certain components in OODT, \nwhile other OODT
components would need to distribute data to \nusers outside of OODT. End-users
in some cases wanted to \nperform transformations on the data sent to them by
OODT, and \nthen to return the data back into OODT. The framework needed to
\nsupport scenarios such as these seamlessly. \n\nMany other constraints also
imposed the heterogeneity \nrequirement on OODT. We can group these constraints
into two \nmajor categories: \n\u2022 Organizational \u2013 As we briefly
alluded above, discipline \n\nexperts who wanted to disseminate their data via
OODT \nreally wanted the data to reside at their respective \ninstitutions.
This constraint non-negotiable, and significantly \nimpacted the space of
technical solutions that we could \ninvestigate for OODT. \n\n\u2022 Technical
\u2013 Since OODT had to federate many different data \nholdings and catalogs,
we faced the constraints of linking \nthem together and federating very
different schemas and \nvarying levels of sophistication in the data system
interfaces \n(e.g., flat files, DBMS, web pages). Even those systems \nmanaging
data through \u201chigher level APIs\u201d and middleware \n(e.g., RMI, CORBA,
SOAP) proved non-trivial to integrate. \nThe constraints enjoined by
heterogeneity alone led us to \n\nrealize that the OODT framework would need to
draw heavily \nfrom multiple areas. Database systems, although used
\nsuccessfully for many years to manage large amounts of data at \nmany sites,
lacked the flexibility and interface capability to \nintegrate data from other
more crude APIs and storage systems \n(such as a PI-led web site). Databases
also did not address the \ndistribution of data and \u201cownership\u201d
issues. The advent of the \nweb, although a promising means for providing
openness and \nflexible interfaces to data, would not alone address the issues
such \nas multi-institutional security and access. Furthermore, its
\nrequest/reply nature would not easily handle other distribution \nscenarios,
e.g., subscribe/notify. Research in the area of grid \ncomputing [1] has
defined \u201cout of the box\u201d services for managing \ndata systems (e.g.,
GridFTP), but which utilized alone would not \naddress our other challenges
(e.g., complexity). \n\nLocation Transparency \u2013 Even though data could
potentially \nbe input into and output from the system from many
\ngeographically disparate and distributed sites, it should appear to \nthe
end-users as if the data flow occurred from a single location. \nThis
requirement was reinforced by the need to dynamically add \ndata producers and
consumers to a system supported by OODT, \nas will be further discussed below.
\n\nAutonomy \u2013 When designing the OODT framework, we could \nnot dictate
how data providers should store, process, find, evolve, \nor retire their data.
Instead, the framework needed to be \n\n\n\ntransparent, allowing data
providers to continue with their regular \nbusiness processes, while managing
and disseminating their \ninformation unobtrusively. \n\nDynamism \u2013 It is
expected that data providers for the most part \nwill be stable organizations.
However, there are cases in which \nnew data producing (occasionally) and
consuming (frequently) \nnodes will need to be brought on-line. Back-end data
sources need \nto be pluggable, with little or no direct impact on the end-user
of \nthe OODT system, or on the organization that owns the data \nsource. New
end-users (or client hosts) should also be able to \n\u201ccome and go\u201d
without any disruption to the rest of the system. In \nthe end, we realized
this meant the whole infrastructure must be \ncapable of some level of dynamism
in order to meet these \nconstraints. \n\nScalability \u2013 OODT needed to
manage large volumes of data, \nfrom at least hundreds of gigabytes at its
inception to the current \nmissions which will produce hundreds of terabytes.
The \nframework needed to support at least dozens of institutional data
\nproviders (which themselves may have subordinate data system \nproviders),
dozens of user types (e.g., scientists, teachers, \nstudents, policy makers),
thousands of users, hundreds of \ngeographic sites, and thousands of different
data types to manage \nand disseminate. \n\nDistribution \u2013 The framework
should be able to handle the \nphysical distribution of data across sites
nationally and \ninternationally, and ultimately the physical distribution of
the \nsystem interfaces which provide the data. \n\nDecentralization \u2013
Each site may have its own data \nmanagement processes, interfaces and data
types, which were \noperating independently for some time. We needed to devise
a \nway of coordinating and managing data between these data sites \nand
providers without centralizing control of their systems, or \ninformation. In
other words, the requirement was that the different \nsites retain their full
autonomy, and that OODT adapts instead. \n\nPerformance \u2013 Despite its
scale and interaction with many \norganizations, data systems, and providers,
OODT still needed to \nperform under stringent demands. Queries for information
needed \nto be serviced quickly: in many cases response time under five
\nseconds was used as a baseline. Additionally, OODT needed to be \noperational
whenever any of the participating scientists wanted to \nlocate, access, or
process their data. \n\n3. BACKGROUND AND RELATED WORK \nSeveral large-scale
software technologies that distribute, \n\nmanage, and process information have
been constructed over the \npast decade. Each of these technologies falls into
one or more of \nfour distinct areas: grid-computing, information integration,
\ndatabases, and middleware. In this section, we briefly survey \nrelated
projects in each of these areas and compare their foci and \naccomplishments to
those of OODT. Additionally, since a major \nfocal point of OODT is software
architecture, we start out by \nproviding some brief software architecture
background and \nterminology to set the context. \n\nTraditionally, software
architecture has referred to the \nabstraction of a software system into its
fundamental building \nblocks: software components, their methods of
interaction (or \nsoftware connectors), and the governing rules that guide the
\n\ncomposition of software components and software connectors
\n(configurations) [2, 3]. Software architecture has been recognized \nin many
ways to be the linchpin of the software development \nprocess. Ideally, the
software requirements are reflected within \nthe software system\u2019s
components and interactions; the \ncomponents and interactions are captured
within the system\u2019s \narchitecture; and the architecture is used to guide
the design, \nimplementation, and evolution of the system. Design guidelines
\nthat have been proven effective are often codified into \narchitectural
styles, while specific architectural solutions (e.g., \nconcrete system
structures, component types and interfaces, and \ninteraction facilities)
within specific domains are captured as \nreusable reference architectures.
\n\nGrid computing deals with highly complex and distributed \ncomputational
problems and large volume data management \ntasks. Massive parallel
computation, distributed workflow, and \npetabyte scale data distribution are
only a small cross-section of \nthe grid\u2019s capabilities. Grid projects are
usually broken down into \ntwo areas. Computational grid systems are concerned
with \nsolving complex scientific problems involving supercomputing \nscale
resources dispersed across various organizational \nboundaries. The
representative computational grid system is the \nGlobus Toolkit [4]. Globus is
built on top of a web-services [5] \nsubstrate and provides resource management
components, \ndistributed workflow and security infrastructure. Other
\ncomputational grid systems provide similar capabilities. For \nexample,
Alchemi [6] is a .NET-based grid technology that \nsupports distributed job
scheduling and an object-oriented grid \ndevelopment environment. JCGrid [7] is
a light weight, Java-\nbased open source computational grid project whose goal
is to \nsupport distributed job scheduling and the splitting of CPU-\nintensive
tasks across multiple machines. \n\nThe other class of grid systems, Data
grids, is involved in the \nmanagement, processing, and distribution of large
data volumes to \ndisbursed and heterogeneous users, user types, and geographic
\nlocations. There are several major data grid projects. The LHC \nComputing
Grid [8] is a system whose main goal is to provide a \ndata management and
processing infrastructure for the high \nenergy physics community. The Earth
System Grid [9] is geared \ntowards supporting climate modeling research and
distribution of \nclimate data sets and metadata to the climate and weather
\nscientific community. \n\nTwo independently conducted studies [10, 11] have
\nidentified three key areas that the current grid implementations \nmust
address more effectively in order to promote data and \nsoftware
interoperability: (1) formality in grid requirements \nspecification, (2)
rigorous architectural description, and (3) \ninteroperability between grid
solutions. As we will discuss in this \npaper, our work to date on OODT has the
potential to be a \nstepping stone in each of these areas: its explicit focus
on \narchitectures for data-intensive, \u201cgrid-like\u201d systems naturally
\naddresses the three concerns. \n\nThere have been several well-known efforts
within the AI \nand database communities that have delved into the topic of
\ninformation integration, or the shared access, search, and retrieval \nof
distributed, heterogeneous information resources. Within the \npast decade,
there has been significant interest in building \ninformation mediators that
can integrate information from \nmultiple data sources. Mediators federate
information by querying \nmultiple data sources, and fusing back the gathered
results. The \nrepresentative systems using this approach include TSIMMS [12],
\n\n\n\nInformation Manifold [13], The Internet Softbot [14], InfoSleuth
\n[15], Infomaster [16], DISCO [17], SIMS [18] and Ariadne [19]. \nEach of
these approaches focuses on fundamental algorithmic \ncomponents of information
integration: (1) formulating \nexpressive, efficient query languages (such as
Theseus [20]) that \nquery many heterogeneous data stores; (2) accurately and
reliably \ndescribing both global, and source data models (e.g. the
Global-\nas-view [12] and Local-as-view [21] approaches); (3) providing a
\nmeans for global-to-source data model integration; and (4) \nimproving
queries and deciding which data sources to query (e.g. \nquery reformulation
[22] and query rewriting [22, 23]). \n\nHowever, these algorithmic techniques
fail to address the \nsoftware engineering side of information integration. For
instance, \nexisting literature fails to answer questions such as, which of the
\ncomponents in the different systems\u2019 architectures are common; \nhow can
they be reused; which portions of their implementations \nare tied to (which)
software components; which software \nconnectors are the components using to
interact; are the \ninteraction mechanisms replaceable (e.g., can a
client-server \ninteraction in Ariadne become a peer-to-peer interaction); and
so \non. Additionally, none of the above related mediator systems have
\nformalized a process for designing, implementing, deploying, and
\nmaintaining the software components belonging to each system. \n\nSeveral
middleware technologies such as CORBA, \nEnterprise Java Beans [24], Java RMI
[25], and more recently \nSOAP and Web services [5] have been suggested as
\u201csilver \nbullets\u201d that address the problem of integrating and
utilizing \nheterogeneous software computing and data resources. Each of
\nthese technologies provides three basic services: (1) an \n\nimplementation
and composition framework for software \ncomponents, possibly written in
different languages but \nconforming to a specific middleware interface; (2) a
naming \nregistry used to locate components; and (3) a set of basic services
\nsuch as (un-)marshalling of data, concurrency, distribution and \nsecurity.
\n\nAlthough middleware is very useful \u201cglue\u201d that can connect
\nsoftware components written in different languages or deployed \nin
heterogeneous environments, middleware technologies do not \nprovide any
\u201cout of the box\u201d services that deal with computing \nand data
resource management across organizational boundaries \nand across computing
environments at a national scale. These \nkinds of services usually have to be
engineered into the \nmiddleware itself. We should note that in grid computing
such \nservices are explicitly called out and provided at a higher layer of
\nabstraction. In fact, the combination of these higher-level grid \nservices
and an underlying middleware platform is typically \nreferred to as a
\u201cgrid technology\u201d [11]. \n\n4. OODT ARCHITECTURE \nOODT\u2019s
architecture is a reference architecture that is \n\nintended to be
instantiated and tailored for use across science \ndomains and projects. The
reference architecture comprises \nseveral components and connectors. A
particular instance of this \nreference architecture, that of NASA\u2019s
planetary data system \n(PDS) project, is shown in Figure 1. OODT is installed
on a given \nhost inside a \u201csandbox\u201d, and is aware of and interacts
only with \nthe designated external data sources outside its sandbox.
OODT\u2019s \n\nm\nessaging layer (H\n\nTTP)\n\n\u2026\n.. \u2026..\n\n
\nFigure 1. The Planetary Data System (PDS) OODT Architecture Instantiation
\n\n\n\ncomponents are responsible for delivering data from \nheterogeneous
data stores, identifying and locating data within the \nsystem, and ingesting
and processing data into underlying data \nstores. The connectors are
responsible for integrating OODT with \nheterogeneous data sources; providing
reliable messaging to the \nsoftware components; marshalling resource
descriptions and \ntransferring data between components; transactional
\ncommunication between components; and security related issues \nsuch as
identification, authorization, and authentication. In this \nsection, we
describe the guiding principles behind the reference \narchitecture. We then
describe each of the OODT reference \ncomponents and connectors in detail. In
Section 5, we describe \nspecific instantiations of the reference architecture
in the context \nof several projects that are using OODT. \n\n4.1 Guiding
Principles \nThe software engineering challenges discussed in Section 2
\n\nmotivated and framed the development of OODT. Conquering \nthese challenges
led us to a set of four guiding principles behind \nthe OODT reference
architecture. \n\nThe first guiding principle is division of labor. Each
\ncapability provided by OODT (e.g., processing, ingestion, search, \nand
retrieval of data, access to heterogeneous data, and so on) is \ncarefully
divided among separate, independent architectural \ncomponents and connectors.
As will be further detailed below, the \nprinciple is upheld through
OODT\u2019s rigorous separation of \nconcerns, and modularity enforced by
explicit interfaces. This \nprinciple addresses the complexity, heterogeneity,
dynamism, and \ndecentralization challenges. \n\nClosely related to the
preceding principle is technology \nindependence. This principle involves
keeping up-to-date with the \nevolution of software technology (both in-house
and third-party), \nwhile avoiding tying the OODT architecture to any specific
\nimplementation. By allowing us to select the technology most \nappropriate to
a given task or specific need, this principle helps us \nto address the
challenges of complexity, scalability, security, \ndistribution, location
transparency, performance, and dynamism. \nFor instance, OODT\u2019s initial
reference implementation used \nCORBA as the substrate for its messaging layer
connector. When \nthe CORBA vendor decided to begin charging JPL significant
\nlicense fees (thus violating NASA\u2019s objective of producing a \nsolution
that would be free to its users), the principle of \ntechnology independence
came into play. Because the OODT \nmessaging layer connector supports a wrapper
interface around \nthe lower-level distribution technology, we were able to
replace \nour initial CORBA-based connector with one using Java\u2019s open
\nsource RMI middleware, and redeploy the new connector to the \nOODT user
sites, within three person days. \n\nAnother guiding principle of OODT is the
distinguishing of \nmetadata as a first-class citizen in the reference
architecture, and \nseparating metadata from data. The job of metadata (i.e.,
\u201cdata \nabout data\u201d) is to describe the data universe in which the
system \nis operating. Since OODT is meant to be a technology that \nintegrates
diverse data sources, this data universe is highly \nheterogeneous and possibly
dynamic. Metadata in OODT is \nmeant to catalog information, allowing a user to
locate and \ndescribe the actual data in which she is interested. On the other
\nhand, the job of data in OODT is to describe physical or scientific
\nphenomena; it is the ultimate end user product that an OODT \nsystem should
deliver. This principle helps to address the \n\nchallenges of heterogeneity,
autonomy of data providers, and \ndecentralization. \n\nSeparating the data
model from the software is another key \nprinciple behind the reference
architecture. Akin to ontology/data-\ndriven systems, OODT components should
not be tied to the data \nand metadata that they manipulate. Instead, the
components \nshould be flexible enough to understand many (meta-)data models
\nused across different scientific domains, without reengineering or
\ntailoring of the component implementations. This principle helps \nto address
the challenges of complexity and heterogeneity. \n\nThese four guiding
principles are reified in a reference \narchitecture comprising four pairs of
component types and two \nclasses of connectors organized in a canonical
structure. One \ninstantiation of the reference architecture reflecting the
canonical \nstructure is depicted in Figure 1. Each OODT architectural
\nelement (component and connector) serves a specific purpose, \nwith its
functionality exported through a well-defined interface. \nThis supports
OODT\u2019s constant evolution, allowing us to add, \nremove, and substitute,
if necessary dynamically (i.e., at runtime), \nelements of a given type. It
also allows us to introduce flexibility \nin the individual instances of the
reference architecture while, at \nthe same time, controlling the legal system
configurations. \nFinally, the explicit connectors and well-defined component
\ninterfaces allow OODT in principle to integrate with a wide \nvariety of
third-party systems (e.g., [26]). The outcome of the \nguiding principles
(described above) and design decisions \n(detailed below) is an architecture
that is \u201ceasy to build, hard to \nbreak\u201d. \n\n4.2 OODT Components
\n4.2.1 Product Server and Product Client \n\nThe Product Server is used to
retrieve data from \nheterogeneous data stores. The product server accepts a
query \nstructure that identifies a set of zero or more products which \nshould
be returned the issuer of the query. A product is a unit of \ndata in OODT and
represents anything that a user of the system is \ninterested in retrieving: a
JPEG image of Mars, an MS Word \ndocument, a zip file containing text file
results of a cancer study, \nand so on. Product servers can be located at
remote data sites, \ngeographically and/or institutionally disparate from other
OODT \ncomponents. Alternatively, product servers can be centralized, \nlocated
at a single site. The objective of the product server is to \ndeliver data from
otherwise heterogeneous data stores and \nsystems. As long as a data store (or
system) provides some kind \nof access interface to get its data, a product
server can \u201cwrap\u201d \nthose interfaces with the help of Handler
connectors described in \nSection 4.3 below. \n\nThe Product Client component
communicates with a product \nserver via the Messaging Layer connectors
described in Section \n4.3. A product client resides at the end-user\u2019s
(e.g., scientist\u2019s) \nsite. It must know the location of at least one
product server, and \nthe query structure that identifies the set of products
that the user \nwants to retrieve. At the same time, it is completely insulated
\nfrom any changes in the physical location or actual representation \nof the
data; its only interface is to the product server(s). Many \nproduct clients
may communicate with the same product server, \nand many product servers can
return data to the same product \nclient. This adds flexibility to the
architecture without introducing \nunwanted long-term dependencies: a product
client can be added, \n\n\n\nremoved, or replaced with another one that depends
on different \nproduct servers, without any effect on the rest of the
architecture. \n\n4.2.2 Profile Server and Profile Client \nThe Profile Server
manages resource description \n\ninformation, i.e., metadata, in a system built
with OODT. \nResource description information is divided into three main
\ncategories: \n\u2022 Housekeeping Information \u2013 Metadata such as ID,
Last \n\nModified Date, Last Revised By. This information is kept \nabout the
resource descriptions themselves and is used by the \nprofile server to
inventory and catalog resource descriptions. \nThis is a fixed set of metadata.
\n\n\u2022 Resource Information \u2013 This includes metadata such as Title,
\nAuthor, Creator, Publisher, Resource Type, and Resource \nLocation. This
information is kept for all the data in the \nsystem, and is an extended
version of the Dublin Core \nMetadata for describing electronic resources [27].
This is \nalso a fixed set of metadata. \n\n\u2022 Domain-Specific Information
\u2013 This includes metadata \nspecific to a particular data domain. For
instance, in a cancer \nresearch system this may include metadata such as Blood
\nSpecimen Type, Site ID, and Protocol/Study Description. \nThis set of
metadata is flexible and is expected to change. \n\nAs with product servers,
profile servers can be decentralized at \nmultiple sites or centralized at a
single site. The objective of the \nprofile server is to deliver metadata that
gives a user enough \ninformation to locate the actual data within OODT
regardless of \nthe underlying system\u2019s exact configuration, and degrees
of \ncomplexity and heterogeneity; the user then retrieves the data via \none
or more product servers. Because profile servers do not serve \nthe actual
data, they need not have a direct interface to the data \nthat they describe.
In addition to the complete separation of duties \nbetween profile and product
servers, this ensures their location \nindependence, allows their separate
evolution, and minimizes the \neffects of component and/or network failures in
an OODT system. \n\nProfile Client components communicate with profile servers
\nover the messaging layer connectors. The client must know the \nlocation of
the profile server, and must provide a query that \nidentifies the metadata
that a user is interested in retrieving. There \ncan be many profile clients
speaking with a single profile server, \nand many profile servers speaking with
a single profile client. \nThe architectural effects are analogous to those in
the case of \nproduct clients and servers. \n\n4.2.3 Query Server and Query
Client \nThe Query Server component provides an integrated search \n\nand
retrieval capability for the OODT reference architecture. \nQuery servers
interact with profile and product servers to retrieve \nmetadata and data
requested by system users. A query server is \nseeded with an initial set of
references to profile servers. Upon \nreceiving a query from a user, the query
server passes it along to \neach profile server from its list, and collects the
metadata \nreturned. Part of this metadata is a resource location (recall
\nSection 4.2.2) in the form of a URI [28]. A URI can be a link to a \nproduct
server, to a web site with the actual data, or to some \nexternal data
providing system. This directly supports \nheterogeneity, location
transparency, and autonomy of data \nproviders in OODT. \n\nAnother novel
aspect of OODT\u2019s architecture is that if a \nprofile server is unable to
service the query, or if it believes that \n\nother profile servers it is aware
of may contain relevant metadata, \nit will return the URIs of those profile
servers; the query server \nmay then forward the query to them. As a result,
query servers are \ncompletely decoupled from product servers (and from any
\n\u201cexposed\u201d external data sources), and are also decoupled from
\nmost of the profile servers. In turn, this lessens the complexity of
\nimplementing, integrating, and evolving query servers. Once the \nresource
metadata is returned, the query server will either allow \nthe user herself to
use the supplied URIs to find the data in which \nshe was interested
(interactive mode), or it will retrieve, package, \nand deliver the data to the
user (non-interactive mode). As with \nthe product and profile servers, query
servers can be centrally \nlocated at a single site, or they can be
decentralized across \nmultiple sites. \n\nQuery Client components
communicate with the query \nservers. The query client must provide a query
server with a query \nthat identifies the data in which the user is interested,
and it must \nset a mode for the query server (interactive or non-interactive
\nmode). The query client may know the location of the query \nserver that it
wants to contact, or it may rely on the messaging \nlayer connector to route
its queries to one or more query servers. \n\n4.2.4 Catalog and Archive
Server and Client \nThe Catalog and Archive Server (CAS) component in OODT
\n\nis responsible for providing a common mechanism for ingestion \nof data
into a data store, including any processing required as a \nresult of
ingestion. For instance, prior to the ingestion of a poor-\nresolution image of
Mars, the image may need to be refined and \nthe resolution improved. CAS would
handle this type of \nprocessing. Any data ingested into CAS must include
associated \nmetadata information so that the data can be cataloged for search
\nand retrieval purposes. Upon ingestion, the data is sent to a data \nstore
for preservation, and the corresponding metadata is sent to \nthe associated
catalog. The data store and catalog need not be \nlocated on the same host;
they may be located on remote sites \nprovided there is an access mechanism to
store and retrieve data \nfrom each. The goal of CAS is to streamline and
standardize the \nprocess of adding data to an OODT-aware system. Note that a
\nsystem whose data stores were populated prior to its integration \ninto OODT
can still use CAS for its new data. Since the CAS \ncomponent populates data
stores and catalogs with both data and \nmetadata, specialized product and
profile server components have \nbeen developed to serve data and metadata from
the CAS backend \ndata stores and catalogs more efficiently. Any older data can
still \nbe served with existing product and profile servers. \n\nThe Archive
Client component communicates with CAS. The \narchive client must know the
location of the CAS component, and \nmust provide it with data to ingest. Many
archive clients can \ncommunicate with a single CAS component, and vice versa.
Both \nthe archive client and CAS components are completely \nindependent of
the preceding three pairs of component types in \nthe OODT reference
architecture. \n\n4.3 OODT Connectors \n4.3.1 Handler Connectors \n\nHandler
connectors are responsible for enabling the \ninteraction between OODT\u2019s
components and third-party data \nstores. A handler connector performs the
transformation between \nan underlying (meta-)data store\u2019s internal API
for retrieving data \nand its (meta-)data format on the one hand, and the OODT
system \n\n\n\non the other. Each handler connector is typically developed for
a \nclass of data stores and metadata systems. For example, for a \ngiven DBMS
such as Oracle, and a given internal representation \nschema for metadata, a
generic Oracle handler connector is \ntypically developed and then reused.
Similarly, for a given \nfilesystem scheme for storing data, a generic
filesystem handler \nconnector is developed and reused across like filesystem
data \nstores. \n\nEach profile server and product server relies on one or
more \nhandler connectors. Profile servers use profile handlers, and \nproduct
servers use query handlers. Handler connectors thereby \ncompletely insulate
product and profile servers from the third-\nparty data stores. Handlers also
allow for different types of \ntransformations on (meta-)data to be introduced
dynamically \nwithout any effect on the rest of OODT components. For \nexample,
a product server that distributes Mars image data might \nbe serviced by a
query handler connector that returns high-\nresolution (e.g., 10 GB) JPEG image
files of the latest summit \nclimbed by a Mars rover; if the system ends up
experiencing \nperformance problems, another handler may be (temporarily)
\nadded to return lower-resolution (e.g., 1 MB) JPEG image files of \nthe same
scenario. Likewise, a profile server may have two \nprofile handler connectors,
one that returns image-quality \nmetadata (e.g., resolution and bits/pixel) and
another that returns \ninstrument metadata about Mars rover images (e.g.,
instrument \nname or image creation date). \n\n4.3.2 Messaging Layer Connector
\nThe Messaging Layer connector is responsible for \n\nmarshalling data and
metadata between components in an OODT \nsystem. The messaging layer must keep
track of the locations of \nthe components, what types of components reside in
which \nlocations, and if components are still running or not. Additionally,
\nthe messaging layer is responsible for taking care of any needed \nsecurity
mechanisms such as authentication against an LDAP \ndirectory service, or
authorization of a user to perform certain \nrole-based actions. \n\nThe
messaging layer in OODT provides synchronous \ninteraction among the
components, and some delivery guarantees \non messages transferred between the
software components. \nTypically in any large-scale data system, the
asynchronous mode \nof interaction is not encouraged because partial data
transfers are \nof no use to users such as scientists who need to make analysis
on \nentire data sets. \n\nThe messaging layer supports communication between
any \nnumber of connected OODT software components. In addition, \nthe
messaging layer natively supports connections to other \nmessaging layer
connectors as well. This provides us with the \nability to extend and adapt an
OODT system\u2019s architecture, as \nwell as easily tailor the architecture
for any specific interaction \nneeds (e.g., by adding data encryption and/or
compression \ncapabilities to the connector). \n\n5. EXPERIENCE AND CASE
STUDIES \nThe OODT framework has been used both within and \n\noutside NASA.
JPL, NASA\u2019s Ames Research Center, the \nNational Institutes of Health
(NIH), the National Cancer Institute \n(NCI), several research universities,
and U.S. Federally Funded \nResearch and Development Centers (FFRDCs) are all
using \nOODT in some form or fashion. OODT is also available for \ndownload
through a large open-source software distributor [29]. \n\nOODT components are
found in planetary science, earth science, \nbiomedical, and clinical research
projects. In this section, we \ndiscuss our experience with OODT in several
representative \nprojects within these scientific areas. We compare and
contrast \nhow the projects were handled before and after OODT. We sketch
\nsome of the domain-specific technical challenges we encountered \nand
identify how OODT helped to solve them. \n\nTo begin using OODT, a user designs
a deployment \narchitecture from one or more of the reference OODT \ncomponents
(e.g., product and profile servers), and the reference \nOODT connectors. The
user must determine if any existing \nhandler connectors can be reused, or if
specialized handler \nconnectors need to be developed. Once all the components
are \nready, the user has two options for deploying her architecture to \nthe
target hosts: (1) the user may translate her design into a \nspecialized OODT
deployment descriptor XML file, which can \nthen be used to start each program
on the target host(s); or (2) the \nuser can deploy her OODT architecture using
a remote server \ncontrol component, adding components, and connectors via a
\ngraphical user interface. The GUI allows the user to send \ncomponent and
connector code to the target hosts, to start, shut-\ndown, and restart the
components and connectors, and to monitor \ntheir health during execution.
\n\n5.1 Planetary Data System \nOne of the flagship deployments of OODT has
been for \n\nNASA\u2019s Planetary Data System (PDS) [30]. PDS consists of
\nseven \u201cdiscipline nodes\u201d and an engineering and management \nnode.
Each node resides at a different U.S. university or \ngovernment agency, and is
managed autonomously. \n\nFor many years PDS distributed its data and metadata
on \nphysical media, primarily CD-ROM. Each CD-ROM was \nformatted a according
to a \u201chome-grown\u201d directory layout \nstructure called an archive
volume, which later was turned into a \nPDS standard. PDS metadata was
constructed using a common, \nwell-structured set of 1200 metadata elements,
such as Target \nName and Instrument Type, that were identified from the onset
of \nthe PDS project by planetary scientists. Beginning in the late \n1990s the
advent of the WWW and the increasing data volumes of \nmissions led NASA
managers to impose a new paradigm for \ndistributing data to the users of the
PDS: data and metadata were \nnow to be distributed electronically, via a
single, unified web \nportal. The web portal and accompanying infrastructure to
\ndistribute PDS data and metadata was built in 2001 using OODT \nin the manner
depicted in Figure 1. \n\nWe faced several technical challenges deploying OODT
to \nPDS. PDS data and metadata were highly distributed, spanning all \nseven
of the scientific discipline nodes across the country. \nAlthough the entire
data volume across PDS at the time was \naround 7 terabytes, it was estimated
that the volume would grow \nto 10 terabytes by 2004. Consequently, the system
needed to be \nscalable and respond to large growth spurts caused by new data
\nproducing missions. The flexibility and modularity of the OODT \nproduct and
profile server components were particularly useful in \nthis regard. Using a
product and/or profile server, each new data \nproducing system in the PDS
could be dynamically \u201cplugged in\u201d \nto the existing PDS
infrastructure that we constructed, without \ndisturbing existing components
and processes. \n\nWe also faced the problem of heterogeneity. Almost every
\nnode within PDS had a different operating system, ranging from \nLinux, to
Windows, to Solaris, to Mac OS X. Each node \n\n\n\nEDRN \nQuery
\nServer\n\nm\nessaging layer (R\n\nM\nI)\n\nProduct \nServer\n\nDBMS
\n(Specimen \nMetadata)\n\nmoffitt.usf.edu (win2k server)\n\nMS SQL DBMS
\n(Specimen \nProducts)\n\nSpecimen \nQuery \n\nHandler\n\nSpecimen Profile
\nHandler (MS SQL)\n\nOODT \u201cSandbox\u201d\n\nOODT
\u201cSandbox\u201d\n\nProduct \nServer\n\nProfile
\nServer\n\nanother.erne.server (AnotherOS)\n\nCAS Profile \nHandler\n\nCAS
Query \nHandler\n\nOODT \u201cSandbox\u201d\nCatalog and \n\nArchive
Server\n\nLung Images \n(Filesystem)\n\nOther
\nApplications\n\nginger.fhcrc.org (win2k)\n\nOther Applications\n\nERNE Web
\nPortal\n\n(Query Client)\n\nuser host\n\nProfile \nClient\n\nProduct
\nClient\n\nProfile ServerOther \nApplications\n\nOther \nApplications\n\nOther
Applications\n\nOther Applications\n\nSpecimen Inventory\n(MS SQL)\n\nOther
Applications\n\nOther Applications\n\npds.jpl.nasa.gov (Linux)\nLegend:\n\nOODT
\nComponent\n\nData/metadata \nstore\n\nOODT Connector Hardware \nhost\n\nOODT
\ncontrolled \nportion of \nmachine\n\ndata/control flow\nBlack Box\n\n \n
\n\nFigure 2. The Early Detection Research Network (EDRN) OODT Architecture
Instantiation \n\nmaintained its own local catalog system. Although each node
in \nPDS had different file system implementations dictated by their \nOS, each
node stored their data and metadata according to the \narchive volume
structure. Because of this, we were able to write a \nsingle, reusable PDS
Query Handler which could serve back \nproducts from a PDS archive volume
structure located on a file \nsystem. Plugging into each node\u2019s catalog
system proved to be a \nsignificant challenge. For nearly all of the nodes,
specialized \nprofile handler connectors were constructed to interface with the
\nunderlying catalog systems, which ranged from static text files \ncalled PDS
label files to dynamic web site inventory systems \nconstructed using Java
Server Pages. Because each of the catalogs \ntagged PDS data using the common
set of 1200 elements, we \nwere able to share much of the code base among the
profile \nhandler connectors, ultimately only changing the portion of the
\ncode that made the particular JSP page call, or read the selected \nset of
metadata from the label file. The entire code base of the \nPDS including all
the domain specific handler connectors is only \nslightly over 15 KSLOC,
illustrating the high degree of \nreusability provided by the OODT framework.
\n\n5.2 Early Detection Research Network \nOODT is also supporting the National
Cancer Institute\u2019s \n\n(NCI) Early Detection Research Network (EDRN). EDRN
is a \ndistributed research program that unites researchers from over \nthirty
institutions across the United States. Tens of thousands of \nscientists
participate in the EDRN. Each institution is focused on \nthe discovery of
cancer biomarkers as indicators for disease [31]. \n\nA critical need for the
EDRN is an electronic infrastructure to \nsupport discovery and validation of
these markers. \n\nIn 2001 we worked with the EDRN program to develop the
\nfirst component of their electronic biomarker infrastructure called \nthe
EDRN Resource Network Exchange (ERNE). The (partial) \ncorresponding
architecture is depicted in Figure 2. One of the \nmajor goals of ERNE was to
provide real-time access to bio-\nspecimen information across the institutions
of the EDRN. Bio-\nspecimen information typically consisted of gigabytes of
\nspecimen images, and location and contact metadata for obtaining \nthe
specimen from its origin study institution. The previous \nmethod of obtaining
bio-specimen information was very human-\nintensive: it involved phone calls
and some forms of electronic \ncommunication such as email. Specimen
information was not \nsearchable across institutions participating in the EDRN.
The bio-\nspecimen catalogs were largely out-of-date, and out-of-synch with
\ncurrent holdings at each participating institution. \n\nOne of the initial
technical challenges we faced with EDRN \nwas scale. The EDRN was over three
times as large as the PDS. \nBecause of this we chose to target ten
institutions initially, rather \nthan the entire set of thirty one. Again,
OODT\u2019s modularity and \nscalability came into play as we could phase
deployment at each \ndeployment institution. As we instantiated new product,
profile, \nquery, and archive servers at each institution, we could do so
\nwithout interrupting any existing OODT infrastructure already \ndeployed.
\n\nAnother challenge that we encountered was dealing with \neach participating
site\u2019s Institutional Review Board (IRB). An \nIRB is required to review
and ensure compliance of projects with \n\n\n\nfederal laws related to working
with data from research projects \ninvolving human subjects. To satisfy the
IRB, any OODT \ncomponents deployed at an EDRN site had to provide an adequate
\nsecurity capability in order to get approval to share the data \nexternally
from an institution. OODT\u2019s separation of data and \nmetadata explicitly
allowed us to satisfy this requirement. We \ndesigned ERNE so that each
institution could remain in control of \ntheir specimen holding data by
instantiating product server \ncomponents at each site, rather than
distributing the information \nacross ERNE which would have violated the IRB
agreements. \n\nAnother significant challenge we faced in developing ERNE
\nwas lack of a consistent metadata model for each ERNE site. We \nwere forced
to develop a common specimen metadata model and \nthen to create specific
mappings to link each local site to the \ncommon model. OODT aided us once
again in this endeavor as \nthe common mappings we developed were easily
codified into a \nquery handler connector, and reused across each ERNE site.
\n\nThe entire code base of ERNE, including all its specialized \nhandler
connectors is only slightly over 5.3 KSLOC, highlighting \nthe high degree of
reusability of the shared framework code base \nand the handler code base. \n\n
\n\n5.3 Science Processing Systems \nOODT has also been deployed in several
science processing \n\nsystem missions both, operational and under development.
Due to \nspace limitations, we can only briefly summarize each of the \nOODT
deployments in these systems. \n\nSeaWinds, a NASA-funded earth science
instrument flying \non the Japanese ADEOS-II spacecraft, used the OODT CAS
\ncomponent as a workflow and processing component for its \nProcessing and
Analysis Center (SeaPAC). SeaWinds produced \nseveral gigabytes of data during
its six year mission. CAS was \nused to control the execution and data flow of
mission-specific \ndata processor components, which calibrated and created
derived \ndata products from raw instrument data, and archived those \nproducts
for distribution into the data store managed by CAS. A \nmajor challenge we
faced during the development of SeaPAC was \nthat the processor components
were developed by a group \noutside of the SeaWinds project. We had to provide
a mechanism \nfor integrating their source code into the OODT SeaPAC
\nframework. OODT\u2019s separation of concerns allowed us to address \nthis
issue with relative ease: once the data processors were \nfinished, we were
able wrap and tailor them internally within \nCAS, without disturbing the
existing SeaPaC infrastructure. \n\nThe success of the CAS within SeaWinds led
to its reuse on \nseveral different missions. Another earth science mission
called \nQuikSCAT retrofitted and replaced some of their existing \nprocessing
components with CAS, using the SeaWinds experience \nas an example. The
Orbiting Carbon Observatory (OCO) mission \nthat will fly in 2009, and that is
currently under development, is \nalso utilizing CAS to ingest and process
existing FTS CO2 \nspectrometer data from earth-based instruments. The James
Web \nTelescope (JWT) is using the CAS for to implement its workflow \nand
processing capabilities for astrophysics data and metadata. \nEach of these
science processing systems will face similar \ntechnical challenges, including
separation of concerns between \nthe actual processing framework and the
developers writing the \nprocessor code, the volume of data that must be
handled by the \nprocessing system (OCO is projected to produce over 150
\nterabytes), and the flexibility and tailorability of the workflow \n\nneeded
to process the data. We believe that OODT is uniquely \npositioned to address
these difficult challenges. \n\n5.4 Computer Modeling Simulation and
\nVisualization \n\nOODT has also been deployed to aid the Computer \nModeling
Simulation and Visualization (CMSV) community at \nJPL, by linking together
several institutional model repositories \nacross the organizations within the
lab, and creating a web portal \ninterface to query the integrated model
repositories. We \ndeveloped specialized profile server components that locate
and \nlink to different model resources across JPL, such as power \nsubsystem
models of the Mars Exploration Rovers (MER), CAD-\ndrawing models of different
spacecraft assembly parts, and \nsystems architecture models for engineering
and design of \nspacecraft. Each of these different model types lived in
separate \nindependent repositories across JPL. For instance, the CAD \nmodels
were stored in a commercial product called TeamCenter \nEnterprise [32], while
the power and systems architecture models \nwere stored in a commercial product
called Xerox Docushare \n[33]. \n\nTo integrate these model repositories for
CMSV, we had to \nderive a common set of metadata across the wide spectrum of
\ndifferent model types that existed at JPL. OODT\u2019s separation of \ndata
from metadata allowed us to rapidly instantiate our common \nmetadata model
once we developed it, by constructing specialized \nprofile handler connectors
that mapped each repository\u2019s local \nmodel to the common model.
Reusability levels were high across \nthe connectors, resulting in an extremely
small code base of 2.57 \nKSLOC. \n\nAnother challenge in light of this
mapping activity was \ninterfacing with the APIs of the underlying model
repositories. In \nthe above two cases, the APIs were commercial products, and
\npoorly documented. In some cases, such as the Docushare \nrepository, the
APIs did not fully conform to their stated \nspecifications. The division of
labor amongst OODT components \ncame into play on this task. It allowed us to
focus on deploying \nthe rest of the OODT supporting infrastructure, such as
the web \nportal, and the profile handler connectors, and not getting stalled
\nwaiting for the support teams from each of the commercial \nvendors to debug
our API problems. Once the OODT CMSV \ninfrastructure was deployed, the
modeling and simulation \ncommunity at JPL immediately began adopting it and
sharing \ntheir models across the lab. During the past year, the system has
\nreceived around 40,000 hits on the web portal, and over 9,000 \nqueries for
models. \n\n6. CONCLUSIONS \nWhen the need arose at NASA seven years ago for a
data \n\ndistribution and management solution that satisfied the formidable
\nrequirements outlined in this paper, it was not clear to us initially \nhow
to approach the problem. On the surface, several applicable \nsolutions
already existed (middleware, information integration \nsystems, and the
emerging grid technologies). Adopting one of \nthem seemed to be a preferable
path because it would have saved \nus precious time. However, upon closer
inspection we realized \nthat each of these options could be instructive, but
that none of \nthem solved the problem we were facing (and that even some of
\nthese technologies themselves were facing). \n\nThe observation that directly
inspired OODT was that we \nwere dealing with software engineering challenges,
and that those \n\n\n\nchallenges naturally required a software engineering
solution. \nOODT is a large, complex, dynamic system, distributed across
\nmany sites, servicing many different users, and classes of users, \nwith
large amounts of heterogeneous data, possibly spanning \nmultiple domains.
Software engineering research and practice \nboth suggest that success in
developing such a system will be \ndetermined to a large extent by the
system\u2019s software \narchitecture. It therefore became imperative that we
rely on our \nexperience within the domain of data-intensive systems (e.g.,
\nJPL\u2019s PDS project), as well as our study of related research and
\npractice, in order to develop an architecture for OODT that will \naddress
the challenges we discussed in Section 2. Once the \narchitecture was designed
and evaluated, OODT\u2019s initial \nimplementation and its subsequent
adaptations followed naturally. \n\nAs OODT\u2019s developers we are heartened,
but as software \nengineering researchers and practitioners disappointed, that
\nOODT still appears to be the only system of its kind. The \nintersection of
middleware, information management, and grid \ncomputing is rapidly growing,
yet it is still characterized by one-\noff solutions targeted at very specific
problems in specific \ndomains. Unfortunately, these solutions are sometimes
clever by \naccident and more frequently little more than \u201chacks\u201d.
We \nbelieve that OODT\u2019s approach is more appropriate, more \neffective,
more broadly applicable, and certainly more helpful to \ndevelopers of future
systems in this area. We consider OODT\u2019s \ndemonstrated ability to evolve
and its applicability in a growing \nnumber of science domains to be a
testament to its explicit, \ncarefully crafted software architecture. \n\n7.
ACKNOWLEDGEMENTS \nThis material is based upon work supported by the Jet
\n\nPropulsion Laboratory, managed by the California Institute of \nTechnology.
Effort also supported by the National Science \nFoundation under Grant Numbers
CCR-9985441 and ITR-\n0312780. \n\n8. REFERENCES \n[1] A. Chervenak, I.
Foster, et al., \"The Data Grid: Towards an \n\nArchitecture for the
Distributed Management and Analysis of \nLarge Scientific Data Sets,\" J. of
Network and Computer \nApplications, vol. 23, pp. 187-200, 2000. \n\n[2] N.
Medvidovic and R. N. Taylor, \"A Classification and \nComparison Framework for
Software Architecture Description \nLanguages,\" IEEE TSE, vol. 26, pp. 70-93,
2000. \n\n[3] D. E. Perry and A. L. Wolf, \"Foundations for the Study of
\nSoftware Architecture,\" Software Engineering Notes (SEN), \nvol. 17, pp.
40-52, 1992. \n\n[4] \"The Globus Alliance (http://www.globus.org),\" 2005.
\n[5] \"Webservices.org (http://www.webservices.org),\" 2005. \n[6] A. Luther,
R. Buyya, et al., \"Alchemi: A .NET-based \n\nEnterprise Grid Computing
System,\" in Proc. of 6th \nInternational Conference on Internet Computing, Las
Vegas, \nNV, USA, 2005. \n\n[7] \"JCGrid Web Site
(http://jcgrid.sourceforge.net),\" 2005. \n[8] \"LHC Computing Grid
(http://lcg.web.cern.ch/LCG/),\" 2005. \n[9] D. Bernholdt, S. Bharathi, et al.,
\"The Earth System Grid: \n\nSupporting the Next Generation of Climate Modeling
\nResearch,\" Proceedings of the IEEE, vol. 93, pp. 485-495, \n2005. \n\n[10]
A. Finkelstein, C. Gryce, et al., \"Relating Requirements and \nArchitectures:
A Study of Data Grids,\" J. of Grid Computing, \nvol. 2, pp. 207-222, 2004.
\n\n[11] C. A. Mattmann, N. Medvidovic, et al., \"Unlocking the Grid,\" \nin
Proc. of CBSE, St. Louis, MO, pp. 322-336, 2005. \n\n[12] J. Hammer, H.
Garcia-Molina, et al., \"Information translation, \nmediation, and mosaic-based
browsing in the tsimmis system,\" \nin Proc. of ACM SIGMOD International
Conference on \nManagement of Data, San Jose, CA, pp. 483-487, 1995. \n\n[13]
T. Kirk, A. Y. Levy, et al., \"The information manifold,\" \nWorking Notes of
the AAAI Spring Symposium on Information \nGathering in Heterogeneous,
Distributed Environment, Menlo \nPark, CA, Technical Report SS-95-08, 1995.
\n\n[14] O. Etzioni and D. S. Weld, \"A softbot-based interface to the
\nInternet,\" CACM, vol. 37, pp. 72-76, 1994. \n\n[15] A. Go\u00f1i, A.
Illarramendi, et al., \"An optimal cache for a \nfederated database system,\"
Journal of Intelligent Information \nSystems, vol. 9, pp. 125-155, 1997.
\n\n[16] M. R. Genesereth, A. Keller, et al., \"Infomaster: An \ninformation
integration system,\" in Proc. of ACM SIGMOD \nInternational Conference on
Management of Data, Tucson, \nAZ, pp. 539-542, 1997. \n\n[17] A. Tomasic, L.
Raschid, et al., \"A data model and query \nprocessing techniques for scaling
access to distributed \nheterogeneous databases in disco,\" IEEE Transactions
on \nComputers, 1997. \n\n[18] Y. Arens, C. A. Knoblock, et al., \"Query
Reformulation for \nDynamic Information Integration,\" Journal of Intelligent
\nInformation Systems, vol. 6, pp. 99-130, 1996. \n\n[19] J. Ambite, N. Ashish,
et al., \"Ariadne: A system for \nconstructing mediators for internet
sources,\" in Proc. of ACM \nSIGMOD International Conference on Management of
Data, \nSeattle, WA, pp. 561-563, 1998. \n\n[20] G. Barish and C. A. Knoblock,
\"An Expressive and Efficient \nLanguage for Information Gathering on the
Web,\" in Proc. of \n6th International Conference on AI Planning and Scheduling
\n(AIPS-2002) Workshop, Toulouse, France, 2002. \n\n[21] A. Y. Halevy,
\"Answering queries using views: A survey,\" \nVLDB Journal, vol. 10, pp.
270-294, 2001. \n\n[22] J. L. Ambite, C. A. Knoblock, et al., \"Compiling
Source \nDescriptions for Efficient and Flexible Information \nIntegration,\"
Information Systems Journal, vol. 16, pp. 149-\n187, 2001. \n\n[23] E.
Lambrecht and S. Kambhampati, \"Planning for Information \nGathering: A
Tutorial Survey,\" ASU CSE Technical Report \n96-017, May 1997. \n\n[24]
\"Enterprise Java Beans (http://java.sun.com/ejb),\" pp. 2005. \n[25] \"Java
RMI (http://java.sun.com/rmi/),\" 2005. \n[26] C. A. Mattmann, S. Malek, et
al., \"GLIDE: A Grid-based \n\nLightweight Infrastructure for Data-intensive
Environments,\" \nin Proc. of European Grid Conference, Amsterdam, the
\nNetherlands, pp. 68-77, 2005. \n\n[27] DCMI, \"Dublin Core Metadata Element
Set,\" 1999. \n[28] T. Berners-Lee, R. Fielding, et al., \"Uniform Resource
\n\nIdentifiers (URI): Generic Syntax,\" 1998. \n[29] \"Open Channel
Foundation: Request Object Oriented Data \n\nTechnology (OODT) -
\n(http://openchannelsoftware.com/orders/index.php?group_id=3\n32),\" 2005.
\n\n[30] J. S. Hughes and S. K. McMahon, \"The Planetary Data System. \nA Case
Study in the Development and Management of Meta-\nData for a Scientific Digital
Library.,\" in Proc. of ECDL, pp. \n335-350, 1998. \n\n[31] S. Srivastava,
Informatics in proteomics. Boca Raton, FL: \nTaylor & Francis/CRC Press, 2005.
\n\n[32] \"UGS Products: TeamCenter
\n(http://www.ugs.com/products/teamcenter/),\" 2005. \n\n[33] \"Document
Management | Xerox Docushre \n(http://docushare.xerox.com/ds/),\" 2005. \n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\tINTRODUCTION\n\tSOFTWARE ENGINEERING
CHALLENGES\n\tBACKGROUND AND RELATED WORK\n\tOODT ARCHITECTURE\n\tGuiding
Principles\n\tOODT Components\n\tProduct Server and Product Client\n\tProfile
Server and Profile Client\n\tQuery Server and Query Client\n\tCatalog and
Archive Server and Client\n\n\tOODT Connectors\n\tHandler
Connectors\n\tMessaging Layer Connector\n\n\n\tEXPERIENCE AND CASE
STUDIES\n\tPlanetary Data System\n\tEarly Detection Research Network\n\tScience
Processing Systems\n\tComputer Modeling Simulation and
Visualization\n\n\tCONCLUSIONS\n\tACKNOWLEDGEMENTS\n\tREFERENCES\n\n",
"X-TIKA:parse_time_millis": "11123",
"access_permission:assemble_document": "true",
"access_permission:can_modify": "true",
"access_permission:can_print": "true",
"access_permission:can_print_degraded": "true",
"access_permission:extract_content": "true",
"access_permission:extract_for_accessibility": "true",
"access_permission:fill_in_form": "true",
"access_permission:modify_annotations": "true",
"created": "Wed Feb 15 13:13:58 PST 2006",
"creator": "End User Computing Services",
"date": "2006-02-15T21:16:01Z",
"dc:creator": "End User Computing Services",
"dc:format": "application/pdf; version=1.4",
"dc:title": "Proceedings Template - WORD",
"dcterms:created": "2006-02-15T21:13:58Z",
"dcterms:modified": "2006-02-15T21:16:01Z",
"grobid:header_Abstract": "Modern scientific research is increasingly
conducted by virtual communities of scientists distributed around the world.
The data volumes created by these communities are extremely large, and growing
rapidly. The management of the resulting highly distributed, virtual data
systems is a complex task, characterized by a number of formidable technical
challenges, many of which are of a software engineering nature. In this paper
we describe our experience over the past seven years in constructing and
deploying OODT, a software framework that supports large, distributed, virtual
scientific communities. We outline the key software engineering challenges that
we faced, and addressed, along the way. We argue that a major contributor to
the success of OODT was its explicit focus on software architecture. We
describe several large-scale, real-world deployments of OODT, and the manner in
which OODT helped us to address the domain-specific challenges induced by each
deployment.",
"grobid:header_AbstractHeader": "ABSTRACT",
"grobid:header_Address": "Pasadena, CA 91109, USA Los Angeles, CA
90089, USA",
"grobid:header_Affiliation": "1 Jet Propulsion Laboratory California
Institute of Technology ; 2 Computer Science Department University of Southern
California",
"grobid:header_Authors": "Chris A. Mattmann 1, 2 Daniel J. Crichton 1
Nenad Medvidovic 2 Steve Hughes 1",
"grobid:header_BeginPage": "-1",
"grobid:header_Class": "class org.grobid.core.data.BiblioItem",
"grobid:header_Email":
"{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov ; {mattmann,neno}@usc.edu",
"grobid:header_EndPage": "-1",
"grobid:header_Error": "true",
"grobid:header_FirstAuthorSurname": "Mattmann",
"grobid:header_FullAffiliations": "[Affiliation{name='null',
url='null', institutions=[California Institute of Technology],
departments=null, laboratories=[Jet Propulsion Laboratory], country='USA',
postCode='91109', postBox='null', region='CA', settlement='Pasadena',
addrLine='null', marker='1', addressString='null', affiliationString='null',
failAffiliation=false}, Affiliation{name='null', url='null',
institutions=[University of Southern California], departments=[Computer Science
Department], laboratories=null, country='USA', postCode='90089',
postBox='null', region='CA', settlement='Los Angeles', addrLine='null',
marker='2', addressString='null', affiliationString='null',
failAffiliation=false}]",
"grobid:header_FullAuthors": "[Chris A Mattmann, Daniel J Crichton,
Nenad Medvidovic, Steve Hughes]",
"grobid:header_Item": "-1",
"grobid:header_Keyword": "Categories and Subject Descriptors D2
Software Engineering, D211 Domain Specific Architectures Keywords OODT, Data
Management, Software Architecture",
"grobid:header_Keywords": "[D2 Software Engineering, D211 Domain
Specific Architectures (type:subject-headers), Keywords
(type:subject-headers), OODT, Data Management, Software Architecture
(type:subject-headers)]",
"grobid:header_Language": "en",
"grobid:header_NbPages": "-1",
"grobid:header_OriginalAuthors": "Chris A. Mattmann 1, 2 Daniel J.
Crichton 1 Nenad Medvidovic 2 Steve Hughes 1",
"grobid:header_Title": "A Software Architecture-Based Framework for
Highly Distributed and Data Intensive Scientific Applications",
"meta:author": "End User Computing Services",
"meta:creation-date": "2006-02-15T21:13:58Z",
"meta:save-date": "2006-02-15T21:16:01Z",
"modified": "2006-02-15T21:16:01Z",
"pdf:PDFVersion": "1.4",
"pdf:encrypted": "false",
"producer": "Acrobat Distiller 6.0 (Windows)",
"resourceName": "ICSE06.pdf",
"title": "Proceedings Template - WORD",
"xmp:CreatorTool": "Acrobat PDFMaker 6.0 for Word",
"xmpTPg:NPages": "10"
}
]
{noformat}

Great work, [~sujenshah]. I'm going to commit this now and start work on the
Wiki page!

> Integrate the GROBID PDF extractor in Tika
> ------------------------------------------
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Sujen Shah
> Assignee: Chris A. Mattmann
> Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning
> library for extracting, parsing and re-structuring raw documents such as PDF
> into structured TEI-encoded documents with a particular focus on technical
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and
> help extract extra metadata about the paper like authors, publication,
> citations, etc.
> It would be nice to have this integrated into Tika, I have tried it on my
> local, will issue a pull request soon.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

Reply via email to