Re: Running operations over data

Cameron Goodale Wed, 26 Feb 2014 22:33:06 -0800

Hey Tom,

TLDR - Crawler ships with some actions, but you can write your own actions,
and those actions can be wired into PreIngestion or PostIngestion.
 FileManager has MetExtractors that run before ingestion, they
traditionally are meant to extract metadata (as the name implies) but you
could just as easily have it run a checksum and store it in metadata, or
convert an incoming file into PDF, then ingest the PDF.

On the Snow Data System here at JPL we have a lights out operation that
might be of interest, so I will try to explain it below.

1.  Every hour OODT PushPull wakes up and tries to download new data from a
Near Real Time Satellite Imagery service via FTP (
http://lance-modis.eosdis.nasa.gov/)
2.  Every 20 minutes OODT Crawler wakes up and crawls a local file staging
area where PushPull downloads Satellite Images
3.  When the crawler encounters files that have been downloaded and are
ready for ingestion then things get interesting.  During the crawl several
pre-conditions need to be met (the file cannot already be in the catalog -
guarding against duplicates, the file has to be of the correct mime-type,
etc..)
4.  If preconditions pass then Crawler will ingest the file(s) into OODT
FileManager, but things don't stop here.
5.  Crawler has a post-ingest success hook that we leverage and we use the
"TriggerPostIngestWorkflow" action which automatically submits an event to
workflow
6.  OODT Workflow Manager receives the event (in this example it would be
"MOD09GANRTIngest") and it boils that down into tasks that get run.
7.  Workflow Manager then sends these tasks to the OODT Resource Manager
who farms the jobs off to Batchstubs that are running across 4 different
machines.
8.  When the jobs complete, crawler will ingest the final outputs back into
the FileManager.

Hope that helps.

Best Regards,

Cameron

On Tue, Feb 25, 2014 at 1:47 PM, Tom Barber <[email protected]> wrote:

>  Hello folks,
>
> Preparing for this talk, so I figure I should probably work out how OODT
> works..... ;)
>
> Anyway I have some ideas as how to integrate some more non science like
> tools into OODT but I'm still figuring out some of the components. Namely,
> workflows.
>
> If for example, in OODT world I wanted to ingest a bunch of data and
> perform some operation on them, does this happen during the ingest phase,
> or post ingest?
>
> Normally you guys would write some crazy scientific stuff I guess to
> analyse the data you're ingesting and then dump it in some different format
> into the catalog, does that sound about right?
>
> Thanks
>
> Tom
> --
> *Tom Barber* | Technical Director
>
> meteorite bi
> *T:* +44 20 8133 3730
> *W:* www.meteorite.bi | *Skype:* meteorite.consulting
> *A:* Surrey Technology Centre, Surrey Research Park, Guildford, GU2 7YG,
> UK
>

-- 

Sent from a Tin Can attached to a String

Re: Running operations over data

Reply via email to