On Wed, 11 Mar 2015, davesgonechina wrote:
Hi John,
Good question - we're taking in XLS, CSV, JSON, XML, and on a bad day PDF
of varying file sizes, each requiring different transformation and audit
strategies, on both regular and irregular schedules. New batches often
feature schema changes requiring modification to ingest procedures, which
we're trying to automate as much as possible but obviously require a human
chaperone.
Mediawiki is our default choice at the moment, but then I would still be
looking for a good workflow management model for the structure of the wiki,
especially since in my experience wikis are often a graveyard for the best
intentions.
A few places that you might try asking this question again, to see if you
can find a solution that better answers your question:
The American Society for Information Science & Technology's Research Data
Access & Preservation group. It has a lot of librarians & archivists in
it, as well as people from various research disiplines:
http://mail.asis.org/mailman/listinfo/rdap
http://www.asis.org/rdap/
...
The Research Data Alliance has a number of groups that might be relevant.
Here are a few that I suspect are the best fit:
Libraries for Research Data IG
https://rd-alliance.org/groups/libraries-research-data.html
Reproducibility IG
https://rd-alliance.org/groups/reproducibility-ig.html
Research Data Provenance IG
https://rd-alliance.org/groups/research-data-provenance.html
Data Citation WG
(as this fits into their 'dynamic data' problem)
https://rd-alliance.org/groups/data-citation-wg.html
('IG' is 'Interest Group', which are long-lived. 'WG' is 'Working Group'
which are formed to solve a specific problem and then disband)
The group 'Publishing Data Workflows' might seem to be appropriate but
it's actually 'Workflows for Publishing Data' not 'Publishing of Data
Workflows' (which falls under 'Data Provenance' and 'Data Citation')
There was a presentation at the meeting earlier this week by Andreas
Rauber in the Data Citation group on workflows using git or SQL databases
to be able to track appending or modification for CSV and similar ASCII
files.
...
Also, I would consider this to be on-topic for Stack Exchange's "Open
Data" site (and I'm one of the moderators for the site):
http://opendata.stackexchange.com/
-Joe
On Tue, Mar 10, 2015 at 8:10 PM, Scancella, John <j...@loc.gov> wrote:
Dave,
How are you getting the metadata streams? Are they actual stream objects,
or files, or database dumps, etc?
As for the tools, I have used a number of the ones you listed below. I
personally prefer JIRA (and it is free for non-profit). If you are ok if
editing in wiki syntax I would recommend mediaWiki (it is what powers
Wikipedia). You could also take a look at continuous deployment
technologies like Virtual Machines (virtualbox), linux containers (docker),
and rapid deployment tools (ansible, salt). Of course if you are doing lots
of code changes you will want to test all of this continually (Jenkins).
John Scancella
Library of Congress, OSI
-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
davesgonechina
Sent: Tuesday, March 10, 2015 6:05 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Data Lifecycle Tracking & Documentation Tools
Hi all,
One of my projects involves harvesting, cleaning and transforming steady
streams of metadata from numerous publishers. It's an infinite loop but
every cycle can be a little bit or significantly different. Many issue
tracking tools are designed for a linear progression that ends in
deployment, not a circular workflow, and I've not hit upon a tool or use
strategy that really fits.
The best illustration I've found so far of the type of workflow I'm
talking about is the DCC Curation Lifecycle Model <
http://www.dcc.ac.uk/sites/default/files/documents/publications/DCCLifecycle.pdf
.
Here are some things I've tried or thought about trying:
- Git comments
- Github Issues
- MySQL comments
- Bash script logs
- JIRA
- Trac
- Trello
- Wiki
- Unfuddle
- Redmine
- Zendesk
- Request Tracker
- Basecamp
- Asana
Thoughts?
Dave