Re: [sword-app-tech] How to send large fiels

Richard Jones Fri, 09 Dec 2011 09:09:02 -0800

Hi Folks,

I've been watching this thread with interest.


I think the "deposit by reference" issue is an important one to pull out.
 We have always said that SWORD can support deposit by reference, using,
for example, an ORE resource map as the deposit package which causes the
server to dereference the content in its own time.

In addition, during work on DataFlow it has become apparent that there are
some situations where a server will take deposit of some content, but not
unpack it until some time later on.

Combined, these two Big Content approaches raise some questions for me over
the current SWORD 2.0 spec (which was developed prior to our work on data
deposit):

1/ How do you really specify "deposit by reference" in SWORD?  Is it enough
that the "Package" format be one which is known to be by reference (e.g.
ORE), or do we need semantics which say "here is a package, but that
package also contains references to resources that you may
(optionally/mandatorally) dereference".

2/ Since the "deposit" of a by-reference object is in two or more parts
(the manifest, followed by one or more dereferences) at what point is the
deposit action complete?  This is also true of the DataFlow case where the
deposit of a BagIt format ZIP file is the first step, but that ZIP is not
unpacked until later on (perhaps much later).  This has an impact on how
SWORD Deposit Receipts should be interpreted.  If a Deposit Receipt is
returned which says SUCCESS, this may only refer to the first act of
physical deposit, and not the subsequent acts of dereference or unpack.

3/ Carrying on from (2), if there is an error in a downstream operation
which the server considers part of the "deposit" (dereference, unpack) but
for which the client has already been alerted of a successful deposit of
the initial content, then how does SWORD alert the client that an error
along the way has occurred.  An obvious candidate would be the Statement
having semantics to deal with errors; it may even be that the Statement in
its current form is sufficient, and the answer is for the "state" itself to
be an "error" state that the client understands and can respond to.

I'd be interested in people's thoughts.

Cheers,

Richard

On Tue, Dec 6, 2011 at 12:31 PM, Sally Rumsey <
sally.rum...@bodleian.ox.ac.uk> wrote:

> To add to the melting pot, we're (Oxford) working on both development of a
> data repository (DataBank) and a data catalogue as part of the JISC funded
> Damaro project.
>
> http://damaro.oucs.ox.ac.uk/
>
> DataBank data repository development also forms part of the DataFlow
> project
> http://www.dataflow.ox.ac.uk/
>
> Sally
>
> --
> Sally Rumsey
> Digital Collections Development Manager
> Bodleian Digital Library Systems and Services (BDLSS)
>
> sally.rum...@bodleian.ox.ac.uk
> Tel: 01865 283860
>
>
> > From: Stuart Lewis <s.le...@auckland.ac.nz>
> > Date: Mon, 5 Dec 2011 23:31:47 +0000
> > To: David FLANDERS <d.fland...@jisc.ac.uk>
> > Cc: <sword-app-tech@lists.sourceforge.net>, <oda...@gmail.com>, Rufus
> > <rufus.poll...@okfn.org>, Kathi Fletcher <kathi.fletc...@gmail.com>
> > Subject: Re: [sword-app-tech] How to send large fiels
> >
> > FWIW we've got a couple of students under New Zealand's "Summer of
> eResearch"
> > scheme looking at implementing a university data catalogue, with CKAN
> being
> > one of the candidate systems.  Will let you know how we get on.
> >
> >
> > Stuart Lewis
> > Digital Development Manager
> > Te Tumu Herenga The University of Auckland Library
> > Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
> > Ph: +64 (0)9 373 7599 x81928
> >
> >
> >
> > On 6/12/2011, at 11:28 AM, David FLANDERS wrote:
> >
> >> I’ve bugged Rufus a fair amount on this, one of his project manager’s
> Mark
> >> Macgillvray has been thinking on this re ‘Open Scholarship’ a fair
> amount –
> >> wish we could get a University to start playing around with this, of
> course
> >> the Tardis folk down in Oz have been doing good things as well, Cc Steve
> >> Androulakis.  /dff
> >>
> >> From: Kathi Fletcher [mailto:kathi.fletc...@gmail.com]
> >> Sent: 05 December 2011 14:50
> >> To: David FLANDERS
> >> Cc: &lt sword-app-tech@lists.sourceforge.net&gt; Rufus; Leggett, Pete
> >> Subject: Re: [sword-app-tech] How to send large fiels
> >>
> >> Hi,
> >>
> >> I have CC'd Rufus Pollack of CKAN in case he has ideas about some sort
> of
> >> system where papers go in document repositories like DSpace, EPrint,
> and data
> >> goes in data repositories like CKAN etc.
> >>
> >> Kathi
> >>
> >> ---------- Forwarded message ----------
> >> From: David FLANDERS <d.fland...@jisc.ac.uk>
> >> Date: 2011/12/5
> >> Subject: Re: [sword-app-tech] How to send large fiels
> >> To: Ben O'Steen <bost...@gmail.com>, Stuart Lewis <
> s.le...@auckland.ac.nz>
> >> Cc: "&lt sword-app-tech@lists.sourceforge.net&gt"
> >> <sword-app-tech@lists.sourceforge.net>, "Leggett, Pete"
> >> <p.f.legg...@exeter.ac.uk>
> >>
> >> +1
> >>
> >> Why not use systems *built for* data instead of a system built for
> research
> >> papers?  CKAN, Tardis, Kasabi, MongoDB, NoSQL store (triple, graph,
> >> keyValue)...?
> >>
> >> I’d like to hear a good reason not to use these systems and then
> interoperate
> >> with repositories rather than build the same functionality into
> repositories?
> >> /dff
> >>
> >> From: Ben O'Steen [mailto:bost...@gmail.com]
> >> Sent: 05 December 2011 08:00
> >> To: Stuart Lewis
> >> Cc: &lt sword-app-tech@lists.sourceforge.net&gt; Leggett, Pete
> >>
> >> Subject: Re: [sword-app-tech] How to send large fiels
> >>
> >> While I think I understand the drive to put these files within a
> repository,
> >> I would suggest caution. Just because it might be possible to put a
> file into
> >> the care of a repository doesn't make it a practical or useful thing to
> do.
> >>
> >> - What do you feel you might gain by placing 500Gb+ files into a
> repository,
> >> compared with having them in an addressable filestore?
> >> - Have people been able to download files of that size from DSpace,
> Fedora or
> >> EPrints?
> >> - Has the repository been allocated space on a suitable filesystem?
> XFS, EBS,
> >> Thumper or similar?
> >> - Once the file is ingested into DSpace or Fedora for example, is there
> any
> >> other route to retrieve this, aside from HTTP? (Coding your own
> servlet/addon
> >> is not a real answer to this.) Is it easily accessible via Grid-FTP or
> >> HPN-SSH for example?
> >> - Can the workflows you wish to utilise handle the data you are giving
> it? Is
> >> any broad stroke tool aside from fixity useful here?
> >>
> >> Again, I am advising caution here, not besmirching the name of
> repositories.
> >> They do a good job with what we might currently term "small files", but
> were
> >> never developed with research data sizes in mind (3-500Gb is a decent
> rough
> >> guide. 1+Tb sets are certainly not uncommon)
> >>
> >> So, in short, weigh up the benefits against the downsides and not in
> >> hypotheticals. Actually do it, and get real researchers to try and use
> it.
> >> You'll soon have a metric to show what is useful and what isn't.
> >>
> >> On Monday, 5 December 2011, Stuart Lewis wrote:
> >> Hi Pete,
> >>
> >> Thanks for the information.  I've attached a piece of code that we use
> >> locally as part of the curation framework (in DSpace 1.7 or above),
> written
> >> by a colleague: Kim Shepherd.  The curation framework allows small jobs
> to be
> >> run on single items, collections, communities, or the whole repository.
>  This
> >> particular job looks to see if there is a filename in a pre-described
> >> metadata field, and if there is no matching bitstream, it will then
> ingest
> >> the file from disk.
> >>
> >> More details of the curation system can be seen at:
> >>
> >>  - https://wiki.duraspace.org/display/DSPACE/CurationSystem
> >>  - https://wiki.duraspace.org/display/DSPACE/Curation+Task+Cookbook
> >>
> >> Some other curation tasks that Kim has written:
> >>
> >>  - https://github.com/kshepherd/Curation
> >>
> >> This can be used by depositing the metadata via SWORD, with the
> filename in a
> >> metadata field.  Optionally the code could be changed to copy the file
> from
> >> another source (e.g. FTP, HTTP, Grid, etc).
> >>
> >> Thanks,
> >>
> >>
> >> Stuart Lewis
> >> Digital Development Manager
> >> Te Tumu Herenga The University of Auckland Library
> >> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
> >> Ph: +64 (0)9 373 7599 x81928
> >>
> >>
> >>
> >> On 29/11/2011, at 12:09 PM, Leggett, Pete wrote:
> >>
> >>> Hi Stuart,
> >>>
> >>> You asked for more info. We are developing a Research Data Repository
> >>> based on Dspace for storing the research data associated with Exeter
> >>> University research publications.
> >>> For some research fields such as Physics, Biology, this data can be
> very
> >>> large - TB's it seems!, hence the need to consider large injests over
> what
> >>> might be several days.
> >>> The researcher has the data, and would I am guessing create the
> metadata
> >>> but maybe in collaboration with a data curator. Ideally the researcher
> >>> would perform the deposit with, for large data sets, an offline injest
> of
> >>> the data itself. The data can be on the researchers
> >>> server/workstation/laptop/dvd/usb hard drive etc.
> >>>
> >>> There seems to be a couple of ways at least of approaching this so
> what I
> >>> was after was some references to what and how other people have done
> this
> >>> to give me a better handle on the best way forward - having very little
> >>> dspace or repository experience myself. But given the size of larger
> data
> >>> sets, I do think the best solution will involve as little copying of
> the
> >>> data as possible - with the ultimate being just one copy process, of
> the
> >>> data from source into repository. Everything less being done by
> reference
> >>> if that is possible.
> >>>
> >>> Are you perhaps able to point me at some "code" examples for the SWORD
> >>> deposit you talk about where a second process injests the files ? Would
> >>> this be coded in Java ?
> >>> Does the injest process have to be java based or can it be a perl
> script
> >>> for example ? Please forgive my Dspace ignorance!
> >>>
> >>> Best regards,
> >>>
> >>> Pete
> >>>
> >>>
> >>> On 28/11/2011 20:26, "Stuart Lewis" <s.le...@auckland.ac.nz> wrote:
> >>>
> >>>> Hi Pete,
> >>>>
> >>>> 'Deposit by reference' would probably be used to 'pull' data from a
> >>>> remote server.  If you already have the data on your DSpace server, as
> >>>> Mark points out there might be better ways to perform the import,
> such as
> >>>> registering the bitstreams, or just performing a local import.
> >>>>
> >>>> A SWORD deposit by reference might take place in two parts:
> >>>>
> >>>> - Deposit some metadata, that includes a description of the file(s) to
> >>>> be ingested
> >>>>
> >>>> - A second process (perhaps triggered by the SWORD deposit, or
> >>>> undertaken later, such as via a DSpace curation task) that ingests the
> >>>> file(s) into the DSpace object.
> >>>>
> >>>> Could you tell us a bit more about the process you want to implement?
> >>>> Who has the data, the metadata, who performs the deposit etc?
> >>>>
> >>>> Thanks,
> >>>>
> >>>>
> >>>> Stuart Lewis
> >>>> Digital Development Manager
> >>>> Te Tumu Herenga The University of Auckland Library
> >>>> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
> >>>> Ph: +64 (0)9 373 7599 x81928
> >>>>
> >>>>
> >>>>
> >>>> On 29/11/2011, at 7:19 AM, Leggett, Pete wrote:
> >>>>
> >>>>> Stuart,
> >>>>>
> >>>>> Can you provide any links to examples of using Œdeposit by
> reference¹ ?
> >>>>>
> >>>>> I am looking at feasibility of depositing very large items (tar.gz or
> >>>>> zip¹d data files), say up to 16TB, into Dspace 1.6.x with the obvious
> >>>>> problems of doing this using a web interface.
> >>>>> Wondering if EasyDeposit can be adapted to do Œdeposit by reference¹
> >>>>> with either a utility of some kind on the  dspace server looking for
> >>>>> large items to injest or a client pushing the data onto a directory
> on
> >>>>> the dspace server from where it can be injested. Ideally want to
> >>>>> minimise any copies of the data.
> >>>>>
> >>>>> Really want to avoid copying the item once it¹s on the Dspace server.
> >>>>> Could item be uploaded directly into asset store maybe ?
> >>>>> The other problem is how anyone could download the item once it¹s in
> >>>>> Dspace ?
> >>>>>
> >>>>> Anyone else doing this sort of very large item ( i.e. TB¹s ) injest ?
> >>>>>
> >>>>> Thank you,
> >>>>>
> >>>>> Pete
> >>>>>
> >>>>>
> >>>>> From: David FLANDERS [mailto:d.fland...@jisc.ac.uk]
> >>>>
> ---------------------------------------------------------------------------
> >>>> ---
> >> All the data continuously generated in your IT infrastructure
> >> contains a definitive record of customers, application performance,
> >> security threats, fraudulent activity, and more. Splunk takes this
> >> data and makes sense of it. IT sense. And common sense.
> >> http://p.sf.net/sfu/splunk-novd2d
> >> _______________________________________________
> >> sword-app-tech mailing list
> >> sword-app-tech@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/sword-app-tech
> >>
> >>
>
> ----------------------------------------------------------------------------->>
> -
> >> All the data continuously generated in your IT infrastructure
> >> contains a definitive record of customers, application performance,
> >> security threats, fraudulent activity, and more. Splunk takes this
> >> data and makes sense of it. IT sense. And common sense.
> >> http://p.sf.net/sfu/splunk-novd2d
> >> _______________________________________________
> >> sword-app-tech mailing list
> >> sword-app-tech@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/sword-app-tech
> >>
> >>
> >>
> >>
> >> --
> >> Katherine Fletcher, kathi.fletc...@gmail.com
> >> Twitter: kefletcher Blog: kefletcher.blogspot.com
> >>
> >>
> >>
> >>
>
> ----------------------------------------------------------------------------->>
> -
> >> All the data continuously generated in your IT infrastructure
> >> contains a definitive record of customers, application performance,
> >> security threats, fraudulent activity, and more. Splunk takes this
> >> data and makes sense of it. IT sense. And common sense.
> >>
> http://p.sf.net/sfu/splunk-novd2d____________________________________________
> >> ___
> >> sword-app-tech mailing list
> >> sword-app-tech@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/sword-app-tech
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > All the data continuously generated in your IT infrastructure
> > contains a definitive record of customers, application performance,
> > security threats, fraudulent activity, and more. Splunk takes this
> > data and makes sense of it. IT sense. And common sense.
> > http://p.sf.net/sfu/splunk-novd2d
> > _______________________________________________
> > sword-app-tech mailing list
> > sword-app-tech@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/sword-app-tech
>
>
>
> ------------------------------------------------------------------------------
> Cloud Services Checklist: Pricing and Packaging Optimization
> This white paper is intended to serve as a reference, checklist and point
> of
> discussion for anyone considering optimizing the pricing and packaging
> model
> of a cloud services business. Read Now!
> http://www.accelacomm.com/jaw/sfnl/114/51491232/
> _______________________________________________
> sword-app-tech mailing list
> sword-app-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/sword-app-tech
>

------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of 
discussion for anyone considering optimizing the pricing and packaging model 
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/

_______________________________________________
sword-app-tech mailing list
sword-app-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sword-app-tech

Re: [sword-app-tech] How to send large fiels

Reply via email to