Re: [sword-app-tech] How to send large fiels

Rufus Pollock Mon, 05 Dec 2011 14:36:05 -0800

2011/12/5 David FLANDERS <d.fland...@jisc.ac.uk>:
> I’ve bugged Rufus a fair amount on this, one of his project manager’s Mark


What do we need to do more :-)?

We're really interested in use of a data hub / "DMS" (CMS but for
data) like CKAN in the academic environment. We also talked earlier in
the year about having CKAN implement SWORD (still up for this -- just
need to schedule it within the multitude of other cool features
wanted!).

BTW: my CKAN colleague Irina (in cc) will be at DCC in Bristol today
and tomorrow and would love to talk with people more about CKAN,
http://thedatahub.org/ and data in general.

rufus

> Macgillvray has been thinking on this re ‘Open Scholarship’ a fair amount –
> wish we could get a University to start playing around with this, of course
> the Tardis folk down in Oz have been doing good things as well, Cc Steve
> Androulakis.  /dff
>
>
>
> From: Kathi Fletcher [mailto:kathi.fletc...@gmail.com]
> Sent: 05 December 2011 14:50
> To: David FLANDERS
> Cc: &lt sword-app-tech@lists.sourceforge.net&gt; Rufus; Leggett, Pete
>
>
> Subject: Re: [sword-app-tech] How to send large fiels
>
>
>
> Hi,
>
>
>
> I have CC'd Rufus Pollack of CKAN in case he has ideas about some sort of
> system where papers go in document repositories like DSpace, EPrint, and
> data goes in data repositories like CKAN etc.
>
>
>
> Kathi
>
>
>
> ---------- Forwarded message ----------
> From: David FLANDERS <d.fland...@jisc.ac.uk>
> Date: 2011/12/5
> Subject: Re: [sword-app-tech] How to send large fiels
> To: Ben O'Steen <bost...@gmail.com>, Stuart Lewis <s.le...@auckland.ac.nz>
> Cc: "&lt sword-app-tech@lists.sourceforge.net&gt"
> <sword-app-tech@lists.sourceforge.net>, "Leggett, Pete"
> <p.f.legg...@exeter.ac.uk>
>
> +1
>
>
>
> Why not use systems *built for* data instead of a system built for research
> papers?  CKAN, Tardis, Kasabi, MongoDB, NoSQL store (triple, graph,
> keyValue)...?
>
>
>
> I’d like to hear a good reason not to use these systems and then
> interoperate with repositories rather than build the same functionality into
> repositories?  /dff
>
>
>
> From: Ben O'Steen [mailto:bost...@gmail.com]
> Sent: 05 December 2011 08:00
> To: Stuart Lewis
> Cc: &lt sword-app-tech@lists.sourceforge.net&gt; Leggett, Pete
>
>
> Subject: Re: [sword-app-tech] How to send large fiels
>
>
>
> While I think I understand the drive to put these files within a repository,
> I would suggest caution. Just because it might be possible to put a file
> into the care of a repository doesn't make it a practical or useful thing to
> do.
>
>
>
> - What do you feel you might gain by placing 500Gb+ files into a repository,
> compared with having them in an addressable filestore?
>
> - Have people been able to download files of that size from DSpace, Fedora
> or EPrints?
>
> - Has the repository been allocated space on a suitable filesystem? XFS,
> EBS, Thumper or similar?
>
> - Once the file is ingested into DSpace or Fedora for example, is there any
> other route to retrieve this, aside from HTTP? (Coding your own
> servlet/addon is not a real answer to this.) Is it easily accessible via
> Grid-FTP or HPN-SSH for example?
>
> - Can the workflows you wish to utilise handle the data you are giving it?
> Is any broad stroke tool aside from fixity useful here?
>
>
>
> Again, I am advising caution here, not besmirching the name of repositories.
> They do a good job with what we might currently term "small files", but were
> never developed with research data sizes in mind (3-500Gb is a decent rough
> guide. 1+Tb sets are certainly not uncommon)
>
>
>
> So, in short, weigh up the benefits against the downsides and not in
> hypotheticals. Actually do it, and get real researchers to try and use it.
> You'll soon have a metric to show what is useful and what isn't.
>
>
> On Monday, 5 December 2011, Stuart Lewis wrote:
>
> Hi Pete,
>
> Thanks for the information.  I've attached a piece of code that we use
> locally as part of the curation framework (in DSpace 1.7 or above), written
> by a colleague: Kim Shepherd.  The curation framework allows small jobs to
> be run on single items, collections, communities, or the whole repository.
>  This particular job looks to see if there is a filename in a pre-described
> metadata field, and if there is no matching bitstream, it will then ingest
> the file from disk.
>
> More details of the curation system can be seen at:
>
>  - https://wiki.duraspace.org/display/DSPACE/CurationSystem
>  - https://wiki.duraspace.org/display/DSPACE/Curation+Task+Cookbook
>
> Some other curation tasks that Kim has written:
>
>  - https://github.com/kshepherd/Curation
>
> This can be used by depositing the metadata via SWORD, with the filename in
> a metadata field.  Optionally the code could be changed to copy the file
> from another source (e.g. FTP, HTTP, Grid, etc).
>
> Thanks,
>
>
> Stuart Lewis
> Digital Development Manager
> Te Tumu Herenga The University of Auckland Library
> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
> Ph: +64 (0)9 373 7599 x81928
>
>
>
> On 29/11/2011, at 12:09 PM, Leggett, Pete wrote:
>
>> Hi Stuart,
>>
>> You asked for more info. We are developing a Research Data Repository
>> based on Dspace for storing the research data associated with Exeter
>> University research publications.
>> For some research fields such as Physics, Biology, this data can be very
>> large - TB's it seems!, hence the need to consider large injests over what
>> might be several days.
>> The researcher has the data, and would I am guessing create the metadata
>> but maybe in collaboration with a data curator. Ideally the researcher
>> would perform the deposit with, for large data sets, an offline injest of
>> the data itself. The data can be on the researchers
>> server/workstation/laptop/dvd/usb hard drive etc.
>>
>> There seems to be a couple of ways at least of approaching this so what I
>> was after was some references to what and how other people have done this
>> to give me a better handle on the best way forward - having very little
>> dspace or repository experience myself. But given the size of larger data
>> sets, I do think the best solution will involve as little copying of the
>> data as possible - with the ultimate being just one copy process, of the
>> data from source into repository. Everything less being done by reference
>> if that is possible.
>>
>> Are you perhaps able to point me at some "code" examples for the SWORD
>> deposit you talk about where a second process injests the files ? Would
>> this be coded in Java ?
>> Does the injest process have to be java based or can it be a perl script
>> for example ? Please forgive my Dspace ignorance!
>>
>> Best regards,
>>
>> Pete
>>
>>
>> On 28/11/2011 20:26, "Stuart Lewis" <s.le...@auckland.ac.nz> wrote:
>>
>>> Hi Pete,
>>>
>>> 'Deposit by reference' would probably be used to 'pull' data from a
>>> remote server.  If you already have the data on your DSpace server, as
>>> Mark points out there might be better ways to perform the import, such as
>>> registering the bitstreams, or just performing a local import.
>>>
>>> A SWORD deposit by reference might take place in two parts:
>>>
>>> - Deposit some metadata, that includes a description of the file(s) to
>>> be ingested
>>>
>>> - A second process (perhaps triggered by the SWORD deposit, or
>>> undertaken later, such as via a DSpace curation task) that ingests the
>>> file(s) into the DSpace object.
>>>
>>> Could you tell us a bit more about the process you want to implement?
>>> Who has the data, the metadata, who performs the deposit etc?
>>>
>>> Thanks,
>>>
>>>
>>> Stuart Lewis
>>> Digital Development Manager
>>> Te Tumu Herenga The University of Auckland Library
>>> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
>>> Ph: +64 (0)9 373 7599 x81928
>>>
>>>
>>>
>>> On 29/11/2011, at 7:19 AM, Leggett, Pete wrote:
>>>
>>>> Stuart,
>>>>
>>>> Can you provide any links to examples of using Œdeposit by reference¹ ?
>>>>
>>>> I am looking at feasibility of depositing very large items (tar.gz or
>>>> zip¹d data files), say up to 16TB, into Dspace 1.6.x with the obvious
>>>> problems of doing this using a web interface.
>>>> Wondering if EasyDeposit can be adapted to do Œdeposit by reference¹
>>>> with either a utility of some kind on the  dspace server looking for
>>>> large items to injest or a client pushing the data onto a directory on
>>>> the dspace server from where it can be injested. Ideally want to
>>>> minimise any copies of the data.
>>>>
>>>> Really want to avoid copying the item once it¹s on the Dspace server.
>>>> Could item be uploaded directly into asset store maybe ?
>>>> The other problem is how anyone could download the item once it¹s in
>>>> Dspace ?
>>>>
>>>> Anyone else doing this sort of very large item ( i.e. TB¹s ) injest ?
>>>>
>>>> Thank you,
>>>>
>>>> Pete
>>>>
>>>>
>>>> From: David FLANDERS [mailto:d.fland...@jisc.ac.uk]
>>>------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> sword-app-tech mailing list
> sword-app-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/sword-app-tech
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> sword-app-tech mailing list
> sword-app-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/sword-app-tech
>
>
>
>
>
> --
> Katherine Fletcher, kathi.fletc...@gmail.com
>
> Twitter: kefletcher Blog: kefletcher.blogspot.com
>
>
>
>
>
>



-- 
Co-Founder, Open Knowledge Foundation
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
sword-app-tech mailing list
sword-app-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sword-app-tech

Re: [sword-app-tech] How to send large fiels

Reply via email to