2011/12/5 David FLANDERS <d.fland...@jisc.ac.uk>: > I’ve bugged Rufus a fair amount on this, one of his project manager’s Mark
What do we need to do more :-)? We're really interested in use of a data hub / "DMS" (CMS but for data) like CKAN in the academic environment. We also talked earlier in the year about having CKAN implement SWORD (still up for this -- just need to schedule it within the multitude of other cool features wanted!). BTW: my CKAN colleague Irina (in cc) will be at DCC in Bristol today and tomorrow and would love to talk with people more about CKAN, http://thedatahub.org/ and data in general. rufus > Macgillvray has been thinking on this re ‘Open Scholarship’ a fair amount – > wish we could get a University to start playing around with this, of course > the Tardis folk down in Oz have been doing good things as well, Cc Steve > Androulakis. /dff > > > > From: Kathi Fletcher [mailto:kathi.fletc...@gmail.com] > Sent: 05 December 2011 14:50 > To: David FLANDERS > Cc: < sword-app-tech@lists.sourceforge.net> Rufus; Leggett, Pete > > > Subject: Re: [sword-app-tech] How to send large fiels > > > > Hi, > > > > I have CC'd Rufus Pollack of CKAN in case he has ideas about some sort of > system where papers go in document repositories like DSpace, EPrint, and > data goes in data repositories like CKAN etc. > > > > Kathi > > > > ---------- Forwarded message ---------- > From: David FLANDERS <d.fland...@jisc.ac.uk> > Date: 2011/12/5 > Subject: Re: [sword-app-tech] How to send large fiels > To: Ben O'Steen <bost...@gmail.com>, Stuart Lewis <s.le...@auckland.ac.nz> > Cc: "< sword-app-tech@lists.sourceforge.net>" > <sword-app-tech@lists.sourceforge.net>, "Leggett, Pete" > <p.f.legg...@exeter.ac.uk> > > +1 > > > > Why not use systems *built for* data instead of a system built for research > papers? CKAN, Tardis, Kasabi, MongoDB, NoSQL store (triple, graph, > keyValue)...? > > > > I’d like to hear a good reason not to use these systems and then > interoperate with repositories rather than build the same functionality into > repositories? /dff > > > > From: Ben O'Steen [mailto:bost...@gmail.com] > Sent: 05 December 2011 08:00 > To: Stuart Lewis > Cc: < sword-app-tech@lists.sourceforge.net> Leggett, Pete > > > Subject: Re: [sword-app-tech] How to send large fiels > > > > While I think I understand the drive to put these files within a repository, > I would suggest caution. Just because it might be possible to put a file > into the care of a repository doesn't make it a practical or useful thing to > do. > > > > - What do you feel you might gain by placing 500Gb+ files into a repository, > compared with having them in an addressable filestore? > > - Have people been able to download files of that size from DSpace, Fedora > or EPrints? > > - Has the repository been allocated space on a suitable filesystem? XFS, > EBS, Thumper or similar? > > - Once the file is ingested into DSpace or Fedora for example, is there any > other route to retrieve this, aside from HTTP? (Coding your own > servlet/addon is not a real answer to this.) Is it easily accessible via > Grid-FTP or HPN-SSH for example? > > - Can the workflows you wish to utilise handle the data you are giving it? > Is any broad stroke tool aside from fixity useful here? > > > > Again, I am advising caution here, not besmirching the name of repositories. > They do a good job with what we might currently term "small files", but were > never developed with research data sizes in mind (3-500Gb is a decent rough > guide. 1+Tb sets are certainly not uncommon) > > > > So, in short, weigh up the benefits against the downsides and not in > hypotheticals. Actually do it, and get real researchers to try and use it. > You'll soon have a metric to show what is useful and what isn't. > > > On Monday, 5 December 2011, Stuart Lewis wrote: > > Hi Pete, > > Thanks for the information. I've attached a piece of code that we use > locally as part of the curation framework (in DSpace 1.7 or above), written > by a colleague: Kim Shepherd. The curation framework allows small jobs to > be run on single items, collections, communities, or the whole repository. > This particular job looks to see if there is a filename in a pre-described > metadata field, and if there is no matching bitstream, it will then ingest > the file from disk. > > More details of the curation system can be seen at: > > - https://wiki.duraspace.org/display/DSPACE/CurationSystem > - https://wiki.duraspace.org/display/DSPACE/Curation+Task+Cookbook > > Some other curation tasks that Kim has written: > > - https://github.com/kshepherd/Curation > > This can be used by depositing the metadata via SWORD, with the filename in > a metadata field. Optionally the code could be changed to copy the file > from another source (e.g. FTP, HTTP, Grid, etc). > > Thanks, > > > Stuart Lewis > Digital Development Manager > Te Tumu Herenga The University of Auckland Library > Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand > Ph: +64 (0)9 373 7599 x81928 > > > > On 29/11/2011, at 12:09 PM, Leggett, Pete wrote: > >> Hi Stuart, >> >> You asked for more info. We are developing a Research Data Repository >> based on Dspace for storing the research data associated with Exeter >> University research publications. >> For some research fields such as Physics, Biology, this data can be very >> large - TB's it seems!, hence the need to consider large injests over what >> might be several days. >> The researcher has the data, and would I am guessing create the metadata >> but maybe in collaboration with a data curator. Ideally the researcher >> would perform the deposit with, for large data sets, an offline injest of >> the data itself. The data can be on the researchers >> server/workstation/laptop/dvd/usb hard drive etc. >> >> There seems to be a couple of ways at least of approaching this so what I >> was after was some references to what and how other people have done this >> to give me a better handle on the best way forward - having very little >> dspace or repository experience myself. But given the size of larger data >> sets, I do think the best solution will involve as little copying of the >> data as possible - with the ultimate being just one copy process, of the >> data from source into repository. Everything less being done by reference >> if that is possible. >> >> Are you perhaps able to point me at some "code" examples for the SWORD >> deposit you talk about where a second process injests the files ? Would >> this be coded in Java ? >> Does the injest process have to be java based or can it be a perl script >> for example ? Please forgive my Dspace ignorance! >> >> Best regards, >> >> Pete >> >> >> On 28/11/2011 20:26, "Stuart Lewis" <s.le...@auckland.ac.nz> wrote: >> >>> Hi Pete, >>> >>> 'Deposit by reference' would probably be used to 'pull' data from a >>> remote server. If you already have the data on your DSpace server, as >>> Mark points out there might be better ways to perform the import, such as >>> registering the bitstreams, or just performing a local import. >>> >>> A SWORD deposit by reference might take place in two parts: >>> >>> - Deposit some metadata, that includes a description of the file(s) to >>> be ingested >>> >>> - A second process (perhaps triggered by the SWORD deposit, or >>> undertaken later, such as via a DSpace curation task) that ingests the >>> file(s) into the DSpace object. >>> >>> Could you tell us a bit more about the process you want to implement? >>> Who has the data, the metadata, who performs the deposit etc? >>> >>> Thanks, >>> >>> >>> Stuart Lewis >>> Digital Development Manager >>> Te Tumu Herenga The University of Auckland Library >>> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand >>> Ph: +64 (0)9 373 7599 x81928 >>> >>> >>> >>> On 29/11/2011, at 7:19 AM, Leggett, Pete wrote: >>> >>>> Stuart, >>>> >>>> Can you provide any links to examples of using Œdeposit by reference¹ ? >>>> >>>> I am looking at feasibility of depositing very large items (tar.gz or >>>> zip¹d data files), say up to 16TB, into Dspace 1.6.x with the obvious >>>> problems of doing this using a web interface. >>>> Wondering if EasyDeposit can be adapted to do Œdeposit by reference¹ >>>> with either a utility of some kind on the dspace server looking for >>>> large items to injest or a client pushing the data onto a directory on >>>> the dspace server from where it can be injested. Ideally want to >>>> minimise any copies of the data. >>>> >>>> Really want to avoid copying the item once it¹s on the Dspace server. >>>> Could item be uploaded directly into asset store maybe ? >>>> The other problem is how anyone could download the item once it¹s in >>>> Dspace ? >>>> >>>> Anyone else doing this sort of very large item ( i.e. TB¹s ) injest ? >>>> >>>> Thank you, >>>> >>>> Pete >>>> >>>> >>>> From: David FLANDERS [mailto:d.fland...@jisc.ac.uk] >>>------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > sword-app-tech mailing list > sword-app-tech@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/sword-app-tech > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > sword-app-tech mailing list > sword-app-tech@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/sword-app-tech > > > > > > -- > Katherine Fletcher, kathi.fletc...@gmail.com > > Twitter: kefletcher Blog: kefletcher.blogspot.com > > > > > > -- Co-Founder, Open Knowledge Foundation Promoting Open Knowledge in a Digital Age http://www.okfn.org/ - http://blog.okfn.org/ ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ sword-app-tech mailing list sword-app-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/sword-app-tech