Re: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden)

2011-11-30 Thread Peter Rice

On 11/30/2011 11:32 AM, Pjotr Prins wrote:


Git is not very good for storing large data files, which we would want
to fetch partially. My suggestion would be to have a plain old file
repo, e.g. on S3, which can be mirrored by others.


We had issues with large files in the EMBOSS release, and make those 
available via rsync to add to the developers CVS checkout. They include 
the NCBI taxonomy source and index files and the ontology source and 
index files.


The next EMBOSS release will include http and ftp URLs as valid inputs 
for any data type, so EMBOSS could use remote files for format tests. I' 
look into how other repositories could be added.


I had to add some extra qualifiers to allow queries and offsets to be 
specified, and rewrote the query language parsing to merge very similar 
code segments.


regards,

Peter Rice
EMBOSS Team
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden)

2011-11-30 Thread Peter Cock
On Wed, Nov 30, 2011 at 11:38 AM, Peter Rice p...@ebi.ac.uk wrote:
 On 11/30/2011 11:32 AM, Pjotr Prins wrote:

 Git is not very good for storing large data files, which we would want
 to fetch partially. My suggestion would be to have a plain old file
 repo, e.g. on S3, which can be mirrored by others.

 We had issues with large files in the EMBOSS release, and make those
 available via rsync to add to the developers CVS checkout. They include the
 NCBI taxonomy source and index files and the ontology source and index
 files.

 The next EMBOSS release will include http and ftp URLs as valid inputs for
 any data type, so EMBOSS could use remote files for format tests. I' look
 into how other repositories could be added.

 I had to add some extra qualifiers to allow queries and offsets to be
 specified, and rewrote the query language parsing to merge very similar code
 segments.

 regards,

 Peter Rice
 EMBOSS Team

How about an OBF hosted FTP site then if we want big data?
I guess we'd mostly be adding files, and changes/deletions
should be rare, so a full version tracking repository isn't
essential if we are disciplined about updating README files
or more formal meta data.

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden)

2011-11-30 Thread Peter Cock
On Wed, Nov 30, 2011 at 11:45 AM, Pjotr Prins pjotr.publi...@thebird.nl wrote:
 On Wed, Nov 30, 2011 at 11:42:22AM +, Peter Cock wrote:
 How about an OBF hosted FTP site then if we want big data?

 Yes :)

 I guess we'd mostly be adding files, and changes/deletions
 should be rare, so a full version tracking repository isn't
 essential if we are disciplined about updating README files
 or more formal meta data.

 We can still have the readme's and MD5s mirrored in a small repo. That
 would track changes/moving/renaming.

 Pj.

True, or even a hybrid where small files also live in a git
repo, but for larger files we just store the URL and MD5?

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden)

2011-11-30 Thread Fields, Christopher J
On Nov 30, 2011, at 5:58 AM, Peter Cock wrote:

 On Wed, Nov 30, 2011 at 11:45 AM, Pjotr Prins pjotr.publi...@thebird.nl 
 wrote:
 On Wed, Nov 30, 2011 at 11:42:22AM +, Peter Cock wrote:
 How about an OBF hosted FTP site then if we want big data?
 
 Yes :)
 
 I guess we'd mostly be adding files, and changes/deletions
 should be rare, so a full version tracking repository isn't
 essential if we are disciplined about updating README files
 or more formal meta data.
 
 We can still have the readme's and MD5s mirrored in a small repo. That
 would track changes/moving/renaming.
 
 Pj.
 
 True, or even a hybrid where small files also live in a git
 repo, but for larger files we just store the URL and MD5?
 
 Peter

There was an initial push for this years ago IIRC, with the biodata repository, 
but it never took off.  Not sure if the dev.open-bio.org CVS repo is even 
browsable anymore (I believe this was all synced to portal for browsing), but 
the old biodata CVS repo is still in /home/repositories/biodata (very little 
there, might as well start from scratch).

chris


___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss