Hello Curtis,
The datacache was originally pointed to the data staging area and is now
pointed to the data published area. The difference is that the published
area contains data and location (.loc) files that are in synch and have
completed final testing. It is your choice about whether to use the
staged-only data - it depends how risk tolerant your project is and if
you plan on testing. But, that said, I think it is almost certainly fine
or our team wouldn't have staged it yet. A vanishingly small number of
datasets are pulled back once they make it to staging, and this is why
we were comfortable pointing datacache there in the first place (were
unable to point to the published area at first, but wanted to make the
data available ASAP).
Going forward - I can let you know that these indexes are very easy to
create: one command-line execution, then add one line to the associated
.loc file. Instructions are here, see "Bowtie and Tophat":
http://wiki.galaxyproject.org/Admin/NGS%20Local%20Setup
For one or few genomes, not a problem. For hundreds of genomes with
variants, can become tedious even with helper tools and in our case, the
processing interacted with disk that was undergoing changes (as we have
been working on system configuration most of the summer). Also, with the
Data Manager is now available, creating batch indexes for use via rsync
become lower priority. Even so, I would expect more indexes to be fully
published once the final configuration is in place, as many are already
staged or close being staged (watch the yellow banner on Main).
Hopefully this helps to explain the data, guides you to making an
informed decision, and aids with creating your own indexes as needed,
Thanks!
Jen
Galaxy team
On 9/18/13 1:04 PM, Curtis Hendrickson (Campus) wrote:
Folks,
First, I wanted to thank you for making the datacache available
(http://wiki.galaxyproject.org/Admin/Data%20Integration;
rsync://datacache.g2.bx.psu.edu). It's a great resource.
However, what is the best way to stay abreast of changes to what's in
datacache, and understand how these indexes are computed?
We are currently upgrading to bowtie2, but I notice that the bowtie2
indices for mm9, which used to be in
rsync://datacache.g2.bx.psu.edu/indexes/mm9/mm9*/bowtie2_index
have been removed, and only the hg19 genome has bowtie2 indices. Why
only that one, and not the others?
Where are the scripts you use to make these indices, in case I want to
create bowtie2 indices for other
So, how do I find out **why** they were removed? (Can I safely use the
copy I have, or was there a problem with them?)
More generally, how do I understand the policies and logic behind the
datacache indices, and be notified of changes, short of running my own
periodic rsync/diff?
Finally, since I'm doing "reproducible research" is anything planned
for systematically versioning genome indices, so I can easily tell
what version of a system (ie, what BWA version) was used to create the
index, and be sure that an index will not suddenly disappear.
Thanks,
Curtis
Research Associate/CTSA-Informatics Team
University of Alabama at Birmingham
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/
--
Jennifer Hillman-Jackson
http://galaxyproject.org
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/