Re: [Genome] programmatic interface to get multi-genome alignements

Hiram Clawson Tue, 20 Mar 2012 11:18:29 -0700

Good Morning Jacques:

The data files for all multiple alignments is currently just
under 1 Tb in size (uncompressed).  The best way for you to access that data
in an efficient manner is to actually have the .maf files at
your site and use the maf selection tools from the kent source
code to extract information from those files.  It would be
very difficult to access this information via the DAS or
table browser interface due to the immense amount of data in the
answer sets and the processing time to extract an answer.


There are several mechanisms you can use to obtain the maf
files for local use.  The rsync server at hgdownload can be
used to obtain a list of files.  For example, to obtain
a list of the uncompressed maf files used by the genome browser:

rsync -navP --exclude 'genbank/' rsync://hgdownload.cse.ucsc.edu/gbdb/ 2>&1 \
    | grep multiz | grep -v "^d" | egrep 'maf$' > /tmp/gbdb.maf.file.list

Alternatively, the gzipped compressed maf files from the goldenPath downloads:

rsync -navP rsync://hgdownload.cse.ucsc.edu/goldenPath/ 2>&1 \
    | grep "multiz" | grep "maf.gz" | grep -v upstream > 
/tmp/goldenPath.maf.gz.file.list

To select the file names from those listings:

awk '{print $NF}' /tmp/goldenPath.maf.gz.file.list > /tmp/fetch.maf.list

And then to transfer just those files:

rsync -avP --files-from=/tmp/fetch.maf.list 
rsync://hgdownload.cse.ucsc.edu/goldenPath/  ./

The hierarchy of those files will be constructed in ./

You can now work directly with the maf files to answer all questions about the 
alignment,
for example, extract a list of species in the alignment:

mafSpeciesList file.maf.gz stdout

Note the maf utilities in the kent source tree:

mafAddIRows mafAddQRows mafCoverage mafFetch mafFilter
mafFrag mafFrags mafGene mafMeFirst mafOrder mafRanges
mafSpeciesList mafSpeciesSubset mafSplit mafSplitPos
mafToAxt mafToPsl mafsInRegion

--Hiram

Jacques van Helden wrote:
> Dear UCSC team,
> 
> First of all , thank you very much for developing and maintaining the UCSC 
> Genome Browser, which is a great resource for all the community. 
> 
> We  developed, since 1997, a software suite called Regulatory Sequence 
> Analysis Tools (RSAT, http://rsat.ulb.ac.be/rsat/). For a list of supported 
> functionalities, see 
>       http://www.ncbi.nlm.nih.gov/pubmed/18495751
> and the 2011 update
>       http://www.ncbi.nlm.nih.gov/pubmed/21715389
> 
> We recently developed a new tool called peak-motifs, which detects 
> transcription factor binding motifs in full collections of ChIP-seq peaks. 
>       http://www.ncbi.nlm.nih.gov/pubmed/22156162
> 
> We are now extending the approach to analyze conserved motifs under the 
> peaks. 
> We are currently using the MAF files produced by multiz, but this requires 
> for us to maintain a local copy of all the multiz alignemnts, which poses 
> problems of consistency with updates of supported genomes.
> 
> We would thus like to establish a programmatic connection to UCSC Genome 
> Browser, in order to dynamically retrieve multi-genome alignments of the 
> conserved regions covered by a set of peaks (more generally, we would like to 
> obtain the MAFs under a set of genomic coordinates specified as a bed file). 
> 
> We already saw how to use your DAS interface for retrieving single-organism 
> sequences under the peaks, but we did not find the equivalent for retrieving 
> the MAFS and the related taxonomic information. 
> 
> Could you indicate us if there is a programmatic access to UCSC (DAS, 
> SOAP/WSDL, Perl modules, Python modules or anything else) that would allow us 
> to do the following queries ?
> 
> 1) Return the list of organisms for which a multi-z alignment is available. 
> Currently, we must first get (with DAS) the list of all supported organisms, 
> and then send one request for each organism in order to know if it contains 
> one or several multizNway attributes).
> 
> 2) Given the name of a reference organism, obtain the list of other organisms 
> aligned with its genome in the multizNway alignments (the list varies from 
> organism to organism).
> 
> 3) Given a clade, obtain the list of included organisms. 
> 
> 4) Given a set of genomic coordinates (bed file), retrieve the subset of MAFs 
> intersecting these coordinates.
> It would be event better if the method would allow the client to specify a 
> subset of organisms for which the aligned sequences would be returned (much 
> in the same way as the UCSC viewer allows to select a subset of organisms to 
> be displayed in the multiz track). 
> 
> Many thanks for your help
> 
> Pr. Jacques van Helden
> 
> Université d'Aix-Marseille (AMU). 
> Lab. Technological Advances for Genomics and Clinics (TAGC)
> INSERM Unit U1090, 163, Avenue de Luminy, 13288 MARSEILLE cedex 09. France
> Fax: +33 4 91 82 87 01
> Web:  http://jacques.van-helden.perso.luminy.univmed.fr/
> Email: [email protected]
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] programmatic interface to get multi-genome alignements

Reply via email to