Re: [galaxy-user] extract genome sequence

2012-09-25 Thread Jennifer Jackson

Hello Yan,

Unfortunately, the tool can only extract sequence that is provided as 
the mapping target. This will be a problem with any of the methods. This 
tool does avoid a problem with generating negative coordinates (which 
will cause a problem with the 'Extract' tool). But it is not quite 
giving you what you want either, assuming that partially extended 
sequence, based on available data, would be acceptable.


Using the compute tool may be the best option for your case, now that 
the data is clearer. End coordinates that extend past the edge of the 
chromosome are not a problem, but the Start coordinate will need to be 
set to 1 (if using GFF3 as interval directly) or 0 (if you converted to 
BED - this doesn't appear to be the case). The expression below will 
either subtract '5000' from a Start coordinate or change it to a 1, 
depending on how close it is to the leading edge of the scaffold. 
(Modify for BED to be 0-based as needed).


(c2 - 5000) if (c2  5000) else (1)

Then add 5000 to the end, 'Cut' columns, and extract as Graham recommended.

I am not going to address the GFF3 format except to say that if you have 
gene rows in your data, use those if your target genome has spliced 
transcripts. If the data is transcript, not gene based, and is split 
between rows (multi-exon), then the processing becomes more complicated. 
One potential solution is the 'Extract' tool - it does not only extract 
fasta sequence, it can also be used to combine records for some GFF/GTF 
datasets - so you could try this and output Interval data instead of 
Fasta. This creates a new GTF file with global coordinates (but the 
sequence output will be spliced). Check to see if correct, run the 
'Compute' tool to do the extensions, 'Cut' columns, and do a final 
'Extract' run to obtain the extended, global, sequence. All of this 
would have to be tested with your data - much depends on the attributes 
in your file.


Hopefully one of these solution will work out for you,

Jen
Galaxy team

On 9/25/12 12:41 AM, Yan He wrote:

Hi Jen,

Thanks very much for your help! It is very helpful. However, following
your suggestion, what I got is not what I want. Take one sequence for
example. The annotation for one scaffold is
C16582  GLEAN   mRNA35  385 0.555898-   .   
ID=OYG_GLEAN_1001;
C16582  GLEAN   CDS 35  385 .   -   0   
Parent=OYG_GLEAN_1001;

What I got for this scaffold is
 ?_C16582_385_5385_-
GCAAACAAGC
 ?_C16582_385_5385_-
GCAAACAAGC


I understand that it is trying to get the sequence of the gene
downstream from 385-5385, but the sequence is short, so I only get what
the scaffold has. I would like to have the upstream+gene+downstream
sequence at the same time, not only the upstream or downstream. How can
I do this using a galaxy tool? Thanks!

Yan




  Date: Mon, 24 Sep 2012 12:26:03 -0700
  From: j...@bx.psu.edu
  To: yanh...@hotmail.com
  CC: galaxy-user@lists.bx.psu.edu
  Subject: Re: [galaxy-user] extract genome sequence
 
  Hi Yan,
 
  Both of the other suggestions are good - I'll also give you another
  choice to build coordinates before using the Fetch Sequences - Extract
  Genomic DNA tool to obtain the fasta sequence.
 
  Using your input in BED/Interval format (convert from GFF/GTF if
  necessary, using the tool Convert Formats - GFF-to-BED ), or the
  first 6 columns if a BED12 (use Cut as needed), then run the Operate
  on Genomic Intervals - Get flanks tool.
 
  Region: Whole feature
  Location of the flanking region/s: Both
  Offset 0
  Length of the flanking region(s): 5000
 
  Your question is similar to this one (the first part, but I thought you
  might be interested in how to just get the flanks, too).
 
http://user.list.galaxyproject.org/Get-flanks-version-1-0-0-td4604849.html
 
  Good luck with your project!
 
  Jen
  Galaxy team
 
  ps. To search prior questions, please see:
  http://galaxy.psu.edu/search/mailinglists/
 
  On 9/23/12 7:00 PM, Yan He wrote:
   Hi everyone,
  
   I have the genome sequence and gene annotation file. Is there a tool on
   Galaxy to extract the 5,000 bp upstream, 5,000 bp downstream and genome
   sequences of the genes (including exons and introns) from the genome
   sequence? Any suggestions are highly appreciated! Thanks!
  
   Yan
  
  
  
   ___
   The Galaxy User list should be used for the discussion of
   Galaxy analysis and other features on the public server
   at usegalaxy.org. Please keep all replies on the list by
   using reply all in your mail client. For discussion of
   local Galaxy instances and the Galaxy source code, please
   use the Galaxy Development list:
  
   http://lists.bx.psu.edu/listinfo/galaxy-dev
  
   To manage your subscriptions to this and other Galaxy lists,
   please use the interface

Re: [galaxy-user] extract genome sequence

2012-09-24 Thread Björn Grüning
Hi Yan,

did you know the tool extractfeat from the EMBOSS suite (its in the
toolshed)?

I don't know offhand if it can work in batch mode, but its possible to
add that feature.

Cheers,
Bjoern

 Hi everyone,

 I have the genome sequence and gene annotation file. Is there a tool
 on Galaxy to extract the 5,000 bp upstream, 5,000 bp downstream and
 genome sequences of the genes (including exons and introns) from the
 genome sequence? Any suggestions are highly appreciated! Thanks!
 
  
 
 Yan
 
 
 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:
 
   http://lists.bx.psu.edu/listinfo/galaxy-dev
 
 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:
 
   http://lists.bx.psu.edu/

-- 
Björn Grüning
Albert-Ludwigs-Universität Freiburg
Institute of Pharmaceutical Sciences
Pharmaceutical Bioinformatics
Hermann-Herder-Strasse 9
D-79104 Freiburg i. Br.

Tel.:  +49 761 203-4872
Fax.:  +49 761 203-97769
E-Mail: bjoern.gruen...@pharmazie.uni-freiburg.de
Web: http://www.pharmaceutical-bioinformatics.org/

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] extract genome sequence

2012-09-24 Thread graham etherington (TSL)
Yan,
One way to do this is to create an interval file with the new co-ordinates
(+/- 5kb) and then use the Fetch Sequences  Extract genomic DNA tool.
To create a new co-ordinates file, input your annotation file into the
Text Manipulation  Compute tool, using expressions like c3 = c3-5000 to
get your new co-ordinates. You'll get 2 new columns in the final output
file and then use the Text Manipulation  Cut tool to extract the columns
you need to create an interval file.
Hope this helps.
Cheers,
Graham

Dr. Graham Etherington
Bioinformatics Support Officer,
The Sainsbury Laboratory,
Norwich Research Park,
Norwich NR4 7UH.
UK
Tel: +44 (0)1603 450601





On 24/09/2012 09:02, Björn Grüning
bjoern.gruen...@pharmazie.uni-freiburg.de wrote:

Hi Yan,

did you know the tool extractfeat from the EMBOSS suite (its in the
toolshed)?

I don't know offhand if it can work in batch mode, but its possible to
add that feature.

Cheers,
Bjoern

 Hi everyone,

 I have the genome sequence and gene annotation file. Is there a tool
 on Galaxy to extract the 5,000 bp upstream, 5,000 bp downstream and
 genome sequences of the genes (including exons and introns) from the
 genome sequence? Any suggestions are highly appreciated! Thanks!
 
  
 
 Yan
 
 
 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:
 
   http://lists.bx.psu.edu/listinfo/galaxy-dev
 
 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:
 
   http://lists.bx.psu.edu/

-- 
Björn Grüning
Albert-Ludwigs-Universität Freiburg
Institute of Pharmaceutical Sciences
Pharmaceutical Bioinformatics
Hermann-Herder-Strasse 9
D-79104 Freiburg i. Br.

Tel.:  +49 761 203-4872
Fax.:  +49 761 203-97769
E-Mail: bjoern.gruen...@pharmazie.uni-freiburg.de
Web: http://www.pharmaceutical-bioinformatics.org/

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


[galaxy-user] extract genome sequence

2012-09-23 Thread Yan He
Hi everyone,

 

I have the genome sequence and gene annotation file. Is there a tool on
Galaxy to extract the 5,000 bp upstream, 5,000 bp downstream and genome
sequences of the genes (including exons and introns) from the genome
sequence? Any suggestions are highly appreciated! Thanks!

 

Yan

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/