Re: [Taverna-users] any reliable way to parse out the name of the gene when getting fasta sequences from NCBI?

Katy Wolstencroft Mon, 08 Mar 2010 03:29:37 -0800

As Stian says, you run the risk of missing a lot of information if you 
rely only on the human readable description, but you could combine the 
two approaches - you could perform a blast and then filter the results 
to find any mayfly hits. Some blast implementations allow you to perform 
searches restricted to one genome, but since you are not working on a 
model organism, I suspect you will need to run a general blast and find 
the sequences of interest from within the results.


Best wishes,

Katy

Stian Soiland-Reyes wrote:
> On Mon, Mar 8, 2010 at 02:54,  <[email protected]> wrote:
>   
>> Potamanthellus caenoides voucher BYU:IGCEP220 28S ribosomal RNA gene,
>> partial sequence
>>
>> Another is like this:
>> Paraleptophlebia submarginata voucher BYU:IGCEP243 28S ribosomal RNA
>> gene, partial sequence
>>
>> Ideally, what I want is some way to get just the name of the gene ie:
>> "28S ribosomal RNA gene"
>> That way it would be easier to know that these 2 sequences are probably
>> homologous.  I have written some logic to try to parse this actual name
>>     
>
> I'm not a bioinformatician, but I believed that if you wanted to know
> if two genes were homologous, you would need to do a sequence
> similarity search on them, using tools such as BLAST. Simply checking
> the human-specified description would just be a light-weight text
> mining, and you would miss out on lots of genes that are more specific
> or more generic and have different terms in their description.
>
> I've forwarded your question to our resident bioinformatician Katy to
> have a look.
>
>
> Purely technically you could probably split the string using a regular
> expression, if you always have
>
>   
>> Xxxxx Xxxxxxxxxxx xxxxx yyy:zzzzzzzz __________________, partial sequence
>>     
>
> then you can create a pattern such as [A-Za-z0-9]+:[A-Za-z0-9]+ (.*),
> to match the "yyy:zzzzzzzz" bit, and get the ____ string after that,
> until the comma.
>
> See the attached workflow, using the
> Filter_list_of_strings_extracting_match_to_a_regex worker.
>
> (You might need to modify this to work with the FASTA format, the
> string constants gene1 and gene2 simply contain the two example
> strings you provided)
>
>
>
> Which service are you using to look up the genes? Some also allow you
> to get the metadata in XML format, which could be more structured and
> allow you to get the 'description' directly.
>
>   
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> ------------------------------------------------------------------------
>
> _______________________________________________
> taverna-users mailing list
> [email protected]
> [email protected]
> Web site: http://www.taverna.org.uk
> Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
>   


-- 
Dr Katy Wolstencroft
Research Fellow, School of Computer Science,1.17 Kilburn Building
University of Manchester, Oxford Road, Manchester, M13 9PL, UK
Tel: +44(0)161 2756276
 


------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
taverna-users mailing list
[email protected]
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/

Re: [Taverna-users] any reliable way to parse out the name of the gene when getting fasta sequences from NCBI?

Reply via email to