Re: [Taverna-users] any reliable way to parse out the name of the gene when getting fasta sequences from NCBI?

Paul Fisher Mon, 08 Mar 2010 02:53:19 -0800

I agree. A Blast search would be significantly more accurate than a 
parsing task.
If this is what you're doing then I apologise.


I've sent some logic to Alan and Stian to clarify. They should then be 
able to distribute it.
It's basically the same as Stian says, but involves splitting on commas 
and word boundries.

If the output consists of the same/similar number of words in the exact 
same order, you should be able to parse the info out quite easily.

regards,
Paul.

Stian Soiland-Reyes wrote:
> On Mon, Mar 8, 2010 at 02:54,  <[email protected]> wrote:
>   
>> Potamanthellus caenoides voucher BYU:IGCEP220 28S ribosomal RNA gene,
>> partial sequence
>>
>> Another is like this:
>> Paraleptophlebia submarginata voucher BYU:IGCEP243 28S ribosomal RNA
>> gene, partial sequence
>>
>> Ideally, what I want is some way to get just the name of the gene ie:
>> "28S ribosomal RNA gene"
>> That way it would be easier to know that these 2 sequences are probably
>> homologous.  I have written some logic to try to parse this actual name
>>     
>
> I'm not a bioinformatician, but I believed that if you wanted to know
> if two genes were homologous, you would need to do a sequence
> similarity search on them, using tools such as BLAST. Simply checking
> the human-specified description would just be a light-weight text
> mining, and you would miss out on lots of genes that are more specific
> or more generic and have different terms in their description.
>
> I've forwarded your question to our resident bioinformatician Katy to
> have a look.
>
>
> Purely technically you could probably split the string using a regular
> expression, if you always have
>
>   
>> Xxxxx Xxxxxxxxxxx xxxxx yyy:zzzzzzzz __________________, partial sequence
>>     
>
> then you can create a pattern such as [A-Za-z0-9]+:[A-Za-z0-9]+ (.*),
> to match the "yyy:zzzzzzzz" bit, and get the ____ string after that,
> until the comma.
>
> See the attached workflow, using the
> Filter_list_of_strings_extracting_match_to_a_regex worker.
>
> (You might need to modify this to work with the FASTA format, the
> string constants gene1 and gene2 simply contain the two example
> strings you provided)
>
>
>
> Which service are you using to look up the genes? Some also allow you
> to get the metadata in XML format, which could be more structured and
> allow you to get the 'description' directly.
>
>   
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> ------------------------------------------------------------------------
>
> _______________________________________________
> taverna-users mailing list
> [email protected]
> [email protected]
> Web site: http://www.taverna.org.uk
> Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
taverna-users mailing list
[email protected]
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/

Re: [Taverna-users] any reliable way to parse out the name of the gene when getting fasta sequences from NCBI?

Reply via email to