On Mon, Mar 8, 2010 at 02:54,  <[email protected]> wrote:
> Potamanthellus caenoides voucher BYU:IGCEP220 28S ribosomal RNA gene,
> partial sequence
>
> Another is like this:
> Paraleptophlebia submarginata voucher BYU:IGCEP243 28S ribosomal RNA
> gene, partial sequence
>
> Ideally, what I want is some way to get just the name of the gene ie:
> "28S ribosomal RNA gene"
> That way it would be easier to know that these 2 sequences are probably
> homologous.  I have written some logic to try to parse this actual name

I'm not a bioinformatician, but I believed that if you wanted to know
if two genes were homologous, you would need to do a sequence
similarity search on them, using tools such as BLAST. Simply checking
the human-specified description would just be a light-weight text
mining, and you would miss out on lots of genes that are more specific
or more generic and have different terms in their description.

I've forwarded your question to our resident bioinformatician Katy to
have a look.


Purely technically you could probably split the string using a regular
expression, if you always have

> Xxxxx Xxxxxxxxxxx xxxxx yyy:zzzzzzzz __________________, partial sequence

then you can create a pattern such as [A-Za-z0-9]+:[A-Za-z0-9]+ (.*),
to match the "yyy:zzzzzzzz" bit, and get the ____ string after that,
until the comma.

See the attached workflow, using the
Filter_list_of_strings_extracting_match_to_a_regex worker.

(You might need to modify this to work with the FASTA format, the
string constants gene1 and gene2 simply contain the two example
strings you provided)



Which service are you using to look up the genes? Some also allow you
to get the metadata in XML format, which could be more structured and
allow you to get the 'description' directly.

-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester

Attachment: mcookson-regex.t2flow
Description: Binary data

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
taverna-users mailing list
[email protected]
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/

Reply via email to