On Mon, Mar 8, 2010 at 02:54, <[email protected]> wrote: > Potamanthellus caenoides voucher BYU:IGCEP220 28S ribosomal RNA gene, > partial sequence > > Another is like this: > Paraleptophlebia submarginata voucher BYU:IGCEP243 28S ribosomal RNA > gene, partial sequence > > Ideally, what I want is some way to get just the name of the gene ie: > "28S ribosomal RNA gene" > That way it would be easier to know that these 2 sequences are probably > homologous. I have written some logic to try to parse this actual name
I'm not a bioinformatician, but I believed that if you wanted to know if two genes were homologous, you would need to do a sequence similarity search on them, using tools such as BLAST. Simply checking the human-specified description would just be a light-weight text mining, and you would miss out on lots of genes that are more specific or more generic and have different terms in their description. I've forwarded your question to our resident bioinformatician Katy to have a look. Purely technically you could probably split the string using a regular expression, if you always have > Xxxxx Xxxxxxxxxxx xxxxx yyy:zzzzzzzz __________________, partial sequence then you can create a pattern such as [A-Za-z0-9]+:[A-Za-z0-9]+ (.*), to match the "yyy:zzzzzzzz" bit, and get the ____ string after that, until the comma. See the attached workflow, using the Filter_list_of_strings_extracting_match_to_a_regex worker. (You might need to modify this to work with the FASTA format, the string constants gene1 and gene2 simply contain the two example strings you provided) Which service are you using to look up the genes? Some also allow you to get the metadata in XML format, which could be more structured and allow you to get the 'description' directly. -- Stian Soiland-Reyes, myGrid team School of Computer Science The University of Manchester
mcookson-regex.t2flow
Description: Binary data
------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev
_______________________________________________ taverna-users mailing list [email protected] [email protected] Web site: http://www.taverna.org.uk Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
