As Stian says, you run the risk of missing a lot of information if you rely only on the human readable description, but you could combine the two approaches - you could perform a blast and then filter the results to find any mayfly hits. Some blast implementations allow you to perform searches restricted to one genome, but since you are not working on a model organism, I suspect you will need to run a general blast and find the sequences of interest from within the results.
Best wishes, Katy Stian Soiland-Reyes wrote: > On Mon, Mar 8, 2010 at 02:54, <[email protected]> wrote: > >> Potamanthellus caenoides voucher BYU:IGCEP220 28S ribosomal RNA gene, >> partial sequence >> >> Another is like this: >> Paraleptophlebia submarginata voucher BYU:IGCEP243 28S ribosomal RNA >> gene, partial sequence >> >> Ideally, what I want is some way to get just the name of the gene ie: >> "28S ribosomal RNA gene" >> That way it would be easier to know that these 2 sequences are probably >> homologous. I have written some logic to try to parse this actual name >> > > I'm not a bioinformatician, but I believed that if you wanted to know > if two genes were homologous, you would need to do a sequence > similarity search on them, using tools such as BLAST. Simply checking > the human-specified description would just be a light-weight text > mining, and you would miss out on lots of genes that are more specific > or more generic and have different terms in their description. > > I've forwarded your question to our resident bioinformatician Katy to > have a look. > > > Purely technically you could probably split the string using a regular > expression, if you always have > > >> Xxxxx Xxxxxxxxxxx xxxxx yyy:zzzzzzzz __________________, partial sequence >> > > then you can create a pattern such as [A-Za-z0-9]+:[A-Za-z0-9]+ (.*), > to match the "yyy:zzzzzzzz" bit, and get the ____ string after that, > until the comma. > > See the attached workflow, using the > Filter_list_of_strings_extracting_match_to_a_regex worker. > > (You might need to modify this to work with the FASTA format, the > string constants gene1 and gene2 simply contain the two example > strings you provided) > > > > Which service are you using to look up the genes? Some also allow you > to get the metadata in XML format, which could be more structured and > allow you to get the 'description' directly. > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > ------------------------------------------------------------------------ > > _______________________________________________ > taverna-users mailing list > [email protected] > [email protected] > Web site: http://www.taverna.org.uk > Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/ > -- Dr Katy Wolstencroft Research Fellow, School of Computer Science,1.17 Kilburn Building University of Manchester, Oxford Road, Manchester, M13 9PL, UK Tel: +44(0)161 2756276 ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ taverna-users mailing list [email protected] [email protected] Web site: http://www.taverna.org.uk Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
