Sorry, Here is the attachment. The whole cleanup process could be done with pm;y SED calls I'm sure, but would be beyond my SED comfort level.
MarvS On Thu, Feb 17, 2011 at 12:06 PM, Tom Keller <kell...@ohsu.edu> wrote: > HI Martin, > I am interested i the solution. There was no attachment to the email I > received. Would you mind sending it? > > thank you, > Tom > MMI DNA Services Core Facility > 503-494-2442 > kellert at ohsu.edu > Office: 6588 RJH (CROET/BasicScience) > > > > > > On Feb 16, 2011, at 6:07 PM, Marvin Stodolsky wrote: > >> All thanks for the suggestions. A solution to the GeneBegin..GeneEnd >> problem has been worked out, per the Attachment, for those interested. >> >> But for me the more important problem is making a FASTA repository, >> which is a subset of the gene files in a much larger Repository. This >> is desirable before & after using Usearch - >> http://www.drive5.com/usearch/intro.html >> to select out a minimally homologous gene set of a species. >> Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among >> the undesirables. >> >> Specifically, is the command using ENTRET or relatives , to accept a list >> like >> 637008924 >> 637008927 >> 640691430 >> 640691431 >> 637008928 >> 637008954 >> 637008980 >> for extraction and repacking into a single smaller Repository? >> >> If not, could you recommend a software tool/suite for this type of job. >> >> MarvS >> >> On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice <p...@ebi.ac.uk> wrote: >>> On 14/02/2011 23:35, Marvin Stodolsky wrote: >>>> >>>> This is elementary I’m sure, but I’ve been unable to work out the >>>> syntax from the documentation. >>>> More minor issue. >>>> >>>> When using infoseq to extract all the fasta Headers from a sequence >>>> Repository, the GeneBegin..GeneEnd (like 234466..234589) often fails to >>>> come as a uniform field/fields in a resultant spreadsheet. Is there a Fix >>>> for this? >>> >>> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they >>> part of the sequence ID in the FASTA file? >>> >>> You can use a delimiter between items for infoseq using: >>> >>> -nocolumn >>> >>> on the command line. >>> >>> For import into a spreadsheet you can set the delimiter to be tab with: >>> >>> -nocolumn -delimiter "\t" >>> >>> on the command line. That should then import nicely into a spreadsheet. >>> >>> Hope that helps >>> >>> Peter Rice >>> EMBOSS Team >>> >> >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS@lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss > >
With respect to using info in the FASTA description field, the intent and partial solution can now be explained. The top level intent is to avoid overlapping genes, in a statiscal analysis being pl anned. The 3rd & 4th lines below from an "infoseq -nocolumns" whole genome retreival. They report an overlap, i.e., the DNA gyrase A is overlapped by seryl-tRNA: serly_begin=7294 - 7322=gyrase_end < 0 DnaJ domain protein 1828..2760(+) [Mycoplasma genitalium G37] DNA gyrase subunit B 2845..4797(+) [Mycoplasma genitalium G37] DNA gyrase subunit A 4812..7322(+) [Mycoplasma genitalium G37] seryl-tRNA synthetase 7294..8547(+) [Mycoplasma genitalium G37] thymidylate kinase 8551..9183(+) [Mycoplasma genitalium G37] In a few microbes I've checked, about a quarter of the genes have some putative overlap. These could contaminate the proteins/codon_usage statistical analysis being planned. Thus I wished an enmass way of recogizing the overlapping genes. A non-elegant fix has been worked out. Pulling the dataset into a spreadsheet, spaces in the description field were next replaced with >< : DnaJ><domain><protein><1828..2760(+)><[Mycoplasma><genitalium><G37] DNA><gyrase><subunit><B><2845..4797(+)><[Mycoplasma><genitalium><G37] DNA><gyrase><subunit><A><4812..7322(+)><[Mycoplasma><genitalium><G37] seryl-tRNA><synthetase><7294..8547(+)><[Mycoplasma><genitalium><G37] thymidylate 8551..9183(+)><[Mycoplasma><genitalium><G37] Next ><[ is replace by "to be field seperator" |[ DNA><polymerase><III,><beta><subunit><686..1828(+)|[Mycoplasma><genitalium><G37] DnaJ><domain><protein><1828..2760(+)|[Mycoplasma><genitalium><G37] DNA><gyrase><subunit><B><2845..4797(+)|[Mycoplasma><genitalium><G37] DNA><gyrase><subunit><A><4812..7322(+)|[Mycoplasma><genitalium><G37] seryl-tRNA><synthetase><7294..8547(+)|[Mycoplasma><genitalium><G37] and the file saved as: Myc637000176m.csv to get rid of >< in the terminal common [Mycoplasma><genitalium><G37], there was done $ cut -d"[" -f1 Myc637000176m.csv > Myc637000176m2.csv resulting in : DNA><polymerase><III,><beta><subunit><686..1828(+)| DnaJ><domain><protein><1828..2760(+)| DNA><gyrase><subunit><B><2845..4797(+)| DNA><gyrase><subunit><A><4812..7322(+)| seryl-tRNA><synthetase><7294..8547(+)| internals are next mostly deleted with: sed -e 's/<.*>//g' Myc637000176m2.csv > Myc637000176m3.csv resulting in: DNA><686..1828(+)| DnaJ><1828..2760(+)| DNA><2845..4797(+)| DNA><4812..7322(+)| seryl-tRNA><7294..8547(+)| The single remmaining >< is replaced with potential separator | sed -e 's/></|/g' Myc637000176m3.csv > Myc637000176m4.csv resulting in: DNA|686..1828(+)| DnaJ|1828..2760(+)| DNA|2845..4797(+)| DNA|4812..7322(+)| seryl-tRNA|7294..8547(+)| BASICALLY, the clever work is now done, and the rest is more routine manipulation. A cleanup was done with: sed -e 's/)|//g' Myc637000176m4.csv > Myc637000176m5.csv sed -e 's/(/|/g' Myc637000176m5.csv > Myc637000176m6.csv together changing the (+)| to |+ ,that is a separated field The replacement of the residual .. with potential separator | was easiest done as a within spreadsheet operation in its own field, because of too many residual "." in the whole file After routine manipulations within the spread sheet, a view of the overlap detection section is: F G H I J fields Start End Begin-nextEnd OR((H2<0),(H1<0)) Stable 0/1 Value, for SORTING on 686 1828 0 FALSE 0 1828 2760 85 FALSE 0 2845 4797 15 TRUE 0 4812 7322 -28 TRUE 1 7294 8547 4 TRUE 1 8551 9183 -27 TRUE 1 9156 9920 3 FALSE 0 9923 11251 0 FALSE 0 The overlapping genes have stable value 1,during sorting, while field I FALSE/TRUE and not stable during SORTing
_______________________________________________ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss