Sorry,
Here is the attachment.
The whole cleanup process could be done with pm;y SED calls I'm sure,
but would be beyond my SED comfort level.
MarvS
On Thu, Feb 17, 2011 at 12:06 PM, Tom Keller wrote:
> HI Martin,
> I am interested i the solution. There was no attachment to the email I
> received. Would you mind sending it?
>
> thank you,
> Tom
> MMI DNA Services Core Facility
> 503-494-2442
> kellert at ohsu.edu
> Office: 6588 RJH (CROET/BasicScience)
>
>
>
>
>
> On Feb 16, 2011, at 6:07 PM, Marvin Stodolsky wrote:
>
>> All thanks for the suggestions. A solution to the GeneBegin..GeneEnd
>> problem has been worked out, per the Attachment, for those interested.
>>
>> But for me the more important problem is making a FASTA repository,
>> which is a subset of the gene files in a much larger Repository. This
>> is desirable before & after using Usearch -
>> http://www.drive5.com/usearch/intro.html
>> to select out a minimally homologous gene set of a species.
>> Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among
>> the undesirables.
>>
>> Specifically, is the command using ENTRET or relatives , to accept a list
>> like
>> 637008924
>> 637008927
>> 640691430
>> 640691431
>> 637008928
>> 637008954
>> 637008980
>> for extraction and repacking into a single smaller Repository?
>>
>> If not, could you recommend a software tool/suite for this type of job.
>>
>> MarvS
>>
>> On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice wrote:
>>> On 14/02/2011 23:35, Marvin Stodolsky wrote:
This is elementary I’m sure, but I’ve been unable to work out the
syntax from the documentation.
More minor issue.
When using infoseq to extract all the fasta Headers from a sequence
Repository, the GeneBegin..GeneEnd (like 234466..234589) often fails to
come as a uniform field/fields in a resultant spreadsheet. Is there a Fix
for this?
>>>
>>> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they
>>> part of the sequence ID in the FASTA file?
>>>
>>> You can use a delimiter between items for infoseq using:
>>>
>>> -nocolumn
>>>
>>> on the command line.
>>>
>>> For import into a spreadsheet you can set the delimiter to be tab with:
>>>
>>> -nocolumn -delimiter "\t"
>>>
>>> on the command line. That should then import nicely into a spreadsheet.
>>>
>>> Hope that helps
>>>
>>> Peter Rice
>>> EMBOSS Team
>>>
>>
>> ___
>> EMBOSS mailing list
>> EMBOSS@lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>
>
With respect to using info in the FASTA description field, the intent and
partial solution can now be explained.
The top level intent is to avoid overlapping genes, in a statiscal analysis
being pl anned.
The 3rd & 4th lines below from an "infoseq -nocolumns" whole genome retreival.
They report an overlap, i.e.,
the DNA gyrase A is overlapped by seryl-tRNA: serly_begin=7294 -
7322=gyrase_end < 0
DnaJ domain protein 1828..2760(+) [Mycoplasma genitalium G37]
DNA gyrase subunit B 2845..4797(+) [Mycoplasma genitalium G37]
DNA gyrase subunit A 4812..7322(+) [Mycoplasma genitalium G37]
seryl-tRNA synthetase 7294..8547(+) [Mycoplasma genitalium G37]
thymidylate kinase 8551..9183(+) [Mycoplasma genitalium G37]
In a few microbes I've checked, about a quarter of the genes have some putative
overlap. These could contaminate the proteins/codon_usage statistical analysis
being planned. Thus I wished an enmass way of recogizing the overlapping genes.
A non-elegant fix has been worked out.
Pulling the dataset into a spreadsheet, spaces in the description field were
next replaced with >< :
DnaJ><1828..2760(+)><[Mycoplasma><2845..4797(+)><[Mycoplasma><4812..7322(+)><[Mycoplasma><7294..8547(+)><[Mycoplasma><[Mycoplasma><[ is replace by "to be field seperator" |[
DNA><686..1828(+)|[Mycoplasma><1828..2760(+)|[Mycoplasma><2845..4797(+)|[Mycoplasma><4812..7322(+)|[Mycoplasma><7294..8547(+)|[Mycoplasma>< in the terminal common [Mycoplasma> Myc637000176m2.csv
resulting in :
DNA><686..1828(+)|
DnaJ><1828..2760(+)|
DNA><2845..4797(+)|
DNA><4812..7322(+)|
seryl-tRNA><7294..8547(+)|
internals are next mostly deleted with:
sed -e 's/<.*>//g' Myc637000176m2.csv > Myc637000176m3.csv
resulting in:
DNA><686..1828(+)|
DnaJ><1828..2760(+)|
DNA><2845..4797(+)|
DNA><4812..7322(+)|
seryl-tRNA><7294..8547(+)|
The single remmaining >< is replaced with potential separator |
sed -e 's/> Myc637000176m4.csv
resulting in:
DNA|686..1828(+)|
DnaJ|1828..2760(+)|
DNA|2845..4797(+)|
DNA|4812..7322(+)|
seryl-tRNA|7294..8547(+)|
BASICALLY, the clever work is now done, and the rest is more routine
manipulation.
A cleanup was done with:
sed -e 's/)|//g' Myc637000176m4.csv > Myc637000176m5.csv
sed -e 's/(/|/g' Myc637000176m5.csv > Myc637000176m6.csv
together changing the (+)| to |+ ,that is a separated field
The replacement of the residual .. with potential separator | was easiest
done as a within spreadsheet