Re: [EMBOSS] FW: Reducing a FASTA repository, new user

2011-02-17 Thread Marvin Stodolsky
Sorry,

Here is the attachment.
The whole cleanup process could be done with pm;y SED calls I'm sure,
but would be beyond my SED comfort level.

MarvS

On Thu, Feb 17, 2011 at 12:06 PM, Tom Keller  wrote:
> HI Martin,
> I am interested i the solution. There was no attachment to the email I 
> received. Would you mind sending it?
>
> thank you,
> Tom
> MMI DNA Services Core Facility
> 503-494-2442
> kellert at ohsu.edu
> Office: 6588 RJH (CROET/BasicScience)
>
>
>
>
>
> On Feb 16, 2011, at 6:07 PM, Marvin Stodolsky wrote:
>
>> All thanks for the suggestions.  A solution to the GeneBegin..GeneEnd
>> problem has been worked out, per the Attachment, for those interested.
>>
>> But for me the more important problem is making a FASTA repository,
>> which is a subset of the gene files in a much larger Repository.  This
>> is desirable before & after using Usearch -
>> http://www.drive5.com/usearch/intro.html
>> to select out a minimally homologous gene set of a species.
>> Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among
>> the undesirables.
>>
>> Specifically, is the command using ENTRET or relatives , to accept a list 
>> like
>> 637008924
>> 637008927
>> 640691430
>> 640691431
>> 637008928
>> 637008954
>> 637008980
>> for extraction and repacking into a single smaller Repository?
>>
>> If not, could you recommend a software tool/suite for this type of job.
>>
>> MarvS
>>
>> On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice  wrote:
>>> On 14/02/2011 23:35, Marvin Stodolsky wrote:

  This is elementary I’m sure, but I’ve been unable to work out the
 syntax  from the documentation.
 More minor issue.

 When using infoseq to extract all the fasta Headers from a sequence
 Repository, the GeneBegin..GeneEnd (like   234466..234589) often fails to
 come as a uniform field/fields in a resultant spreadsheet.  Is there a Fix
 for this?
>>>
>>> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they
>>> part of the sequence ID in the FASTA file?
>>>
>>> You can use a delimiter between items for infoseq using:
>>>
>>>  -nocolumn
>>>
>>> on the command line.
>>>
>>> For import into a spreadsheet you can set the delimiter to be tab with:
>>>
>>>  -nocolumn -delimiter "\t"
>>>
>>> on the command line. That should then import nicely into a spreadsheet.
>>>
>>> Hope that helps
>>>
>>> Peter Rice
>>> EMBOSS Team
>>>
>>
>> ___
>> EMBOSS mailing list
>> EMBOSS@lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>
>




With respect to using info in the FASTA description field, the intent and 
partial solution can now be explained.
The top level intent is to avoid overlapping genes, in a statiscal analysis 
being pl anned.
The 3rd & 4th lines below from an "infoseq -nocolumns" whole genome retreival. 
They report an overlap, i.e., 
the DNA gyrase A is overlapped by seryl-tRNA: serly_begin=7294 - 
7322=gyrase_end < 0

DnaJ domain protein 1828..2760(+) [Mycoplasma genitalium G37]
DNA gyrase subunit B 2845..4797(+) [Mycoplasma genitalium G37]
DNA gyrase subunit A 4812..7322(+) [Mycoplasma genitalium G37]
seryl-tRNA synthetase 7294..8547(+) [Mycoplasma genitalium G37]
thymidylate kinase 8551..9183(+) [Mycoplasma genitalium G37]

In a few microbes I've checked, about a quarter of the genes have some putative 
overlap. These could contaminate the proteins/codon_usage statistical analysis 
being planned. Thus I wished an enmass way of recogizing the overlapping genes.
A non-elegant fix has been worked out.

Pulling the dataset into a spreadsheet, spaces in the description field were  
next replaced with >< :
DnaJ><1828..2760(+)><[Mycoplasma><2845..4797(+)><[Mycoplasma><4812..7322(+)><[Mycoplasma><7294..8547(+)><[Mycoplasma><[Mycoplasma><[ is replace by "to be field seperator"  |[
DNA><686..1828(+)|[Mycoplasma><1828..2760(+)|[Mycoplasma><2845..4797(+)|[Mycoplasma><4812..7322(+)|[Mycoplasma><7294..8547(+)|[Mycoplasma>< in the terminal common  [Mycoplasma> Myc637000176m2.csv
resulting in :
DNA><686..1828(+)|
DnaJ><1828..2760(+)|
DNA><2845..4797(+)|
DNA><4812..7322(+)|
seryl-tRNA><7294..8547(+)|

internals are next mostly deleted with:
sed -e 's/<.*>//g'  Myc637000176m2.csv > Myc637000176m3.csv
resulting in:
DNA><686..1828(+)|
DnaJ><1828..2760(+)|
DNA><2845..4797(+)|
DNA><4812..7322(+)|
seryl-tRNA><7294..8547(+)|

The single remmaining >< is replaced with potential separator | 
sed -e 's/> Myc637000176m4.csv
resulting in:
DNA|686..1828(+)|
DnaJ|1828..2760(+)|
DNA|2845..4797(+)|
DNA|4812..7322(+)|
seryl-tRNA|7294..8547(+)|
BASICALLY, the clever work is now done, and the rest is more routine 
manipulation.

A cleanup was done with:
sed -e 's/)|//g'  Myc637000176m4.csv > Myc637000176m5.csv
sed -e 's/(/|/g'  Myc637000176m5.csv > Myc637000176m6.csv
together changing the  (+)|  to   |+   ,that is a separated field

The replacement of the residual  ..  with potential separator | was easiest 
done as a within spreadsheet 

Re: [EMBOSS] FW: Reducing a FASTA repository, new user

2011-02-16 Thread Marvin Stodolsky
All thanks for the suggestions.  A solution to the GeneBegin..GeneEnd
problem has been worked out, per the Attachment, for those interested.

But for me the more important problem is making a FASTA repository,
which is a subset of the gene files in a much larger Repository.  This
is desirable before & after using Usearch -
http://www.drive5.com/usearch/intro.html
to select out a minimally homologous gene set of a species.
Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among
the undesirables.

Specifically, is the command using ENTRET or relatives , to accept a list like
637008924
637008927
640691430
640691431
637008928
637008954
637008980
for extraction and repacking into a single smaller Repository?

If not, could you recommend a software tool/suite for this type of job.

MarvS

On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice  wrote:
> On 14/02/2011 23:35, Marvin Stodolsky wrote:
>>
>>  This is elementary I’m sure, but I’ve been unable to work out the
>> syntax  from the documentation.
>> More minor issue.
>>
>> When using infoseq to extract all the fasta Headers from a sequence
>> Repository, the GeneBegin..GeneEnd (like   234466..234589) often fails to
>> come as a uniform field/fields in a resultant spreadsheet.  Is there a Fix
>> for this?
>
> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they
> part of the sequence ID in the FASTA file?
>
> You can use a delimiter between items for infoseq using:
>
>  -nocolumn
>
> on the command line.
>
> For import into a spreadsheet you can set the delimiter to be tab with:
>
>  -nocolumn -delimiter "\t"
>
> on the command line. That should then import nicely into a spreadsheet.
>
> Hope that helps
>
> Peter Rice
> EMBOSS Team
>

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] FW: Reducing a FASTA repository, new user

2011-02-15 Thread Peter Rice

On 14/02/2011 23:35, Marvin Stodolsky wrote:

  This is elementary I’m sure, but I’ve been unable to work out the
syntax  from the documentation.
More minor issue.

When using infoseq to extract all the fasta Headers from a sequence
Repository, the GeneBegin..GeneEnd (like   234466..234589) often fails to
come as a uniform field/fields in a resultant spreadsheet.  Is there a Fix
for this?


I don't see the genebegin and geneend in EMBOSS infoseq output. Are they 
part of the sequence ID in the FASTA file?


You can use a delimiter between items for infoseq using:

 -nocolumn

on the command line.

For import into a spreadsheet you can set the delimiter to be tab with:

 -nocolumn -delimiter "\t"

on the command line. That should then import nicely into a spreadsheet.

Hope that helps

Peter Rice
EMBOSS Team
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] FW: Reducing a FASTA repository, new user

2011-02-14 Thread Marvin Stodolsky
 This is elementary I’m sure, but I’ve been unable to work out the
syntax  from the documentation.
More minor issue.

When using infoseq to extract all the fasta Headers from a sequence
Repository, the GeneBegin..GeneEnd (like   234466..234589) often fails to
come as a uniform field/fields in a resultant spreadsheet.  Is there a Fix
for this?

MarvS

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss