Re: formatting and syntax

2004-02-11 Thread James Edward Gray II
(redirected to Perl Beginners by James)

On Feb 11, 2004, at 10:34 AM, Michael S. Robeson II wrote:

Hey, thanks again for the perl code.
You're welcome, but let's keep our discussion on the mailing list so we 
can all help and learn.

However, I forgot to take into account that the original input file 
can look one of two ways:
Ah, the old switcheroo.  Gotcha.  

>bob
atcgactagcatcgatcg
acacgtacgactagcac
>fred
actgactacgatcgaca
acgcgcgatacggcat
or (as I posted originally)

>bob
atcgactagcatcgatcgacacgtacgactagcac
>fred
actgactacgatcgacaacgcgcgatacggcat
to be out put as:

R 1 42
 a t c g a c t a g c a t c g a t c g a c a c g t a c g a c t a g c a c 
- - - - - - -   bob
 a c t g a c t a c g a t c g a c a a c g c g c g a t a c g g c a t - - 
- - - - - - -   fred
How about this time I give you the code to parse the two types of input 
and you tie it in with the parts we've already figured out to get the 
right output?  Just shout if you run into more problems.

James

#!/usr/bin/perl

use strict;
use warnings;
local $/ = '';		# use "paragraph mode"

while () {
unless (s/^>(.+?)\s*\n//) {  # find and remove the name
warn "Skipping unknown format:  $_";
next;
}

my $name = $1;  # save name
tr/\n //d;  # join multi-line sequences
print "Name:  $name, Sequence:  $_\n";# show off our progess
}
__DATA__
>bob
atcgactagcatcgatcg
acacgtacgactagcac
>fred
actgactacgatcgaca
acgcgcgatacggcat
>bob
atcgactagcatcgatcgacacgtacgactagcac
>fred
actgactacgatcgacaacgcgcgatacggcat
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: formatting and syntax

2004-02-05 Thread Jeff 'japhy' Pinyan
On Feb 5, R. Joseph Newton said:

>my $sequence_length = 20;
>my $line = ;
>chomp $line;
>while ($line) {
>   my $sequence_tag = trim_line($line);
>   $line = ;
>   chomp $line;
>   my @nucleotides = split //, $line;
>   push @nucleotides, '_' for (1..($sequence_length - @nucleotides));

I'd be in favor of:

  push @nucleotides, ('_') x ($sequence_length - @nucleotides);

The 'x' operator on a list returns the list elements repeated the
specified number of times.

>__DATA__
> >bob
>AGTGATGCCGACG
>A G T G A T G C C G A C G _ _ _ _ _ _ _   bob

Ack.  You're mixing the input with the output!

-- 
Jeff "japhy" Pinyan  [EMAIL PROTECTED]  http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
 what does y/// stand for?   why, yansliterate of course.
[  I'm looking for programming work.  If you like my work, let me know.  ]


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: formatting and syntax

2004-02-05 Thread R. Joseph Newton
"Michael S. Robeson II" wrote:

> Hi I am all still to new to PERL and I am having trouble playing with
> formatting my data into a new format. So here is my problem:
>
> I have data (DNA sequence) in a file that looks like this:
>
> 
> # Infile
> 
>  >bob
> AGTGATGCCGACG
>  >fred
> ACGCATATCGCAT
>  >jon
> CAGTACGATTTATC

Good we can see the input structure here.  What jumps out at me is that the
input file comes in pairs of lines.  You will want to structure your input
routine to read and handle the lines by the pair, then.

>
>
> and I need it converted to:
>
> 
> # Outfile
> 
> R 1 20
>
>   A G U G A T G C C G A C G - - - - - - -   bob
>   A C G C A U A U C G C A U - - - - - - -   fred
>   C A G U A C G A U U U A U C - - - - - -   jon

>

[snip-a picture is worth  athousands woprds, and you showed us the picture
above.]

Well we have a fairly simple problem here, I'd say:

Greetings! E:\d_drive\perlStuff\giffy>perl -w
my $sequence_length = 20;
my $line = ;
chomp $line;
while ($line) {
   my $sequence_tag = trim_line($line);
   $line = ;
   chomp $line;
   my @nucleotides = split //, $line;
   push @nucleotides, '_' for (1..($sequence_length - @nucleotides));
   print join(' ', @nucleotides), "   $sequence_tag\n";
   $line = ;
   chomp $line;
}

sub trim_line {
  my $in_line = shift;
  $in_line =~ s/^ >//;
  chomp $in_line;
  return $in_line;
}

__DATA__
 >bob
AGTGATGCCGACG
A G T G A T G C C G A C G _ _ _ _ _ _ _   bob
 >fred
ACGCATATCGCAT
A C G C A T A T C G C A T _ _ _ _ _ _ _   fred
 >jon
CAGTACGATTTATC
C A G T A C G A T T T A T C _ _ _ _ _ _   jon

or, better yet...

Greetings! E:\d_drive\perlStuff\giffy>perl -w
my $sequence_length = 20;
my $line = ;
chomp $line;
while ($line) {
   my $sequence_tag = trim_line($line);
   $line = ;
   chomp $line;
   $line = print_underscore_padded($line, $sequence_length, $sequence_tag);

}


sub trim_line {
  my $in_line = shift;
  $in_line =~ s/^ >//;
  chomp $in_line;
  return $in_line;
}

sub print_underscore_padded {
   my ($line, $sequence_length, $sequence_tag) = @_;
   my @nucleotides = split //, $line;
   push @nucleotides, '_' for (1..($sequence_length - @nucleotides));
   print join(' ', @nucleotides), "   $sequence_tag\n";
   $line = ;
   chomp $line;
   return $line;
}

__DATA__
 >bob
AGTGATGCCGACG
A G T G A T G C C G A C G _ _ _ _ _ _ _   bob
 >fred
ACGCATATCGCAT
A C G C A T A T C G C A T _ _ _ _ _ _ _   fred
 >jon
CAGTACGATTTATC
C A G T A C G A T T T A T C _ _ _ _ _ _   jon


Does that help?

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: formatting and syntax

2004-02-04 Thread Jeff 'japhy' Pinyan
On Feb 4, Michael S. Robeson II said:

> >bob
>AGTGATGCCGACG
> >fred
>ACGCATATCGCAT
> >jon
>CAGTACGATTTATC

>R 1 20
>
>  A G U G A T G C C G A C G - - - - - - -   bob
>  A C G C A U A U C G C A U - - - - - - -   fred
>  C A G U A C G A U U U A U C - - - - - -   jon
>
>
>The "R 1" is static and should always appear. The "20" at the top of
>the new file should be a number defined by the user, that is they
>should be prompted for the length they wish the sequence to be. That is
>the total length of the sequence plus the added dashes could be 20 or
>3000 or whatever.  So, if they type 20 and there is only 10 letters in
>that row then the script should add 10 dashes to bring that total up to
>the 20 chosen by the user.

I'll provide one way to do this:

  # assuming $size has the number entered by the user

  while () {
my ($name) = / >(.+)/;# get the line name
chomp(my $DNA = );  # get the next line (the DNA)

# add $size - length() dashes to the end of $DNA
$DNA .= "-" x ($size - length $DNA);

# print the DNA with spaces, then a tab, then the name
print join(" ", split //, $DNA), "\t$name\n";
  }

-- 
Jeff "japhy" Pinyan  [EMAIL PROTECTED]  http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
 what does y/// stand for?   why, yansliterate of course.
[  I'm looking for programming work.  If you like my work, let me know.  ]


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: formatting and syntax

2004-02-04 Thread James Edward Gray II
On Feb 4, 2004, at 11:35 AM, Michael S. Robeson II wrote:

Hi I am all still to new to PERL and I am having trouble playing with 
formatting my data into a new format. So here is my problem:

I have data (DNA sequence) in a file that looks like this:


# Infile

>bob
AGTGATGCCGACG
>fred
ACGCATATCGCAT
>jon
CAGTACGATTTATC
and I need it converted to:


# Outfile

R 1 20
 A G U G A T G C C G A C G - - - - - - -   bob
 A C G C A U A U C G C A U - - - - - - -   fred
 C A G U A C G A U U U A U C - - - - - -   jon
The "R 1" is static and should always appear. The "20" at the top of 
the new file should be a number defined by the user, that is they 
should be prompted for the length they wish the sequence to be. That 
is the total length of the sequence plus the added dashes could be 20 
or 3000 or whatever.  So, if they type 20 and there is only 10 letters 
in that row then the script should add 10 dashes to bring that total 
up to the 20 chosen by the user.

Note that there should be a space between all letters and dashes - 
including a space at the beginning. Then there are supposed to be 7 
spaces after the sequence string followed by the name as shown in the 
example output file above. Also, of note is the fact that all of the 
T's are changed to U's. For those of you that know biology I am not 
only switching formats of the data but also changing DNA to RNA.

I hope I am explaining this clear enough, but here (see below) is as 
far as I can get with the code. I just do not know how to structure 
the loop/code to do this. I always have trouble with manipulating data 
the way I want when it comes to a loop. I would prefer an easier to 
understand code rather than an efficient code. This way I can learn 
the simple stuff first and learn the short-cuts later. Thanks to 
anyone who can help.

- Cheers!
- Mike
##
#!/usr/bin/perl
use warnings;
use strict;
print "Enter the path of the INFILE to be processed:\n";

# For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"

chomp (my $infile = );

open(INFILE, $infile)
or die "Can't open INFILE for input: $!";
print "Enter in the path of the OUTFILE:\n";

# For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"

chomp (my $outfile = );

open(OUTFILE, ">$outfile")
or die "Can't open OUTFILE for input: $!";
print "Enter in the LENGTH you want the sequence to be:\n";
my ( $len ) =  =~ /(\d+)/ or die "Invalid length parameter";
print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file is supposed
my $name;
while () {
	chomp;
	if (/^>(\w+)/) { $name = $1; }
	else {
		tr/T/U/;	# convert Ts to Us
		substr($_, $len) = '' if length($_) > $len;			# shorten, if needed
		$_ .= '.' x ($len - length($_)) if length($_) < $len;	# lengthen, if 
needed
		s/\b|\B/ /g;	# add spaces
		print OUTFILE "$_  $name\n";# print
	}
}

Hope that helps.

James

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: formatting and syntax

2004-02-04 Thread david
Michael S. Robeson II wrote:

> I have data (DNA sequence) in a file that looks like this:
> 
> 
> # Infile
> 
>  >bob
> AGTGATGCCGACG
>  >fred
> ACGCATATCGCAT
>  >jon
> CAGTACGATTTATC
> 
> and I need it converted to:
> 
> 
> # Outfile
> 
> R 1 20
> 
>   A G U G A T G C C G A C G - - - - - - -   bob
>   A C G C A U A U C G C A U - - - - - - -   fred
>   C A G U A C G A U U U A U C - - - - - -   jon

there are many ways of doing that. here is one:

#!/usr/bin/perl -w
use strict;

#--
#-- discard the first 3 header lines
#--
 for 1..3;

#--
#-- read each ' >'
#--
$/ = ' >';

while(){

next unless(my($n,$s) = /(.+)\n(.+)/);

#--
#-- pad dna sequence to 20 bytes and translate T to U
#-- here, you will prompt the user to enter a number instead
#--
($s .= '-'x(20-length($s))) =~ y/T/U/;

#--
#-- put space after each character
#--
$s =~ s/./$& /g;

print "$s\t$n\n";
}

__DATA__

# Infile

 >bob
AGTGATGCCGACG
 >fred
ACGCATATCGCAT
 >jon
CAGTACGATTTATC

__END__

prints:

A G U G A U G C C G A C G - - - - - - - bob
A C G C A U A U C G C A U - - - - - - - fred
C A G U A C G A U U U A U C - - - - - - jon

david
-- 
sub'_{print"@_ ";* \ = * __ ,\ & \}
sub'__{print"@_ ";* \ = * ___ ,\ & \}
sub'___{print"@_ ";* \ = *  ,\ & \}
sub'{print"@_,\n"}&{_+Just}(another)->(Perl)->(Hacker)

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: formatting and syntax

2004-02-04 Thread Rob Dixon
Michael S. Robeson II wrote:
>
> Hi I am all still to new to PERL and I am having trouble playing with
> formatting my data into a new format. So here is my problem:
>
> I have data (DNA sequence) in a file that looks like this:
[snip]

Please don't talk about interesting stuff like DNA sequences on a Perl
group. We need less distraction.

Rob



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




formatting and syntax

2004-02-04 Thread Michael S. Robeson II
Hi I am all still to new to PERL and I am having trouble playing with 
formatting my data into a new format. So here is my problem:

I have data (DNA sequence) in a file that looks like this:


# Infile

>bob
AGTGATGCCGACG
>fred
ACGCATATCGCAT
>jon
CAGTACGATTTATC
and I need it converted to:


# Outfile

R 1 20
 A G U G A T G C C G A C G - - - - - - -   bob
 A C G C A U A U C G C A U - - - - - - -   fred
 C A G U A C G A U U U A U C - - - - - -   jon
The "R 1" is static and should always appear. The "20" at the top of 
the new file should be a number defined by the user, that is they 
should be prompted for the length they wish the sequence to be. That is 
the total length of the sequence plus the added dashes could be 20 or 
3000 or whatever.  So, if they type 20 and there is only 10 letters in 
that row then the script should add 10 dashes to bring that total up to 
the 20 chosen by the user.

Note that there should be a space between all letters and dashes - 
including a space at the beginning. Then there are supposed to be 7 
spaces after the sequence string followed by the name as shown in the 
example output file above. Also, of note is the fact that all of the 
T's are changed to U's. For those of you that know biology I am not 
only switching formats of the data but also changing DNA to RNA.

I hope I am explaining this clear enough, but here (see below) is as 
far as I can get with the code. I just do not know how to structure the 
loop/code to do this. I always have trouble with manipulating data the 
way I want when it comes to a loop. I would prefer an easier to 
understand code rather than an efficient code. This way I can learn the 
simple stuff first and learn the short-cuts later. Thanks to anyone who 
can help.

- Cheers!
- Mike
##
#!/usr/bin/perl
use warnings;
use strict;
print "Enter the path of the INFILE to be processed:\n";

# For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"

chomp (my $infile = );

open(INFILE, $infile)
or die "Can't open INFILE for input: $!";
print "Enter in the path of the OUTFILE:\n";

# For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"

chomp (my $outfile = );

open(OUTFILE, ">$outfile")
or die "Can't open OUTFILE for input: $!";
print "Enter in the LENGTH you want the sequence to be:\n";
my ( $len ) =  =~ /(\d+)/ or die "Invalid length parameter";
print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file is supposed

# type of loop or structure to follow ?

#

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]