Re: stupid newbie question

2005-01-17 Thread Andrew Mace
Why not something like:
my %sequences = ();
my $seq;
while() {
if($_ =~ m/^Sequence ([^\n]+)$/) {
$seq = $1;
$sequences{$1} = [0,0];
} elsif($_ =~ m/CR05-C1-10(\d)/) {
if($1 == 2) {
$sequences{$seq}-[0]++;
} elsif($1 == 3) {
$sequences{$seq}-[1]++;
}
}
}
my $total_102 = 0;
my $total_103 = 0;
for(keys %sequences) {
	print $_, : 102 = , $sequences{$_}-[0], ; 103 = , 
$sequences{$_}-[1], \n;
	$total_102 += $sequences{$_}-[0];
	$total_103 += $sequences{$_}-[1];
}

print Total 102 = , $total_102, \n;
print Total 103 = , $total_103, \n;
Andrew

On Jan 17, 2005, at 2:04 PM, Marco Takita wrote:
Hi guys, sorry for the question not directly related to macosx but 
this is the OS I work with and I know that you guys are really 
helpful.

I'm really new to perl. Actually I'm trying write my very first 
script. Let me try to explain what I need. I have a large text file 
that is basically something like this:

Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498
Sequence Contig3773
Assembled_from  CR05-C1-103-041-_E11_-CT.F_044.ab1  -694  275
Assembled_from  CR05-C1-102-019-_A11_-CT.F_048.ab1  -626  289
Assembled_from  CR05-C1-102-019-_D03_-CT.F_013.ab1  -625  314
Assembled_from  CR05-C1-102-019-_B11_-CT.F_047.ab1  -733  185
Sequence  Contig3774
and so on.
What I need is to count how many times either CR05-C1-102 or 
CR05-C1-103 appears in the text, which I was able to do:

#!/usr/bin/perl
while () {
 chomp;
@text = (CR05-C1-102,CR05-C1-103);
 foreach $wd (split) {
if ($wd =~ @text[0], @text[1]){
if ($wd =~ @text[0]){
$score++;
}
if ($wd =~ @text[1]){
$res++;
   }
}
  }
   }
print  CR05-C1-102 $score CR05-C1-103 $res \n\n;
My problem is that I cannot do that for individual blocks like:
Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498
I was not able to isolate this block from the rest of the text.
Any idea how to do that?
Thanks a lot
Dr. Marco Aurélio Takita, Ph.D.
Centro APTA Citros Sylvio Moreira
Rodovia Anhanguera Km 158
Caixa Postal 04
13490-970 Cordeirópolis - SP, BRAZIL
Tel.: 55-19-35461399



Re: stupid newbie question

2005-01-17 Thread Marco Takita
Thanks Andrew for your input!
But the script still gives me the result for the total number of times 
they appear in the text. What I need now is to get the results for 
individual blocks, something like this:

input file
Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498
Sequence Contig3773
Assembled_from  CR05-C1-103-041-_E11_-CT.F_044.ab1  -694  275
Assembled_from  CR05-C1-102-019-_A11_-CT.F_048.ab1  -626  289
Assembled_from  CR05-C1-102-019-_D03_-CT.F_013.ab1  -625  314
Assembled_from  CR05-C1-102-019-_B11_-CT.F_047.ab1  -733  185
output:
Contig 3772
CR05-C1-102 6 CR05-C1-103 1
Contig 3773
CR05-C1-102 3 CR05-C1-103 1
I believe that it is not very complicated to do that but it is just 
that I'm able to do that by myself...

Marco Takita
On Jan 17, 2005, at 5:34 PM, Andrew Mace wrote:
Why not something like:
my %sequences = ();
my $seq;
while() {
if($_ =~ m/^Sequence ([^\n]+)$/) {
$seq = $1;
$sequences{$1} = [0,0];
} elsif($_ =~ m/CR05-C1-10(\d)/) {
if($1 == 2) {
$sequences{$seq}-[0]++;
} elsif($1 == 3) {
$sequences{$seq}-[1]++;
}
}
}
my $total_102 = 0;
my $total_103 = 0;
for(keys %sequences) {
	print $_, : 102 = , $sequences{$_}-[0], ; 103 = , 
$sequences{$_}-[1], \n;
	$total_102 += $sequences{$_}-[0];
	$total_103 += $sequences{$_}-[1];
}

print Total 102 = , $total_102, \n;
print Total 103 = , $total_103, \n;
Andrew

On Jan 17, 2005, at 2:04 PM, Marco Takita wrote:
Hi guys, sorry for the question not directly related to macosx but 
this is the OS I work with and I know that you guys are really 
helpful.

I'm really new to perl. Actually I'm trying write my very first 
script. Let me try to explain what I need. I have a large text file 
that is basically something like this:

Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498
Sequence Contig3773
Assembled_from  CR05-C1-103-041-_E11_-CT.F_044.ab1  -694  275
Assembled_from  CR05-C1-102-019-_A11_-CT.F_048.ab1  -626  289
Assembled_from  CR05-C1-102-019-_D03_-CT.F_013.ab1  -625  314
Assembled_from  CR05-C1-102-019-_B11_-CT.F_047.ab1  -733  185
Sequence  Contig3774
and so on.
What I need is to count how many times either CR05-C1-102 or 
CR05-C1-103 appears in the text, which I was able to do:

#!/usr/bin/perl
while () {
 chomp;
@text = (CR05-C1-102,CR05-C1-103);
 foreach $wd (split) {
if ($wd =~ @text[0], @text[1]){
if ($wd =~ @text[0]){
$score++;
}
if ($wd =~ @text[1]){
$res++;
   }
}
  }
   }
print  CR05-C1-102 $score CR05-C1-103 $res \n\n;
My problem is that I cannot do that for individual blocks like:
Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498
I was not able to isolate this block from the rest of the text.
Any idea how to do that?
Thanks a lot
Dr. Marco Aurélio Takita, Ph.D.
Centro APTA Citros Sylvio Moreira
Rodovia Anhanguera Km 158
Caixa Postal 04
13490-970 Cordeirópolis - SP, BRAZIL
Tel.: 55-19-35461399




Re: stupid newbie question

2005-01-17 Thread John Delacour
At 5:04 pm -0200 17/1/05, Marco Takita wrote:
What I need is to count how many times either CR05-C1-102 or 
CR05-C1-103 appears in the text, which I was able to do:

#!/usr/bin/perl
while () {


My problem is that I cannot do that for individual blocks like:
Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498

There are far shorter ways of doing it than I show here but since you 
say you're new to Perl I'll make it as long as I can:

#!/usr/bin/perl -w
use strict;
my ($i,  $line, @lines, $text);
$text = 'EOT';
Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498
EOT
@lines = split m~[  \012  \015  \x{2029}  ] ~x, $text;
foreach $line ( @lines ) {
  $i++  if  $line  =~  m~CR05-C1-102  |  CR05-C1-103~ix;
}
print $i;
$text is some text delimited by paragraph separators of one of 3 
kinds -- which of them being irrelevant in this case.  We split the 
$calar $text into an @rray of lines.  We then loop through  @lines 
adding 1 to the initial value 0/undefined of $i each time a match (m) 
is found in $line for ..102  or (|) ..103

JD





Re: stupid newbie question

2005-01-17 Thread Kim Helliwell
I don't have time to work out the details, but if I were faced
with this problem, I'd use a hash of hashes to store the blocks, with
the outer key set to the block names, and the inner keys set to
the CR05--- whatever.
Use regular expressions to look for the string Sequence followed
by some stuff (which you store into a scalar until you have the count)
initialize an anonymous hash to store the counts by whatever strings
you need to search for, and increment the counts as you scan. When
you hit the next occurrence of Sequence, store the anonymous hash
as the value in the main hash and create a new anonymous hash.
If you don't know about hashes of hashes and anonymous hashes, read
(and study) chapter 4 of the Camel book.
Kim
On Jan 17, 2005, at 12:05 PM, Marco Takita wrote:
Thanks Andrew for your input!
But the script still gives me the result for the total number of times 
they appear in the text. What I need now is to get the results for 
individual blocks, something like this:

input file
Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498
Sequence Contig3773
Assembled_from  CR05-C1-103-041-_E11_-CT.F_044.ab1  -694  275
Assembled_from  CR05-C1-102-019-_A11_-CT.F_048.ab1  -626  289
Assembled_from  CR05-C1-102-019-_D03_-CT.F_013.ab1  -625  314
Assembled_from  CR05-C1-102-019-_B11_-CT.F_047.ab1  -733  185
output:
Contig 3772
CR05-C1-102 6 CR05-C1-103 1
Contig 3773
CR05-C1-102 3 CR05-C1-103 1
I believe that it is not very complicated to do that but it is just 
that I'm able to do that by myself...

Marco Takita
On Jan 17, 2005, at 5:34 PM, Andrew Mace wrote:
Why not something like:
my %sequences = ();
my $seq;
while() {
if($_ =~ m/^Sequence ([^\n]+)$/) {
$seq = $1;
$sequences{$1} = [0,0];
} elsif($_ =~ m/CR05-C1-10(\d)/) {
if($1 == 2) {
$sequences{$seq}-[0]++;
} elsif($1 == 3) {
$sequences{$seq}-[1]++;
}
}
}
my $total_102 = 0;
my $total_103 = 0;
for(keys %sequences) {
	print $_, : 102 = , $sequences{$_}-[0], ; 103 = , 
$sequences{$_}-[1], \n;
	$total_102 += $sequences{$_}-[0];
	$total_103 += $sequences{$_}-[1];
}

print Total 102 = , $total_102, \n;
print Total 103 = , $total_103, \n;
Andrew

On Jan 17, 2005, at 2:04 PM, Marco Takita wrote:
Hi guys, sorry for the question not directly related to macosx but 
this is the OS I work with and I know that you guys are really 
helpful.

I'm really new to perl. Actually I'm trying write my very first 
script. Let me try to explain what I need. I have a large text file 
that is basically something like this:

Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498
Sequence Contig3773
Assembled_from  CR05-C1-103-041-_E11_-CT.F_044.ab1  -694  275
Assembled_from  CR05-C1-102-019-_A11_-CT.F_048.ab1  -626  289
Assembled_from  CR05-C1-102-019-_D03_-CT.F_013.ab1  -625  314
Assembled_from  CR05-C1-102-019-_B11_-CT.F_047.ab1  -733  185
Sequence  Contig3774
and so on.
What I need is to count how many times either CR05-C1-102 or 
CR05-C1-103 appears in the text, which I was able to do:

#!/usr/bin/perl
while () {
 chomp;
@text = (CR05-C1-102,CR05-C1-103);
 foreach $wd (split) {
if ($wd =~ @text[0], @text[1]){
if ($wd =~ @text[0]){
$score++;
}
if ($wd =~ @text[1]){
$res++;
   }
}
  }
   }
print  CR05-C1-102 $score CR05-C1-103 $res \n\n;
My problem is that I cannot do that for individual blocks like:
Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498
I was not able to isolate this block from the rest of the text.
Any idea how to do that?
Thanks a lot
Dr. Marco Aurélio Takita, Ph.D.
Centro APTA Citros Sylvio Moreira
Rodovia Anhanguera Km 158
Caixa Postal 04
13490-970 Cordeirópolis - 

Re: stupid newbie question

2005-01-17 Thread John Delacour
At 6:05 pm -0200 17/1/05, you wrote:
Thanks Andrew for your input! But the script still gives me the 
result for the total number of times they appear in the text. What I 
need now is to get the results for individual blocks, something like 
this:

input file
Sequence Contig3772
Assembled_from  CR05-C1-102-004-_A01_-CT.F_008.ab1  -40  955
Assembled_from  CR05-C1-102-006-_E05_-CT.F_035.ab1  -40  972
Assembled_from  CR05-C1-102-004-_B01_-CT.F_007.ab1  -32  1007
Assembled_from  CR05-C1-103-033-_G08_-CT.F_026.ab1  397  1400
Assembled_from  CR05-C1-102-060-_D07_-CT.F_029.ab1  403  1450
Assembled_from  CR05-C1-102-008-_G03_-CT.F_010.ab1  404  1427
Assembled_from  CR05-C1-102-065-_F12_-CT.F_043.ab1  406  1498
Sequence Contig3773
Assembled_from  CR05-C1-103-041-_E11_-CT.F_044.ab1  -694  275
Assembled_from  CR05-C1-102-019-_A11_-CT.F_048.ab1  -626  289
Assembled_from  CR05-C1-102-019-_D03_-CT.F_013.ab1  -625  314
Assembled_from  CR05-C1-102-019-_B11_-CT.F_047.ab1  -733  185

Apologies first of all for my original useless response.  Here's how 
I would do it -- and it works.

while () {
  /Contig([0-9]+)/i and $hash=$1 and eval my \%$hash;
  /CR05-C1-102|CR05-C1-103/i and eval \$$hash\{\$\} += 1;
}
Every time a ...Contig line is encountered a new hash is created. 
When a -102- match is found $hash{-102-} is incremented etc.

Using the above contents for your (\n delimited) file, you can run 
the script and then test the results, as below.  How you decide to 
name the keys etc. is up to you.

## TEST
print  qq~$3772{'CR05-C1-102'} $3772{'CR05-C1-103'}~;
# Result: 6 1
JD





Re: HTML::Tidy Filter for BBEdit?

2005-01-17 Thread Paul McCann
Hi John,
you wrote...

 my question is, how can I create the same sort of script which will 
 use HTML::Tidy?
 
 Both the module and the library are installed and running OK on my 
 Mac, but I can't seem to get it to work, in fact I can't even figure 
 out from the documentation how to get any output from it at all!

I think HTML::Tidy doesn't offer what you're seeking: looks like it's a
validation tool, not a pretty-print tool. So your output is empty if things
are well in the world of your html, and warnings/errors if there are problems
validating the string HTML::Tidy is fed. 

The following works as a filter in BBEdit, but (as above) I don't think
it'll do what you're seeking!

Do the items in the Markup-Tidy submenu do what you're seeking? I'm
pretty sure they hook into a copy of HTML Tidy that's embedded in BBEdit.

Cheers,
Paul
--
#!/usr/local/bin/perl -w
use strict;
use HTML::Tidy;
local $/;
my $input=;
my $dummyfilename='input'; 
my $tidy = new HTML::Tidy;
$tidy-parse($dummyfilename, $input );
for my $message ( $tidy-messages ) {
 print $message-as_string;
}
--