Re: stupid newbie question
Why not something like: my %sequences = (); my $seq; while() { if($_ =~ m/^Sequence ([^\n]+)$/) { $seq = $1; $sequences{$1} = [0,0]; } elsif($_ =~ m/CR05-C1-10(\d)/) { if($1 == 2) { $sequences{$seq}-[0]++; } elsif($1 == 3) { $sequences{$seq}-[1]++; } } } my $total_102 = 0; my $total_103 = 0; for(keys %sequences) { print $_, : 102 = , $sequences{$_}-[0], ; 103 = , $sequences{$_}-[1], \n; $total_102 += $sequences{$_}-[0]; $total_103 += $sequences{$_}-[1]; } print Total 102 = , $total_102, \n; print Total 103 = , $total_103, \n; Andrew On Jan 17, 2005, at 2:04 PM, Marco Takita wrote: Hi guys, sorry for the question not directly related to macosx but this is the OS I work with and I know that you guys are really helpful. I'm really new to perl. Actually I'm trying write my very first script. Let me try to explain what I need. I have a large text file that is basically something like this: Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498 Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185 Sequence Contig3774 and so on. What I need is to count how many times either CR05-C1-102 or CR05-C1-103 appears in the text, which I was able to do: #!/usr/bin/perl while () { chomp; @text = (CR05-C1-102,CR05-C1-103); foreach $wd (split) { if ($wd =~ @text[0], @text[1]){ if ($wd =~ @text[0]){ $score++; } if ($wd =~ @text[1]){ $res++; } } } } print CR05-C1-102 $score CR05-C1-103 $res \n\n; My problem is that I cannot do that for individual blocks like: Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498 I was not able to isolate this block from the rest of the text. Any idea how to do that? Thanks a lot Dr. Marco Aurélio Takita, Ph.D. Centro APTA Citros Sylvio Moreira Rodovia Anhanguera Km 158 Caixa Postal 04 13490-970 Cordeirópolis - SP, BRAZIL Tel.: 55-19-35461399
Re: stupid newbie question
Thanks Andrew for your input! But the script still gives me the result for the total number of times they appear in the text. What I need now is to get the results for individual blocks, something like this: input file Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498 Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185 output: Contig 3772 CR05-C1-102 6 CR05-C1-103 1 Contig 3773 CR05-C1-102 3 CR05-C1-103 1 I believe that it is not very complicated to do that but it is just that I'm able to do that by myself... Marco Takita On Jan 17, 2005, at 5:34 PM, Andrew Mace wrote: Why not something like: my %sequences = (); my $seq; while() { if($_ =~ m/^Sequence ([^\n]+)$/) { $seq = $1; $sequences{$1} = [0,0]; } elsif($_ =~ m/CR05-C1-10(\d)/) { if($1 == 2) { $sequences{$seq}-[0]++; } elsif($1 == 3) { $sequences{$seq}-[1]++; } } } my $total_102 = 0; my $total_103 = 0; for(keys %sequences) { print $_, : 102 = , $sequences{$_}-[0], ; 103 = , $sequences{$_}-[1], \n; $total_102 += $sequences{$_}-[0]; $total_103 += $sequences{$_}-[1]; } print Total 102 = , $total_102, \n; print Total 103 = , $total_103, \n; Andrew On Jan 17, 2005, at 2:04 PM, Marco Takita wrote: Hi guys, sorry for the question not directly related to macosx but this is the OS I work with and I know that you guys are really helpful. I'm really new to perl. Actually I'm trying write my very first script. Let me try to explain what I need. I have a large text file that is basically something like this: Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498 Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185 Sequence Contig3774 and so on. What I need is to count how many times either CR05-C1-102 or CR05-C1-103 appears in the text, which I was able to do: #!/usr/bin/perl while () { chomp; @text = (CR05-C1-102,CR05-C1-103); foreach $wd (split) { if ($wd =~ @text[0], @text[1]){ if ($wd =~ @text[0]){ $score++; } if ($wd =~ @text[1]){ $res++; } } } } print CR05-C1-102 $score CR05-C1-103 $res \n\n; My problem is that I cannot do that for individual blocks like: Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498 I was not able to isolate this block from the rest of the text. Any idea how to do that? Thanks a lot Dr. Marco Aurélio Takita, Ph.D. Centro APTA Citros Sylvio Moreira Rodovia Anhanguera Km 158 Caixa Postal 04 13490-970 Cordeirópolis - SP, BRAZIL Tel.: 55-19-35461399
Re: stupid newbie question
At 5:04 pm -0200 17/1/05, Marco Takita wrote: What I need is to count how many times either CR05-C1-102 or CR05-C1-103 appears in the text, which I was able to do: #!/usr/bin/perl while () { My problem is that I cannot do that for individual blocks like: Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498 There are far shorter ways of doing it than I show here but since you say you're new to Perl I'll make it as long as I can: #!/usr/bin/perl -w use strict; my ($i, $line, @lines, $text); $text = 'EOT'; Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498 EOT @lines = split m~[ \012 \015 \x{2029} ] ~x, $text; foreach $line ( @lines ) { $i++ if $line =~ m~CR05-C1-102 | CR05-C1-103~ix; } print $i; $text is some text delimited by paragraph separators of one of 3 kinds -- which of them being irrelevant in this case. We split the $calar $text into an @rray of lines. We then loop through @lines adding 1 to the initial value 0/undefined of $i each time a match (m) is found in $line for ..102 or (|) ..103 JD
Re: stupid newbie question
I don't have time to work out the details, but if I were faced with this problem, I'd use a hash of hashes to store the blocks, with the outer key set to the block names, and the inner keys set to the CR05--- whatever. Use regular expressions to look for the string Sequence followed by some stuff (which you store into a scalar until you have the count) initialize an anonymous hash to store the counts by whatever strings you need to search for, and increment the counts as you scan. When you hit the next occurrence of Sequence, store the anonymous hash as the value in the main hash and create a new anonymous hash. If you don't know about hashes of hashes and anonymous hashes, read (and study) chapter 4 of the Camel book. Kim On Jan 17, 2005, at 12:05 PM, Marco Takita wrote: Thanks Andrew for your input! But the script still gives me the result for the total number of times they appear in the text. What I need now is to get the results for individual blocks, something like this: input file Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498 Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185 output: Contig 3772 CR05-C1-102 6 CR05-C1-103 1 Contig 3773 CR05-C1-102 3 CR05-C1-103 1 I believe that it is not very complicated to do that but it is just that I'm able to do that by myself... Marco Takita On Jan 17, 2005, at 5:34 PM, Andrew Mace wrote: Why not something like: my %sequences = (); my $seq; while() { if($_ =~ m/^Sequence ([^\n]+)$/) { $seq = $1; $sequences{$1} = [0,0]; } elsif($_ =~ m/CR05-C1-10(\d)/) { if($1 == 2) { $sequences{$seq}-[0]++; } elsif($1 == 3) { $sequences{$seq}-[1]++; } } } my $total_102 = 0; my $total_103 = 0; for(keys %sequences) { print $_, : 102 = , $sequences{$_}-[0], ; 103 = , $sequences{$_}-[1], \n; $total_102 += $sequences{$_}-[0]; $total_103 += $sequences{$_}-[1]; } print Total 102 = , $total_102, \n; print Total 103 = , $total_103, \n; Andrew On Jan 17, 2005, at 2:04 PM, Marco Takita wrote: Hi guys, sorry for the question not directly related to macosx but this is the OS I work with and I know that you guys are really helpful. I'm really new to perl. Actually I'm trying write my very first script. Let me try to explain what I need. I have a large text file that is basically something like this: Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498 Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185 Sequence Contig3774 and so on. What I need is to count how many times either CR05-C1-102 or CR05-C1-103 appears in the text, which I was able to do: #!/usr/bin/perl while () { chomp; @text = (CR05-C1-102,CR05-C1-103); foreach $wd (split) { if ($wd =~ @text[0], @text[1]){ if ($wd =~ @text[0]){ $score++; } if ($wd =~ @text[1]){ $res++; } } } } print CR05-C1-102 $score CR05-C1-103 $res \n\n; My problem is that I cannot do that for individual blocks like: Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498 I was not able to isolate this block from the rest of the text. Any idea how to do that? Thanks a lot Dr. Marco Aurélio Takita, Ph.D. Centro APTA Citros Sylvio Moreira Rodovia Anhanguera Km 158 Caixa Postal 04 13490-970 Cordeirópolis -
Re: stupid newbie question
At 6:05 pm -0200 17/1/05, you wrote: Thanks Andrew for your input! But the script still gives me the result for the total number of times they appear in the text. What I need now is to get the results for individual blocks, something like this: input file Sequence Contig3772 Assembled_from CR05-C1-102-004-_A01_-CT.F_008.ab1 -40 955 Assembled_from CR05-C1-102-006-_E05_-CT.F_035.ab1 -40 972 Assembled_from CR05-C1-102-004-_B01_-CT.F_007.ab1 -32 1007 Assembled_from CR05-C1-103-033-_G08_-CT.F_026.ab1 397 1400 Assembled_from CR05-C1-102-060-_D07_-CT.F_029.ab1 403 1450 Assembled_from CR05-C1-102-008-_G03_-CT.F_010.ab1 404 1427 Assembled_from CR05-C1-102-065-_F12_-CT.F_043.ab1 406 1498 Sequence Contig3773 Assembled_from CR05-C1-103-041-_E11_-CT.F_044.ab1 -694 275 Assembled_from CR05-C1-102-019-_A11_-CT.F_048.ab1 -626 289 Assembled_from CR05-C1-102-019-_D03_-CT.F_013.ab1 -625 314 Assembled_from CR05-C1-102-019-_B11_-CT.F_047.ab1 -733 185 Apologies first of all for my original useless response. Here's how I would do it -- and it works. while () { /Contig([0-9]+)/i and $hash=$1 and eval my \%$hash; /CR05-C1-102|CR05-C1-103/i and eval \$$hash\{\$\} += 1; } Every time a ...Contig line is encountered a new hash is created. When a -102- match is found $hash{-102-} is incremented etc. Using the above contents for your (\n delimited) file, you can run the script and then test the results, as below. How you decide to name the keys etc. is up to you. ## TEST print qq~$3772{'CR05-C1-102'} $3772{'CR05-C1-103'}~; # Result: 6 1 JD
Re: HTML::Tidy Filter for BBEdit?
Hi John, you wrote... my question is, how can I create the same sort of script which will use HTML::Tidy? Both the module and the library are installed and running OK on my Mac, but I can't seem to get it to work, in fact I can't even figure out from the documentation how to get any output from it at all! I think HTML::Tidy doesn't offer what you're seeking: looks like it's a validation tool, not a pretty-print tool. So your output is empty if things are well in the world of your html, and warnings/errors if there are problems validating the string HTML::Tidy is fed. The following works as a filter in BBEdit, but (as above) I don't think it'll do what you're seeking! Do the items in the Markup-Tidy submenu do what you're seeking? I'm pretty sure they hook into a copy of HTML Tidy that's embedded in BBEdit. Cheers, Paul -- #!/usr/local/bin/perl -w use strict; use HTML::Tidy; local $/; my $input=; my $dummyfilename='input'; my $tidy = new HTML::Tidy; $tidy-parse($dummyfilename, $input ); for my $message ( $tidy-messages ) { print $message-as_string; } --