Re: Getting the dir structure
On Sat, 2004-12-04 at 17:25 -0800, John W. Krahn wrote: Dan Jones wrote: On Thu, 2004-12-02 at 00:13 -0800, Mr M senthil kumar wrote: SNIP I have a file with thousands of line like : /abc/def/ijk/test.txt /pqr/lmn/test1.t I want to get the directory where the files test.txt and test1.txt are lying. /SNIP Hi, You can try the following: #!/usr/bin/perl open (IN,input_file) || die Cannot open file: $!; open (OUT,output_file) || die Cannot send the output: $!; I see this a lot. One thing that immediately occurs to me is that if opening IN succeeds but opening OUT fails, the program dies without closing IN. Is this acceptable code in the Perl world or should the code close all open files before dieing? The operating system handles all resources like files and memory so when the program exits the operating system frees up all file handle resources and the memory allocated for the strings input_file and Cannot open file: $!. If the operating system didn't do this then programming would be a *LOT* harder and a *LOT* less robust! As an analogy: If you rent a hotel/motel room you *could* clean it up yourself before you check out, but most people don't. :-) I understand that the system will usually clean up the messes you leave behind. However, in application programming with higher level languages, it's considered extremely poor programming practice to rely on this behavior. My question isn't whether the system will close the file, it's whether this is considered acceptable program behavior. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Getting the dir structure
On Thu, 2004-12-02 at 00:13 -0800, Mr M senthil kumar wrote: SNIP I have a file with thousands of line like : /abc/def/ijk/test.txt /pqr/lmn/test1.t I want to get the directory where the files test.txt and test1.txt are lying. /SNIP Hi, You can try the following: #!/usr/bin/perl open (IN,input_file) || die Cannot open file: $!; open (OUT,output_file) || die Cannot send the output: $!; I see this a lot. One thing that immediately occurs to me is that if opening IN succeeds but opening OUT fails, the program dies without closing IN. Is this acceptable code in the Perl world or should the code close all open files before dieing? -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Sufficient effort
On Thu, 2004-12-02 at 09:38 +0100, Gunnar Hjalmarsson wrote: Casey West wrote: And I should point out that I agree. However, that doesn't excuse bad behavior in response. Furthermore, I expect more from a responder than a questioner. Okay, I've made up my mind. I don't have the required patience with the lazy dogs, so I don't fit here. Of course, you can always follow Jenda's advice and ignore them. And I will. I'll ignore this list from now on. Good luck! And beware of offending the flashes. Have you stopped to consider that the reaction you have to your posting style being questioned is the same reaction some people have to your posts? If you're willing to abandon the list over someone questioning your style, how many people have also been willing to abandon it when they're slammed for violating some unknown etiquette? The intent here wasn't to attach you personally; it was to say that perhaps there's a better way to do things. (Does that sound familiar?) If you do leave, your knowledge will be sorely missed. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Reading large mbox files
I'm looking to write a utility to do some processing on email messages stored in mbox format. Some mbox files can be quite large, hundreds of megs or perhaps gigs in size. Obviously, reading in the whole file at once isn't feasible. The most obvious method is to set $/ to the regex /\n\nFrom / (messages in mbox format are seperated by a blank line and begin with a From line) and to read in email messages one at a time. It seems to me that this would be quite slow. Another possibility that springs to mind is to read in chunks 64k or so chunks of data and then split those chunks into individual messages. This will complicate the program logic, however, as the chunks will inevitably split the last message in two. I'd then either have to back up the offset into the file to point to the begging of the message or to store the beginning of the message, read in a new chunk, get the last half of the message off the new chunk, combine it with the stored beginning of the message, then process it. I'm aware that there are a number of modules which deal with mail and mbox handling, but so far none of them seem to make doing what I'm trying to do easy. Reinventing the wheel isn't always a waste of time - it's sometimes a very good way to learn how wheels are constructed and how to use your tools to construct wheels. This gives you insight and practice when you have to use those same tools to construct non-wheels. :) Any thoughts or pointers to discussions on how to handle large files in Perl would be welcome. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Reading large mbox files
On Sat, 2004-11-27 at 23:07 +0100, Gunnar Hjalmarsson wrote: Dan Jones wrote: The most obvious method is to set $/ to the regex /\n\nFrom / (messages in mbox format are seperated by a blank line and begin with a From line) and to read in email messages one at a time. From perldoc perlvar: Remember: the value of $/ is a string, not a regex. Yes, I've since realized that. So that method isn't obvious at all; it's not available. (But you can set $/ to the *string* \n\nFrom .) Well, the idea of resetting $/ is obvious, even if I misstated the details. It seems to me that this would be quite slow. What made you draw that conclusion? Because I cut my teeth on C (and later C++). That means I have lots of habits and mind sets that are great for a C programmer but not so great for a wannabe JAPH. Disk I/O is slow. It's generally _much_ faster to read in decent size chunks from the hard drive than to do repeated reads of small size. What I didn't consider is that Larry's a pretty bright guy, and he's probably forgotten more about low level I/O than I know. After thinking about it a bit more, I realized that Perl is almost certainly doing the buffering for you. I'd be very surprised if reading via the angle operators correlated to doing actual disk reads. Instead, I'd wager that Perl slurps in a big chunk and just feeds it to you as you request it. One of the advantages of using higher level languages is that they'll do a lot of the schlock work for you, if you remember to let them do it! If you want to preserve the message separators with the right messages, line by line processing may be the easiest approach: my $msgsep = qr(^From ); my $msg; while (MBOX) { if ( /$msgsep/ ) { processmsg( \$msg ) if $msg; $msg = $_; } else { $msg .= $_; } } processmsg( \$msg ); sub processmsg { ... } (The $msgsep regex should probably be more specified.) Since the message separator doesn't change, I think it's easier just to remove it from the end of the message and add it back in to the next message. I had to extract the first message before going in to the loop, since the message separator isn't missing from it and adding it back in leads to it being present twice at the beginning of the message. Here's what I've come up with so far: #!/usr/bin/perl use strict; use warnings; use Term::ReadKey; sub ProcMessage($); sub Pause(); die usage: rdmbox mailbox unless ($ARGV[0]); open MAILBOX, $ARGV[0] or die Can't open $ARGV[0]; local $/ = \n\nFrom ; $_ = MAILBOX; $_ =~ s/\n\nFrom $//; ProcMessage($_); while(MAILBOX) { $_ =~ s/\n\nFrom $//; ProcMessage(From $_); if(Pause() == 0) { last; } } close MAILBOX; sub ProcMessage($){ my $message = shift; print Message:\n $message \n\n; } sub Pause() { open(TTY, /dev/tty); ReadMode raw; my $key = ReadKey 0, *TTY; ReadMode normal; if($key eq q) { return 0; } else { return 1; } } Comments on the above code are welcome, even if it touches on some other issue. If it isn't obvious by now, I'm trying to learn here! Basic error checking is included but error messages are a bit terse right now. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Reading large mbox files
On Sun, 2004-11-28 at 04:09 +0100, Gunnar Hjalmarsson wrote: Dan Jones wrote: local $/ = \n\nFrom ; $_ = MAILBOX; $_ =~ s/\n\nFrom $//; ProcMessage($_); while(MAILBOX) { $_ =~ s/\n\nFrom $//; ProcMessage(From $_); if(Pause() == 0) { last; } } close MAILBOX; sub ProcMessage($){ my $message = shift; print Message:\n $message \n\n; } Seems fine to me. The only concern is paragraphs that start with From , without e.g. being prepended to those lines. I suppose you know whether that is an issue to count with. That isn't supposed to happen. The program writing to the mailbox is responsible for checking for that and appending a to those lines. See here if you're interested: http://en.wikipedia.org/wiki/Mbox On to the next issue. One of the things I want to do is to check for duplicate messages. The common way to do that is to simply check the Message ID. The widely used procmail recipe uses that method, as does the formail utility. However, Message IDs are not guaranteed to be unique. If a collision does occur, you lose a message. My thought is to hash the message body, and store that hash value. If a Message ID collision occurs as you're processing the mailbox, you check the hash values to be sure they're the same before deleting the message. The problem is that Perl has a variable type called hash. (Yes, I know, you probably heard that somewhere before.) Searching for information on using hashing functions in Perl leads to pages and pages of information dealing with the hash variable type. Perl obviously uses an internal hashing function to generate its hash variables. Is it possible to access that function from a script? If not, does anyone know of a module or pointer to information on hashing functions for Perl? -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
RE: Recursively counting a matching pattern on a single line.
On Wed, 2004-10-27 at 20:07, S.A. Birl wrote: On Oct 27, [EMAIL PROTECTED] ([EMAIL PROTECTED]: Brian: Brian: If you want to make sure they are alternating like etc... I would do Brian: this: Brian: Brian: $_ = $line; Brian: Brian: @syms = m/[]/g; Brian: $string = join(, @syms); Brian: if ($strings !~ m/^()*$/) Brian: { Brian:## Scream here! Brian: } Brian: Brian: The regular expression: Brian: Brian: m/^()*$/ Brian: Brian: will ensure that it starts with and ends with and anything in between Brian: will be which I think should do the trick. That logic is pretty hairy Brian: though and I could be missing something. Wouldnt m/[]/g literally match and not characters? Why wouldnt it be m/[.+]/g ? Brackets in a regular expression match a single character inside the brackets. It's the RE equivalent to an OR. So m/[]/g will match each instance of '' or ''. The match will be a single character long. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
RE: Determine the Binary format of a file
On Wed, 2004-10-27 at 23:19, Jim wrote: Have any backups? Paper reports? If all else fails, you could always hire some interns and turn it into a massive data [re-]entry project, provided that a paper trail exists... LOL! If I don't figure it out tonight, gonna tell my boss to renew the software :) If you have access to the software, you might be able to create a new file and put unique data into it, such as strings of repeating numbers or letters, then try to reverse engineer that format. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: what is something like this - $seen{$1}
On Tue, 2004-10-26 at 22:00, Chasecreek Systemhouse wrote: Interesting. Why doesn't this skip already seen letters, I used the case-insensitive modifier... %seen = ( ); $string = AaBbCcDdEeFf; while ($string =~ /(.)/gi) { $seen{$1}++; } print \n\nunique chars are: , sort(keys %seen), \n; 'A' and 'a' are the same, or is the logic only char() oriented? I guess I'm forced to use lc(); The 'i' modifier only affects the matching of the RE. If you had a letter in your RE, it would match either case of that letter. It doesn't change the case of the letter, just affects which ones match. Since the RE has no letters, the 'i' modifier doesn't do anything here. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Broken Subroutine
On Wed, 2004-10-20 at 23:19, Ron Smith wrote: The following is the code: #!/usr/bin/perl -w use strict; my @paths = `dir /b/s`; # print @paths; my @basenames = basenames(@paths); sub basenames { foreach (@_) { if ($_ =~ /(\w+)\.\d+\.\w+$/) { @basenames = $1; # print @basenames\n; } } } First, while it's allowable, it seems to me that you're asking for trouble by using the same name multiple times. Perl may not have difficulty keeping sub var, $var, @var and @var declared again at a different scope separated, but programmers sure do. For instance, do you mean for the array @basenames inside the subroutine to be the same as the array @basenames that you declared outside the subroutine? If so, why are you trying to assign a value to it when it's already (theoretically) being populated inside the subroutine? I say theoretically because when you assign a scalar to an array via '=', you're essentially creating a new array with one element. Any values that were in the array are lost. If you're trying to add an additional element to the array, you need to use push (to add to the end of the array) or unshift (to add to the beginning of the array.) If you intended to use the same variable inside and outside the subroutine (just one verion of @basenames), then don't bother assigning to the variable. Just call the subroutine. I don't recommend it, but it will work. On the other hand, if you intended to have two different variables, change the name of one of them (and you'll need to declare it with my inside the subroutine.). Then explicitly return the array from the subroutine. Something like this (untested code!): bad way #!/usr/bin/perl -w use strict; my @paths = `dir /b/s`; my @basenames; procbasenames(@paths); sub procbasenames { foreach (@_) { if ($_ =~ /(\w+)\.\d+\.\w+$/) { push @basenames, $1; } } } better way #!/usr/bin/perl -w use strict; my @paths = `dir /b/s`; my @basenames = procbasenames(@paths); sub procbasenames { my @basenamematches; foreach (@_) { if ($_ =~ /(\w+)\.\d+\.\w+$/) { push @basenamematches, $1; } } return @basenamematches; } -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Regex to match valid host or dns names
On Wed, 2004-10-13 at 15:06, K.Prabakar wrote: example below, it fails to match host-no.top-level as a valid host name. I modify the regex several times - but still don't get the right outlook. my @hosts = qw(192.168.22.1 192.168.22.18 localhost another.host.domain host-no.top-level my.host.domain.com); foreach (@hosts){ # Works ok push (@ips, $_ ) if $_ =~ /^\d{1,3}\.\d{1,3}\.\d{1|3}/; # Can't match host-no.top-level. push (@dns, $_) if $_ =~ /^\w+-?[\w+]?\.?[\w+.{1}]*\w+$/; } /^\w+-?[\w+]?\.?[\w+.{1}]*\w+$/--Here you look for only one - and also not allowing any other non-word charaters(like hyphen). The . can match any character even other than - . You can think like this:(For IP's) search for a number with maximum 3 digits and then followed by the same kind of 3 numbers but prefixed with a dot. Try this --- $_ =~ /^\d{1,3}[\.\d{1,3}]{3}/ You can think like this:(For DNS's) search for a WORD which may(-?) contain hyphen within it and then followed by the same kind of zero-or-more-WORDs but prefixed with a dot which is a normal dns name pattern. Try this $_ =~ /^\w\w*-?\w+?[\.\w\w*-?\w+?]*$/ But this will allow IP's also in your @dns because \w can match digits also. Isn't this easily solved? foreach (@hosts){ if($_ =~ /^\d{1,3}[\.\d{1,3}]{3}/) { push (@ips, $_ ); } elsif($_ =~ /^\w\w*-?\w+?[\.\w\w*-?\w+?]*$/) { push (@dns, $_) } } -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response