Re: [Boston.pm] merge and compare help
On 8/27/07, Tolkin, Steve <[EMAIL PROTECTED]> wrote: > This can be easily extended to be a general purpose match/merge program. > Suppose we call the two inputs A and B. Each ID is in one of three > possible cases, and so we want three subroutines, named e.g., just_in_a, > just_in_b, and in_both. (In the original example just_in_a would do > the same thing as just_in_b, but that is not always desired.) In the dark ages, this was called the Master File Update algorithm, for merging the "yesterdays balances/inventory" and "todays transactions" punch-card decks to produce tonight's new balances/inventory deck and an exception report (printer), with 2 card readers, 1 card punch, 1 printer, and no online storage and precious little RAM. The Other, Other Michael Jackson did a very elegant workup of this algorithm. > I am looking for perl code that does this, in a configurable way, e.g. > let the user specify the ID column/s, sort the two inputs (if not > already sorted), read them both, call the subs, etc. Please send a link > or the code itself. My version assumes the sorted ID is in column 1, iirc. I was thinking of just such an extension but haven't done it. ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] merge and compare help
This can be easily extended to be a general purpose match/merge program. Suppose we call the two inputs A and B. Each ID is in one of three possible cases, and so we want three subroutines, named e.g., just_in_a, just_in_b, and in_both. (In the original example just_in_a would do the same thing as just_in_b, but that is not always desired.) I am looking for perl code that does this, in a configurable way, e.g. let the user specify the ID column/s, sort the two inputs (if not already sorted), read them both, call the subs, etc. Please send a link or the code itself. thanks, Steve -- Steven TolkinSteve-d0t-Tolkin-at-fmr-d0t-com 508-787-9006 Fidelity Investments 400 Puritan Way M3B Marlborough MA 01752 There is nothing so practical as a good theory. Comments are by me, not Fidelity Investments, its subsidiaries or affiliates. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Macdonald Sent: Monday, August 27, 2007 4:01 PM To: Alex Brelsfoard Cc: boston-pm@mail.pm.org Subject: Re: [Boston.pm] merge and compare help Your solution is the right one. The final trick is to make sure you keep going with one file after the other file reaches the end. I usually have the file read routine return a fake record for EOF, that has a key guaranteed to be higher than any real key. (That requires knowing what the keys look like, but it will often be something like "\255\255\255\255".) The merge subroutine checks for that EOF key and exits. If a merge is done for a different key, then neither file can be at EOF. If a record is written without needing a merge, then that file at least is not at EOF. This trick gets rid of a lot of code that checks whether either or both files are at EOF when you are deciding whether to read from a file, and comparing the current records. On Mon, Aug 27, 2007 at 02:04:57PM -0400, Alex Brelsfoard wrote: > Hi All, > > I'm back and with a new algorithm/solution I need help with. > I have two csv files, sorted by the first column (ID). > Each file may have all the same, none of the same, or some of the same ID's. > I would like to take these two files, and make one out of them. > Two tricks: > - When I come across the same ID in each file I need to merge those two > lines (don't worry about the merge, I can handle that). > - I want to be looking at the least number of lines from each file as > possible at any one time (optimally I would like to only be looking at one > of each file at the same time). > > Basically we are dealing with large files here and I don't want to kill my > RAM by storing all the data from both files into a hash or some other > object. > > I have an algorithm I like, I'm just not certain how to implement it: > 1. Examine the ID of the first line of each file. > 2. If they are the same, then merge and print the merge to the final output > file.. > 3. If they are not the same, find the lesser one and have it print its > contents to the final output file until its ID is the same or greater than > the other file's. > 4. repeat. > > Any advice on how to do this? > > Thanks. > --Alex > > ___ > Boston-pm mailing list > Boston-pm@mail.pm.org > http://mail.pm.org/mailman/listinfo/boston-pm ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] merge and compare help
Your solution is the right one. The final trick is to make sure you keep going with one file after the other file reaches the end. I usually have the file read routine return a fake record for EOF, that has a key guaranteed to be higher than any real key. (That requires knowing what the keys look like, but it will often be something like "\255\255\255\255".) The merge subroutine checks for that EOF key and exits. If a merge is done for a different key, then neither file can be at EOF. If a record is written without needing a merge, then that file at least is not at EOF. This trick gets rid of a lot of code that checks whether either or both files are at EOF when you are deciding whether to read from a file, and comparing the current records. On Mon, Aug 27, 2007 at 02:04:57PM -0400, Alex Brelsfoard wrote: > Hi All, > > I'm back and with a new algorithm/solution I need help with. > I have two csv files, sorted by the first column (ID). > Each file may have all the same, none of the same, or some of the same ID's. > I would like to take these two files, and make one out of them. > Two tricks: > - When I come across the same ID in each file I need to merge those two > lines (don't worry about the merge, I can handle that). > - I want to be looking at the least number of lines from each file as > possible at any one time (optimally I would like to only be looking at one > of each file at the same time). > > Basically we are dealing with large files here and I don't want to kill my > RAM by storing all the data from both files into a hash or some other > object. > > I have an algorithm I like, I'm just not certain how to implement it: > 1. Examine the ID of the first line of each file. > 2. If they are the same, then merge and print the merge to the final output > file.. > 3. If they are not the same, find the lesser one and have it print its > contents to the final output file until its ID is the same or greater than > the other file's. > 4. repeat. > > Any advice on how to do this? > > Thanks. > --Alex > > ___ > Boston-pm mailing list > Boston-pm@mail.pm.org > http://mail.pm.org/mailman/listinfo/boston-pm ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm
[Boston.pm] merge and compare help
Hi All, I'm back and with a new algorithm/solution I need help with. I have two csv files, sorted by the first column (ID). Each file may have all the same, none of the same, or some of the same ID's. I would like to take these two files, and make one out of them. Two tricks: - When I come across the same ID in each file I need to merge those two lines (don't worry about the merge, I can handle that). - I want to be looking at the least number of lines from each file as possible at any one time (optimally I would like to only be looking at one of each file at the same time). Basically we are dealing with large files here and I don't want to kill my RAM by storing all the data from both files into a hash or some other object. I have an algorithm I like, I'm just not certain how to implement it: 1. Examine the ID of the first line of each file. 2. If they are the same, then merge and print the merge to the final output file.. 3. If they are not the same, find the lesser one and have it print its contents to the final output file until its ID is the same or greater than the other file's. 4. repeat. Any advice on how to do this? Thanks. --Alex ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] quick XML::Twig / variable scope problem
Thank you all very much. I think this gives me all the answers I need (and then some). I needed to move forward on this project, so for the moment I have taken the ugly route ad made the variables global. But I hope to try applying some of these better methods by the middle of the week. I will post again to sa what I finally went with. Thanks again. --Alex On 8/27/07, mirod <[EMAIL PROTECTED]> wrote: > > Alex Brelsfoard wrote: > > > sub main { > > my $catalog_timestamp; > [...] > > > my $catalog_settings = XML::Twig->new( > > twig_handlers => { > > 'Item'=> \&get_catalog_field_names, > > 'Catalog' => \&get_catalog_timestamp, > > } > > ); > [...] > > sub get_catalog_timestamp { > > my ($twig, $elt) = @_; > > > > # get the PublishTimeStamp attribute > > $catalog_timestamp = $elt->att('PublishTimestamp'); > > > > $twig->purge(); > > } > However I get an error from this. > > "Global symbol "$catalog_timestamp" requires explicit package name" > > The error references the "$catalog_timestamp = > > $elt->att('PublishTimestamp');" line. > > As per the error message ;--) $catalog_timestamp is in scope in the main > sub, but not in the rest of the file. > > 2 options: make it global to the file by declaring it outside of the > main sub, or (cleaner), pass it to the handler via a closure: > > > my $catalog_settings= XML::Twig->new( >twig_handlers => { > 'Item' => sub { get_catalog_field_names( @_, $catalog_timestamp); } > ... > > sub get_catalog_timestamp { > my ($twig, $elt, $catalog_timestamp) = @_; > ... > > Does this help? > > -- > mirod > ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] quick XML::Twig / variable scope problem
Alex Brelsfoard wrote: > sub main { > my $catalog_timestamp; [...] > my $catalog_settings = XML::Twig->new( > twig_handlers => { > 'Item'=> \&get_catalog_field_names, > 'Catalog' => \&get_catalog_timestamp, > } > ); [...] > sub get_catalog_timestamp { > my ($twig, $elt) = @_; > > # get the PublishTimeStamp attribute > $catalog_timestamp = $elt->att('PublishTimestamp'); > > $twig->purge(); > } However I get an error from this. > "Global symbol "$catalog_timestamp" requires explicit package name" > The error references the "$catalog_timestamp = > $elt->att('PublishTimestamp');" line. As per the error message ;--) $catalog_timestamp is in scope in the main sub, but not in the rest of the file. 2 options: make it global to the file by declaring it outside of the main sub, or (cleaner), pass it to the handler via a closure: my $catalog_settings= XML::Twig->new( twig_handlers => { 'Item' => sub { get_catalog_field_names( @_, $catalog_timestamp); } ... sub get_catalog_timestamp { my ($twig, $elt, $catalog_timestamp) = @_; ... Does this help? -- mirod ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm