patch: DateTime::TimeZone PP redundant structures

2005-07-15 Thread Matt Sisk
Hello,

I updated my 'redundant data structure' fix so that it works with the latest
code in the repository and have included it below.

This only works for the PP versions of the zone modules. It was more of a proof
of concept -- there's probably more efficient ways to approach the problem.
(for example, I'm using md5 signatures of serialized data structures as keys --
I could probably simply use the serialized structures themselves as keys; on the
other hand using md5 might be necessary in the XS world). The performance hit
*should* only happen the first time each zone module is loaded (and even then
I've got the current threshold set to 25, so loading fewer zones than that
should be more or less simliar to current behavior).

The idea is to attack smaller data structures first, reusing them whenever
possible. I also attack larger data sets (say arrays of rules) and objects
(which themselves might contain other structures) and reuse those
super-structures as well, even though they might already be reusing smaller
structures.

I've also included a small test script that loads all the time zone modules and
prints some statistics (not a real benchmark, just reference count info for a
full zone load).

Feel free to play with it and perhaps see if any of it can be applied to the new
XS modifications.

Here's the output I get from the tzt.pl test script below:

---

Files/zones loaded: 373
_juxtapose() calls: 2689
 Refs seen: 21843
   Refs shared: 11006
   Refs eliminated: 10837

breakdown by type of ref:
  ARRAY => 10002 (reduced from 18613)
  DateTime::TimeZone::OlsonDB::Rule => 499 (reduced from 1961)
  DateTime => 193 (reduced from 282)
  HASH => 193 (reduced from 282)
  DateTime::TimeZone::OlsonDB::Observance => 117 (reduced from 141)
  DateTime::Locale::en_US => 1 (reduced from 282)
  DateTime::TimeZone::Floating => 1 (reduced from 282)

---

Cheers,
Matt
Index: lib/DateTime/TimeZone.pm
===
RCS file: 
/cvsroot/perl-date-time/modules/DateTime-TimeZone/lib/DateTime/TimeZone.pm,v
retrieving revision 1.115
diff -d -u -r1.115 TimeZone.pm
--- lib/DateTime/TimeZone.pm8 Jul 2005 02:57:58 -   1.115
+++ lib/DateTime/TimeZone.pm15 Jul 2005 15:55:29 -
@@ -24,6 +24,14 @@
 use constant IS_DST  => 5;
 use constant SHORT_NAME  => 6;
 
+# for hunting down redundant data structures
+use constant JUXTA_THRESHOLD => 25;
+use vars qw(%juxta_data_registry @juxta_data_queue %juxta_type_count
+$juxta_load_count $juxta_invocation_count $juxta_attempts);
+$juxta_load_count = $juxta_invocation_count = $juxta_attempts = 0;
+use Digest::MD5;
+use Data::Dumper;
+
 sub new
 {
 my $class = shift;
@@ -342,6 +350,44 @@
 return $self;
 }
 
+
+# Class methods for hunting down redundant data structures
+
+# (counts are merely for diagnostics)
+sub _juxta_increment_load { ++$juxta_load_count }
+
+sub _juxtapose {
+  # does not handle recursive structures!
+  my $class = shift;
+  ++$juxta_invocation_count;
+  if ($juxta_load_count < JUXTA_THRESHOLD) {
+push(@juxta_data_queue, @_);
+return @_ > 1 ? @_ : $_[0];
+  }
+  if ($juxta_load_count == JUXTA_THRESHOLD) {
+# start tracking redundant structures only when we've loaded
+# JUXTA_THRESHOLD timezones (and process the zone objects loaded
+# thus far).
+$class->_juxta_increment_load;
+$class->_juxtapose(@juxta_data_queue);
+@juxta_data_queue = ();
+  }
+  # we're over the threshold, so crunch our args
+  foreach (@_) {
+ref or next;
+++$juxta_attempts;
+++$juxta_type_count{ref($_)};
+my $key = Digest::MD5::md5_hex(Dumper($_));
+if ($juxta_data_registry{$key}) {
+  $_ = $juxta_data_registry{$key};
+}
+else {
+  $juxta_data_registry{$key} = $_;
+}
+  }
+  @_ > 1 ? @_ : $_[0];
+}
+
 #
 # Functions
 #
Index: tools/parse_olson
===
RCS file: /cvsroot/perl-date-time/modules/DateTime-TimeZone/tools/parse_olson,v
retrieving revision 1.41
diff -d -u -r1.41 parse_olson
--- tools/parse_olson   8 Jul 2005 21:11:21 -   1.41
+++ tools/parse_olson   15 Jul 2005 15:55:29 -
@@ -391,8 +391,14 @@
 
 [EMAIL PROTECTED]::TimeZone::${mod_name}::ISA = ( 'Class::Singleton', 
'DateTime::TimeZone' );
 
-my \$spans =
-$spans;
+# Load counts for redundant data structure threshold
+__PACKAGE__->_juxta_increment_load;
+
+# Crunch redundant data structures
+my \$spans = __PACKAGE__->_juxtapose([
+__PACKAGE__->_juxtapose(
+  [EMAIL PROTECTED]
+)]);
 
 sub _spans { \$spans }
 sub max_span { \$spans->[-1] }
@@ -737,46 +743,95 @@
 
 return '' unless $zone->infinite_rules;
 
-my $generator = <<'EOF';
-my $last_observance = !LAST_OBSERVANCE;
-sub _last_observance { $last_observance }
+my $last_observance = ($zone->sorted_changes)[-1]->observance;
 
-my $rules = !RULES;
-sub _rules { $rules }
-EOF
+my @rules = $zone->infinite_rules;
 
-my $last_observ

Re: patch: DateTime::TimeZone PP redundant structures

2005-07-15 Thread Daisuke Maki
Matt, Dave,

I like this...  but I'd like to generalize this juxtapose thing into a
separate package and release it, if it's okay with you guys (especially
Matt)?

Something as simple as this:

   my $j = Juxtapose->new;
   @data = $j->juxtapose(@data);

If ok, do you guys have any suggestions for the name of the module?

Meanwhile I did a massive update of the XS code, so I need to integrate
all that with this. whoof.

--d


Re: patch: DateTime::TimeZone PP redundant structures

2005-07-16 Thread Matt Sisk
Daisuke Maki wrote:

>I like this...  but I'd like to generalize this juxtapose thing into a
>separate package and release it, if it's okay with you guys (especially
>Matt)?
>  
>

That's fine with me. How about Data::Juxtapose ? You'd probably want to
design it so that you can plug in different fingerprinting methods (as
opposed to limiting it to md5_hex(Dumper()) ).

Matt


Data::Juxtapose 0.01 [Re: patch: DateTime::TimeZone PP redundant structures]

2005-07-18 Thread Daisuke Maki
Here's the repackaged Data-Juxtapose.
I added the use of Scalar::Util (if available) just in case.

   http://www.wafu.ne.jp/~daisuke/Data-Juxtapose-0.01.tar.gz

Let me know if there are any problems. If I don't hear anything, I'll
probably upload it to CPAN tomorrow.

--d


Re: Data::Juxtapose 0.01 [Re: patch: DateTime::TimeZone PP redundant structures]

2005-07-18 Thread Matt Sisk
The basic format seems good. I'm not sure about using weak_ref, though
-- what if an object or structure falls out of scope, yet there's still
a key in the data registry? If another ref comes along whose fingerprint
matches that key, does it end up getting replaced with undef ? (I'm kind
of sleepy at the moment, so I could be misreading the code).

Also, as originally implemented in the context of DateTime::TimeZone, I
was using class structures and class methods. You've moved it into
object mode, which is fine, but I'm curious as to how you utilize the
shared object. I'm assuming it's a single object in the DT::TZ base class?

Cheers,
Matt



Re: Data::Juxtapose 0.01 [Re: patch: DateTime::TimeZone PP redundant structures]

2005-07-18 Thread Daisuke Maki

> The basic format seems good. I'm not sure about using weak_ref, though
> -- what if an object or structure falls out of scope, yet there's still
> a key in the data registry? If another ref comes along whose fingerprint
> matches that key, does it end up getting replaced with undef ? (I'm kind
> of sleepy at the moment, so I could be misreading the code).

I had a reason why I did that, but I can't seem to come up with a good
reason now. Hmm. I suppose it won't hurt to *not* have it, I guess I can
just remove it.

> Also, as originally implemented in the context of DateTime::TimeZone, I
> was using class structures and class methods. You've moved it into
> object mode, which is fine, but I'm curious as to how you utilize the
> shared object. I'm assuming it's a single object in the DT::TZ base class?

Yeah, that's what I was planning to do. just hold a class variable that
points to a D::J object.

--d