Re: Large Data Set In Mod_Perl

2003-06-13 Thread Patrick Mulvany
On Wed, May 28, 2003 at 10:07:39PM -0400, Dale Lancaster wrote:
 For the perl hash, I would key the hash on the combo of planet and date,
 something like:
 
 my %Planets = (
 
 jupiter= {
 1900-01-01= ( 5h 39m 18s, +22o
 4.0', 28.922, -15,128, -164.799, set),
 1900-01-02= ( 5h 39m 18s, +22o
 4.0', 28.922, -15,128, -164.799, set),
  },
 
 neptune= {
 1900-01-01= ( 5h 39m 18s, +22o
 4.0', 28.922, -15,128, -164.799, set),
 1900-01-02= ( 5h 39m 18s, +22o
 4.0', 28.922, -15,128, -164.799, set),
 },
 ) ;

my $Planets = {
jupiter= {
1900 = {
  01 = {
01 = 1, # Record number in a file
02 = 2,
}.
  02 = { ...},


This would not require the entire dataset to be stored in memory but rather an offset 
to a file possition which could be randomly accessed.

However If I ever heard of a case for use of a fixed width ascii file using spacing 
records this is it.

If you had one file per planet and assuming that you wanted to start on 1900-01-01 

my $record_width=90;
my $offset = (($year-1900)*372+(($month-1)*31)+($day-1))*$record_width; 
# 1900-01-01 would be offset 0
# 2003-06-13 would be offset 3463560

This format would require blank records inserted for 1900-02-30 etc. but a simple 
script could auto generate the file.

One advantage of this would be the OS would file cache the read only file.

Just my toughts, hope it helps.

Paddy 


Re: Large Data Set In Mod_Perl

2003-06-13 Thread Perrin Harkins
On Fri, 2003-06-13 at 12:02, Patrick Mulvany wrote:
 However If I ever heard of a case for use of a fixed width ascii file using spacing 
 records this is it.

Why make your life difficult?  Just use a dbm file.

- Perrin


RE: Large Data Set In Mod_Perl

2003-05-30 Thread Marc M. Adkins
  perhaps something such as copying the whole 800,000 rows to
  memory (as a hash?) on apache startup?

 That would be the fastest by far, but it will use a boatload of RAM.
 It's pretty easy to try, so test it and see if you can spare the RAM it
 requires.

Always one of my favorite solutions to this sort of problem (dumb and fast)
but in mod_perl won't this eat RAM x number of mod_perl threads???  In this
case one of the advantages of the DBMS is that it is one copy of the data
that everyone shares.

mma



RE: Large Data Set In Mod_Perl

2003-05-30 Thread Perrin Harkins
On Thu, 2003-05-29 at 11:59, Marc M. Adkins wrote:
   perhaps something such as copying the whole 800,000 rows to
   memory (as a hash?) on apache startup?
 
  That would be the fastest by far, but it will use a boatload of RAM.
  It's pretty easy to try, so test it and see if you can spare the RAM it
  requires.
 
 Always one of my favorite solutions to this sort of problem (dumb and fast)
 but in mod_perl won't this eat RAM x number of mod_perl threads???

No.  If you load the data during startup (before the fork) it will be
shared unless you modify it.

- Perrin


RE: Large Data Set In Mod_Perl

2003-05-30 Thread Marc M. Adkins
 On Thu, 2003-05-29 at 12:59, Marc M. Adkins wrote:
  That's news to me (not being facetious).  I was under the
 impression that
  cloning Perl 5.8 ithreads cloned everything, that there was no
 sharing of
  read-only data.

 We're not talking about ithreads here, just processes.  The data is
 shared by copy-on-write.  It's an OS-level feature.  See the mod_perl
 docs for more info.

My original comment was regarding threads, not processes.  I run on Windows
and see only two Apache processes, yet I have a number of Perl interpreters
running in their own ithreads.  My understanding of Perl ithreads is that
while the syntax tree is reused, data stored in the parent ithread is
cloned.

In addition, since I'm on Windows, I'm not convinced that the type of
OS-level code sharing you're talking about is in fact done.  Windows doesn't
fork().

mma



RE: Large Data Set In Mod_Perl

2003-05-30 Thread Perrin Harkins
On Thu, 2003-05-29 at 13:10, Marc M. Adkins wrote:
 My original comment was regarding threads, not processes.  I run on Windows
 and see only two Apache processes, yet I have a number of Perl interpreters
 running in their own ithreads.  My understanding of Perl ithreads is that
 while the syntax tree is reused, data stored in the parent ithread is
 cloned.

Remember, this is an OS-level feature.  Perl doesn't have to do
anything.  The OS keeps track of the fact that the pages in memory have
not been touched since they were copied and doesn't actually bother to
copy them.

 In addition, since I'm on Windows, I'm not convinced that the type of
 OS-level code sharing you're talking about is in fact done.  Windows doesn't
 fork().

It's not about forking, it's about having a modern virtual memory
system.  Windows definitely has this feature.

- Perrin


Re: Large Data Set In Mod_Perl

2003-05-30 Thread Ranga Nathan
Perrin Harkins wrote:

simran wrote:

I need to be able to say:

* Lookup the _distance_ for the planet _mercury_ on the date _1900-01-01_ 


On the face of it, a relational database is best for that kind of query. 
 However, if you won't get any fancier than that, you can get by with 
MLDBM or something similar.

Currently i do this using a postgres database, however, my question is,
is there a quicker way to do this in mod_perl - would a DB_File or some
other structure be better? 

Query speed comes into question only when there is a heavy use. 
Postgress has an 'Explain' facility via pgsql. Just add Explain before 
the query and you will get the cost of the query. By creating proper 
indexes you can get good optimization. What if you add a table later and 
you need to join that with the planet table? If you keep your planet 
data somewhere else, then the access becomes cumbersome as well as 
slower. There are many ways to speed up Postgresql. I recommend the 
Posgresql book by Korryand Susan Douglas. I got it from Barnes and 
Nobles. IMHO stay with the relational database you are on and find ways 
to optimize.
A DBM file will be faster.  What you can do is build a key out of planet 
+ date, so that you grab the right record with a single access.  Either 
use MLDBM for storing hashes inside each record, or just a simple 
join/split approach.

This would be a good idea if you are implementing your tool and you know 
what limitations you will be subjected to.
MySQL would probably also be faster than PostgreSQL for this kind of 
simple read-only querying, but not as fast as a DBM file.  SDBM_File is 
the fastest DBM around, if you can live with the space limitations it has.

perhaps something such as copying the whole 800,000 rows to
memory (as a hash?) on apache startup? 

Postgresql may have a way to 'stick' a table in memory like MySQL.
That would be the fastest by far, but it will use a boatload of RAM. 
It's pretty easy to try, so test it and see if you can spare the RAM it 
requires.

- Perrin






Re: Large Data Set In Mod_Perl

2003-05-29 Thread Perrin Harkins
simran wrote:
I need to be able to say:

* Lookup the _distance_ for the planet _mercury_ on the date _1900-01-01_ 
On the face of it, a relational database is best for that kind of query. 
 However, if you won't get any fancier than that, you can get by with 
MLDBM or something similar.

Currently i do this using a postgres database, however, my question is,
is there a quicker way to do this in mod_perl - would a DB_File or some
other structure be better? 
A DBM file will be faster.  What you can do is build a key out of planet 
+ date, so that you grab the right record with a single access.  Either 
use MLDBM for storing hashes inside each record, or just a simple 
join/split approach.

MySQL would probably also be faster than PostgreSQL for this kind of 
simple read-only querying, but not as fast as a DBM file.  SDBM_File is 
the fastest DBM around, if you can live with the space limitations it has.

perhaps something such as copying the whole 800,000 rows to
memory (as a hash?) on apache startup? 
That would be the fastest by far, but it will use a boatload of RAM. 
It's pretty easy to try, so test it and see if you can spare the RAM it 
requires.

- Perrin



Re: Large Data Set In Mod_Perl

2003-05-29 Thread Dale Lancaster
I've dealt with fairly large sets, but not as static as yours.  If your only
keys for searching are planet and date, then a perl lookup with a hash will
be faster overall since a DB lookup involves connecting to the database,
doing the standard prepare/execute/fetch which could be as costly (for a
single lookup) as the lookup itself.  The actual lookup of the record in the
database is probably as fast or faster than Perl (especially after the
initial lookup that primes the caches) if you have indexed the columns on
the table properly.

If you are planning to do lots of lookups on this dataset, preloading the
dataset in a perl hash would definitely be the better approach.  If you are
doing only a few lookups over a given period, it may or may not be worth it
and taking up lots of memory for no reason and sticking with the db lookup
would probably be best.

For the perl hash, I would key the hash on the combo of planet and date,
something like:

my %Planets = (

jupiter= {
1900-01-01= ( 5h 39m 18s, +22o
4.0', 28.922, -15,128, -164.799, set),
1900-01-02= ( 5h 39m 18s, +22o
4.0', 28.922, -15,128, -164.799, set),
 },

neptune= {
1900-01-01= ( 5h 39m 18s, +22o
4.0', 28.922, -15,128, -164.799, set),
1900-01-02= ( 5h 39m 18s, +22o
4.0', 28.922, -15,128, -164.799, set),
},
) ;

You could also just combine the planet and date as the string for the hash
key like jupiter1900-01-01 but not real sure if this buys you any
performance - it might even be slightly slower since its working on a much
larger single hash rather than a two dimensional hash - might be interesting
to benchmark it on your size dataset to see what really happens.  As to
using DB_file, it would probably be somewhere between the Perl hash approach
and using the standard SQL database interface.

dale

- Original Message - 
From: simran [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, May 28, 2003 9:29 PM
Subject: Large Data Set In Mod_Perl


 Hi All,

 For one of the websites i have developed (/am developing), i have a
 dataset that i must refer to for some of the dynamic pages.

 The data is planetary data that is pretty much in spreadsheet format,
 aka, i have just under 800,000 rows of data. I don't do any copmlex
 searches or functions on the data. I simply need to look at certain
 columns at certain times.

 sample data set:

  planet  |date| right_ascension | declination | distance |
altitude | azimuth  | visibility
 -++-+-+--+
--+--+
  jupiter | 1900-01-01 | 15h 57m 7s  | -19° 37.2'  |6.108 |
10.199 |   39.263 | up
  mars| 1900-01-01 | 19h 2m 20s  | -23° 36.7'  |2.401 |
14.764 |-4.65 | up
  mercury | 1900-01-01 | 17h 15m 16s | -21° 59.7'  |1.151 |
14.041 |   20.846 | up
  moon| 1900-01-01 | 18h 41m 17s | -21° 21.8'  | 58.2 |
17.136 |0.343 | transit
  neptune | 1900-01-01 | 5h 39m 18s  | +22° 4.0'   |   28.922
  -15.128 | -164.799 | set


 I need to be able to say:

 * Lookup the _distance_ for the planet _mercury_ on the date _1900-01-01_

 Currently i do this using a postgres database, however, my question is,
 is there a quicker way to do this in mod_perl - would a DB_File or some
 other structure be better?

 I would be interested in knowing if others have dealt with large data
 sets as above and what solutions they have used.

 A DB is quick, but is there something one can use in mod_perl that would
 be quicker? perhaps something such as copying the whole 800,000 rows to
 memory (as a hash?) on apache startup?

 simran.





Re: Large Data Set In Mod_Perl

2003-05-29 Thread Ged Haywood
Hi there,

On Wed, 28 May 2003, Perrin Harkins wrote:

 simran wrote:
[snip]
  * Lookup the _distance_ for the planet _mercury_ on the date _1900-01-01_ 
[snip]
 you can get by with MLDBM or something similar.

You might also want to investigate using a compiled C Btree library which
could be tuned specifically to your dataset.  Hard work.

[snip]
  perhaps something such as copying the whole 800,000 rows to memory
[snip]
 That would be the fastest by far, but it will use a boatload of RAM. 

To economise on memory you could compress the data (or part of it)
before storage/lookup using a fast compress/decompress algorithm.
There would be a tradeoff between memory consumption and processor
cycles of course.  That kind of thing can get a bit complicated... :)

73,
Ged.