Re: [Boston.pm] emergency social meeting

2004-11-12 Thread Aaron Sherman
On Fri, 2004-11-12 at 10:26 -0500, Uri Guttman wrote:
> > "bdf" == brian d foy <[EMAIL PROTECTED]> writes:
> 
>   bdf> I'm in Boston from Nov 15-18.

> for those of you who don't know, bdf teaches for stonehenge (randal's
> biz) and is the founder of perl mongers (which you are a member of!) so
> let's give him his due props with a nice emergency social meeting.

Yeah, yeah... whatever. How cool is his camera?!


___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Aaron Sherman
On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote:
> On Fri, 12 Nov 2004 10:05:27 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote:
> > > "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes:
> [...]
> > this talk about mmap makes little sense to me. it may save some i/o and
> > even some buffering but you still need the ram and mmap still causes
> > disk accesses.

Just to throw in my own two cents before I critique the reply: "some
disk buffering" can mean  a factor of 10-1000 performance improvement in
real world applications. This is my personal experience with real-world
programs. Of course, if all you want is linear access to a file once,
then mmap doesn't help. But, if you want random access to a file,
nothing beats mmap because people spend their LIVES tuning paging
strategies, and such code has knowlege of the hardware that you cannot
otherwise take advantage of in a general purpose IO layer.

> Um, mmap does not (well should not - Windows may vary) use any
> RAM

You are confusing two issues. "using RAM" is not the same as "allocating
process address space". Allocating process address space is, of course,
required for mmap (same way you allocate address space when you load a
shared library, which is also mmap-based under Unix and Unix-like
systems). All systems have to limit address space at some point. Linux
does this at 3GB up to 2.6.x where it becomes more configurable and can
be as large as 3.5, I think.

To be clear, though, if you had 10MB of RAM, you could still mmap a 3GB
file, assuming you allowed for over-committed allocation in the kernel
(assuming Linux... filthy habit, I know).

> mmap should not cause any more or less disk accesses than
> reading from the file in the same pattern should have.  It just lets
> you do things like use Perl's RE engine directly on the file
> contents.

Actually, no it doesn't as far as I know (unless the copy-on-write code
got MUCH better recently).

Like I said, you probably won't get the win out of mmap in Perl that you
would expect. In Parrot you would, but that's another story.


___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Aaron Sherman
On Fri, 2004-11-12 at 21:47 -0500, William Ricker wrote:
> > This is at best 2/3 correct.
>  
> > First you're right that mmap has a 2 GB limit because it maps
> > things into your address space, and so the size of your pointers
> > limit what you can address.
> (unless you have 64bit pointers of course)

No, even without 64 bit pointers, you can have a 4GB address space (not
signed). The trick is that under Linux you're usually limited to 3GB
because the rest is reserved and other OSes impose other similar
limitations.

I have worked with an application that allocates about 2.5GB of RAM on
startup, so I have occasion to know this ;-)


___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread William Ricker
> Well mmap is a Unix concept.  

IIRC, it's a Multics concept, that Brian and Dennis presumably took south to NJ 
with them; if it was original with Multics or is even older I'd have to check 
with the retired Multicians list or a archeobibliography.

> To the best of my knowledge
> it is not natively supported in Windows. 

Right. ActiveState module repository does not include mmap.pm builds for 
Windows, only several *nix platforms.

MKS rocks, if you need commercial grade support for *nix-on-winDos. Cygwin is 
fine if you don't need 800# support and don't want to manage the WINDOS ENV 
from KSH for launching those closed source programs from a nice shell.

> Like him, I have no idea why pagefile.sys would enter into the
> picture.  It certainly doesn't on Linux.


It oughtn't, but a lame enough emulation *of the interface* on a lame "os" 
might have to copy the whole thing into the swapfile instead of doing a 
memory-map file operation - in which case, kiss the efficiency goodbye.

Bill


---
William Ricker [EMAIL PROTECTED]

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] emergency social meeting

2004-11-12 Thread William Ricker

> i don't know how mobile or boston road aware brian is, so

So the key questions for brian d foy are, where are you staying and would you 
like to get further afield or stay close to base?  

If bdf's non-mobile but would like to explore, perhaps a mobile monger can 
share him a camel ride.

Do we get a better turnout for a social when we're subway accessible, or do 
enough suburbanites show up out Rt9 that no one misses us urban trolley 
dodgers? If so, it's good to do some out there too.

(Given a free choice, Cambridge Brew and Boston Beer Works (either site) would 
get my vote, but there's plenty of good drinking around.)

bill

---
William Ricker [EMAIL PROTECTED]

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


[Boston.pm] new edition of book available

2004-11-12 Thread Greg London
Hey all. For those of you who were at the monger meeting,
I plugged my sci-fi book "Hunger Pangs" and said a new
edition was coming out this week. Well, it's out as
of last night. Check out the preview (part of which
is licensed CreativeCommons-NonCommercial) and buy the
book if you like hard sf or military sf.

"Hunger Pangs"

http://www.greglondon.com/hunger/


___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Ben Tilly
On Fri, 12 Nov 2004 10:05:27 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote:
> > "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes:
[...]
> this talk about mmap makes little sense to me. it may save some i/o and
> even some buffering but you still need the ram and mmap still causes
> disk accesses.

Um, mmap does not (well should not - Windows may vary) use any
RAM (other than what the memory manager needs to keep track of
the fact that the mapping has happened).  Using mmap does not
imply any particular algorithm, it is an optimization saying that
you're going to leave the ugly details of paging the file to/from your
process to the OS.

mmap should not cause any more or less disk accesses than
reading from the file in the same pattern should have.  It just lets
you do things like use Perl's RE engine directly on the file
contents.

> if the original file is too big for ram then the
> algorithm chosen must be one to minimize disk accesses and mmap doesn't
> save those. this is why disk/tape sorts were invented, to minimize the
> slow disk and tape accesses. so you would still need my algorithm or
> something similar regardless of how you actually get the data from disk
> to ram. and yes i have used mmap on many projects.

I'm not sure what you mean by "something similar", but yes,
you'll need SOME algorithm to solve the problem.  Which
statement is so general as to be meaningless.  I'm sure that
there are some possible algorithms that you'd never have
thought of.  (Mostly because they're bad.)

Disk/tape sorts were invented because back in the day there
was not enough RAM to do anything useful and so everything
had to go to disk.  Of course once you're forced to go to disk,
why not optimize it...?

Of course this problem said to guarantee being able to do the
sort, not necessarily to do it most efficiently.  Therefore no
single criteria - including disk accesses - necessarily MUST
dominate your choice.  Furthermore disk accesses are not
created equal.  There are multiple levels of cache between
you and disk.  Accessing data in a way that is friendly to cache
will improve performance greatly.

In particular managing to access data sequentially is orders of
magnitude faster than jumping around.  The key is not how
often you "access disk", it is how often your hard drive has to
do a seek.  When it needs to seek it reads far more data than
it is asked for and puts that in cache.  When you read
sequentially, most of your accesses come from cache, not
disk.  That is why databases use merge-sort so much, it
accesses data in exactly the way that hard drives are designed
to be accessed most efficiently.  A quick sort has fewer disk
accesses, but far more of them cause an unwanted seek.

> when analyzing algorithm effienciency you must work out which is the
> slowest operation that has the steepest growth curve and work on
> minimizing it. since disk access is so much slower than ram access it
> becomes the key element rather than the classic comparison in sorts. in

You must, must, must.  What is this preoccupation with must?

As I just pointed out, disk accesses are not all equal.

Secondly in many applications you will *parallelize* the
slowest step, not minimize it.  For instance good databases
not only like to use mergesort internally, they often distribute
the job to several processes or threads that all work at once,
that way if one process is waiting on a disk read, others may
be going at the same time.

Thirdly, and most importantly, it is more important to make
code work than to make it efficient.  If a stupid solution will
work and a smart one should be faster, code the stupid
solution first.

> a matrix transposition in ram, i would count the matrix accesses and/or
> copies of elements. with a larger matrix, then ram accesses would be
> key. my solution would load as much matrix into ram as possible (maybe
> using mmap but that is not critical anymore) and transpose it. then
> write the section out. that is 2 (large) disk accesses per chunk (or 1
> per disk block). then you do a merge (assuming you can access all the
> sction files at one time) which is another disk access per section (or
> block). and one more to write out the final matrix (in row order). so
> that is O((2 + 2) * section_count) disk accesses which isn't too bad.

You said that you want to assume that we can access all section
files at once.  Well suppose that I take a CSV file which is 100
columns by 10 million rows, transpose it, then try to transpose it
again.  Your assumption just broke.  Maybe it would work for the
person with the original problem, maybe not.

Here is the outline of a solution that avoids all such assumptions.

1. Run through the CSV file and output a file lines of the format:
  $column,$row:$field
You'll need to encode embedded newlines some way, for instance
s/\\//g; s/\n/\\n/g; - you may also want to pre-pad the columns and
rows with some number of 0's so that an ASCII-betical sort does
The 

Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Ben Tilly
On Fri, 12 Nov 2004 09:01:32 -0500, Tolkin, Steve <[EMAIL PROTECTED]> wrote:
> I think there may be a more restrictive limit, at least on Windows.
> The OS must be able to find a contiguous block
> of virtual memory, i.e. in pagefile.sys.
> The paging file may not be able to grow, (depending on
> how it is configured) and there may not be a large enough block.

Given how mmap is supposed to work, I'd doubt this diagnosis.
I'm not saying that you're wrong, I'm just saying that this would
be a strange limitation.

What mmap is supposed to do is cause a section of your
address space to transparently be a file on disk.  It should not
actually read that data in, that is done for you on demand when
you access data.  (I guess an implementation could do that, but
the Unix ones certainly don't.)

However it does need to find a contiguous block of address
space in your process memory space to map that data into.

> I would like to learn more about the exact situation of memory mapping
> files
> on Windows -- the above is just based on a hour of Googling
> and the info below.

Well mmap is a Unix concept.  To the best of my knowledge
it is not natively supported in Windows.  If you have it, it will
be implemented by whatever Unix emulation you're using.
For instance if you're using MKS then see
http://mkssoftware.com/docs/man3/mmap.3.asp, if you're
using cygwin see your local cygwin documentation.

It may be that the emulations have additional limits that I
don't know about.  (I don't use Windows.)

> Non perl related info follows:
> I hit this limitation in the otherwise excellent disk indexing program
> Wilbur at http://wilbur.redtree.com which is free (as in beer) and open
> source too.
> (But for Windows only.)
> 
> It uses memory mapped files and when one of the indexes exceeds
> about 500 MB it says something like "unable to map view of a file"
> even though my pagefile.sys is 1536 MB.
> 
> The developer said:
> This is a system message that occurs when Wilbur is unable to memory map
> one
> of its index files and due to the way memory mapping works on Windows, I
> think
> this is normally a symptom of insufficient virtual memory space.
> Possible
> solutions might be increasing the size of your paging file (dig through
> the
> performance options on the system control panel to find this),
> defragmenting
> the disk and of course adding more real memory.

Googling for that, I see him saying the same thing at

http://wilbur.redtree.com/cgi-bin/wilburtalk.pl?noframes;read=1770

Like him, I have no idea why pagefile.sys would enter into the
picture.  It certainly doesn't on Linux.

Cheers,
Ben
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Ben Tilly
On Fri, 12 Nov 2004 07:38:57 -0500, Gyepi SAM <[EMAIL PROTECTED]> wrote:
> On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote:
[...]
> I think mmap would be just as ideal in Perl and a lot less work too.
> Rather than indexing and parsing a *large* file, you must mmap
> and parse it. In fact, the CSV code, which was left as an exercise in you
> pseudo-code, would be the only code required.

It depends on your definition of ideal.  A Perl string is far more
complex than a C string, and translating between the two adds
complexity.  It requires an external module and adds platform
dependencies.

> I should point out though that mmap has a 2GB limit on systems
> without 64bit support. Such systems can't store files larger than
> that anyhow.

This is at best 2/3 correct.

First you're right that mmap has a 2 GB limit because it maps
things into your address space, and so the size of your pointers
limit what you can address.

It is also correct that there are complications in handling large
files on 32 bit systems.  Most operating systems didn't handle
that case.

However today most 32 bit operating systems have support
for large files, and Perl added the necessary hooks to take
advantage of it several versions ago.  So if you have a
relatively up to date system, odds are very good that you
don't have a 2 GB limit.  Certainly not on Windows or Linux.

Cheers,
Ben
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


[Boston.pm] emergency social meeting

2004-11-12 Thread Uri Guttman

> "bdf" == brian d foy <[EMAIL PROTECTED]> writes:

  bdf> I'm in Boston from Nov 15-18.

brian d foy wants to hang with boston.pm some night next week. i
mentioned this a little while ago but nothing happened so it is time to
get it scheduled. how is next tuesday, nov 16 at 7pm? we could meet at
the steakhouse BEHIND legal's on rte 9 (i was told it was decent with
no waiting). i don't know how mobile or boston road aware brian is, so
we could also meet in town if that works out.

for those of you who don't know, bdf teaches for stonehenge (randal's
biz) and is the founder of perl mongers (which you are a member of!) so
let's give him his due props with a nice emergency social meeting.

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Uri Guttman
> "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes:

  GS> On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote:
  >> Seriously, while mmap is ideal in C, in Perl I would just build an array
  >> of tell()s for each line in the file and then walk through the lines,
  >> storing the offset of the last delimiter that I'd seen.

  GS> I think mmap would be just as ideal in Perl and a lot less work too.
  GS> Rather than indexing and parsing a *large* file, you must mmap
  GS> and parse it. In fact, the CSV code, which was left as an exercise in you
  GS> pseudo-code, would be the only code required.

  GS> I should point out though that mmap has a 2GB limit on systems
  GS> without 64bit support. Such systems can't store files larger than
  GS> that anyhow.

  >> Let the kernel file buffer do your heavy lifting for you.

  GS> Exactly, if s/kernel file/mmap/

this talk about mmap makes little sense to me. it may save some i/o and
even some buffering but you still need the ram and mmap still causes
disk accesses. if the original file is too big for ram then the
algorithm chosen must be one to minimize disk accesses and mmap doesn't
save those. this is why disk/tape sorts were invented, to minimize the
slow disk and tape accesses. so you would still need my algorithm or
something similar regardless of how you actually get the data from disk
to ram. and yes i have used mmap on many projects.

when analyzing algorithm effienciency you must work out which is the
slowest operation that has the steepest growth curve and work on
minimizing it. since disk access is so much slower than ram access it
becomes the key element rather than the classic comparison in sorts. in
a matrix transposition in ram, i would count the matrix accesses and/or
copies of elements. with a larger matrix, then ram accesses would be
key. my solution would load as much matrix into ram as possible (maybe
using mmap but that is not critical anymore) and transpose it. then
write the section out. that is 2 (large) disk accesses per chunk (or 1
per disk block). then you do a merge (assuming you can access all the
sction files at one time) which is another disk access per section (or
block). and one more to write out the final matrix (in row order). so
that is O((2 + 2) * section_count) disk accesses which isn't too bad.

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


RE: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Tolkin, Steve
I think there may be a more restrictive limit, at least on Windows.  
The OS must be able to find a contiguous block 
of virtual memory, i.e. in pagefile.sys.
The paging file may not be able to grow, (depending on
how it is configured) and there may not be a large enough block.

I would like to learn more about the exact situation of memory mapping
files 
on Windows -- the above is just based on a hour of Googling
and the info below.

Non perl related info follows:
I hit this limitation in the otherwise excellent disk indexing program
Wilbur at http://wilbur.redtree.com which is free (as in beer) and open
source too.
(But for Windows only.)

It uses memory mapped files and when one of the indexes exceeds
about 500 MB it says something like "unable to map view of a file"
even though my pagefile.sys is 1536 MB.

The developer said:
This is a system message that occurs when Wilbur is unable to memory map
one
of its index files and due to the way memory mapping works on Windows, I
think
this is normally a symptom of insufficient virtual memory space.
Possible
solutions might be increasing the size of your paging file (dig through
the
performance options on the system control panel to find this),
defragmenting
the disk and of course adding more real memory.

But I already have a paging file of 1536 MB, which is the recommended
size in
windows XP (3 times physical memory of 512 MB).  

I also do not think that defragmenting the disk helps, except possibly
if done at boot time to defrag pagefile.sys (However I did do that
and it still failed.)

I was able to work around the problem by putting fewer words
in the index, e.g. kept the default of min length = 3 and no numbers.
I also have lots more files than most people, so I think very few people
will hit this limitation.


Hopefully helpfully yours,
Steve
-- 
Steve TolkinSteve . Tolkin at FMR dot COM   617-563-0516 
Fidelity Investments   82 Devonshire St. V4D Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.



-Original Message-
From: Gyepi SAM [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 12, 2004 7:39 AM
To: [EMAIL PROTECTED]
Subject: Re: [Boston.pm] transposing rows and columns in a CSV file


On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote:
> Seriously, while mmap is ideal in C, in Perl I would just build an
array
> of tell()s for each line in the file and then walk through the lines,
> storing the offset of the last delimiter that I'd seen.

I think mmap would be just as ideal in Perl and a lot less work too.
Rather than indexing and parsing a *large* file, you must mmap
and parse it. In fact, the CSV code, which was left as an exercise in
you
pseudo-code, would be the only code required.

I should point out though that mmap has a 2GB limit on systems
without 64bit support. Such systems can't store files larger than
that anyhow.

> Let the kernel file buffer do your heavy lifting for you.

Exactly, if s/kernel file/mmap/

-Gyepi

--
The convenient method is insecure and the secure method is inconvenient.
--me
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Gyepi SAM
On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote:
> Seriously, while mmap is ideal in C, in Perl I would just build an array
> of tell()s for each line in the file and then walk through the lines,
> storing the offset of the last delimiter that I'd seen.

I think mmap would be just as ideal in Perl and a lot less work too.
Rather than indexing and parsing a *large* file, you must mmap
and parse it. In fact, the CSV code, which was left as an exercise in you
pseudo-code, would be the only code required.

I should point out though that mmap has a 2GB limit on systems
without 64bit support. Such systems can't store files larger than
that anyhow.

> Let the kernel file buffer do your heavy lifting for you.

Exactly, if s/kernel file/mmap/

-Gyepi

--
The convenient method is insecure and the secure method is inconvenient.
--me
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm