Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-15 Thread Ben Tilly
On Mon, 15 Nov 2004 18:46:15 -0500 (EST), Dan Sugalski <[EMAIL PROTECTED]> 
wrote:
> On Mon, 15 Nov 2004, Ben Tilly wrote:
> 
> > On Mon, 15 Nov 2004 15:58:11 -0500, Aaron Sherman
> > <[EMAIL PROTECTED]> wrote:
> > > On Sat, 13 Nov 2004 11:40:25 -0800, Ben Tilly <[EMAIL PROTECTED]> wrote:
> > > > On Fri, 12 Nov 2004 23:04:46 -0500, Aaron Sherman <[EMAIL PROTECTED]> 
> > > > wrote:
> > > > > On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote:
> [Massive snippage of two ships passing in the night]

Heh. :-)

> > > > In Perl I'd expect it to be possible but fragile.  If Parrot could make
> > > > it possible and not fragile, that would be great.
> > >
> > > In parrot it's quite robust. Parrot supports "buffers" as core PMC
> > > types. A buffer can refer to any part of memory with any read-only or
> > > copy-on-write semantics you like.
> >
> > That would be nice.
> >
> > Incidentally will Parrot also support efficiently building strings
> > incrementally?  I like the fact that in Perl 5 it is O($n) to do
> > something like:
> >
> >   $string .= "hello" for 1..$n;
> >
> > In most other languages that is quadratic, and I'm wondering
> > what to expect in Perl 6.
> 
> That's not O(n) in Perl 5, it's just smaller than O(n^2). The same's true
> for Parrot -- we've got mutable strings and generally over-allocate, so
> it's not going to be quadratic time. Neither, though, is it going to be
> linear. Expect somewhere in between.

I don't have source-code in front of me, but my memory says that
when you have to reassign a string in Perl 5, the amount of extra
length that you give is proportional to the length of the string.  (If
that is not how Perl does it, then Perl darned well should!)  In that
case it really is O(n).

What happens is that the total recopying work can be bounded
above by a geometric series that converges to something O(n).
Everything else is O(n).  So the result is O(n).  I gave you more
details at http://www.perlmonks.org/?node_id=276051.  (Hey, my
math background has to be good for something...)

Perl uses the same strategy for other data structures, including
growing a hash and on various array operations.

Of course the main reason that I'm familiar with this trick is that my
2 miniscule core contributions to Perl were performance
enhancements from using this trick in places where it had not
been previously used.  (i.e. unshift and map.)  Unlike you, I
don't have any other performance knowledge to confuse me...

Cheers,
Ben
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-15 Thread Dan Sugalski
On Mon, 15 Nov 2004, Ben Tilly wrote:

> On Mon, 15 Nov 2004 15:58:11 -0500, Aaron Sherman
> <[EMAIL PROTECTED]> wrote:
> > On Sat, 13 Nov 2004 11:40:25 -0800, Ben Tilly <[EMAIL PROTECTED]> wrote:
> > > On Fri, 12 Nov 2004 23:04:46 -0500, Aaron Sherman <[EMAIL PROTECTED]> 
> > > wrote:
> > > > On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote:
[Massive snippage of two ships passing in the night]
> > > In Perl I'd expect it to be possible but fragile.  If Parrot could make
> > > it possible and not fragile, that would be great.
> >
> > In parrot it's quite robust. Parrot supports "buffers" as core PMC
> > types. A buffer can refer to any part of memory with any read-only or
> > copy-on-write semantics you like.
>
> That would be nice.
>
> Incidentally will Parrot also support efficiently building strings
> incrementally?  I like the fact that in Perl 5 it is O($n) to do
> something like:
>
>   $string .= "hello" for 1..$n;
>
> In most other languages that is quadratic, and I'm wondering
> what to expect in Perl 6.

That's not O(n) in Perl 5, it's just smaller than O(n^2). The same's true
for Parrot -- we've got mutable strings and generally over-allocate, so
it's not going to be quadratic time. Neither, though, is it going to be
linear. Expect somewhere in between.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-15 Thread Ben Tilly
On Mon, 15 Nov 2004 15:58:11 -0500, Aaron Sherman
<[EMAIL PROTECTED]> wrote:
> On Sat, 13 Nov 2004 11:40:25 -0800, Ben Tilly <[EMAIL PROTECTED]> wrote:
> > On Fri, 12 Nov 2004 23:04:46 -0500, Aaron Sherman <[EMAIL PROTECTED]> wrote:
> > > On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote:
> > [...]
> > > > Um, mmap does not (well should not - Windows may vary) use any
> > > > RAM
> 
> > > You are confusing two issues. "using RAM" is not the same as "allocating
> > > process address space".
> 
> > How was I confusing issues?
> 
> Let me demonstrate:
>
> > What I meant is that calling mmap does
> > not use significant amounts of RAM.
> 
> Calling mmap uses NO RAM. It doesn't interact with RAM at all. But
> does allocate (potentially huge) amounts of process address space, and
> reserves it in such a way that your process can no longer allocate it
> for uses like libc's memory allocator (which you access through
> functions like malloc).

I feel like we are all talking past each other.  Let's go back to basics.

When I say RAM I mean the physical RAM on the computer.
Whether or not that RAM is currently allocated to your process.  So
if you do something and that makes something else get paged out,
then you've used RAM in my view.  Whether that RAM is in pages
that are attached to your process, or was used by the kernel I still
see that as you using RAM.

> If you mmap a 3GB file (actually less than 3GB, but I'll use that
> number as an example for now) on an x86 linux box and then call
> "malloc", you get back a NULL pointer because malloc will fail. This
> is actually not quite true. That malloc will likely work because it
> will be allocated from some existing page of address space that
> malloc's internal page allocator reserved before you called mmap, but
> that won't work for long.

I'm aware of this and wasn't disputing it.

> >  (The OS needs some to track
> > that the mapping exists, but that should be it.)
> 
> Actually, no. The place that mmap is tracked is a) in the file
> descriptor table, which is outside of your 3GB process space in
> kernel-space and b) in the system page table, which is not in your
> address space at all, but in hardware.

Where it is tracked doesn't concern me.  That it is tracked, does.

However I realize that I don't know enough about how the memory
management is handled.  I would think that this would be dynamic
in some way - on creating a process the kernel should need to
write very little data, but will then write more later.  But I don't know
enough to verify that one way or the other.

> > Once you actually use
> > the data that you mmapped in, file contents will be swapped in, and
> > RAM will be taken, but not until then.
> 
> "RAM will be taken" is a meaningless term here. Ignore RAM for
> purposes of this conversation.

On the one hand you are saying that I'm confused about what I
meant by a comment about using RAM, and on the other you
are telling me that I am to ignore RAM for the purposes of this
conversation, it is meaningless.  There is a contradiction there.
For discussing what *you* want to talk about it may be
meaningless, but for discussing what *I* had been talking
about it isn't.  And for deciding whether or not I was confused
it most definitely isn't.  (Perhaps you're confused about what I
was talking about?)

It appears that you want to discuss what the world looks like
to a process.  For that I wholeheartedly agree, talking about
what is in RAM is generally counterproductive, if the
abstraction of virtual memory works, then you should never
know or care about what is or is not in RAM.

But I was talking about what things lead to resource
consumption that could adversely affect a machine which is
carrying out a particular computation.  For that it matters a
great deal whether particular operations are going to cause
pages of RAM to be discarded and allocated for something
else.  Because when it comes to actual performance, the
abstraction of virtual memory leaks badly.

(And what I was saying about resource consumption is that
mmap doesn't.  Consume in meaningful amounts that is.)

> > As for a 3 GB limit, now that you mention it, I heard something
> > about that.  But I didn't pay attention since I don't need it right now.
> 
> Suffice to say that your process cannot be larger than 3GB under x86
> Linux. There are extensions, options and hacks if you want to go
> larger, but after 3GB it gets very dicey.

Are you saying that Linux does not give an user-level API to Intel's
addressing extensions?  Or that it does but you recommend
against using it?

> > I've also heard about Intel's large addressing extensions (keep 2GB
> > in normal address space, page around the top 2 GB, you get 64 GB
> > of addressible memory).  I'm curious about how (or if) the two can
> > cooperate.
> 
> The ability to re-map memory like this is quite common, and the *OS*
> can take advantage of it, but as long as you're on an x86 and using
> 32-bit pointers, your one process ca

Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-15 Thread Jason Gloudon

> >   BT> I've also heard about Intel's large addressing extensions (keep 2GB
> >   BT> in normal address space, page around the top 2 GB, you get 64 GB

The extension referred to above concerns the 4 gig limit inherent in a 32 bit
address space with byte addressable memory. The 4 gig limit is on the amount of
physical memory that the CPU can directly address. The virtual address space on
x86 also happens to be 32 bits in size.

Historically on many x86 unixlike oses the upper half of the virtual address
space (basically any memory address whose most significant bit was 1) was used
by the operating system and was not acessible to application code. This meant
that a single process could only usefully address 2 gigabytes of virtual memory
at a time.

For some time now it has been possible to build linux kernels where the amount
of virtual memory available to the apps is 3 gigabytes, with the kernel making
do with 1 gig of virtual memory. This was purely a software change.

The CPU page addressing extensions allows the cpu to use more than 4 gigs of
physical memory. However there is no way around the virtual memory address size
limit, so not even the operating system can address it all simultaneously.
However it can allocate all of that memory to different proceses, so if you
have a single process that needs to access 30 gigs of memory you'll want a 64
bit machine, but if you have 12 processes that each need 2.5 gigs of RAM you're
all set.

> > eww, that sounds like manual management of extended memory. like the
> > days of overlays or even pdp-11 extended memory (which i used to access
> > 22 bit address (yes, 22 bit PHYSICAL space) from a 16 bit cpu.). not

Fortunately only the operating system has to deal with the addressing hacks
required.

> Personally I think it would be great to have a system call to say,
> "Page this in, because I'll want to be there soon."  The
> multi-process solution is the next best thing.

It is called madvise().

-- 
Jason
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-15 Thread Dan Sugalski
At 12:12 PM -0800 11/15/04, Ben Tilly wrote:
On Sat, 13 Nov 2004 17:43:37 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote:
 > > "BT" == Ben Tilly <[EMAIL PROTECTED]> writes:
 >   BT> In Perl I'd expect it to be possible but fragile.  If 
Parrot could make
   BT> it possible and not fragile, that would be great.
 parrot will have mmap in bytecode which will be nice and fast and
 shared. i dunno about mmaped data as that is always an issue and with
 > garbage collection it can be tricky.
Nah, mapping mmapped data to internal strings won't be a big deal, 
and GC doesn't care. The memory gets marked as external so parrot 
won't try to move it or free it, and if you wanted it could be COW as 
well so the mmap segment could be read-only. Or not, either way.

I see, so you compile Perl to bytecode, and then load it through mmap.
Sure, it might be faster to compile Perl than to read the file. But if the
bytecode has been used recently...
More than that, it means not loading in unused code, and deferring 
loading of code until you actually get to it to use. Not as big a 
deal for one-liners (which wouldn't get compiled to disk, I expect) 
but if you had something like, say, Tk or CGI compiled to bytecode 
you could see a significant win, since 90%+ of those modules never 
get used by any particular program. (And the commonly used library 
modules are potentially already loaded in system memory by another 
process)
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-15 Thread Ben Tilly
On Sat, 13 Nov 2004 17:43:37 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote:
> > "BT" == Ben Tilly <[EMAIL PROTECTED]> writes:
> 
>   BT> How was I confusing issues?  What I meant is that calling mmap does
>   BT> not use significant amounts of RAM.  (The OS needs some to track
>   BT> that the mapping exists, but that should be it.)  Once you actually use
>   BT> the data that you mmapped in, file contents will be swapped in, and
>   BT> RAM will be taken, but not until then.
> 
> mmap uses virtual ram which means physical ram and swap space. so mmap can
> suck up as much physical ram as you have if you allocate it.

We are both right, and we are both wrong.  The reality is that
any such behaviour on the part of mmap is OS and
implementation dependent.  And an intelligent OS will make
much of it configurable.  

Let me refer to my local Linux manpage.  For this purpose I'd
specify MAP_SHARED.  In that case edits made to memory are
made to the file, and there is no need to reserve RAM or swap.
(Not that Linux would reserve RAM anyways, Linux allows
overcommitting.)  Something close to your described behaviour
may happen if you use MAP_PRIVATE and don't specify
 MAP_NORESERVE.  If you specify MAP_NORESERVE you won't use
swap.  (You might get a SIGSEGV if you make a write when RAM
is not available.)

Which reinforces the point that specific details about the
side-effects of mmap (or any system call) are implementation
dependent and should never be assumed.

>   BT> As for a 3 GB limit, now that you mention it, I heard something
>   BT> about that.  But I didn't pay attention since I don't need it right now.
>   BT> I've also heard about Intel's large addressing extensions (keep 2GB
>   BT> in normal address space, page around the top 2 GB, you get 64 GB
>   BT> of addressible memory).  I'm curious about how (or if) the two can
>   BT> cooperate.
> 
> eww, that sounds like manual management of extended memory. like the
> days of overlays or even pdp-11 extended memory (which i used to access
> 22 bit address (yes, 22 bit PHYSICAL space) from a 16 bit cpu.). not
> something you want to deal with unless you have to.

You got it.  Worse yet, from the way that I see it, in the next 5 years our
industry must decide whether the bad old days are going to return, and
I don't know which way it will jump.

The problem is that consumer computers will soon use over 4 GB of
RAM.  The obvious clean solution is to switch to 64 bit CPUs. But as
soon as you go to 64-bit pointers, a lot of programs and data
structures grow (worst case they double) so there is a big jump in
memory needs.  (Rather than being a bit over your needs, you are a
lot over.)  And when data bloats, moving that data around the
computer slows down as well.  Programmer pain is invisible.  That
size and performance hit is not.

A couple of years ago Intel quietly added an addressing extension to
allow for up to 64 GB of RAM, and then pursued a non-consumer
64-bit strategy (which flopped).  The way that I read that is that Intel
expects the consumer industry to do as it did for a decade after the
16 to 32 bit conversion should have happened - stick with smaller
pointers and swallow the addressing difficulty.

AMD's strategy was more public.  They came out with a 64-bit CPU
that solved a bunch of problems with the x86 architecture, meaning
that if you switch to 64-bit mode there is a good chance that your
code will speed up.  Their CPU has been successful enough that
Intel has been compelled to issue a copycat CPU.

But which way the industry will jump is still unclear to me.  To me
the first good test is what happens when high-end games start
having memory requirements that are too big for straightforward
32-bit access.  (A limit which conventially is placed at 2 GB, but can
be stretched to somewhere between 2 GB and 4 GB.)  Will they
manually manage some big chunks of data, or will they require an
AMD Athalon-compatible computer?

>as for the 
> original
> problem, i keep saying that mmap will give little help if the input
> matrix won't fit into physical ram. once you start swapping (manually or
> virtually) all bets are off on speed of any transpostion algorithm. you
> have to start counting disk accesses and once you do, who care how it
> was done (mmap or whatever)?

mmap with MAP_SHARED may reduce your RAM requirements, and
improves your access speed.  I agree that the difference is pretty
marginal.

As for your "counting disk accesses", I've already pointed out that
disk accesses are not created equal.  If you really care about the
performance of the application, you need to benchmark, not make
simplistic estimates.  Because unless you know a lot of detail about
how the disk drive works, you can't easily predict what the actual
performance will be.

[...]
>   BT> However the over-committed allocation comment confuses me.
>   BT> Why would a single mmap result in over committing memo

Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-13 Thread Uri Guttman
> "BT" == Ben Tilly <[EMAIL PROTECTED]> writes:

  BT> How was I confusing issues?  What I meant is that calling mmap does
  BT> not use significant amounts of RAM.  (The OS needs some to track
  BT> that the mapping exists, but that should be it.)  Once you actually use
  BT> the data that you mmapped in, file contents will be swapped in, and
  BT> RAM will be taken, but not until then.

mmap uses virtual ram which means physical ram and swap space. so mmap can
suck up as much physical ram as you have if you allocate it.

  BT> As for a 3 GB limit, now that you mention it, I heard something
  BT> about that.  But I didn't pay attention since I don't need it right now.
  BT> I've also heard about Intel's large addressing extensions (keep 2GB
  BT> in normal address space, page around the top 2 GB, you get 64 GB
  BT> of addressible memory).  I'm curious about how (or if) the two can
  BT> cooperate.

eww, that sounds like manual management of extended memory. like the
days of overlays or even pdp-11 extended memory (which i used to access
22 bit address (yes, 22 bit PHYSICAL space) from a 16 bit cpu.). not
something you want to deal with unless you have to. as for the original
problem, i keep saying that mmap will give little help if the input
matrix won't fit into physical ram. once you start swapping (manually or
virtually) all bets are off on speed of any transpostion algorithm. you
have to start counting disk accesses and once you do, who care how it
was done (mmap or whatever)?

  >> To be clear, though, if you had 10MB of RAM, you could still mmap a 3GB
  >> file, assuming you allowed for over-committed allocation in the kernel
  >> (assuming Linux... filthy habit, I know).

  BT> Exactly what I was referring to.

  BT> However the over-committed allocation comment confuses me.
  BT> Why would a single mmap result in over committing memory?


you can allocate all the virtual space allowed when you don't need
it. then mmap gains you little. mmap is best used as a window into a
large file. note that page faults when using mmap will block a process
until that page is swapped in. so some systems use small processes to
touch shared (mmaped) virtual ram so that block and when they wake up
the main process can use the real ram.

  BT> I was thinking that you'd use something like Sys::Mmap's mmap
  BT> call directly so that there is a Perl variable that Perl thinks is a
  BT> regular variable but which at a C level has its data at an mmapped
  BT> location.  Fragile, I know (because Perl doesn't know that it cannot
  BT> reallocate the variable), but as long as you are careful to not cause
  BT> it to be reallocated or copied, there should be no limitations on
  BT> what you can do.

but those are bad limitations. i wouldn't ever design in such a beast. i
would use other methods of IPC than mmap in perl.

  BT> In Perl I'd expect it to be possible but fragile.  If Parrot could make
  BT> it possible and not fragile, that would be great.

parrot will have mmap in bytecode which will be nice and fast and
shared. i dunno about mmaped data as that is always an issue and with
garbage collection it can be tricky.

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-13 Thread Ben Tilly
On Fri, 12 Nov 2004 23:04:46 -0500, Aaron Sherman <[EMAIL PROTECTED]> wrote:
> On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote:
[...]
> > Um, mmap does not (well should not - Windows may vary) use any
> > RAM
> 
> You are confusing two issues. "using RAM" is not the same as "allocating
> process address space". Allocating process address space is, of course,
> required for mmap (same way you allocate address space when you load a
> shared library, which is also mmap-based under Unix and Unix-like
> systems). All systems have to limit address space at some point. Linux
> does this at 3GB up to 2.6.x where it becomes more configurable and can
> be as large as 3.5, I think.

How was I confusing issues?  What I meant is that calling mmap does
not use significant amounts of RAM.  (The OS needs some to track
that the mapping exists, but that should be it.)  Once you actually use
the data that you mmapped in, file contents will be swapped in, and
RAM will be taken, but not until then.

As for a 3 GB limit, now that you mention it, I heard something
about that.  But I didn't pay attention since I don't need it right now.
I've also heard about Intel's large addressing extensions (keep 2GB
in normal address space, page around the top 2 GB, you get 64 GB
of addressible memory).  I'm curious about how (or if) the two can
cooperate.

> To be clear, though, if you had 10MB of RAM, you could still mmap a 3GB
> file, assuming you allowed for over-committed allocation in the kernel
> (assuming Linux... filthy habit, I know).

Exactly what I was referring to.

However the over-committed allocation comment confuses me.
Why would a single mmap result in over committing memory?

> > mmap should not cause any more or less disk accesses than
> > reading from the file in the same pattern should have.  It just lets
> > you do things like use Perl's RE engine directly on the file
> > contents.
> 
> Actually, no it doesn't as far as I know (unless the copy-on-write code
> got MUCH better recently).

Where does a write happen?  I was thinking in terms of using the
RE engine (with pos) as a tokenizer.

I was thinking that you'd use something like Sys::Mmap's mmap
call directly so that there is a Perl variable that Perl thinks is a
regular variable but which at a C level has its data at an mmapped
location.  Fragile, I know (because Perl doesn't know that it cannot
reallocate the variable), but as long as you are careful to not cause
it to be reallocated or copied, there should be no limitations on
what you can do.

> Like I said, you probably won't get the win out of mmap in Perl that you
> would expect. In Parrot you would, but that's another story.

In Perl I'd expect it to be possible but fragile.  If Parrot could make
it possible and not fragile, that would be great.

Cheers,
Ben
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Aaron Sherman
On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote:
> On Fri, 12 Nov 2004 10:05:27 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote:
> > > "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes:
> [...]
> > this talk about mmap makes little sense to me. it may save some i/o and
> > even some buffering but you still need the ram and mmap still causes
> > disk accesses.

Just to throw in my own two cents before I critique the reply: "some
disk buffering" can mean  a factor of 10-1000 performance improvement in
real world applications. This is my personal experience with real-world
programs. Of course, if all you want is linear access to a file once,
then mmap doesn't help. But, if you want random access to a file,
nothing beats mmap because people spend their LIVES tuning paging
strategies, and such code has knowlege of the hardware that you cannot
otherwise take advantage of in a general purpose IO layer.

> Um, mmap does not (well should not - Windows may vary) use any
> RAM

You are confusing two issues. "using RAM" is not the same as "allocating
process address space". Allocating process address space is, of course,
required for mmap (same way you allocate address space when you load a
shared library, which is also mmap-based under Unix and Unix-like
systems). All systems have to limit address space at some point. Linux
does this at 3GB up to 2.6.x where it becomes more configurable and can
be as large as 3.5, I think.

To be clear, though, if you had 10MB of RAM, you could still mmap a 3GB
file, assuming you allowed for over-committed allocation in the kernel
(assuming Linux... filthy habit, I know).

> mmap should not cause any more or less disk accesses than
> reading from the file in the same pattern should have.  It just lets
> you do things like use Perl's RE engine directly on the file
> contents.

Actually, no it doesn't as far as I know (unless the copy-on-write code
got MUCH better recently).

Like I said, you probably won't get the win out of mmap in Perl that you
would expect. In Parrot you would, but that's another story.


___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Aaron Sherman
On Fri, 2004-11-12 at 21:47 -0500, William Ricker wrote:
> > This is at best 2/3 correct.
>  
> > First you're right that mmap has a 2 GB limit because it maps
> > things into your address space, and so the size of your pointers
> > limit what you can address.
> (unless you have 64bit pointers of course)

No, even without 64 bit pointers, you can have a 4GB address space (not
signed). The trick is that under Linux you're usually limited to 3GB
because the rest is reserved and other OSes impose other similar
limitations.

I have worked with an application that allocates about 2.5GB of RAM on
startup, so I have occasion to know this ;-)


___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread William Ricker
> Well mmap is a Unix concept.  

IIRC, it's a Multics concept, that Brian and Dennis presumably took south to NJ 
with them; if it was original with Multics or is even older I'd have to check 
with the retired Multicians list or a archeobibliography.

> To the best of my knowledge
> it is not natively supported in Windows. 

Right. ActiveState module repository does not include mmap.pm builds for 
Windows, only several *nix platforms.

MKS rocks, if you need commercial grade support for *nix-on-winDos. Cygwin is 
fine if you don't need 800# support and don't want to manage the WINDOS ENV 
from KSH for launching those closed source programs from a nice shell.

> Like him, I have no idea why pagefile.sys would enter into the
> picture.  It certainly doesn't on Linux.


It oughtn't, but a lame enough emulation *of the interface* on a lame "os" 
might have to copy the whole thing into the swapfile instead of doing a 
memory-map file operation - in which case, kiss the efficiency goodbye.

Bill


---
William Ricker [EMAIL PROTECTED]

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread William Ricker
> This is at best 2/3 correct.
 
> First you're right that mmap has a 2 GB limit because it maps
> things into your address space, and so the size of your pointers
> limit what you can address.
(unless you have 64bit pointers of course)
 
Does the holder of the original problem (sorry, I've lost track of whose 
question it was now)
1. have *nix or windos?
32/64 if *nix? 
w/cygwin or MKS if win?
2. have a file >2GB or <2GB ?
3. have to do this once or many times?

Unfortunately, I don't immediately see a Debian pkg for Mmap.pm for Alpha, 
which is the only platform I have perl64 on. Maybe next rev's perl 5.8 will 
bring one.

bill


---
William Ricker [EMAIL PROTECTED]

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Ben Tilly
On Fri, 12 Nov 2004 10:05:27 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote:
> > "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes:
[...]
> this talk about mmap makes little sense to me. it may save some i/o and
> even some buffering but you still need the ram and mmap still causes
> disk accesses.

Um, mmap does not (well should not - Windows may vary) use any
RAM (other than what the memory manager needs to keep track of
the fact that the mapping has happened).  Using mmap does not
imply any particular algorithm, it is an optimization saying that
you're going to leave the ugly details of paging the file to/from your
process to the OS.

mmap should not cause any more or less disk accesses than
reading from the file in the same pattern should have.  It just lets
you do things like use Perl's RE engine directly on the file
contents.

> if the original file is too big for ram then the
> algorithm chosen must be one to minimize disk accesses and mmap doesn't
> save those. this is why disk/tape sorts were invented, to minimize the
> slow disk and tape accesses. so you would still need my algorithm or
> something similar regardless of how you actually get the data from disk
> to ram. and yes i have used mmap on many projects.

I'm not sure what you mean by "something similar", but yes,
you'll need SOME algorithm to solve the problem.  Which
statement is so general as to be meaningless.  I'm sure that
there are some possible algorithms that you'd never have
thought of.  (Mostly because they're bad.)

Disk/tape sorts were invented because back in the day there
was not enough RAM to do anything useful and so everything
had to go to disk.  Of course once you're forced to go to disk,
why not optimize it...?

Of course this problem said to guarantee being able to do the
sort, not necessarily to do it most efficiently.  Therefore no
single criteria - including disk accesses - necessarily MUST
dominate your choice.  Furthermore disk accesses are not
created equal.  There are multiple levels of cache between
you and disk.  Accessing data in a way that is friendly to cache
will improve performance greatly.

In particular managing to access data sequentially is orders of
magnitude faster than jumping around.  The key is not how
often you "access disk", it is how often your hard drive has to
do a seek.  When it needs to seek it reads far more data than
it is asked for and puts that in cache.  When you read
sequentially, most of your accesses come from cache, not
disk.  That is why databases use merge-sort so much, it
accesses data in exactly the way that hard drives are designed
to be accessed most efficiently.  A quick sort has fewer disk
accesses, but far more of them cause an unwanted seek.

> when analyzing algorithm effienciency you must work out which is the
> slowest operation that has the steepest growth curve and work on
> minimizing it. since disk access is so much slower than ram access it
> becomes the key element rather than the classic comparison in sorts. in

You must, must, must.  What is this preoccupation with must?

As I just pointed out, disk accesses are not all equal.

Secondly in many applications you will *parallelize* the
slowest step, not minimize it.  For instance good databases
not only like to use mergesort internally, they often distribute
the job to several processes or threads that all work at once,
that way if one process is waiting on a disk read, others may
be going at the same time.

Thirdly, and most importantly, it is more important to make
code work than to make it efficient.  If a stupid solution will
work and a smart one should be faster, code the stupid
solution first.

> a matrix transposition in ram, i would count the matrix accesses and/or
> copies of elements. with a larger matrix, then ram accesses would be
> key. my solution would load as much matrix into ram as possible (maybe
> using mmap but that is not critical anymore) and transpose it. then
> write the section out. that is 2 (large) disk accesses per chunk (or 1
> per disk block). then you do a merge (assuming you can access all the
> sction files at one time) which is another disk access per section (or
> block). and one more to write out the final matrix (in row order). so
> that is O((2 + 2) * section_count) disk accesses which isn't too bad.

You said that you want to assume that we can access all section
files at once.  Well suppose that I take a CSV file which is 100
columns by 10 million rows, transpose it, then try to transpose it
again.  Your assumption just broke.  Maybe it would work for the
person with the original problem, maybe not.

Here is the outline of a solution that avoids all such assumptions.

1. Run through the CSV file and output a file lines of the format:
  $column,$row:$field
You'll need to encode embedded newlines some way, for instance
s/\\//g; s/\n/\\n/g; - you may also want to pre-pad the columns and
rows with some number of 0's so that an ASCII-betical sort does
The R

Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Ben Tilly
On Fri, 12 Nov 2004 09:01:32 -0500, Tolkin, Steve <[EMAIL PROTECTED]> wrote:
> I think there may be a more restrictive limit, at least on Windows.
> The OS must be able to find a contiguous block
> of virtual memory, i.e. in pagefile.sys.
> The paging file may not be able to grow, (depending on
> how it is configured) and there may not be a large enough block.

Given how mmap is supposed to work, I'd doubt this diagnosis.
I'm not saying that you're wrong, I'm just saying that this would
be a strange limitation.

What mmap is supposed to do is cause a section of your
address space to transparently be a file on disk.  It should not
actually read that data in, that is done for you on demand when
you access data.  (I guess an implementation could do that, but
the Unix ones certainly don't.)

However it does need to find a contiguous block of address
space in your process memory space to map that data into.

> I would like to learn more about the exact situation of memory mapping
> files
> on Windows -- the above is just based on a hour of Googling
> and the info below.

Well mmap is a Unix concept.  To the best of my knowledge
it is not natively supported in Windows.  If you have it, it will
be implemented by whatever Unix emulation you're using.
For instance if you're using MKS then see
http://mkssoftware.com/docs/man3/mmap.3.asp, if you're
using cygwin see your local cygwin documentation.

It may be that the emulations have additional limits that I
don't know about.  (I don't use Windows.)

> Non perl related info follows:
> I hit this limitation in the otherwise excellent disk indexing program
> Wilbur at http://wilbur.redtree.com which is free (as in beer) and open
> source too.
> (But for Windows only.)
> 
> It uses memory mapped files and when one of the indexes exceeds
> about 500 MB it says something like "unable to map view of a file"
> even though my pagefile.sys is 1536 MB.
> 
> The developer said:
> This is a system message that occurs when Wilbur is unable to memory map
> one
> of its index files and due to the way memory mapping works on Windows, I
> think
> this is normally a symptom of insufficient virtual memory space.
> Possible
> solutions might be increasing the size of your paging file (dig through
> the
> performance options on the system control panel to find this),
> defragmenting
> the disk and of course adding more real memory.

Googling for that, I see him saying the same thing at

http://wilbur.redtree.com/cgi-bin/wilburtalk.pl?noframes;read=1770

Like him, I have no idea why pagefile.sys would enter into the
picture.  It certainly doesn't on Linux.

Cheers,
Ben
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Ben Tilly
On Fri, 12 Nov 2004 07:38:57 -0500, Gyepi SAM <[EMAIL PROTECTED]> wrote:
> On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote:
[...]
> I think mmap would be just as ideal in Perl and a lot less work too.
> Rather than indexing and parsing a *large* file, you must mmap
> and parse it. In fact, the CSV code, which was left as an exercise in you
> pseudo-code, would be the only code required.

It depends on your definition of ideal.  A Perl string is far more
complex than a C string, and translating between the two adds
complexity.  It requires an external module and adds platform
dependencies.

> I should point out though that mmap has a 2GB limit on systems
> without 64bit support. Such systems can't store files larger than
> that anyhow.

This is at best 2/3 correct.

First you're right that mmap has a 2 GB limit because it maps
things into your address space, and so the size of your pointers
limit what you can address.

It is also correct that there are complications in handling large
files on 32 bit systems.  Most operating systems didn't handle
that case.

However today most 32 bit operating systems have support
for large files, and Perl added the necessary hooks to take
advantage of it several versions ago.  So if you have a
relatively up to date system, odds are very good that you
don't have a 2 GB limit.  Certainly not on Windows or Linux.

Cheers,
Ben
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Uri Guttman
> "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes:

  GS> On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote:
  >> Seriously, while mmap is ideal in C, in Perl I would just build an array
  >> of tell()s for each line in the file and then walk through the lines,
  >> storing the offset of the last delimiter that I'd seen.

  GS> I think mmap would be just as ideal in Perl and a lot less work too.
  GS> Rather than indexing and parsing a *large* file, you must mmap
  GS> and parse it. In fact, the CSV code, which was left as an exercise in you
  GS> pseudo-code, would be the only code required.

  GS> I should point out though that mmap has a 2GB limit on systems
  GS> without 64bit support. Such systems can't store files larger than
  GS> that anyhow.

  >> Let the kernel file buffer do your heavy lifting for you.

  GS> Exactly, if s/kernel file/mmap/

this talk about mmap makes little sense to me. it may save some i/o and
even some buffering but you still need the ram and mmap still causes
disk accesses. if the original file is too big for ram then the
algorithm chosen must be one to minimize disk accesses and mmap doesn't
save those. this is why disk/tape sorts were invented, to minimize the
slow disk and tape accesses. so you would still need my algorithm or
something similar regardless of how you actually get the data from disk
to ram. and yes i have used mmap on many projects.

when analyzing algorithm effienciency you must work out which is the
slowest operation that has the steepest growth curve and work on
minimizing it. since disk access is so much slower than ram access it
becomes the key element rather than the classic comparison in sorts. in
a matrix transposition in ram, i would count the matrix accesses and/or
copies of elements. with a larger matrix, then ram accesses would be
key. my solution would load as much matrix into ram as possible (maybe
using mmap but that is not critical anymore) and transpose it. then
write the section out. that is 2 (large) disk accesses per chunk (or 1
per disk block). then you do a merge (assuming you can access all the
sction files at one time) which is another disk access per section (or
block). and one more to write out the final matrix (in row order). so
that is O((2 + 2) * section_count) disk accesses which isn't too bad.

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


RE: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Tolkin, Steve
I think there may be a more restrictive limit, at least on Windows.  
The OS must be able to find a contiguous block 
of virtual memory, i.e. in pagefile.sys.
The paging file may not be able to grow, (depending on
how it is configured) and there may not be a large enough block.

I would like to learn more about the exact situation of memory mapping
files 
on Windows -- the above is just based on a hour of Googling
and the info below.

Non perl related info follows:
I hit this limitation in the otherwise excellent disk indexing program
Wilbur at http://wilbur.redtree.com which is free (as in beer) and open
source too.
(But for Windows only.)

It uses memory mapped files and when one of the indexes exceeds
about 500 MB it says something like "unable to map view of a file"
even though my pagefile.sys is 1536 MB.

The developer said:
This is a system message that occurs when Wilbur is unable to memory map
one
of its index files and due to the way memory mapping works on Windows, I
think
this is normally a symptom of insufficient virtual memory space.
Possible
solutions might be increasing the size of your paging file (dig through
the
performance options on the system control panel to find this),
defragmenting
the disk and of course adding more real memory.

But I already have a paging file of 1536 MB, which is the recommended
size in
windows XP (3 times physical memory of 512 MB).  

I also do not think that defragmenting the disk helps, except possibly
if done at boot time to defrag pagefile.sys (However I did do that
and it still failed.)

I was able to work around the problem by putting fewer words
in the index, e.g. kept the default of min length = 3 and no numbers.
I also have lots more files than most people, so I think very few people
will hit this limitation.


Hopefully helpfully yours,
Steve
-- 
Steve TolkinSteve . Tolkin at FMR dot COM   617-563-0516 
Fidelity Investments   82 Devonshire St. V4D Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.



-Original Message-
From: Gyepi SAM [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 12, 2004 7:39 AM
To: [EMAIL PROTECTED]
Subject: Re: [Boston.pm] transposing rows and columns in a CSV file


On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote:
> Seriously, while mmap is ideal in C, in Perl I would just build an
array
> of tell()s for each line in the file and then walk through the lines,
> storing the offset of the last delimiter that I'd seen.

I think mmap would be just as ideal in Perl and a lot less work too.
Rather than indexing and parsing a *large* file, you must mmap
and parse it. In fact, the CSV code, which was left as an exercise in
you
pseudo-code, would be the only code required.

I should point out though that mmap has a 2GB limit on systems
without 64bit support. Such systems can't store files larger than
that anyhow.

> Let the kernel file buffer do your heavy lifting for you.

Exactly, if s/kernel file/mmap/

-Gyepi

--
The convenient method is insecure and the secure method is inconvenient.
--me
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-12 Thread Gyepi SAM
On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote:
> Seriously, while mmap is ideal in C, in Perl I would just build an array
> of tell()s for each line in the file and then walk through the lines,
> storing the offset of the last delimiter that I'd seen.

I think mmap would be just as ideal in Perl and a lot less work too.
Rather than indexing and parsing a *large* file, you must mmap
and parse it. In fact, the CSV code, which was left as an exercise in you
pseudo-code, would be the only code required.

I should point out though that mmap has a 2GB limit on systems
without 64bit support. Such systems can't store files larger than
that anyhow.

> Let the kernel file buffer do your heavy lifting for you.

Exactly, if s/kernel file/mmap/

-Gyepi

--
The convenient method is insecure and the secure method is inconvenient.
--me
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-11 Thread Aaron Sherman
On Wed, 2004-11-10 at 23:13 -0500, Uri Guttman wrote:

> sorry i missed the meeting but i have a nasty cold i am fighting off.

I just got over that one :-(

As for transposing a matrix that won't fit in ram... that's easy. Mail
it to someone who has more ram.

Aaron "Gordian Knot" Sherman, at your service ;-)

Seriously, while mmap is ideal in C, in Perl I would just build an array
of tell()s for each line in the file and then walk through the lines,
storing the offset of the last delimiter that I'd seen. That way, every
read looks like this:

# XXX WARNING, untested pseudo-code
next if $offsets[$thisline] == -1;
seek(INPUT,$tells[$thisline]+$offsets[$thisline],0);
$bigbuf = '';
for($i=0;$i;$i++) {
sysread(INPUT,$buf,$blocksz);
# Proper CSV parsing left as an exercise...
if ($buf =~ /[,\n]/) {
$pos = length($`)+1;
$sep = $&;
if ($sep eq "\n") {
$offsets[$thisline] = -1; # Done here
} else {
$offsets[$thisline] += $i*$blocksz+$pos;
}
last;
} else {
$bigbuf .= $buf;
}
}
handle_one_field($bigbuf);

Let the kernel file buffer do your heavy lifting for you.

-- 

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-10 Thread Uri Guttman
> "WR" == William Ricker <[EMAIL PROTECTED]> writes:

  WR> This problem of transposing a matrix bigger than you want to
  WR> consider slurping feels familiar. The techniques used to sort
  WR> "big" files in the bad old days, disk or tape based sort-merge,
  WR> might be adaptble. The trick is to do it in 2 (or small * log N)
  WR> passes,by creating a simpler suitable intermediate form.

interestingly, when doing tape merges with multiple tapes, the most
efficient way is based on the fibonacci sequence. but in that case you
can see the first records of each tape in a set to do the merge. in the
large matrix you can't see all the needed elements so that may not be a
viable direction to pursue.

but here is another thought based on the old disk sort/merges and
such. you can transpose sections of the input matrix and then join/merge
those. so if you can only fit one part of the matrix into ram (including
output buffers), you transpose it and output it to a file. do the same
for the other sections and output them. then it is a fairly easy merge
where you can read the first records of each of the transposed sections
and merge into a single record and output it to a final file.

see, i am not all about stem :)

sorry i missed the meeting but i have a nasty cold i am fighting off.

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-10 Thread William Ricker
This problem of transposing a matrix bigger than you want to consider slurping 
feels familiar. The techniques used to sort "big" files in the bad old days, 
disk or tape based sort-merge, might be adaptble. The trick is to do it in 2 
(or small * log N) passes,by creating a simpler suitable intermediate form.

Bill

---
William Ricker [EMAIL PROTECTED]

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] transposing rows and columns in a CSV file

2004-11-10 Thread Gyepi SAM
On Wed, Nov 10, 2004 at 04:38:26PM -0500, Tom Metro wrote:
> At last night meeting someone asked about ways to transpose the rows and 
> columns in a CSV file that was too large to fit into memory and 
> potentially had thousands of columns and hundreds of thousands of rows.

All of the suggested solutions are great, but ...

If a file is too large to slurp in to memory, use mmap. It's perfect for the
task and allows for the most natural solution.
There are at least two mmap modules on CPAN.

Of course, since I was not at the meeting, this may well have been discussed
and I could be a day late! ;)

-Gyepi
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


[Boston.pm] transposing rows and columns in a CSV file

2004-11-10 Thread Tom Metro
At last night meeting someone asked about ways to transpose the rows and 
columns in a CSV file that was too large to fit into memory and 
potentially had thousands of columns and hundreds of thousands of rows.

If your data fits the criteria of having reasonably uniform cell widths, 
say for example no cells holding more than 50 bytes, you could use a 
sparse file with a fixed width record structure. You would make a single 
pass through your source data file, parsing the CSV only once, and make 
random (as in random access) seeks into the target file, following a 
simple formula to determine the cell's offset.

(If a sparse file isn't doable, but you have lots of disk space, you can 
instead create the target file by padding it with enough nulls to hold 
the number of cells you are expecting. Using dd to copy blocks from 
/dev/zero would do the trick.)

Conceptually this isn't all that different from using a database, as 
someone suggested, but should have much less overhead.

Of course if the desired end result is to have another CSV file, you'll 
have to write another chunk of code to convert the fixed record file 
back to CSV.

Whether millions of random seeks and a post conversion step really 
provide enough of a performance gain over making multiple passes through 
the source file is debatable. If this is a one-time conversion effort, 
the multi pass approach might be the way to go.

A variation on this idea, which would eliminate copying the data to an 
intermediary file, and would be able to tolerate widely varying cell 
widths, is to make one pass through the source file and generate an 
index for it - say the byte offset of each row, and then the relative 
offset and length of each cell - which, if you're lucky, might fit into 
memory, otherwise you'd again use a sparse file. Then you'd walk through 
the index, make random seeks into the source file, and write the cells 
to your target CSV file in the desired order. Not quite a single pass on 
the source file, but at least you'd have the overhead of performing the 
CSV parsing only once.

 -Tom
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm