Re: [Haskell-cafe] High memory usage with 1.4 Million records?

2012-06-08 Thread Andrew Myers
Thanks for the responses everyone, I'll try them out and see what happens :)
Andrew

On Fri, Jun 8, 2012 at 4:40 PM, Johan Tibell  wrote:

> Hi Andrew,
>
> On Thu, Jun 7, 2012 at 5:39 PM, Andrew Myers  wrote:
> > Hi Cafe,
> > I'm working on inspecting some data that I'm trying to represent as
> records
> > in Haskell and seeing about twice the memory footprint than I was
> > expecting.  I've got roughly 1.4 million records in a CSV file (400M on
> > disk) that I parse in using bytestring-csv.  bytestring-csv returns a
> > [[ByteString]] (wrapped in `type`s) which I then convert into a list of
> > records that have the following structure:
> >
> >> 3  Int
> >> 1 Text Length 3
> >> 1 Text Length 11
> >> 12 Float
> >> 1 UTCTime
> >
> > All fields are marked strict and have {-# UNPACK #-} pragmas (I'm
> guessing
> > that doesn't do anything for non primitives).  (Side note, is there a
> way to
> > check if things are actually being unpacked?)
>
> GHC used to complain when you use UNPACK with something that can't be
> unpacked, but that warning seems to have been (accidentally) removed
> in 7.4.1.
>
> The rule for unpacking is:
>
> * all product types (i.e. types with only one constructor) can be
> unpacked. This includes Int, Char, Double, etc and tuples or records
> their-of.
> * sum types (i.e. data types with more than one constructor) and
> polymorphic fields can't be unpacked.
>
> > My back of the napkin memory estimates based on the assumption that
> nothing
> > is being unpacked (and my very spotty understanding of Haskell data
> > structures):
> >
> > Platform: 64 Bit Linux
> > #  Type (Sizeof type (occasionally a guess))
> >
> > 3 * Int (8)
> > 14 * Char (4) -- Text is some kind of bytestring but I'm guessing it
> can't
> > be worse than the same number of Char?
> > 12  * Float (4)
> > 18 * sizeOf (ptr) (8)
> > UTC:  -- From what I can gather through :info in ghci
> > 4 * (ptr) (8)
> > 2 * Integer (16) -- Shouldn't be overly large, times are within 2012
>
> All fields in a constructor are word aligned. This means that all
> primitive types take 8 bytes on a 64-bit platform, including Char and
> Float. You might find the following blog posts by me useful in
> computing the size of data structures:
>
>
> http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
> http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html
> http://blog.johantibell.com/2011/11/slides-from-my-guest-lecture-at.html
>
> Here's some more on the topic:
>
>
> http://stackoverflow.com/questions/3254758/memory-footprint-of-haskell-data-types
>
> http://stackoverflow.com/questions/657/how-to-find-out-ghcs-memory-representations-of-data-types
>
> > I've written a small driver test program that just parses the CSV, finds
> the
> > minimum value for a couple of the Float fields, and exits.  In the
> process
> > monitor the memory usage is 6.9G before the program exits.  I've tried
> > profiling with +RTS -hc but it ran for >3 hours without finishing, it
> > normally finishes within 4 minutes.  Anyone have any ideas for me?
> Things
> > to try?
> > Thanks,
> > Andrew
>
> You could try to use a 32-bit GHC, which would use about half the
> memory. You're at the limit of the size of data that you can
> comfortably fit in memory on a normal desktop machine, so it might be
> time to consider a streaming approach.
>
> -- Johan
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] High memory usage with 1.4 Million records?

2012-06-07 Thread Andrew Myers
Hi Cafe,
I'm working on inspecting some data that I'm trying to represent as records
in Haskell and seeing about twice the memory footprint than I was
expecting.  I've got roughly 1.4 million records in a CSV file (400M on
disk) that I parse in using bytestring-csv.  bytestring-csv returns a
[[ByteString]] (wrapped in `type`s) which I then convert into a list of
records that have the following structure:

> 3  Int
> 1 Text Length 3
> 1 Text Length 11
> 12 Float
> 1 UTCTime

All fields are marked strict and have {-# UNPACK #-} pragmas (I'm guessing
that doesn't do anything for non primitives).  (Side note, is there a way
to check if things are actually being unpacked?)

My back of the napkin memory estimates based on the assumption that nothing
is being unpacked (and my very spotty understanding of Haskell data
structures):

Platform: 64 Bit Linux
#  Type (Sizeof type (occasionally a guess))

3 * Int (8)
14 * Char (4) -- Text is some kind of bytestring but I'm guessing it can't
be worse than the same number of Char?
12  * Float (4)
18 * sizeOf (ptr) (8)
UTC:  -- From what I can gather through :info in ghci
4 * (ptr) (8)
2 * Integer (16) -- Shouldn't be overly large, times are within 2012

List: (Pointer to element and next cons cell)
1408113 * 8 * 2

=
2513G + 21.5M
So even if the original bytestring file is being kept entirely in memory
somehow that's not more than 3G.

I've written a small driver test program that just parses the CSV, finds
the minimum value for a couple of the Float fields, and exits.  In the
process monitor the memory usage is 6.9G before the program exits.  I've
tried profiling with +RTS -hc but it ran for >3 hours without finishing, it
normally finishes within 4 minutes.  Anyone have any ideas for me?  Things
to try?
Thanks,
Andrew
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe