Re: [Go] High memory usage on CSV read into table

2018-12-04 Thread Daniel Harper
Sorry I've been away at reinvent.

Just tried out what's currently on master (with the chunked change that
looks like it has merged). I'll do the break down of the different parts
later but as a high level look at just running the same script as described
above these are the numbers

https://docs.google.com/spreadsheets/d/1SE4S-wcKQ5cwlHoN7rQm7XOZLjI0HSyMje6q-zLvUHM/edit?usp=sharing

Looks to me like the change has definitely helped, with memory usage
dropping to around 300mb, although the usage doesn't really change that
much once chunk size is > 1000




Daniel Harper
http://djhworld.github.io


On Fri, 23 Nov 2018 at 10:58, Sebastien Binet  wrote:

> On Mon, Nov 19, 2018 at 11:29 PM Wes McKinney  wrote:
>
> > That seems buggy then. There is only 4.125 bytes of overhead per
> > string value on average (a 32-bit offset, plus a valid bit)
> > On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper 
> > wrote:
> > >
> > > Uncompressed
> > >
> > > $ ls -la concurrent_streams.csv
> > > -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv
> > >
> > > $ wc -l concurrent_streams.csv
> > >  1007481 concurrent_streams.csv
> > >
> > >
> > > Daniel Harper
> > > http://djhworld.github.io
> > >
> > >
> > > On Mon, 19 Nov 2018 at 21:55, Wes McKinney 
> wrote:
> > >
> > > > I'm curious how the file is only 100MB if it's producing ~6GB of
> > > > strings in memory. Is it compressed?
> > > > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper 
> > > > wrote:
> > > > >
> > > > > Thanks,
> > > > >
> > > > > I've tried the new code and that seems to have shaved about 1GB of
> > memory
> > > > > off, so the heap is about 8.84GB now, here is the updated pprof
> > output
> > > > > https://i.imgur.com/itOHqBf.png
> > > > >
> > > > > It looks like the majority of allocations are in the
> > memory.GoAllocator
> > > > >
> > > > > (pprof) top
> > > > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total
> > > > > Showing top 10 nodes out of 41
> > > > >   flat  flat%   sum%cum   cum%
> > > > > 4.24GB 47.91% 47.91% 4.24GB 47.91%
> > > > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
> > > > > 2.12GB 23.97% 71.88% 2.12GB 23.97%
> > > > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer
> (inline)
> > > > > 1.07GB 12.07% 83.95% 1.07GB 12.07%
> > > > > github.com/apache/arrow/go/arrow/array.NewData
> > > > > 0.83GB  9.38% 93.33% 0.83GB  9.38%
> > > > > github.com/apache/arrow/go/arrow/array.NewStringData
> > > > > 0.33GB  3.69% 97.02% 1.31GB 14.79%
> > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
> > > > > 0.18GB  2.04% 99.06% 0.18GB  2.04%
> > > > > github.com/apache/arrow/go/arrow/array.NewChunked
> > > > > 0.07GB  0.78% 99.85% 0.07GB  0.78%
> > > > > github.com/apache/arrow/go/arrow/array.NewInt64Data
> > > > > 0.01GB  0.15%   100% 0.21GB  2.37%
> > > > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
> > > > >  0 0%   100%6GB 67.91%
> > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
> > > > >  0 0%   100% 4.03GB 45.54%
> > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve
> > > > >
> > > > >
> > > > > I'm a bit busy at the moment but I'll probably repeat the same test
> > on
> > > > the
> > > > > other Arrow implementations (e.g. Java) to see if they allocate a
> > similar
> > > > > amount.
> >
>
> I've implemented chunking over there:
>
> - https://github.com/apache/arrow/pull/3019
>
> could you try with a couple of chunking values?
> e.g.:
> - csv.WithChunk(-1): reads the whole file into memory, creates one big
> record
> - csv.WithChunk(nrows/10): creates 10 records
>
> also, it would be great to try to disentangle the memory usage of the "CSV
> reading part" from the "Table creation" one:
> - have some perf numbers w/o storing all these Records into a []Record
> slice,
> - have some perf numbers w/ only storing these Records into a []Record
> slice,
> - have some perf numbers w/ storing the records into the slice + creating
> the Table.
>
> hth,
> -s
>


Re: [Go] High memory usage on CSV read into table

2018-11-19 Thread Daniel Harper
Uncompressed

$ ls -la concurrent_streams.csv
-rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv

$ wc -l concurrent_streams.csv
 1007481 concurrent_streams.csv


Daniel Harper
http://djhworld.github.io


On Mon, 19 Nov 2018 at 21:55, Wes McKinney  wrote:

> I'm curious how the file is only 100MB if it's producing ~6GB of
> strings in memory. Is it compressed?
> On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper 
> wrote:
> >
> > Thanks,
> >
> > I've tried the new code and that seems to have shaved about 1GB of memory
> > off, so the heap is about 8.84GB now, here is the updated pprof output
> > https://i.imgur.com/itOHqBf.png
> >
> > It looks like the majority of allocations are in the memory.GoAllocator
> >
> > (pprof) top
> > Showing nodes accounting for 8.84GB, 100% of 8.84GB total
> > Showing top 10 nodes out of 41
> >   flat  flat%   sum%cum   cum%
> > 4.24GB 47.91% 47.91% 4.24GB 47.91%
> > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
> > 2.12GB 23.97% 71.88% 2.12GB 23.97%
> > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer (inline)
> > 1.07GB 12.07% 83.95% 1.07GB 12.07%
> > github.com/apache/arrow/go/arrow/array.NewData
> > 0.83GB  9.38% 93.33% 0.83GB  9.38%
> > github.com/apache/arrow/go/arrow/array.NewStringData
> > 0.33GB  3.69% 97.02% 1.31GB 14.79%
> > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
> > 0.18GB  2.04% 99.06% 0.18GB  2.04%
> > github.com/apache/arrow/go/arrow/array.NewChunked
> > 0.07GB  0.78% 99.85% 0.07GB  0.78%
> > github.com/apache/arrow/go/arrow/array.NewInt64Data
> > 0.01GB  0.15%   100% 0.21GB  2.37%
> > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
> >  0 0%   100%6GB 67.91%
> > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
> >  0 0%   100% 4.03GB 45.54%
> > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve
> >
> >
> > I'm a bit busy at the moment but I'll probably repeat the same test on
> the
> > other Arrow implementations (e.g. Java) to see if they allocate a similar
> > amount.
> >
> >
> > Daniel Harper
> > http://djhworld.github.io
> >
> >
> > On Mon, 19 Nov 2018 at 10:17, Sebastien Binet  wrote:
> >
> > > hi Daniel,
> > > On Sun, Nov 18, 2018 at 10:17 PM Daniel Harper 
> > > wrote:
> > >
> > > > Sorry just realised SVG doesn't work.
> > > >
> > > > PNG of the pprof can be found here: https://i.imgur.com/BVXv1Jm.png
> > > >
> > > >
> > > > Daniel Harper
> > > > http://djhworld.github.io
> > > >
> > > >
> > > > On Sun, 18 Nov 2018 at 21:07, Daniel Harper 
> > > wrote:
> > > >
> > > > > Wasn't sure where the best place to discuss this, but I've noticed
> that
> > > > > when running the following piece of code
> > > > >
> > > > > https://play.golang.org/p/SKkqPWoHPPS
> > > > >
> > > > > On a CSV files that contains roughly 1 million records (about
> 100mb of
> > > > > data), the memory usage of the process leaps to about 9.1GB
> > > > >
> > > > > The records look something like this
> > > > >
> > > > >
> > > > >
> > > >
> > >
> "2018-08-27T20:00:00Z","cdnA","dash","audio","http","programme-1","3577","2018","08","27","2018-08-27","live"
> > > > >
> > > > >
> > > >
> > >
> "2018-08-27T20:00:01Z","cdnB","hls","video","https","programme-2","14","2018","08","27","2018-08-27","ondemand"
> > > > >
> > > > > I've attached a pprof output of the process.
> > > > >
> > > > > From the looks of it the heavy use of _strings_ might be where
> most of
> > > > the
> > > > > memory is going.
> > > > >
> > > > > Is this expected? I'm new to the code, happy to help where
> possible!
> > > >
> > >
> > > it's somewhat expected.
> > >
> > > you use `io.ReadFile` to get your data.
> > > this will read the whole file in memory and stick it there: so there's
> > > that.
> > > for much bigger files, I would recommend using `os.Open`.
> > >
> > > also, you don't release the individual records once passed to the
> table, so
> > > you have a memory leak.
> > > here is my current attempt:
> > > - https://play.golang.org/p/ns3GJW6Wx3T
> > >
> > > finally, as I was alluding to on the #data-science slack channel,
> right now
> > > Go arrow/csv will create a new Record for each row in the incoming CSV
> > > file.
> > > so you get a bunch of overhead for every row/record.
> > >
> > > a much more efficient way would be to chunk `n` rows into a single
> Record.
> > > an even more efficient way would be to create a dedicated csv.table
> type
> > > that implements array.Table (as it seems you're interested in using
> that
> > > interface) but only reads the incoming CSV file piecewise (ie:
> implementing
> > > the chunking I was alluding to above but w/o having to load the whole
> > > []Record slice.)
> > >
> > > as a first step to improve this issue, implementing chunking would
> already
> > > shave off a bunch of overhead.
> > >
> > > -s
> > >
>


Re: [Go] High memory usage on CSV read into table

2018-11-19 Thread Daniel Harper
Thanks,

I've tried the new code and that seems to have shaved about 1GB of memory
off, so the heap is about 8.84GB now, here is the updated pprof output
https://i.imgur.com/itOHqBf.png

It looks like the majority of allocations are in the memory.GoAllocator

(pprof) top
Showing nodes accounting for 8.84GB, 100% of 8.84GB total
Showing top 10 nodes out of 41
  flat  flat%   sum%cum   cum%
4.24GB 47.91% 47.91% 4.24GB 47.91%
github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
2.12GB 23.97% 71.88% 2.12GB 23.97%
github.com/apache/arrow/go/arrow/memory.NewResizableBuffer (inline)
1.07GB 12.07% 83.95% 1.07GB 12.07%
github.com/apache/arrow/go/arrow/array.NewData
0.83GB  9.38% 93.33% 0.83GB  9.38%
github.com/apache/arrow/go/arrow/array.NewStringData
0.33GB  3.69% 97.02% 1.31GB 14.79%
github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
0.18GB  2.04% 99.06% 0.18GB  2.04%
github.com/apache/arrow/go/arrow/array.NewChunked
0.07GB  0.78% 99.85% 0.07GB  0.78%
github.com/apache/arrow/go/arrow/array.NewInt64Data
0.01GB  0.15%   100% 0.21GB  2.37%
github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
 0 0%   100%6GB 67.91%
github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
 0 0%   100% 4.03GB 45.54%
github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve


I'm a bit busy at the moment but I'll probably repeat the same test on the
other Arrow implementations (e.g. Java) to see if they allocate a similar
amount.


Daniel Harper
http://djhworld.github.io


On Mon, 19 Nov 2018 at 10:17, Sebastien Binet  wrote:

> hi Daniel,
> On Sun, Nov 18, 2018 at 10:17 PM Daniel Harper 
> wrote:
>
> > Sorry just realised SVG doesn't work.
> >
> > PNG of the pprof can be found here: https://i.imgur.com/BVXv1Jm.png
> >
> >
> > Daniel Harper
> > http://djhworld.github.io
> >
> >
> > On Sun, 18 Nov 2018 at 21:07, Daniel Harper 
> wrote:
> >
> > > Wasn't sure where the best place to discuss this, but I've noticed that
> > > when running the following piece of code
> > >
> > > https://play.golang.org/p/SKkqPWoHPPS
> > >
> > > On a CSV files that contains roughly 1 million records (about 100mb of
> > > data), the memory usage of the process leaps to about 9.1GB
> > >
> > > The records look something like this
> > >
> > >
> > >
> >
> "2018-08-27T20:00:00Z","cdnA","dash","audio","http","programme-1","3577","2018","08","27","2018-08-27","live"
> > >
> > >
> >
> "2018-08-27T20:00:01Z","cdnB","hls","video","https","programme-2","14","2018","08","27","2018-08-27","ondemand"
> > >
> > > I've attached a pprof output of the process.
> > >
> > > From the looks of it the heavy use of _strings_ might be where most of
> > the
> > > memory is going.
> > >
> > > Is this expected? I'm new to the code, happy to help where possible!
> >
>
> it's somewhat expected.
>
> you use `io.ReadFile` to get your data.
> this will read the whole file in memory and stick it there: so there's
> that.
> for much bigger files, I would recommend using `os.Open`.
>
> also, you don't release the individual records once passed to the table, so
> you have a memory leak.
> here is my current attempt:
> - https://play.golang.org/p/ns3GJW6Wx3T
>
> finally, as I was alluding to on the #data-science slack channel, right now
> Go arrow/csv will create a new Record for each row in the incoming CSV
> file.
> so you get a bunch of overhead for every row/record.
>
> a much more efficient way would be to chunk `n` rows into a single Record.
> an even more efficient way would be to create a dedicated csv.table type
> that implements array.Table (as it seems you're interested in using that
> interface) but only reads the incoming CSV file piecewise (ie: implementing
> the chunking I was alluding to above but w/o having to load the whole
> []Record slice.)
>
> as a first step to improve this issue, implementing chunking would already
> shave off a bunch of overhead.
>
> -s
>


Re: [Go] High memory usage on CSV read into table

2018-11-18 Thread Daniel Harper
Sorry just realised SVG doesn't work.

PNG of the pprof can be found here: https://i.imgur.com/BVXv1Jm.png


Daniel Harper
http://djhworld.github.io


On Sun, 18 Nov 2018 at 21:07, Daniel Harper  wrote:

> Wasn't sure where the best place to discuss this, but I've noticed that
> when running the following piece of code
>
> https://play.golang.org/p/SKkqPWoHPPS
>
> On a CSV files that contains roughly 1 million records (about 100mb of
> data), the memory usage of the process leaps to about 9.1GB
>
> The records look something like this
>
>
> "2018-08-27T20:00:00Z","cdnA","dash","audio","http","programme-1","3577","2018","08","27","2018-08-27","live"
>
> "2018-08-27T20:00:01Z","cdnB","hls","video","https","programme-2","14","2018","08","27","2018-08-27","ondemand"
>
> I've attached a pprof output of the process.
>
> From the looks of it the heavy use of _strings_ might be where most of the
> memory is going.
>
> Is this expected? I'm new to the code, happy to help where possible!
>
>
> Daniel Harper
> http://djhworld.github.io
>