On Tue, Dec 4, 2018 at 10:23 PM Daniel Harper <djharpe...@gmail.com> wrote:

> Sorry I've been away at reinvent.
>
> Just tried out what's currently on master (with the chunked change that
> looks like it has merged). I'll do the break down of the different parts
> later but as a high level look at just running the same script as described
> above these are the numbers
>
>
> https://docs.google.com/spreadsheets/d/1SE4S-wcKQ5cwlHoN7rQm7XOZLjI0HSyMje6q-zLvUHM/edit?usp=sharing
>


>
> Looks to me like the change has definitely helped, with memory usage
> dropping to around 300mb, although the usage doesn't really change that
> much once chunk size is > 1000
>

good. you might want to try with a chunk size of -1 (this loads the whole
CSV file into memory in one fell swoop.)

also, there's this PR wich should probably also reduce the memory pressure:
- https://github.com/apache/arrow/pull/3073

cheers,
-s


>
>
>
>
> Daniel Harper
> http://djhworld.github.io
>
>
> On Fri, 23 Nov 2018 at 10:58, Sebastien Binet <bi...@cern.ch> wrote:
>
> > On Mon, Nov 19, 2018 at 11:29 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > That seems buggy then. There is only 4.125 bytes of overhead per
> > > string value on average (a 32-bit offset, plus a valid bit)
> > > On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper <djharpe...@gmail.com>
> > > wrote:
> > > >
> > > > Uncompressed
> > > >
> > > > $ ls -la concurrent_streams.csv
> > > > -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv
> > > >
> > > > $ wc -l concurrent_streams.csv
> > > >  1007481 concurrent_streams.csv
> > > >
> > > >
> > > > Daniel Harper
> > > > http://djhworld.github.io
> > > >
> > > >
> > > > On Mon, 19 Nov 2018 at 21:55, Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > >
> > > > > I'm curious how the file is only 100MB if it's producing ~6GB of
> > > > > strings in memory. Is it compressed?
> > > > > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper <
> djharpe...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > I've tried the new code and that seems to have shaved about 1GB
> of
> > > memory
> > > > > > off, so the heap is about 8.84GB now, here is the updated pprof
> > > output
> > > > > > https://i.imgur.com/itOHqBf.png
> > > > > >
> > > > > > It looks like the majority of allocations are in the
> > > memory.GoAllocator
> > > > > >
> > > > > > (pprof) top
> > > > > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total
> > > > > > Showing top 10 nodes out of 41
> > > > > >       flat  flat%   sum%        cum   cum%
> > > > > >     4.24GB 47.91% 47.91%     4.24GB 47.91%
> > > > > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
> > > > > >     2.12GB 23.97% 71.88%     2.12GB 23.97%
> > > > > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer
> > (inline)
> > > > > >     1.07GB 12.07% 83.95%     1.07GB 12.07%
> > > > > > github.com/apache/arrow/go/arrow/array.NewData
> > > > > >     0.83GB  9.38% 93.33%     0.83GB  9.38%
> > > > > > github.com/apache/arrow/go/arrow/array.NewStringData
> > > > > >     0.33GB  3.69% 97.02%     1.31GB 14.79%
> > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
> > > > > >     0.18GB  2.04% 99.06%     0.18GB  2.04%
> > > > > > github.com/apache/arrow/go/arrow/array.NewChunked
> > > > > >     0.07GB  0.78% 99.85%     0.07GB  0.78%
> > > > > > github.com/apache/arrow/go/arrow/array.NewInt64Data
> > > > > >     0.01GB  0.15%   100%     0.21GB  2.37%
> > > > > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
> > > > > >          0     0%   100%        6GB 67.91%
> > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
> > > > > >          0     0%   100%     4.03GB 45.54%
> > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve
> > > > > >
> > > > > >
> > > > > > I'm a bit busy at the moment but I'll probably repeat the same
> test
> > > on
> > > > > the
> > > > > > other Arrow implementations (e.g. Java) to see if they allocate a
> > > similar
> > > > > > amount.
> > >
> >
> > I've implemented chunking over there:
> >
> > - https://github.com/apache/arrow/pull/3019
> >
> > could you try with a couple of chunking values?
> > e.g.:
> > - csv.WithChunk(-1): reads the whole file into memory, creates one big
> > record
> > - csv.WithChunk(nrows/10): creates 10 records
> >
> > also, it would be great to try to disentangle the memory usage of the
> "CSV
> > reading part" from the "Table creation" one:
> > - have some perf numbers w/o storing all these Records into a []Record
> > slice,
> > - have some perf numbers w/ only storing these Records into a []Record
> > slice,
> > - have some perf numbers w/ storing the records into the slice + creating
> > the Table.
> >
> > hth,
> > -s
> >
>

Reply via email to