Hi Anton,

Le 12/07/2019 à 23:21, Malakhov, Anton a écrit :
> 
> The result is that all these execution nodes scale well enough and run under 
> 100 milliseconds on my 2 x Xeon E5-2650 v4 @ 2.20GHz, 128Gb RAM while CSV 
> reader takes several seconds to complete even reading from in-memory file 
> (8Gb), thus it is not IO bound yet even with good consumer-grade SSDs. Thus 
> my focus recently has been around optimization of CSV parser where I have 
> achieved 50% improvement substituting all the small object allocations via 
> TBB scalable allocator and using TBB-based memory pool instead of default one 
> with pre-allocated huge (2Mb) memory pages (echo 30000 > 
> /proc/sys/vm/nr_hugepages). I found no way yet how to do both of these tricks 
> with jemalloc, so please try to beat or meet my times without TBB allocator.

That sounds interesting, though optimizing memory allocations is
probably not the most enticing use case for TBB.  Memory allocators can
fare differently on different workloads, and just because TBB is better
in some situation doesn't mean it'll always be better.  Similarly,
jemalloc is not the best for every use case.

Note that, as Arrow is a library, we don't want to impose a memory
allocator on the user, hence why jemalloc is merely optional.

(one reason we added the jemalloc option is that jemalloc has
non-standard APIs for aligned allocation and reallocation, btw)

> I also see other hotspots and opportunities for optimizations, some examples 
> are memset is being heavily used while resizing buffers (why and why?) and 
> the column builder trashes caches by not using of streaming stores.

Could you open JIRA issues with your investigations?  I'd be interested
to know what the actual execution bottlenecks are in the CSV reader.

> I used TBB directly to make the execution nodes parallel, however I have also 
> implemented a simple TBB-based ThreadPool and TaskGroup as you can see in 
> this PR: https://github.com/aregm/arrow/pull/6
> I see consistent improvement (up to 1200%!) on BM_ThreadedTaskGroup and 
> BM_ThreadPoolSpawn microbenchmarks, however applying it to the real world 
> task of CSV reader, I don't see any improvements yet.

One thing you could try is shrink the block size in CSV reader and see
when performance starts to fall significantly.  With the current
TaskGroup overhead, small block sizes will suffer a lot.  I expect TBB
to fare better.

(and / or try a CSV file with a hundred columns or so)

> Or even worse, while reading the file, TBB wastes some cycles spinning.

That doesn't sound good (but is a separate issue from the main TaskGroup
usage, IMHO).  TBB doesn't provide a facility for background IO threads
perhaps?

> I'll be looking into applying more sophisticated NUMA and locality-aware 
> tricks as I'll be cleaning paths for the data streams in the parser.

Hmm, as a first approach, I don't think we should waste time trying such
sophisticated optimizations (well, of course, you are free to do so :-)).

Regards

Antoine.

Reply via email to