[ 
https://issues.apache.org/jira/browse/IMPALA-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107595#comment-17107595
 ] 

Tim Armstrong commented on IMPALA-1964:
---------------------------------------

This is pretty old and there's been a lot of improvements since. There are some 
bigger ones that I'm aware of IMPALA-3902 will allow multiple threads to be 
used for inserts. IMPALA-4356 made better use of codegen for evaluating 
expressions - it looks like most of the time was spent in EncodeTimer. I'm 
going to close for now.

> Improve Write Performance
> -------------------------
>
>                 Key: IMPALA-1964
>                 URL: https://issues.apache.org/jira/browse/IMPALA-1964
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Perf Investigation
>    Affects Versions: Impala 2.3.0
>            Reporter: Prateek Rungta
>            Priority: Major
>              Labels: performance
>         Attachments: Profile_After_Hashing.txt, Profile_Before_Hashing.txt
>
>
> Impala write performance is bottlenecked by the single threaded 
> HDFSTableSink.  Consider the following usecase – 
> — For a baremetal cluster with 3 managers & 7 datanodes; 11 drives/datanode 
> - Each drive gave me respectable ~100MB/s read and write performance
> ```
> [root@c3kuhdpnode1 ~]# hdparm -t -T /dev/sdk1
> /dev/sdk1:
>  Timing cached reads:   18650 MB in  1.99 seconds = 9349.32 MB/sec
>  Timing buffered disk reads: 348 MB in  3.00 seconds = 115.88 MB/sec
> [root@c3kuhdpnode1 ~]# time dd if=/dev/sdk1 of=/$drive/sdk1.zero bs=1024 
> count=10000000
> 10000000+0 records in
> 10000000+0 records out
> 10240000000 bytes (10 GB) copied, 90.6001 s, 113 MB/s
> real
> 1m30.602s
> user
> 0m0.929s
> sys
> 0m35.295s
> ```
> — For a single CTAS Query
>     — The query running by itself on the cluster took 37 min. See 
> Profile_Before_Hashing.txt for the explain plan
>     — 10, modified versions of the query, each performing a 1/10 of the 
> writes, took ~22m running concurrently. See Profile_After_Hashing.txt for one 
> of the 10 explain plans. 
> In both cases, I found majority of the time is spent in HDFSTableSink. I'd 
> expect that portion of the query to be able to write nearly as fast as a 
> Teragen on the cluster could, which is not what we are observing. 
> This is pretty terrible for a few different reasons, primarily that, we can't 
> use Hive to generate parquet because it might generate multiple hdfs blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to