[jira] [Commented] (IMPALA-8284) KuduTableSink spends a lot of CPU copying KuduColumnSchemas

2019-03-11 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790090#comment-16790090
 ] 

ASF subversion and git services commented on IMPALA-8284:
-

Commit 49027de9ff4450a438c8e631373c572ce189b36e in impala's branch 
refs/heads/master from Todd Lipcon
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=49027de ]

IMPALA-8284. KuduTableSink spends too much CPU in KuduSchema::Column()

The KuduSchema::Column() accessor actually returns a copy of the
KuduColumnSchema object, which is not lightweight. We were inadvertently
calling this function once for every null cell seen during an insertion.
This caused a performance bottleneck for datasets with large numbers of
NULL cells.

This improves the situation by caching the nullability of the Kudu
columns in our own vector. The vector lookups should be inlined and much
faster than copying a KuduColumnSchema.

No new tests included as this is a perf fix.

Change-Id: I1b4d14d20252bdb190f50ebaaf6179a46eafb932
Reviewed-on: http://gerrit.cloudera.org:8080/12692
Reviewed-by: Will Berkeley 
Reviewed-by: Thomas Marshall 
Tested-by: Impala Public Jenkins 


> KuduTableSink spends a lot of CPU copying KuduColumnSchemas
> ---
>
> Key: IMPALA-8284
> URL: https://issues.apache.org/jira/browse/IMPALA-8284
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 3.1.0
>Reporter: Will Berkeley
>Assignee: Todd Lipcon
>Priority: Major
>  Labels: kudu, newbie, ramp-up
> Fix For: Impala 3.2.0
>
>
> I noticed Impala spending a significant amount of CPU time in 
> {{KuduTableSink::Send}} creating and destroying KuduColumnSchemas.
> See KUDU-2731 for more information.
> Impala could wait for a better option from the Kudu API, or could cache 
> information about nullability of columns outside the hot loop in Send.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8284) KuduTableSink spends a lot of CPU copying KuduColumnSchemas

2019-03-05 Thread Tim Armstrong (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785122#comment-16785122
 ] 

Tim Armstrong commented on IMPALA-8284:
---

[~wdberkeley] Yes we could definitely cache it in Open(). It seems like the 
author probably didn't realise that Column() was a heavyweight operation.

> KuduTableSink spends a lot of CPU copying KuduColumnSchemas
> ---
>
> Key: IMPALA-8284
> URL: https://issues.apache.org/jira/browse/IMPALA-8284
> Project: IMPALA
>  Issue Type: Improvement
>Affects Versions: Impala 3.1.0
>Reporter: Will Berkeley
>Priority: Major
>
> I noticed Impala spending a significant amount of CPU time in 
> {{KuduTableSink::Send}} creating and destroying KuduColumnSchemas.
> See KUDU-2731 for more information.
> Impala could wait for a better option from the Kudu API, or could cache 
> information about nullability of columns outside the hot loop in Send.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org