[jira] [Comment Edited] (IMPALA-7367) Pack StringValue, CollectionValue and TimestampValue slots

2018-09-04 Thread Pooja Nilangekar (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603589#comment-16603589
 ] 

Pooja Nilangekar edited comment on IMPALA-7367 at 9/4/18 9:25 PM:
--

There is an issue with packing TimestampValue. GCC doesn't pack the class 
because its variables are not explicitly declared as packed. So the compiler 
complains with the following error for the date_ and time_ attributes:
{code:java}
error: ignoring packed attribute because of unpacked non-POD field
{code}
However, boost::posix_time::time_duration and boost::gregorian::date each 
contain single elements of uint64_t and int32_t respectively. I am unaware of 
any method to explicitly declare them as packed structures. How should we 
handle this? Would it be acceptable to pack StringValue and CollectionValue for 
now and create a Jira to track TimestampValue for handling it when boost marks 
them as packed?


was (Author: poojanilangekar):
There is an issue with packing TimestampValue. GCC doesn't pack the class 
because its variables are not explicitly declared as packed. So the compiler 
complains with the following error for the date_ and time_ attributes:
{code:java}
error: ignoring packed attribute because of unpacked non-POD field
{code}
However, boost::posix_time::time_duration and boost::gregorian::date are indeed 
packed.This is a known gcc bug 
[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60972]. Since the attributes 
belong to the boost library, I am unaware of any method to explicitly declare 
them as packed structures. How should we handle this? Would it be acceptable to 
pack StringValue and CollectionValue for now and create a Jira to track 
TimestampValue for when the GCC bug gets resolved?

> Pack StringValue, CollectionValue and TimestampValue slots
> --
>
> Key: IMPALA-7367
> URL: https://issues.apache.org/jira/browse/IMPALA-7367
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Pooja Nilangekar
>Priority: Major
>  Labels: perfomance
> Attachments: 0001-WIP.patch
>
>
> This is a follow-on to finish up the work from IMPALA-2789. IMPALA-2789 
> didn't actually fully pack the memory layout because StringValue, 
> TimestampValue and CollectionValue still occupy 16 bytes but only have 12 
> bytes of actual data. This results in a higher memory footprint, which leads 
> to higher memory requirements and worse performance. We don't get any benefit 
> from the padding since the majority of tuples are not actually aligned in 
> memory anyway.
> I did a quick version of the change for StringValue only which improves TPC-H 
> performance.
> {noformat}
> Report Generated on 2018-07-30
> Run Description: "b5608264b4552e44eb73ded1e232a8775c3dba6b vs 
> f1e401505ac20c0400eec819b9196f7f506fb927"
> Cluster Name: UNKNOWN
> Lab Run Info: UNKNOWN
> Impala Version:  impalad version 3.1.0-SNAPSHOT RELEASE ()
> Baseline Impala Version: impalad version 3.1.0-SNAPSHOT RELEASE (2018-07-27)
> +--+---+-++++
> | Workload | File Format   | Avg (s) | Delta(Avg) | GeoMean(s) | 
> Delta(GeoMean) |
> +--+---+-++++
> | TPCH(10) | parquet / none / none | 2.69| -4.78% | 2.09   | 
> -3.11% |
> +--+---+-++++
> +--+--+---++-++++-+---+
> | Workload | Query| File Format   | Avg(s) | Base Avg(s) | 
> Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | Iters |
> +--+--+---++-++++-+---+
> | TPCH(10) | TPCH-Q22 | parquet / none / none | 0.94   | 0.93|   
> +0.75%   |   3.37%|   2.84%| 1   | 30|
> | TPCH(10) | TPCH-Q13 | parquet / none / none | 3.32   | 3.32|   
> +0.13%   |   1.74%|   2.09%| 1   | 30|
> | TPCH(10) | TPCH-Q11 | parquet / none / none | 0.99   | 0.99|   
> -0.02%   |   3.74%|   3.16%| 1   | 30|
> | TPCH(10) | TPCH-Q5  | parquet / none / none | 2.30   | 2.33|   
> -0.96%   |   2.15%|   2.45%| 1   | 30|
> | TPCH(10) | TPCH-Q2  | parquet / none / none | 1.55   | 1.57|   
> -1.45%   |   1.65%|   1.49%| 1   | 30|
> | TPCH(10) | TPCH-Q8  | parquet / none / none | 2.89   | 2.93|   
> -1.51%   |   2.69%|   1.34%| 1   | 3

[jira] [Comment Edited] (IMPALA-7367) Pack StringValue, CollectionValue and TimestampValue slots

2018-09-20 Thread Pooja Nilangekar (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622879#comment-16622879
 ] 

Pooja Nilangekar edited comment on IMPALA-7367 at 9/20/18 11:49 PM:


I ran TPCH with a scale factor of 60 on a minicluster with a patch for 
StringValue and CollectionValue slots. Here is the summary of results: 

{noformat}
+-+--+--+---+-+---+
| Workload | File Format  | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
+-+--+--+---+--+--+
| TPCH(60) | parquet / none / none | 12.45   | -29.84% | 8.63  
| -11.30%  |
+--+-+--+---+--+--+
{noformat}


The queries which showed significant performance gain did use strings or 
timestamps stored as strings. I can understand that we should see an 
improvement, however I am not sure about the magnitude. 

Also there were only 2 queries which showed a regression > 1 %. In both cases, 
the absolute difference was less than 5ms while the query took a few seconds to 
run. So this could just be system noise. 


was (Author: poojanilangekar):
I ran TPCH with a scale factor of 60 on a minicluster with a patch for 
StringValue and CollectionValue slots. Here is the summary of results: 

+-+--+--+---+-+---+
| Workload | File Format  | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
+-+--+--+---+--+--+
| TPCH(60) | parquet / none / none | 12.45   | -29.84% | 8.63  
| -11.30%  |
+--+-+--+---+--+--+

The queries which showed significant performance gain did use strings or 
timestamps stored as strings. I can understand that we should see an 
improvement, however I am not sure about the magnitude. 

Also there were only 2 queries which showed a regression > 1 %. In both cases, 
the absolute difference was less than 5ms while the query took a few seconds to 
run. So this could just be system noise. 

> Pack StringValue, CollectionValue and TimestampValue slots
> --
>
> Key: IMPALA-7367
> URL: https://issues.apache.org/jira/browse/IMPALA-7367
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Pooja Nilangekar
>Priority: Major
>  Labels: perfomance
> Attachments: 0001-WIP.patch
>
>
> This is a follow-on to finish up the work from IMPALA-2789. IMPALA-2789 
> didn't actually fully pack the memory layout because StringValue, 
> TimestampValue and CollectionValue still occupy 16 bytes but only have 12 
> bytes of actual data. This results in a higher memory footprint, which leads 
> to higher memory requirements and worse performance. We don't get any benefit 
> from the padding since the majority of tuples are not actually aligned in 
> memory anyway.
> I did a quick version of the change for StringValue only which improves TPC-H 
> performance.
> {noformat}
> Report Generated on 2018-07-30
> Run Description: "b5608264b4552e44eb73ded1e232a8775c3dba6b vs 
> f1e401505ac20c0400eec819b9196f7f506fb927"
> Cluster Name: UNKNOWN
> Lab Run Info: UNKNOWN
> Impala Version:  impalad version 3.1.0-SNAPSHOT RELEASE ()
> Baseline Impala Version: impalad version 3.1.0-SNAPSHOT RELEASE (2018-07-27)
> +--+---+-++++
> | Workload | File Format   | Avg (s) | Delta(Avg) | GeoMean(s) | 
> Delta(GeoMean) |
> +--+---+-++++
> | TPCH(10) | parquet / none / none | 2.69| -4.78% | 2.09   | 
> -3.11% |
> +--+---+-++++
> +--+--+---++-++++-+---+
> | Workload | Query| File Format   | Avg(s) | Base Avg(s) | 
> Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | Iters |
> +--+--+---++-++++-+---+
> | TPCH(10) | TPCH-Q22 | parquet / none / none | 0.94   | 0.93|   
> +0.75%   |   3.37%|   2.8