subject:"\[Impala\-ASF\-CR\] IMPALA\-5522\:Use tracked memory for DictDecoder and DictEncoder"

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-11-09 Thread Pranay Singh (Code Review)

Pranay Singh has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 16:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@54
PS15, Line 54: DCHECK(buffered_indices_.empty());
> Ok to leave for now but we've generally been moving towards using the recen
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@59
PS15, Line 59: / This
> DCHECK_EQ(dict_bytes_cnt_, 0);
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77
PS15, Line 77:   /// indices. Used to size the buffer passed to WriteData().
> To add to what Tim said, this code didn't actually change, so it is nice to
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77
PS15, Line 77:   /// indices. Used to size the buffer passed to WriteData().
> We generally use the more concise one-line formatting for very short single
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@123
PS15, Line 123:
> formatting nit: * goes on the left with the type name.
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@152
PS15, Line 152:   }
  :   num_call_track_++;
  : }
  :   }
  : };
> I think we can omit this destructor, since DictEncoderBase does the same th
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@168
PS15, Line 168: /// th
> nit: inline isn't necessary. It isn't harmful but can be confusing to have
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@269
PS15, Line 269:
> nit: * should be on left.
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@296
PS15, Line 296:
  : template
  : class DictDecoder : public DictDecoderBase {
  :  public:
  :   //
> I think we can omit this destructor, as DictDecoderBase does this.
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@330
PS15, Line 330: /// It
> nit: unnecessary inline
Done



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 16
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 10 Nov 2017 06:22:29 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-11-09 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#16).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 142 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/16
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 16
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-11-03 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 15:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77
PS15, Line 77:   void ClearIndices() {
> We generally use the more concise one-line formatting for very short single
To add to what Tim said, this code didn't actually change, so it is nice to 
avoid touching code if you can. It makes it easier to track down what code 
change last impacted a piece of code.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@152
PS15, Line 152:   /// Destructor, release the bytes used by Memtracker.
  :   ~DictEncoder() {
  : ReleaseBytes();
  : DCHECK(dict_bytes_cnt_ == 0);
  :   }
I think we can omit this destructor, since DictEncoderBase does the same thing.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@296
PS15, Line 296:   /// Destructor, release the bytes used by Memtracker and 
close.
  :   ~DictDecoder() {
  : ReleaseBytes();
  : DCHECK(dict_bytes_cnt_ == 0);
  :}
I think we can omit this destructor, as DictDecoderBase does this.



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 15
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 03 Nov 2017 23:14:19 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-11-03 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 15:

(8 comments)

Pretty close, have some formatting nits (nothing horrible, just trying to keep 
the codebase consistent) and one perf concern.

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@54
PS15, Line 54: : dict_encoded_size_(0), pool_(NULL), dict_bytes_cnt_(0), 
dict_mem_tracker_(NULL) {}
Ok to leave for now but we've generally been moving towards using the recent 
C++ extension that allows initialising member variables to constants at their 
declartion.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@59
PS15, Line 59: DCHECK
DCHECK_EQ(dict_bytes_cnt_, 0);


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77
PS15, Line 77:   void ClearIndices() {
We generally use the more concise one-line formatting for very short 
single-statement functions like this one.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@123
PS15, Line 123:
formatting nit: * goes on the left with the type name.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@168
PS15, Line 168: inline
nit: inline isn't necessary. It isn't harmful but can be confusing to have 
unnecessary modifiers.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@269
PS15, Line 269:  *
nit: * should be on left.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@330
PS15, Line 330: inline
nit: unnecessary inline


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@374
PS15, Line 374:   ConsumeBytes(sizeof(node));
I'm concerned that calling MemTracker::Consume for every node added to the 
table could hurt performance in some workloads since it will end up 
incrementing the memory consumption counter in the root process-wide MemTracker 
up to 40k times per dictionary.

(In contrast, MemPool::Allocate() is generally fine since it allocates memory 
in chunks).

How about we track the memory for some number of nodes at a time. E.g.

  const int NODE_MEM_TRACKING_GRANULARITY = 4096;
  ...
  if (nodes % NODE_MEM_TRACKING_GRANULARITY == 0) {
ConsumeBytes(sizeof(node) * NODE_MEM_TRACKING_GRANULARITY);
  }



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 15
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 03 Nov 2017 23:07:31 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-11-01 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#15).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 140 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/15
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 15
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-11-01 Thread Pranay Singh (Code Review)

Pranay Singh has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63
PS12, Line 63:   /// dict_encoded_size() bytes.
 :   virtual void Writ
> Close() should call ClearIndices() as one of its steps, ClearIndices() stil
My bad didn't realize it, now that's fixed.



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 14
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 20:44:24 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-31 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63
PS12, Line 63:   /// dict_encoded_size() bytes.
 :   virtual void Writ
> implemented the logic of Close() inside ClearIndices()
Close() should call ClearIndices() as one of its steps, ClearIndices() still 
needs to exist separately and cannot incorporate the logic of Close().



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 14
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 01:16:49 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-31 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#14).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 134 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/14
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 14
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-31 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#13).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 137 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/13
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 13
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-31 Thread Pranay Singh (Code Review)

Pranay Singh has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 12:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h@352
PS12, Line 352:   MemTracker* dict_mem_tracker() { return 
scan_node_->mem_tracker(); }
> See comment in HdfsParquetTableWriter. Also, see how we pass the mem_tracke
Done


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h@81
PS12, Line 81:   /// Memory tracker to track the memory used by encoder.
 :   MemTracker* dict_mem_tracker();
> When I'm reading the code, I'm thinking that this is hiding more than it is
Removed the function made it in line where the memtracker is used so it's more 
explicit


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@179
PS12, Line 179:   dict_encoder_base_->ClearIndices();
  :   dict_encoder_base_->Close();
> When Close() does a call to ClearIndices(), this can be simplified to just
implemented the Close() logic inside ClearIndices()


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@322
PS12, Line 322:plain_encoded_value_size_,
  :parent_->dict_mem_tracker()));
> Nit: Indentation in this case should be even with the "D" in new DictEncode
aligned the indentation with D in next line.


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63
PS12, Line 63:   void Close() {
 : ReleaseBytes();
> I think Close() should do ClearIndices().
implemented the logic of Close() inside ClearIndices()



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 12
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 01:07:09 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-27 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 12:

(5 comments)

I think this is very close.

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h@352
PS12, Line 352:   MemTracker* dict_mem_tracker() { return 
scan_node_->mem_tracker(); }
See comment in HdfsParquetTableWriter. Also, see how we pass the mem_tracker to 
data_page_pool_.


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h@81
PS12, Line 81:   /// Memory tracker to track the memory used by encoder.
 :   MemTracker* dict_mem_tracker();
When I'm reading the code, I'm thinking that this is hiding more than it is 
illuminating. We use this in exactly one place. I would rather have the exact 
code right there with a good comment explaining which mem_tracker we are using.


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@179
PS12, Line 179:   dict_encoder_base_->ClearIndices();
  :   dict_encoder_base_->Close();
When Close() does a call to ClearIndices(), this can be simplified to just a 
Close() call.


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@322
PS12, Line 322:plain_encoded_value_size_,
  :parent_->dict_mem_tracker()));
Nit: Indentation in this case should be even with the "D" in new DictEncoder. 
It goes against every editor's typical behavior, but it's the standard we have. 
Also, feel free to keep plain_encoded_value_size_ on the first line. (It's on 
the edge of our 90 char limit.)


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63
PS12, Line 63:   void Close() {
 : ReleaseBytes();
I think Close() should do ClearIndices().



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 12
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 27 Oct 2017 23:59:23 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-23 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#12).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
7 files changed, 146 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/12
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 12
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-20 Thread Bikramjeet Vig (Code Review)

Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h@352
PS10, Line 352: return mem_tracker_;
> I tried that earlier but I was getting the below error for parquet table wr
I see, that probably happens because HdfsTableSink class only has a forward 
declaration in hdfs-table-write.h

If you would like to give this another try:

You can try including the hdfs-table-sink.h in the hdfs-parquet-table-writer.h 
and it should work.
OR
you can move the implementation of dict_mem_tracker() to 
hdfs-parquet-table-writers.cc and add the include there instead in order to 
minimize the includes in the header file.


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h@356
PS10, Line 356: if (dict_decoder_base_ != nullptr) dict_decoder_base_->Close();
> GetDictionaryDecoder() returns dict_decoder_base_ it should be the same cal
Sorry, I should have been more clearer about what I was suggesting. What I 
meant was, instead of changing the method GetDictionaryDecoder() in 
parquet-column-readers.h where you return dict_decoder_base_, we can instead 
get rid of dict_decoder_base_ member var altogether and use 
GetDictionaryDecoder() because the overridden method in ScalarColumnReader 
would return the actual dict_decoder_ object that we can use directly and in 
all other cases the default implementation will return a nullptr.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 11
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 21 Oct 2017 00:25:32 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-20 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#11).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 160 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/11
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 11
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-20 Thread Pranay Singh (Code Review)

Pranay Singh has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 10:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h@352
PS10, Line 352: return mem_tracker_;
> if this is the same mem tracker in scan_node_ perhaps we can just do
I tried that earlier but I was getting the below error for parquet table 
writer, so in order to have similar change in scanner and writer went this route

Error:
-
hdfs-parquet-table-writer.h:83:57: error: invalid use of incomplete type ‘class 
impala::HdfsTableSink’


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc@337
PS10, Line 337: if (mem_tracker_ != nullptr) {
  : mem_tracker_ = nullptr;
  :   }
> nit: one line
Done


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h@356
PS10, Line 356: if (dict_decoder_base_ != nullptr) dict_decoder_base_->Close();
> instead of dict_decoder_base_ we can just use the virtual function GetDicti
GetDictionaryDecoder() returns dict_decoder_base_ it should be the same calling 
the function GetDictionaryDecoder() or accessing member variable  
dict_decoder_base_ directly.


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@206
PS10, Line 206:
> nit: extra line
Done


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@324
PS10, Line 324:
> nit: extra line
Done



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 10
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Oct 2017 23:04:41 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-19 Thread Bikramjeet Vig (Code Review)

Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 10:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h@352
PS10, Line 352: return mem_tracker_;
if this is the same mem tracker in scan_node_ perhaps we can just do
scan_node_->mem_tracker();
over here and avoid adding a new member variable?
Same in the hdfs-table-writer.


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc@337
PS10, Line 337: if (mem_tracker_ != nullptr) {
  : mem_tracker_ = nullptr;
  :   }
nit: one line

if (mem_tracker_ != nullptr) mem_tracker_ = nullptr;


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h@356
PS10, Line 356: if (dict_decoder_base_ != nullptr) dict_decoder_base_->Close();
instead of dict_decoder_base_ we can just use the virtual function 
GetDictionaryDecoder() to get the dict if it exists


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@206
PS10, Line 206:
nit: extra line


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@324
PS10, Line 324:
nit: extra line



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 10
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 19 Oct 2017 21:41:48 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-17 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#10).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 166 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/10
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 10
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-17 Thread Pranay Singh (Code Review)

Pranay Singh has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 9:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348
PS8, Line 348: f the column.
> We should be using Close() here instead of unregistering every tracker (thi
Removed the Close() since I'm using the MemTracker of
scan_node_->mem_tracker so won't be using 1000's of memtrackers now


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@133
PS8, Line 133: di
> nit: inline here and nearby isn't necessary for a couple of reason. First,
Removed inline


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@184
PS8, Line 184: Node(const T& v, const NodeIndex& n) : value(v), next(n) { }
> Nit: check isn't necessary - Release() does this for you.
Removed the check


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@194
PS8, Line 194: enum { INVALID_INDEX = 4 };
> The TODO should be to use TryConsume(). The asynchronous checks are not the
Changed the TODO to use TryConsume()


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251
PS8, Line 251: / This function
> I agree with Joe's comment. All codepaths should call Close(), so we can ju
Added a DCHECK found during the testing that some of the column readers do not 
call Close, so I've added ReleaseBytes() so that accounting of memory used for 
dictionary is release when DictDecoder is destroyed.



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 9
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 18 Oct 2017 05:28:47 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-13 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251
PS8, Line 251: ReleaseBytes();
> Checking the dict_bytes_cnt_ to be zero seems to be tricky say even if it i
I agree with Joe's comment. All codepaths should call Close(), so we can just 
add a DCHECK here to enforce that they did so.



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 8
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 13 Oct 2017 22:50:19 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-13 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 8:

(4 comments)

I took a look at it and just wanted to point out some cases where the new code 
is diverging from best practices in our codebase.

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348
PS8, Line 348: CloseAndUnregisterFromParent
> I found that MemTracker::Close() is not unregistering with the parent MemTr
We should be using Close() here instead of unregistering every tracker (this is 
documented in the comment for CloseAndUnregisterFromParent()).

That may be moot since I don't think we need a special MemTracker for this - 
it's fine if we just track this against the scan node tracker. There are 
pros/cons of tracking things at a finer granularity but generally we don't 
track memory at a very fine granularity. Looking at the JIRA description, I can 
see it being read as "track the memory against a new MemTracker" but that 
wasn't what I intended.

As-is the behaviour added by this patch is particularly weird because we don't 
have per-scanner MemTrackers, so you'll end up with tens of these "dict 
decoder" trackers hanging off the scan node.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@133
PS8, Line 133: inline
nit: inline here and nearby isn't necessary for a couple of reason. First, the 
compiler can generally figure out that trivial functions can be inlined with or 
without a hint. Second, functions defined in C++ class bodies have an implicit 
inline hint implied anyway: 
https://isocpp.org/wiki/faq/inline-functions#inline-member-fns-more


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@184
PS8, Line 184: if (dict_bytes_cnt_ != 0) {
Nit: check isn't necessary - Release() does this for you.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@194
PS8, Line 194: // TODO: AnyLimitExceeded() can be called to check if memory 
limit has been exceeded.
The TODO should be to use TryConsume(). The asynchronous checks are not the 
direction we want to take things in. Michael Ho had a nice comment here 
explaining the direction we're trying to take things: 
https://issues.apache.org/jira/browse/IMPALA-2399



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 8
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 13 Oct 2017 22:49:17 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-13 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#9).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 170 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/9
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 9
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-13 Thread Pranay Singh (Code Review)

Pranay Singh has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 8:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@274
PS8, Line 274:   // Before closing the Column readers, the accounted bytes in 
mem_tracker for
 :   // dict_decoder_ is released.
 :   if (mem_tracker_ != nullptr) {
 : for (ParquetColumnReader* col_reader : column_readers_) 
col_reader->Cleanup();
 :   }
> I think this code won't be necessary if col_reader->Close() does the dictio
Removed this code and wrapped it up with the col_reader->Close()


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348
PS8, Line 348: CloseAndUnregisterFromParent
> Same question here as elsewhere: Do we really need CloseAndUnregisterFromPa
I found that MemTracker::Close() is not unregistering with the parent 
MemTracker , so I found CloseAndUnregisterFromParent to be more appropriate.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@178
PS8, Line 178: dict_encoder_base_->ClearIndices();
> This should call the new base Close() function (which should do ClearIndice
Done


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@1057
PS8, Line 1057: CloseAndUnregisterFromParent();
> I think we want to limit use of CloseAndUnregisterFromParent() to cases tha
I checked the MemTracker::Close(), it does not unregister the mem_tracker_ from 
it's parent,

 Since this mem_tracker_ should not be used beyond this point I have reset it 
to NULL


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h@234
PS8, Line 234:   /// Delete the bytes used for memory tracking.
 :   virtual void Cleanup() {}
> I think this can be folded into Close() and removed.
Done


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc@269
PS8, Line 269:   virtual void Cleanup() {
 : dict_decoder_.Close();
 :   }
> I think this can be folded into Close() without a separate Cleanup() functi
Done


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@107
PS8, Line 107:   DictEncoder(MemPool* pool, int encoded_value_size, MemTracker* 
mem_tracker) :
 :   DictEncoderBase(pool), buckets_(HASH_TABLE_SIZE, 
Node::INVALID_INDEX),
 :   encoded_value_size_(encoded_value_size),
 :   dict_mem_tracker_(mem_tracker),
 :   dict_bytes_cnt_(0) { }
> As I understand it, the DictEncoderBase/DictEncoder split is that DictEncod
Moved the functionality to the base class as you suggested


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251
PS8, Line 251: ReleaseBytes();
> Is there any codepath that should legitimately use this? If not, DCHECK tha
Checking the dict_bytes_cnt_ to be zero seems to be tricky say even if it is 
not used in a function it can have a non zero value.



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 8
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Fri, 13 Oct 2017 20:45:53 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-12 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 8:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@274
PS8, Line 274:   // Before closing the Column readers, the accounted bytes in 
mem_tracker for
 :   // dict_decoder_ is released.
 :   if (mem_tracker_ != nullptr) {
 : for (ParquetColumnReader* col_reader : column_readers_) 
col_reader->Cleanup();
 :   }
I think this code won't be necessary if col_reader->Close() does the dictionary 
close.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348
PS8, Line 348: CloseAndUnregisterFromParent
Same question here as elsewhere: Do we really need 
CloseAndUnregisterFromParent() or will Close() do?


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@178
PS8, Line 178: dict_encoder_base_->ClearIndices();
This should call the new base Close() function (which should do ClearIndices()).


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@1057
PS8, Line 1057: CloseAndUnregisterFromParent();
I think we want to limit use of CloseAndUnregisterFromParent() to cases that 
really need it. I think the ordinary Close() should work here. Check if that 
works.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h@234
PS8, Line 234:   /// Delete the bytes used for memory tracking.
 :   virtual void Cleanup() {}
I think this can be folded into Close() and removed.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc@269
PS8, Line 269:   virtual void Cleanup() {
 : dict_decoder_.Close();
 :   }
I think this can be folded into Close() without a separate Cleanup() function.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@107
PS8, Line 107:   DictEncoder(MemPool* pool, int encoded_value_size, MemTracker* 
mem_tracker) :
 :   DictEncoderBase(pool), buckets_(HASH_TABLE_SIZE, 
Node::INVALID_INDEX),
 :   encoded_value_size_(encoded_value_size),
 :   dict_mem_tracker_(mem_tracker),
 :   dict_bytes_cnt_(0) { }
As I understand it, the DictEncoderBase/DictEncoder split is that 
DictEncoderBase contains everything that does not rely on T and DictEncoder 
contains the type-specific stuff. The same split applies for 
DictDecoderBase/DictDecoder. Since the memory tracking is not type specific, I 
think it makes sense to put it in the base classes. This should simplify the 
code in HdfsParquetWriter, because then you can call Close() on the 
DictEncoderBase rather than having to go to the typed version. Close() on 
DictEncoderBase should also do ClearIndices().


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251
PS8, Line 251: ReleaseBytes();
Is there any codepath that should legitimately use this? If not, DCHECK that 
dict_bytes_cnt_ is zero.



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 8
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Thu, 12 Oct 2017 17:23:41 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-12 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#8).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 157 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/8
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 8
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-12 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#7).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 158 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/7
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 7
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-04 Thread Pranay Singh (Code Review)

Pranay Singh has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 6:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h@485
PS6, Line 485:   /// Tracks all the memory allocation used by dictionary in 
DictDecoder.
 :   struct DictMemTrack {
 : std::unique_ptr mem_tracker;
 :
 : DictMemTrack(MemTracker *parent_memtrack):mem_tracker(new 
MemTracker(-1,
 : "Decoder parquet dict", parent_memtrack, false)) { };
 :
 : MemTracker*  GetMemTracker() {
 :   return  mem_tracker.get();
 : }
 :
 : ~DictMemTrack() {
 :   mem_tracker->CloseAndUnregisterFromParent();
 : }
 :   };
> I dont think we need to add a new struct for this, you can just call CloseA
I had added a new structure to have the creation /destruction of memtracker in 
one place

and also the CloseAndUnregisterFromParent() has to be called after
releasing all bytes owned by MemTracker in the DictDecoder class (which happens 
in it's destructor), since the destructor of member variables are called later 
than the owner making a separate struct/class for DictMemTrack  solved the 
problem.


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@113
PS6, Line 113:  // Bytes used for memory tracking by dictionary in DictEncoder 
is decremented.
 :   virtual void Cleanup() {
 : if (dict_encoder_base_ != nullptr) {
 :   dict_encoder_base_->Cleanup();
 : }
 :   }
> you can move this to the BaseColumnWriter::Close() method
This can't be done because I'm not tracking the bytes at the BaseColumnWriter


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@767
PS6, Line 767:   // Below code will decrement the bytes used for memory 
tracking by dictionary in
 :   // DictEncoder class for each ColumnWriter.
 :   for (int i = 0; i < columns_.size(); i++) {
 :  if (columns_[i] != nullptr) {
 :columns_[i]->Cleanup();
 :  }
 :   }
> similarly you can move this to the HdfsParquetTableWriter::Close() method
I'll try moving the functionality in Close() for HdfsParquetTableWriter/Reader


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@138
PS6, Line 138: SizeofDict
> I think we can use a more precise name here, SizeOfDict might be confused w
I've changed it to be DictByteSize


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@438
PS6, Line 438: bytes_alloc += sizeof(value);
> can you switch to what joe suggested earlier, that is, instead of bytes_all
This looks fine what is the issue with this?



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 6
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Thu, 05 Oct 2017 01:58:59 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-04 Thread Bikramjeet Vig (Code Review)

Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 6:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h@485
PS6, Line 485:   /// Tracks all the memory allocation used by dictionary in 
DictDecoder.
 :   struct DictMemTrack {
 : std::unique_ptr mem_tracker;
 :
 : DictMemTrack(MemTracker *parent_memtrack):mem_tracker(new 
MemTracker(-1,
 : "Decoder parquet dict", parent_memtrack, false)) { };
 :
 : MemTracker*  GetMemTracker() {
 :   return  mem_tracker.get();
 : }
 :
 : ~DictMemTrack() {
 :   mem_tracker->CloseAndUnregisterFromParent();
 : }
 :   };
I dont think we need to add a new struct for this, you can just call 
CloseAndUnregisterFromParent() on the dict_mem_tracker in 
HdfsParquetScanner::Close

Similarly for the writer


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@113
PS6, Line 113:  // Bytes used for memory tracking by dictionary in DictEncoder 
is decremented.
 :   virtual void Cleanup() {
 : if (dict_encoder_base_ != nullptr) {
 :   dict_encoder_base_->Cleanup();
 : }
 :   }
you can move this to the BaseColumnWriter::Close() method


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@767
PS6, Line 767:   // Below code will decrement the bytes used for memory 
tracking by dictionary in
 :   // DictEncoder class for each ColumnWriter.
 :   for (int i = 0; i < columns_.size(); i++) {
 :  if (columns_[i] != nullptr) {
 :columns_[i]->Cleanup();
 :  }
 :   }
similarly you can move this to the HdfsParquetTableWriter::Close() method


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@138
PS6, Line 138: SizeofDict
I think we can use a more precise name here, SizeOfDict might be confused with 
returning the number of elements in the Dictionary.
Something on the lines of DictByteSize?


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@438
PS6, Line 438: bytes_alloc += sizeof(value);
can you switch to what joe suggested earlier, that is, instead of bytes_alloc, 
using something like the function SizeofDict() that you wrote to update 
ConsumeBytes at the end.
Refer Joe's first comment.



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 6
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Wed, 04 Oct 2017 22:29:59 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-03 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#6).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
7 files changed, 170 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/6
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 6
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-03 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#5).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
7 files changed, 171 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/5
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 5
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-03 Thread Pranay Singh (Code Review)

Pranay Singh has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@113
PS3, Line 113:
> Remove trailing spaces. This also applies to lines that don't have any code
Done


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@164
PS3, Line 164: mtrackp
> For class fields, the convention is to end the field name with an underscor
Done


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> using num_values * size of type might not work when dealing with variable s
Done


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> The parquet DictionaryPageHeader contains a num_values field. Look where we
Ok I'll allocate a local variable to track the bytes consumed.


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> For strings, the StringValue is a pointer and a length. In the case of the
Done


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc
File be/src/util/dict-test.cc:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc@39
PS3, Line 39: tracker
> can you add test cases that verify that the encoder/decoder is keeping trac
I've added the checks now



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Tue, 03 Oct 2017 21:45:51 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-03 Thread Pranay Singh (Code Review)

Hello Joe McDonnell, Bikramjeet Vig,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#4).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
7 files changed, 171 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/4
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 4
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-03 Thread Bikramjeet Vig (Code Review)

Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> For strings, the StringValue is a pointer and a length. In the case of the
yes I agree



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Tue, 03 Oct 2017 18:59:59 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-10-02 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> using num_values * size of type might not work when dealing with variable s
For strings, the StringValue is a pointer and a length. In the case of the 
dictionary, the pointer is pointing into the dictionary (allocated from 
dictionary_pool_ in InitDictionary()). Since the memory for the dictionary page 
is already tracked, we only need to track these StringValue's, which are 
fixed-size. Either way, we should aggregate the Consume() calls here.



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Mon, 02 Oct 2017 21:48:41 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-09-28 Thread Bikramjeet Vig (Code Review)

Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> The parquet DictionaryPageHeader contains a num_values field. Look where we
using num_values * size of type might not work when dealing with variable sized 
type like string.

I would recommend keeping count of the bytes used in a local variable and then 
do a ConsumeBytes when we exit the loop. This is because mtrackp loops on all 
its parent trackers to update mem usage every time Consume(num_bytes) is 
called. Not a huge optimization but I think it might be worth avoiding another 
loop inside the hot path.


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc
File be/src/util/dict-test.cc:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc@39
PS3, Line 39: tracker
can you add test cases that verify that the encoder/decoder is keeping track 
correctly.
You can do this by using tracker.consumption() to get the num of bytes consumed 
and compare it to the expected size you calculate separately.



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Fri, 29 Sep 2017 01:16:40 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-09-28 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/data-sink.h
File be/src/exec/data-sink.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/data-sink.h@126
PS3, Line 126:   /// This memtracker tracks all the allocations done by parquet 
dictonary encoder.
 :   boost::scoped_ptr dict_mem_tracker_;
Can we move this down to HdfsParquetTableWriter?


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/exec-node.h
File be/src/exec/exec-node.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/exec-node.h@325
PS3, Line 325:   /// Tracks all the memory allocation by parquet dictonary 
decoders
 :   boost::scoped_ptr dict_mem_tracker_;
Can this move down to HdfsParquetScanner? Other ExecNodes don't need this.


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@113
PS3, Line 113:
Remove trailing spaces. This also applies to lines that don't have any code 
(there are a couple others in this file). These are easily visible in the 
Gerrit UI.

If using emacs, the following will highlight trailing whitespace:
(setq-default show-trailing-whitespace t)


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@164
PS3, Line 164: mtrackp
For class fields, the convention is to end the field name with an underscore. 
Also, look at other code that uses MemTrackers and see what variable name that 
code uses. It is good to harmonize these things across the codebase.


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
The parquet DictionaryPageHeader contains a num_values field. Look where we 
call CreateDictionaryDecoder and you'll see that we reference to it on the next 
line. I think we should consider doing a single ConsumeBytes using num_values * 
size of type.



--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Thu, 28 Sep 2017 16:55:24 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-09-28 Thread Pranay Singh (Code Review)

Hello Joe McDonnell,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#3).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class ExecNode level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class DataSink to track the memory
used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
---
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/data-sink.cc
M be/src/exec/data-sink.h
M be/src/exec/exec-node.cc
M be/src/exec/exec-node.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 93 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/3
--
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-09-15 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8034/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS2, Line 213: dict_decoder_(parent_->dictionary_pool_.get()),
We should think about having a separate MemPool for parts of the dictionary 
that will not be referenced by the RowBatch and can be freed at the end of the 
row group.

Since some datatypes are copied by value rather than pointing into the 
dictionary, this might be used for those as well.


http://gerrit.cloudera.org:8080/#/c/8034/2/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

PS2, Line 229: memcpy(val_ptr, dict_[index], sizeof(T));
The memcpy is doing a dereference here and is not different from *val_ptr = 
*dict_[index];


PS2, Line 238:   std::vector dict_;
Using a T* means that GetValue() must do a dereference to obtain the value. 
This is a very performance sensitive codepath, so we need to avoid this extra 
dereference. This should maintain the value directly in the vector.


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-09-12 Thread Pranay Singh (Code Review)

Pranay Singh has uploaded a new patch set (#2).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses existing dictionary_pool_ to track the memory used
by std::vector in DictDecoder class, whereas an existing memory pool
for DictEncoder class is used when allocating memory for std::vector
in DictEncoder class.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
3 files changed, 42 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-09-12 Thread Pranay Singh (Code Review)

Pranay Singh has posted comments on this change.

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/exec-node.h
File be/src/exec/exec-node.h:

PS1, Line 215:  MemTracker* decoder_mem_tracker() { return 
decoder_mem_tracker_.get(); }
> The dictionary is only used for Parquet, so I think the MemPool should not 
I also think using existing memory pool, dictionary_pool_  is a better option, 
I missed it. I'll use dictionary_pool_ instead


http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS1, Line 213: dict_decoder_(new 
MemPool(parent->scan_node_->decoder_mem_tracker())),
> It is important not to create objects on a per-column basis unless truly ne
I'll use dictionary_pool_ instead of creating a new memory pool for every 
column reads


http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

PS1, Line 234: *val_ptr = *dict_[index];
> dict_ needs to return T's directly rather than T*'s. This is a performance 
I'll use memcpy to copy the contents of dictionary to the buffer passed.


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Pranay Singh
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-09-12 Thread Pranay Singh (Code Review)

Pranay Singh has uploaded a new patch set (#2).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses existing dictionary_pool_ to track the memory used
by std::vector in DictDecoder class, whereas an existing memory pool
for DictEncoder class is used when allocating memory for std::vector
in DictEncoder class.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
3 files changed, 42 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-09-11 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/exec-node.h
File be/src/exec/exec-node.h:

PS1, Line 215:  MemTracker* decoder_mem_tracker() { return 
decoder_mem_tracker_.get(); }
The dictionary is only used for Parquet, so I think the MemPool should not be 
in the ExecNode, as that is used for a large number of other things. Look into 
moving this down to HdfsParquetScanner (or reuse dictionary_pool_).


http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS1, Line 213: dict_decoder_(new 
MemPool(parent->scan_node_->decoder_mem_tracker())),
It is important not to create objects on a per-column basis unless truly 
necessary. I think it makes more sense to have a single MemPool up at the 
HdfsParquetScanner level. Look at how dictionary_pool_ works. I think it might 
be better to reuse that pool.


http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

PS1, Line 234: *val_ptr = *dict_[index];
dict_ needs to return T's directly rather than T*'s. This is a performance 
critical path, and an extra dereference is too expensive.


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

2017-09-11 Thread Pranay Singh (Code Review)

Pranay Singh has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/8034

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
..

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces a new MemTracker to track the memory used
by std::vector in DictDecoder class, whereas an existing memory pool
for DictEncoder class is used when allocating memory for std::vector
in DictEncoder class.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/exec-node.cc
M be/src/exec/exec-node.h
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 52 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh

42 matches

Mail list logo