[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 16: (10 comments) http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@54 PS15, Line 54: DCHECK(buffered_indices_.empty()); > Ok to leave for now but we've generally been moving towards using the recen Done http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@59 PS15, Line 59: / This > DCHECK_EQ(dict_bytes_cnt_, 0); Done http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77 PS15, Line 77: /// indices. Used to size the buffer passed to WriteData(). > To add to what Tim said, this code didn't actually change, so it is nice to Done http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77 PS15, Line 77: /// indices. Used to size the buffer passed to WriteData(). > We generally use the more concise one-line formatting for very short single Done http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@123 PS15, Line 123: > formatting nit: * goes on the left with the type name. Done http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@152 PS15, Line 152: } : num_call_track_++; : } : } : }; > I think we can omit this destructor, since DictEncoderBase does the same th Done http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@168 PS15, Line 168: /// th > nit: inline isn't necessary. It isn't harmful but can be confusing to have Done http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@269 PS15, Line 269: > nit: * should be on left. Done http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@296 PS15, Line 296: : template : class DictDecoder : public DictDecoderBase { : public: : // > I think we can omit this destructor, as DictDecoderBase does this. Done http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@330 PS15, Line 330: /// It > nit: unnecessary inline Done -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 16 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 10 Nov 2017 06:22:29 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#16). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch uses the memory tracker of HdfsScanner to track the memory used by dictionary in DictDecoder class. Similary it uses memory tracker of HdfsTableSink to track the memory used by dictionary in DictEncoder class. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 5 files changed, 142 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/16 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 16 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 15: (3 comments) http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77 PS15, Line 77: void ClearIndices() { > We generally use the more concise one-line formatting for very short single To add to what Tim said, this code didn't actually change, so it is nice to avoid touching code if you can. It makes it easier to track down what code change last impacted a piece of code. http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@152 PS15, Line 152: /// Destructor, release the bytes used by Memtracker. : ~DictEncoder() { : ReleaseBytes(); : DCHECK(dict_bytes_cnt_ == 0); : } I think we can omit this destructor, since DictEncoderBase does the same thing. http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@296 PS15, Line 296: /// Destructor, release the bytes used by Memtracker and close. : ~DictDecoder() { : ReleaseBytes(); : DCHECK(dict_bytes_cnt_ == 0); :} I think we can omit this destructor, as DictDecoderBase does this. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 15 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 03 Nov 2017 23:14:19 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 15: (8 comments) Pretty close, have some formatting nits (nothing horrible, just trying to keep the codebase consistent) and one perf concern. http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@54 PS15, Line 54: : dict_encoded_size_(0), pool_(NULL), dict_bytes_cnt_(0), dict_mem_tracker_(NULL) {} Ok to leave for now but we've generally been moving towards using the recent C++ extension that allows initialising member variables to constants at their declartion. http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@59 PS15, Line 59: DCHECK DCHECK_EQ(dict_bytes_cnt_, 0); http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77 PS15, Line 77: void ClearIndices() { We generally use the more concise one-line formatting for very short single-statement functions like this one. http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@123 PS15, Line 123: formatting nit: * goes on the left with the type name. http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@168 PS15, Line 168: inline nit: inline isn't necessary. It isn't harmful but can be confusing to have unnecessary modifiers. http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@269 PS15, Line 269: * nit: * should be on left. http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@330 PS15, Line 330: inline nit: unnecessary inline http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@374 PS15, Line 374: ConsumeBytes(sizeof(node)); I'm concerned that calling MemTracker::Consume for every node added to the table could hurt performance in some workloads since it will end up incrementing the memory consumption counter in the root process-wide MemTracker up to 40k times per dictionary. (In contrast, MemPool::Allocate() is generally fine since it allocates memory in chunks). How about we track the memory for some number of nodes at a time. E.g. const int NODE_MEM_TRACKING_GRANULARITY = 4096; ... if (nodes % NODE_MEM_TRACKING_GRANULARITY == 0) { ConsumeBytes(sizeof(node) * NODE_MEM_TRACKING_GRANULARITY); } -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 15 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 03 Nov 2017 23:07:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#15). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch uses the memory tracker of HdfsScanner to track the memory used by dictionary in DictDecoder class. Similary it uses memory tracker of HdfsTableSink to track the memory used by dictionary in DictEncoder class. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 5 files changed, 140 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/15 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 15 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 14: (1 comment) http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63 PS12, Line 63: /// dict_encoded_size() bytes. : virtual void Writ > Close() should call ClearIndices() as one of its steps, ClearIndices() stil My bad didn't realize it, now that's fixed. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 14 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 20:44:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 14: (1 comment) http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63 PS12, Line 63: /// dict_encoded_size() bytes. : virtual void Writ > implemented the logic of Close() inside ClearIndices() Close() should call ClearIndices() as one of its steps, ClearIndices() still needs to exist separately and cannot incorporate the logic of Close(). -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 14 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 01:16:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#14). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch uses the memory tracker of HdfsScanner to track the memory used by dictionary in DictDecoder class. Similary it uses memory tracker of HdfsTableSink to track the memory used by dictionary in DictEncoder class. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 5 files changed, 134 insertions(+), 15 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/14 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 14 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#13). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch uses the memory tracker of HdfsScanner to track the memory used by dictionary in DictDecoder class. Similary it uses memory tracker of HdfsTableSink to track the memory used by dictionary in DictEncoder class. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 5 files changed, 137 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/13 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 13 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 12: (5 comments) http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h@352 PS12, Line 352: MemTracker* dict_mem_tracker() { return scan_node_->mem_tracker(); } > See comment in HdfsParquetTableWriter. Also, see how we pass the mem_tracke Done http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h@81 PS12, Line 81: /// Memory tracker to track the memory used by encoder. : MemTracker* dict_mem_tracker(); > When I'm reading the code, I'm thinking that this is hiding more than it is Removed the function made it in line where the memtracker is used so it's more explicit http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@179 PS12, Line 179: dict_encoder_base_->ClearIndices(); : dict_encoder_base_->Close(); > When Close() does a call to ClearIndices(), this can be simplified to just implemented the Close() logic inside ClearIndices() http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@322 PS12, Line 322:plain_encoded_value_size_, :parent_->dict_mem_tracker())); > Nit: Indentation in this case should be even with the "D" in new DictEncode aligned the indentation with D in next line. http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63 PS12, Line 63: void Close() { : ReleaseBytes(); > I think Close() should do ClearIndices(). implemented the logic of Close() inside ClearIndices() -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 12 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 01:07:09 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 12: (5 comments) I think this is very close. http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h@352 PS12, Line 352: MemTracker* dict_mem_tracker() { return scan_node_->mem_tracker(); } See comment in HdfsParquetTableWriter. Also, see how we pass the mem_tracker to data_page_pool_. http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h@81 PS12, Line 81: /// Memory tracker to track the memory used by encoder. : MemTracker* dict_mem_tracker(); When I'm reading the code, I'm thinking that this is hiding more than it is illuminating. We use this in exactly one place. I would rather have the exact code right there with a good comment explaining which mem_tracker we are using. http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@179 PS12, Line 179: dict_encoder_base_->ClearIndices(); : dict_encoder_base_->Close(); When Close() does a call to ClearIndices(), this can be simplified to just a Close() call. http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@322 PS12, Line 322:plain_encoded_value_size_, :parent_->dict_mem_tracker())); Nit: Indentation in this case should be even with the "D" in new DictEncoder. It goes against every editor's typical behavior, but it's the standard we have. Also, feel free to keep plain_encoded_value_size_ on the first line. (It's on the edge of our 90 char limit.) http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63 PS12, Line 63: void Close() { : ReleaseBytes(); I think Close() should do ClearIndices(). -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 12 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 27 Oct 2017 23:59:23 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#12). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch uses the memory tracker of HdfsScanner to track the memory used by dictionary in DictDecoder class. Similary it uses memory tracker of HdfsTableSink to track the memory used by dictionary in DictEncoder class. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 7 files changed, 146 insertions(+), 15 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/12 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 12 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 11: (2 comments) http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h@352 PS10, Line 352: return mem_tracker_; > I tried that earlier but I was getting the below error for parquet table wr I see, that probably happens because HdfsTableSink class only has a forward declaration in hdfs-table-write.h If you would like to give this another try: You can try including the hdfs-table-sink.h in the hdfs-parquet-table-writer.h and it should work. OR you can move the implementation of dict_mem_tracker() to hdfs-parquet-table-writers.cc and add the include there instead in order to minimize the includes in the header file. http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h@356 PS10, Line 356: if (dict_decoder_base_ != nullptr) dict_decoder_base_->Close(); > GetDictionaryDecoder() returns dict_decoder_base_ it should be the same cal Sorry, I should have been more clearer about what I was suggesting. What I meant was, instead of changing the method GetDictionaryDecoder() in parquet-column-readers.h where you return dict_decoder_base_, we can instead get rid of dict_decoder_base_ member var altogether and use GetDictionaryDecoder() because the overridden method in ScalarColumnReader would return the actual dict_decoder_ object that we can use directly and in all other cases the default implementation will return a nullptr. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 11 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 21 Oct 2017 00:25:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#11). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch uses the memory tracker of HdfsScanner to track the memory used by dictionary in DictDecoder class. Similary it uses memory tracker of HdfsTableSink to track the memory used by dictionary in DictEncoder class. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 8 files changed, 160 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/11 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 11 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 10: (5 comments) http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h@352 PS10, Line 352: return mem_tracker_; > if this is the same mem tracker in scan_node_ perhaps we can just do I tried that earlier but I was getting the below error for parquet table writer, so in order to have similar change in scanner and writer went this route Error: - hdfs-parquet-table-writer.h:83:57: error: invalid use of incomplete type ‘class impala::HdfsTableSink’ http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc@337 PS10, Line 337: if (mem_tracker_ != nullptr) { : mem_tracker_ = nullptr; : } > nit: one line Done http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h@356 PS10, Line 356: if (dict_decoder_base_ != nullptr) dict_decoder_base_->Close(); > instead of dict_decoder_base_ we can just use the virtual function GetDicti GetDictionaryDecoder() returns dict_decoder_base_ it should be the same calling the function GetDictionaryDecoder() or accessing member variable dict_decoder_base_ directly. http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@206 PS10, Line 206: > nit: extra line Done http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@324 PS10, Line 324: > nit: extra line Done -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 10 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Oct 2017 23:04:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 10: (5 comments) http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h@352 PS10, Line 352: return mem_tracker_; if this is the same mem tracker in scan_node_ perhaps we can just do scan_node_->mem_tracker(); over here and avoid adding a new member variable? Same in the hdfs-table-writer. http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc@337 PS10, Line 337: if (mem_tracker_ != nullptr) { : mem_tracker_ = nullptr; : } nit: one line if (mem_tracker_ != nullptr) mem_tracker_ = nullptr; http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h@356 PS10, Line 356: if (dict_decoder_base_ != nullptr) dict_decoder_base_->Close(); instead of dict_decoder_base_ we can just use the virtual function GetDictionaryDecoder() to get the dict if it exists http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@206 PS10, Line 206: nit: extra line http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@324 PS10, Line 324: nit: extra line -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 10 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 19 Oct 2017 21:41:48 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#10). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch uses the memory tracker of HdfsScanner to track the memory used by dictionary in DictDecoder class. Similary it uses memory tracker of HdfsTableSink to track the memory used by dictionary in DictEncoder class. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 8 files changed, 166 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/10 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 10 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 9: (5 comments) http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348 PS8, Line 348: f the column. > We should be using Close() here instead of unregistering every tracker (thi Removed the Close() since I'm using the MemTracker of scan_node_->mem_tracker so won't be using 1000's of memtrackers now http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@133 PS8, Line 133: di > nit: inline here and nearby isn't necessary for a couple of reason. First, Removed inline http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@184 PS8, Line 184: Node(const T& v, const NodeIndex& n) : value(v), next(n) { } > Nit: check isn't necessary - Release() does this for you. Removed the check http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@194 PS8, Line 194: enum { INVALID_INDEX = 4 }; > The TODO should be to use TryConsume(). The asynchronous checks are not the Changed the TODO to use TryConsume() http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251 PS8, Line 251: / This function > I agree with Joe's comment. All codepaths should call Close(), so we can ju Added a DCHECK found during the testing that some of the column readers do not call Close, so I've added ReleaseBytes() so that accounting of memory used for dictionary is release when DictDecoder is destroyed. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 9 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 18 Oct 2017 05:28:47 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251 PS8, Line 251: ReleaseBytes(); > Checking the dict_bytes_cnt_ to be zero seems to be tricky say even if it i I agree with Joe's comment. All codepaths should call Close(), so we can just add a DCHECK here to enforce that they did so. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 8 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 13 Oct 2017 22:50:19 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 8: (4 comments) I took a look at it and just wanted to point out some cases where the new code is diverging from best practices in our codebase. http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348 PS8, Line 348: CloseAndUnregisterFromParent > I found that MemTracker::Close() is not unregistering with the parent MemTr We should be using Close() here instead of unregistering every tracker (this is documented in the comment for CloseAndUnregisterFromParent()). That may be moot since I don't think we need a special MemTracker for this - it's fine if we just track this against the scan node tracker. There are pros/cons of tracking things at a finer granularity but generally we don't track memory at a very fine granularity. Looking at the JIRA description, I can see it being read as "track the memory against a new MemTracker" but that wasn't what I intended. As-is the behaviour added by this patch is particularly weird because we don't have per-scanner MemTrackers, so you'll end up with tens of these "dict decoder" trackers hanging off the scan node. http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@133 PS8, Line 133: inline nit: inline here and nearby isn't necessary for a couple of reason. First, the compiler can generally figure out that trivial functions can be inlined with or without a hint. Second, functions defined in C++ class bodies have an implicit inline hint implied anyway: https://isocpp.org/wiki/faq/inline-functions#inline-member-fns-more http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@184 PS8, Line 184: if (dict_bytes_cnt_ != 0) { Nit: check isn't necessary - Release() does this for you. http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@194 PS8, Line 194: // TODO: AnyLimitExceeded() can be called to check if memory limit has been exceeded. The TODO should be to use TryConsume(). The asynchronous checks are not the direction we want to take things in. Michael Ho had a nice comment here explaining the direction we're trying to take things: https://issues.apache.org/jira/browse/IMPALA-2399 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 8 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 13 Oct 2017 22:49:17 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#9). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch introduces memory tracker to track the memory used by dictionary used in DictDecoder class and DictEncoder class when a parquet file is read from or written to. Memory tracker is introduced at the class HdfsParquetScanner level to track all the memory used by dictonary in DictDecoders. Similarly memory tracker is introduced in class HdfsParquetTableWriter to track the memory used by dictionary in DictEncoders. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 8 files changed, 170 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/9 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 9 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 8: (8 comments) http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@274 PS8, Line 274: // Before closing the Column readers, the accounted bytes in mem_tracker for : // dict_decoder_ is released. : if (mem_tracker_ != nullptr) { : for (ParquetColumnReader* col_reader : column_readers_) col_reader->Cleanup(); : } > I think this code won't be necessary if col_reader->Close() does the dictio Removed this code and wrapped it up with the col_reader->Close() http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348 PS8, Line 348: CloseAndUnregisterFromParent > Same question here as elsewhere: Do we really need CloseAndUnregisterFromPa I found that MemTracker::Close() is not unregistering with the parent MemTracker , so I found CloseAndUnregisterFromParent to be more appropriate. http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@178 PS8, Line 178: dict_encoder_base_->ClearIndices(); > This should call the new base Close() function (which should do ClearIndice Done http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@1057 PS8, Line 1057: CloseAndUnregisterFromParent(); > I think we want to limit use of CloseAndUnregisterFromParent() to cases tha I checked the MemTracker::Close(), it does not unregister the mem_tracker_ from it's parent, Since this mem_tracker_ should not be used beyond this point I have reset it to NULL http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h@234 PS8, Line 234: /// Delete the bytes used for memory tracking. : virtual void Cleanup() {} > I think this can be folded into Close() and removed. Done http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc@269 PS8, Line 269: virtual void Cleanup() { : dict_decoder_.Close(); : } > I think this can be folded into Close() without a separate Cleanup() functi Done http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@107 PS8, Line 107: DictEncoder(MemPool* pool, int encoded_value_size, MemTracker* mem_tracker) : : DictEncoderBase(pool), buckets_(HASH_TABLE_SIZE, Node::INVALID_INDEX), : encoded_value_size_(encoded_value_size), : dict_mem_tracker_(mem_tracker), : dict_bytes_cnt_(0) { } > As I understand it, the DictEncoderBase/DictEncoder split is that DictEncod Moved the functionality to the base class as you suggested http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251 PS8, Line 251: ReleaseBytes(); > Is there any codepath that should legitimately use this? If not, DCHECK tha Checking the dict_bytes_cnt_ to be zero seems to be tricky say even if it is not used in a function it can have a non zero value. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 8 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Comment-Date: Fri, 13 Oct 2017 20:45:53 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 8: (8 comments) http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@274 PS8, Line 274: // Before closing the Column readers, the accounted bytes in mem_tracker for : // dict_decoder_ is released. : if (mem_tracker_ != nullptr) { : for (ParquetColumnReader* col_reader : column_readers_) col_reader->Cleanup(); : } I think this code won't be necessary if col_reader->Close() does the dictionary close. http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348 PS8, Line 348: CloseAndUnregisterFromParent Same question here as elsewhere: Do we really need CloseAndUnregisterFromParent() or will Close() do? http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@178 PS8, Line 178: dict_encoder_base_->ClearIndices(); This should call the new base Close() function (which should do ClearIndices()). http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@1057 PS8, Line 1057: CloseAndUnregisterFromParent(); I think we want to limit use of CloseAndUnregisterFromParent() to cases that really need it. I think the ordinary Close() should work here. Check if that works. http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h@234 PS8, Line 234: /// Delete the bytes used for memory tracking. : virtual void Cleanup() {} I think this can be folded into Close() and removed. http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc@269 PS8, Line 269: virtual void Cleanup() { : dict_decoder_.Close(); : } I think this can be folded into Close() without a separate Cleanup() function. http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@107 PS8, Line 107: DictEncoder(MemPool* pool, int encoded_value_size, MemTracker* mem_tracker) : : DictEncoderBase(pool), buckets_(HASH_TABLE_SIZE, Node::INVALID_INDEX), : encoded_value_size_(encoded_value_size), : dict_mem_tracker_(mem_tracker), : dict_bytes_cnt_(0) { } As I understand it, the DictEncoderBase/DictEncoder split is that DictEncoderBase contains everything that does not rely on T and DictEncoder contains the type-specific stuff. The same split applies for DictDecoderBase/DictDecoder. Since the memory tracking is not type specific, I think it makes sense to put it in the base classes. This should simplify the code in HdfsParquetWriter, because then you can call Close() on the DictEncoderBase rather than having to go to the typed version. Close() on DictEncoderBase should also do ClearIndices(). http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251 PS8, Line 251: ReleaseBytes(); Is there any codepath that should legitimately use this? If not, DCHECK that dict_bytes_cnt_ is zero. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 8 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Comment-Date: Thu, 12 Oct 2017 17:23:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#8). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch introduces memory tracker to track the memory used by dictionary used in DictDecoder class and DictEncoder class when a parquet file is read from or written to. Memory tracker is introduced at the class HdfsParquetScanner level to track all the memory used by dictonary in DictDecoders. Similarly memory tracker is introduced in class HdfsParquetTableWriter to track the memory used by dictionary in DictEncoders. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 8 files changed, 157 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/8 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 8 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#7). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch introduces memory tracker to track the memory used by dictionary used in DictDecoder class and DictEncoder class when a parquet file is read from or written to. Memory tracker is introduced at the class HdfsParquetScanner level to track all the memory used by dictonary in DictDecoders. Similarly memory tracker is introduced in class HdfsParquetTableWriter to track the memory used by dictionary in DictEncoders. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 8 files changed, 158 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/7 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 7 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 6: (5 comments) http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h@485 PS6, Line 485: /// Tracks all the memory allocation used by dictionary in DictDecoder. : struct DictMemTrack { : std::unique_ptr mem_tracker; : : DictMemTrack(MemTracker *parent_memtrack):mem_tracker(new MemTracker(-1, : "Decoder parquet dict", parent_memtrack, false)) { }; : : MemTracker* GetMemTracker() { : return mem_tracker.get(); : } : : ~DictMemTrack() { : mem_tracker->CloseAndUnregisterFromParent(); : } : }; > I dont think we need to add a new struct for this, you can just call CloseA I had added a new structure to have the creation /destruction of memtracker in one place and also the CloseAndUnregisterFromParent() has to be called after releasing all bytes owned by MemTracker in the DictDecoder class (which happens in it's destructor), since the destructor of member variables are called later than the owner making a separate struct/class for DictMemTrack solved the problem. http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@113 PS6, Line 113: // Bytes used for memory tracking by dictionary in DictEncoder is decremented. : virtual void Cleanup() { : if (dict_encoder_base_ != nullptr) { : dict_encoder_base_->Cleanup(); : } : } > you can move this to the BaseColumnWriter::Close() method This can't be done because I'm not tracking the bytes at the BaseColumnWriter http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@767 PS6, Line 767: // Below code will decrement the bytes used for memory tracking by dictionary in : // DictEncoder class for each ColumnWriter. : for (int i = 0; i < columns_.size(); i++) { : if (columns_[i] != nullptr) { :columns_[i]->Cleanup(); : } : } > similarly you can move this to the HdfsParquetTableWriter::Close() method I'll try moving the functionality in Close() for HdfsParquetTableWriter/Reader http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@138 PS6, Line 138: SizeofDict > I think we can use a more precise name here, SizeOfDict might be confused w I've changed it to be DictByteSize http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@438 PS6, Line 438: bytes_alloc += sizeof(value); > can you switch to what joe suggested earlier, that is, instead of bytes_all This looks fine what is the issue with this? -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 6 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Comment-Date: Thu, 05 Oct 2017 01:58:59 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 6: (5 comments) http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h@485 PS6, Line 485: /// Tracks all the memory allocation used by dictionary in DictDecoder. : struct DictMemTrack { : std::unique_ptr mem_tracker; : : DictMemTrack(MemTracker *parent_memtrack):mem_tracker(new MemTracker(-1, : "Decoder parquet dict", parent_memtrack, false)) { }; : : MemTracker* GetMemTracker() { : return mem_tracker.get(); : } : : ~DictMemTrack() { : mem_tracker->CloseAndUnregisterFromParent(); : } : }; I dont think we need to add a new struct for this, you can just call CloseAndUnregisterFromParent() on the dict_mem_tracker in HdfsParquetScanner::Close Similarly for the writer http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@113 PS6, Line 113: // Bytes used for memory tracking by dictionary in DictEncoder is decremented. : virtual void Cleanup() { : if (dict_encoder_base_ != nullptr) { : dict_encoder_base_->Cleanup(); : } : } you can move this to the BaseColumnWriter::Close() method http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@767 PS6, Line 767: // Below code will decrement the bytes used for memory tracking by dictionary in : // DictEncoder class for each ColumnWriter. : for (int i = 0; i < columns_.size(); i++) { : if (columns_[i] != nullptr) { :columns_[i]->Cleanup(); : } : } similarly you can move this to the HdfsParquetTableWriter::Close() method http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@138 PS6, Line 138: SizeofDict I think we can use a more precise name here, SizeOfDict might be confused with returning the number of elements in the Dictionary. Something on the lines of DictByteSize? http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@438 PS6, Line 438: bytes_alloc += sizeof(value); can you switch to what joe suggested earlier, that is, instead of bytes_alloc, using something like the function SizeofDict() that you wrote to update ConsumeBytes at the end. Refer Joe's first comment. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 6 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Comment-Date: Wed, 04 Oct 2017 22:29:59 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#6). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch introduces memory tracker to track the memory used by dictionary used in DictDecoder class and DictEncoder class when a parquet file is read from or written to. Memory tracker is introduced at the class HdfsParquetScanner level to track all the memory used by dictonary in DictDecoders. Similarly memory tracker is introduced in class HdfsParquetTableWriter to track the memory used by dictionary in DictEncoders. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-readers.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 7 files changed, 170 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/6 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 6 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#5). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch introduces memory tracker to track the memory used by dictionary used in DictDecoder class and DictEncoder class when a parquet file is read from or written to. Memory tracker is introduced at the class HdfsParquetScanner level to track all the memory used by dictonary in DictDecoders. Similarly memory tracker is introduced in class HdfsParquetTableWriter to track the memory used by dictionary in DictEncoders. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-readers.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 7 files changed, 171 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/5 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 5 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@113 PS3, Line 113: > Remove trailing spaces. This also applies to lines that don't have any code Done http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@164 PS3, Line 164: mtrackp > For class fields, the convention is to end the field name with an underscor Done http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412 PS3, Line 412: ConsumeBytes(sizeof(value)); > using num_values * size of type might not work when dealing with variable s Done http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412 PS3, Line 412: ConsumeBytes(sizeof(value)); > The parquet DictionaryPageHeader contains a num_values field. Look where we Ok I'll allocate a local variable to track the bytes consumed. http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412 PS3, Line 412: ConsumeBytes(sizeof(value)); > For strings, the StringValue is a pointer and a length. In the case of the Done http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc File be/src/util/dict-test.cc: http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc@39 PS3, Line 39: tracker > can you add test cases that verify that the encoder/decoder is keeping trac I've added the checks now -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 3 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Comment-Date: Tue, 03 Oct 2017 21:45:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#4). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch introduces memory tracker to track the memory used by dictionary used in DictDecoder class and DictEncoder class when a parquet file is read from or written to. Memory tracker is introduced at the class HdfsParquetScanner level to track all the memory used by dictonary in DictDecoders. Similarly memory tracker is introduced in class HdfsParquetTableWriter to track the memory used by dictionary in DictEncoders. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-readers.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 7 files changed, 171 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/4 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 4 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412 PS3, Line 412: ConsumeBytes(sizeof(value)); > For strings, the StringValue is a pointer and a length. In the case of the yes I agree -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 3 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Comment-Date: Tue, 03 Oct 2017 18:59:59 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412 PS3, Line 412: ConsumeBytes(sizeof(value)); > using num_values * size of type might not work when dealing with variable s For strings, the StringValue is a pointer and a length. In the case of the dictionary, the pointer is pointing into the dictionary (allocated from dictionary_pool_ in InitDictionary()). Since the memory for the dictionary page is already tracked, we only need to track these StringValue's, which are fixed-size. Either way, we should aggregate the Consume() calls here. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 3 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Comment-Date: Mon, 02 Oct 2017 21:48:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412 PS3, Line 412: ConsumeBytes(sizeof(value)); > The parquet DictionaryPageHeader contains a num_values field. Look where we using num_values * size of type might not work when dealing with variable sized type like string. I would recommend keeping count of the bytes used in a local variable and then do a ConsumeBytes when we exit the loop. This is because mtrackp loops on all its parent trackers to update mem usage every time Consume(num_bytes) is called. Not a huge optimization but I think it might be worth avoiding another loop inside the hot path. http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc File be/src/util/dict-test.cc: http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc@39 PS3, Line 39: tracker can you add test cases that verify that the encoder/decoder is keeping track correctly. You can do this by using tracker.consumption() to get the num of bytes consumed and compare it to the expected size you calculate separately. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 3 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Comment-Date: Fri, 29 Sep 2017 01:16:40 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 ) Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 3: (5 comments) http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/data-sink.h File be/src/exec/data-sink.h: http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/data-sink.h@126 PS3, Line 126: /// This memtracker tracks all the allocations done by parquet dictonary encoder. : boost::scoped_ptr dict_mem_tracker_; Can we move this down to HdfsParquetTableWriter? http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/exec-node.h File be/src/exec/exec-node.h: http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/exec-node.h@325 PS3, Line 325: /// Tracks all the memory allocation by parquet dictonary decoders : boost::scoped_ptr dict_mem_tracker_; Can this move down to HdfsParquetScanner? Other ExecNodes don't need this. http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@113 PS3, Line 113: Remove trailing spaces. This also applies to lines that don't have any code (there are a couple others in this file). These are easily visible in the Gerrit UI. If using emacs, the following will highlight trailing whitespace: (setq-default show-trailing-whitespace t) http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@164 PS3, Line 164: mtrackp For class fields, the convention is to end the field name with an underscore. Also, look at other code that uses MemTrackers and see what variable name that code uses. It is good to harmonize these things across the codebase. http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412 PS3, Line 412: ConsumeBytes(sizeof(value)); The parquet DictionaryPageHeader contains a num_values field. Look where we call CreateDictionaryDecoder and you'll see that we reference to it on the next line. I think we should consider doing a single ConsumeBytes using num_values * size of type. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 3 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-Comment-Date: Thu, 28 Sep 2017 16:55:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Hello Joe McDonnell, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8034 to look at the new patch set (#3). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch introduces memory tracker to track the memory used by dictionary used in DictDecoder class and DictEncoder class when a parquet file is read from or written to. Memory tracker is introduced at the class ExecNode level to track all the memory used by dictonary in DictDecoders. Similarly memory tracker is introduced in class DataSink to track the memory used by dictionary in DictEncoders. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing --- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/data-sink.cc M be/src/exec/data-sink.h M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-readers.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 8 files changed, 93 insertions(+), 10 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/3 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-Change-Number: 8034 Gerrit-PatchSet: 3 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Joe McDonnell has posted comments on this change. Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/8034/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: PS2, Line 213: dict_decoder_(parent_->dictionary_pool_.get()), We should think about having a separate MemPool for parts of the dictionary that will not be referenced by the RowBatch and can be freed at the end of the row group. Since some datatypes are copied by value rather than pointing into the dictionary, this might be used for those as well. http://gerrit.cloudera.org:8080/#/c/8034/2/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: PS2, Line 229: memcpy(val_ptr, dict_[index], sizeof(T)); The memcpy is doing a dereference here and is not different from *val_ptr = *dict_[index]; PS2, Line 238: std::vector dict_; Using a T* means that GetValue() must do a dereference to obtain the value. This is a very performance sensitive codepath, so we need to avoid this extra dereference. This should maintain the value directly in the vector. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has uploaded a new patch set (#2). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch uses existing dictionary_pool_ to track the memory used by std::vector in DictDecoder class, whereas an existing memory pool for DictEncoder class is used when allocating memory for std::vector in DictEncoder class. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/parquet-column-readers.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 3 files changed, 42 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/2 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has posted comments on this change. Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/exec-node.h File be/src/exec/exec-node.h: PS1, Line 215: MemTracker* decoder_mem_tracker() { return decoder_mem_tracker_.get(); } > The dictionary is only used for Parquet, so I think the MemPool should not I also think using existing memory pool, dictionary_pool_ is a better option, I missed it. I'll use dictionary_pool_ instead http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: PS1, Line 213: dict_decoder_(new MemPool(parent->scan_node_->decoder_mem_tracker())), > It is important not to create objects on a per-column basis unless truly ne I'll use dictionary_pool_ instead of creating a new memory pool for every column reads http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: PS1, Line 234: *val_ptr = *dict_[index]; > dict_ needs to return T's directly rather than T*'s. This is a performance I'll use memcpy to copy the contents of dictionary to the buffer passed. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Pranay Singh Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has uploaded a new patch set (#2). Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch uses existing dictionary_pool_ to track the memory used by std::vector in DictDecoder class, whereas an existing memory pool for DictEncoder class is used when allocating memory for std::vector in DictEncoder class. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/parquet-column-readers.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 3 files changed, 42 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/2 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Joe McDonnell
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Joe McDonnell has posted comments on this change. Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/exec-node.h File be/src/exec/exec-node.h: PS1, Line 215: MemTracker* decoder_mem_tracker() { return decoder_mem_tracker_.get(); } The dictionary is only used for Parquet, so I think the MemPool should not be in the ExecNode, as that is used for a large number of other things. Look into moving this down to HdfsParquetScanner (or reuse dictionary_pool_). http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: PS1, Line 213: dict_decoder_(new MemPool(parent->scan_node_->decoder_mem_tracker())), It is important not to create objects on a per-column basis unless truly necessary. I think it makes more sense to have a single MemPool up at the HdfsParquetScanner level. Look at how dictionary_pool_ works. I think it might be better to reuse that pool. http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: PS1, Line 234: *val_ptr = *dict_[index]; dict_ needs to return T's directly rather than T*'s. This is a performance critical path, and an extra dereference is too expensive. -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Joe McDonnell Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
Pranay Singh has uploaded a new change for review. http://gerrit.cloudera.org:8080/8034 Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder .. IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch introduces a new MemTracker to track the memory used by std::vector in DictDecoder class, whereas an existing memory pool for DictEncoder class is used when allocating memory for std::vector in DictEncoder class. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 --- M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/parquet-column-readers.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc 5 files changed, 52 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/1 -- To view, visit http://gerrit.cloudera.org:8080/8034 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Pranay Singh