[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742863#comment-13742863 ] Hudson commented on HIVE-4838: -- ABORTED: Integrated in Hive-trunk-hadoop2 #365 (See [https://builds.apache.org/job/Hive-trunk-hadoop2/365/]) HIVE-4838 : Refactor MapJoin HashMap code to improve testability and readability (Brock Noland via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514760) * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractMapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinUtil.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinMetaData.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionException.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionHandler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinTableContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractRowContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/DCLLItem.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/HashMapWrapper.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MRU.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinDoubleKeys.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectSerDeContext.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectValue.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinRowContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinSingleKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestHashMapWrapper.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin/TestMapJoinMemoryExhaustionHandler.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinEqualityTableContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKey.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKeys.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinRowContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinTableContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/Utilities.java > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Fix For: 0.12.0 > > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally thes
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742851#comment-13742851 ] Hudson commented on HIVE-4838: -- FAILURE: Integrated in Hive-trunk-h0.21 #2273 (See [https://builds.apache.org/job/Hive-trunk-h0.21/2273/]) HIVE-4838 : Refactor MapJoin HashMap code to improve testability and readability (Brock Noland via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514760) * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractMapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinUtil.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinMetaData.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionException.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionHandler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinTableContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractRowContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/DCLLItem.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/HashMapWrapper.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MRU.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinDoubleKeys.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectSerDeContext.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectValue.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinRowContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinSingleKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestHashMapWrapper.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin/TestMapJoinMemoryExhaustionHandler.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinEqualityTableContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKey.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKeys.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinRowContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinTableContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/Utilities.java > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Fix For: 0.12.0 > > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742622#comment-13742622 ] Hudson commented on HIVE-4838: -- FAILURE: Integrated in Hive-trunk-hadoop1-ptest #130 (See [https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/130/]) HIVE-4838 : Refactor MapJoin HashMap code to improve testability and readability (Brock Noland via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514760) * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractMapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinUtil.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinMetaData.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionException.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionHandler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinTableContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractRowContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/DCLLItem.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/HashMapWrapper.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MRU.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinDoubleKeys.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectSerDeContext.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectValue.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinRowContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinSingleKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestHashMapWrapper.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin/TestMapJoinMemoryExhaustionHandler.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinEqualityTableContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKey.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKeys.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinRowContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinTableContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/Utilities.java > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Fix For: 0.12.0 > > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container.
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742550#comment-13742550 ] Hudson commented on HIVE-4838: -- FAILURE: Integrated in Hive-trunk-hadoop2-ptest #61 (See [https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/61/]) HIVE-4838 : Refactor MapJoin HashMap code to improve testability and readability (Brock Noland via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514760) * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractMapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinUtil.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinMetaData.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionException.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mapjoin/MapJoinMemoryExhaustionHandler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractMapJoinTableContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/AbstractRowContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/DCLLItem.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/HashMapWrapper.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MRU.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinDoubleKeys.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectSerDeContext.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinObjectValue.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinRowContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinSingleKey.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestHashMapWrapper.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/mapjoin/TestMapJoinMemoryExhaustionHandler.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinEqualityTableContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKey.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinKeys.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinRowContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestMapJoinTableContainer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/Utilities.java > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Fix For: 0.12.0 > > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Id
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742343#comment-13742343 ] Brock Noland commented on HIVE-4838: Thanks!! I have opened HIVE-5110 to look at the memory consumption stuff we discussed. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Fix For: 0.12.0 > > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742316#comment-13742316 ] Brock Noland commented on HIVE-4838: That test has been failing since commit. I believe Gunther asked someone to look at it. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742310#comment-13742310 ] Hive QA commented on HIVE-4838: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12598209/HIVE-4838.patch {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 2884 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_udtf_not_supported2 {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/463/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/463/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests failed with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742184#comment-13742184 ] Brock Noland commented on HIVE-4838: Done, looks like the last build had a connection error to source control. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13741882#comment-13741882 ] Ashutosh Chauhan commented on HIVE-4838: [~brocknoland] Can you trigger HIVE QA run for this? > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737492#comment-13737492 ] Ashutosh Chauhan commented on HIVE-4838: Ok. Sounds good. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737482#comment-13737482 ] Brock Noland commented on HIVE-4838: Sounds good, I will address them. In regards to the moves, I don't believe there are any true "mv's". MapJoinObjectKey -> MapJoinKey is kind of a move but I'd say it's more of complete re-implementation. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737471#comment-13737471 ] Ashutosh Chauhan commented on HIVE-4838: Good work Brock. Left some comments on phabricator. Another question is it seems like there are few file mvs? To preserve history how shall we proceed about applying this patch on trunk. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736168#comment-13736168 ] Hive QA commented on HIVE-4838: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12597317/HIVE-4838.patch {color:green}SUCCESS:{color} +1 2779 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/387/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/387/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13735950#comment-13735950 ] Brock Noland commented on HIVE-4838: Good call, I will make the change tonight and update a new patch. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13735785#comment-13735785 ] Ashutosh Chauhan commented on HIVE-4838: [~brocknoland] Lets get this in, before the patch gets stale. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13732597#comment-13732597 ] Ashutosh Chauhan commented on HIVE-4838: bq. I am fine with removing the memory handling and using OOM. I think that I will allocate a buffer of say 1MB and then when the OOM is hit free that buffer so we can cleanly exit and log. Sounds good. Lets proceed with that. Though, I belief 256KB should be more than sufficient to generate exception and cleanly exit. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13732571#comment-13732571 ] Brock Noland commented on HIVE-4838: What I was saying is the the local task JVM could be of different size than the mapred.child.java.opts on the server. I haven't heard of people hitting this much so it must not be too much of an issue. Good to know the ORC stuff is only used on write so it won't be an issue. I am fine with removing the memory handling and using OOM. I think that I will allocate a buffer of say 1MB and then when the OOM is hit free that buffer so we can cleanly exit and log. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13732293#comment-13732293 ] Ashutosh Chauhan commented on HIVE-4838: Actually memory monitoring I was talking of was about local task which generates hashtable which happens locally on client. To generate a hashtable (which is then ship to task nodes) we launch local job on client in separate process. Logic of memory management for this local task is convoluted (not of MR job which actually does the join in mapper). This local task monitors its own memory, but seems like MapredLocalTask is catching OOM exception anyways. One of this is not required. My thinking is there shouldn't be any memory monitoring and we should just catch OOM exception when it fails. Anyways join is converted into mapjoin only when size of small table is small (governed by config knob), so this OOM should be very very rare. So, my suggestion is to remove MemoryHandler altogether. ORC memory manger won't be a problem here, since ORC makes use of memory manager only while writing data and here we are dumping hashtable in java serialized format, so that wont be relevant. For similar reason (that this is local task) java.opts and io.sort.mb arent relevant either. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13727102#comment-13727102 ] Brock Noland commented on HIVE-4838: I guess we could go that route. My thought was that the memory consumption was monitored to be conservative? I've always wondered about this. I mean if an admin sets mapred.child.java.opts and io.sort.mb final on the cluster the settings we are using from a client perspective could be completely different therefore it's possible it "works" locally but fails on the cluster. Another question I had about this is that ORC has a memory manager that assumes it can use a certain percentage of ram but that could conflict with our work here? That is the ORC memory manager could use memory while creating the hash table that we won't use when reading the hash table? Additionally I thought it might make sense to only store offsets into a side file in the hash map to reduce memory consumption and then throw say a 25MB LRU cache on lookups into the file. Since the file is small it should be in OS buffer cache when not in the LRU cache. Maybe we should take up memory management during map joins in another jira? > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726791#comment-13726791 ] Ashutosh Chauhan commented on HIVE-4838: Yeah I misunderstood that piece. Another question : I see that you have improved memory handling. But I am confused why do we need to monitor memory usage here anyway? This predate your patch so question here really is do we need memory handler here? It seems it was put in place so that we can proactively kill local task before it throws OOM. But since MapRedLocalTask anyways catches OOM exception, it seems like even if local task didnt kill itself before OOM'ing, we should be fine since MapRedLocalTask will take care of OOM exception. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726507#comment-13726507 ] Brock Noland commented on HIVE-4838: Hey can you explain a little bit more? We aren't writing out the metadata per-key or anything like that, we are passing the metadata down into new read/write methods. AFAICT the current approach did the static stuff because they were using the Externalizable interface which didn't not allow any push-down metadata during seralization or deserialization. If you look at MapJoinTableContainer read and write in the new patch you'll see us pushing the metadata (called context) down into the *new* read/write methods and the corresponding read/write methods are not serializing that metadata. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726497#comment-13726497 ] Ashutosh Chauhan commented on HIVE-4838: bq. The current code is using this static code because by using java serialization there is no way to pass any "context" information down to the class when the read/write methods are being called. In the new patch I define my own read/write methods By tracking metadata info per key, will it going to increase the size of hashtable? Earlier, metadata info is passed as one blob and loaded statically which can be looked by every key. Agreed it is not the clean way of doing it, but now this patch is storing metadata info per key, looks like this will increase the size of hashtable. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723403#comment-13723403 ] Edward Capriolo commented on HIVE-4838: --- Hey, I think I may have mistakenly come to the conclusion that https://issues.apache.org/jira/browse/HIVE-2906 Passed tests when it did not. We might be best off reverting 2906 if it is a problem. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723390#comment-13723390 ] Hive QA commented on HIVE-4838: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12594724/HIVE-4838.patch {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 2741 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_serde_user_properties {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/224/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/224/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests failed with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717719#comment-13717719 ] Brock Noland commented on HIVE-4838: Updated review https://reviews.facebook.net/D11679 > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717716#comment-13717716 ] Brock Noland commented on HIVE-4838: Hey, Yes I have. I'll upload an updated patch here in a few minutes. The current code is using this static code because by using java serialization there is no way to pass any "context" information down to the class when the read/write methods are being called. In the new patch I define my own read/write methods (example below). {noformat} public void read(MapJoinObjectSerDeContext context, ObjectInputStream in, Writable container) throws IOException, SerDeException { {noformat} and use those to serialize/deserialize the objects. Specifically in the new patch MapJoinRowContainer.read/write, MapJoinTableContainerSerDe.load/persist and MapJoinKey.read/write will be interesting. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717703#comment-13717703 ] Ashutosh Chauhan commented on HIVE-4838: [~brocknoland] One of the item listed in description is: * Uses static state via the MapJoinMetaData class to pass serialization metadata to the Key, Row classes. Have you attacked this in this patch? If yes, how did you fix it. I haven't dived into the patch to figure that out yet. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707751#comment-13707751 ] Brock Noland commented on HIVE-4838: Correct I believe this to only affect the null safe operator. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707658#comment-13707658 ] Yin Huai commented on HIVE-4838: >From the code, seems this issue only affects <=> operator. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707592#comment-13707592 ] Yin Huai commented on HIVE-4838: Hi Brock, I have a question. Does this correctness issue only affect joins with <=> operator? Or it also affects = operator? > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707332#comment-13707332 ] Brock Noland commented on HIVE-4838: I think the equals method has been broken since HIVE-1754 but as far as I can tell it only affects joins with nulls in the join keys. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707318#comment-13707318 ] Edward Capriolo commented on HIVE-4838: --- This is pretty sad news. How long has map-side join been broken for? > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707269#comment-13707269 ] Brock Noland commented on HIVE-4838: Map-side is wrong and reduce-side was correct. For that query, on the map side, rows which should be joined are not. For example, the reduce side outputs this row: {noformat} a.key a.value b.key b.value 148 NULL 148 NULL {noformat} which makes sense since a.key is equal to b.key and a.value is equal to b.value but the current map-side code omits this row. The reason is that MapJoinDoubleKey is used for the map-side join which doesn't properly compare null values. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707239#comment-13707239 ] Edward Capriolo commented on HIVE-4838: --- So which version is correct the map join or the map reduce join. Or were Both producing the wrong results? > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707213#comment-13707213 ] Brock Noland commented on HIVE-4838: Fair enough, I'll have a patch for HIVE-4845 shortly. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707205#comment-13707205 ] Ashutosh Chauhan commented on HIVE-4838: Interesting. Lets tease out that part from refactoring than. We need to fix correctness issue first. Can you create a separate jira with this issue and submit a minimal patch which fixes it. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707181#comment-13707181 ] Brock Noland commented on HIVE-4838: Hi, Correct there is. It's related to the snippet of code I posted earlier. Basically the equals implementation of MapJoinDoubleKey (and MapJoinObjectKey) is incorrect resulting in different results for the following query depending on how it executed (map-side vs reduce-side): {noformat} SELECT /*+ MAPJOIN(a) */ * FROM smb_input1 a JOIN smb_input1 b ON a.key <=> b.key AND a.value <=> b.value ORDER BY a.key, a.value, b.key, b.value; {noformat} Brock > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707154#comment-13707154 ] Ashutosh Chauhan commented on HIVE-4838: I see there is an update to .q.out file. Does that mean there is a correctness issue in existing code ? > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13705401#comment-13705401 ] Brock Noland commented on HIVE-4838: Hey thanks for the feedback! Yes I thought about those items as well. I have a patch just about ready, which I'd like to get in before the optimizations since it fixes some correctness bugs but I'd love to per-sue those two items in a follow up jira. For example, the following code produces unexpected results :) {noformat} public static void main(String[] args) { MapJoinDoubleKeys left = new MapJoinDoubleKeys(148, null); MapJoinDoubleKeys right = new MapJoinDoubleKeys(148, null); System.out.println(left.equals(right)); MapJoinObjectKey left = new MapJoinObjectKey(new Object[]{null, "left"}); MapJoinObjectKey right = new MapJoinObjectKey(new Object[]{null, "right"}); System.out.println(left.equals(right)); } {noformat} > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability
[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704832#comment-13704832 ] Ashutosh Chauhan commented on HIVE-4838: I am glad you are taking a stab at this Brock. I looked at it couple of days and immediately felt the need for refactor. I was looking at it from performance point of view. There are couple of things which are worth considering in this refactor. * We are using java serialization to serialize the hash table. If we use some custom serialization we can possibly increase both memory efficiency as well as speed for this piece of code. * Keys & values of the map are wrapper java objects, if we can use better data structures that will be further win. I am just putting up as thoughts which came to my mind in 15 mins perusal of that class. Feel free to ignore them for now, we can take these latter once this basic cleanup is in. > Refactor MapJoin HashMap code to improve testability and readability > > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug >Reporter: Brock Noland >Assignee: Brock Noland > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm, MapJoinSingleKey, and > MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira