[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704832#comment-13704832
 ] 

Ashutosh Chauhan commented on HIVE-4838:
----------------------------------------

I am glad you are taking a stab at this Brock. I looked at it couple of days 
and immediately felt the need for refactor. I was looking at it from 
performance point of view. There are couple of things which are worth 
considering in this refactor. 
* We are using java serialization to serialize the hash table. If we use some 
custom serialization we can possibly increase both memory efficiency as well as 
speed for this piece of code.
* Keys & values of the map are wrapper java objects, if we can use better data 
structures that will be further win.

I am just putting up as thoughts which came to my mind in 15 mins perusal of 
that class. Feel free to ignore them for now, we can take these latter once 
this basic cleanup is in.
                
> Refactor MapJoin HashMap code to improve testability and readability
> --------------------------------------------------------------------
>
>                 Key: HIVE-4838
>                 URL: https://issues.apache.org/jira/browse/HIVE-4838
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Brock Noland
>            Assignee: Brock Noland
>
> MapJoin is an essential component for high performance joins in Hive and the 
> current code has done great service for many years. However, the code is 
> showing it's age and currently suffers  from the following issues:
> * Uses static state via the MapJoinMetaData class to pass serialization 
> metadata to the Key, Row classes.
> * The api of a logical "Table Container" is not defined and therefore it's 
> unclear what apis HashMapWrapper 
> needs to publicize. Additionally HashMapWrapper has many used public methods.
> * HashMapWrapper contains logic to serialize, test memory bounds, and 
> implement the table container. Ideally these logical units could be seperated
> * HashTableSinkObjectCtx has unused fields and unused methods
> * CommonJoinOperator and children use ArrayList on left hand side when only 
> List is required
> * There are unused classes MRU, DCLLItemm, MapJoinSingleKey, and 
> MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to