[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724594#action_12724594 ]
Alan Gates commented on PIG-793: -------------------------------- Using jmap, I've been toying around with our DefaultTuple implementation to see how much memory it takes. For a tuple with 3 elements, one int, one double, one 20 character string I see it taking: 16 bytes for the Tuple object 24 bytes for the ArrayList<Object> in the tuple ~26 bytes for pointers in the ArrayList 16 bytes for the Integer 16 bytes for the Double 24 bytes for the String overhead ~52 bytes for the String data Pointers in the ArrayList and character data in the String appear to be padded and vary somewhat depending on how I run the experiments. I played with changing the ArrayList<Object> in DefaultTuple to an Object[]. There are two advantages, the 24 bytes of ArrayList shrinks to 12 for the Object[], and as I wrote it to always have the Object[] be exactly the right size there is no padding cost. The downside to this is append becomes a more expensive operation because it's growing the Object[] by one every time. However, after some investigation I believe that most places we use append can be changed to use set, thus alieviating this issue. I'm working on a patch to change this. Once I have that done I'll report on how that changes memory usage as well as any performance gains or losses. A related item I would like to look into is using Hadoop's Text instead of String to back chararray. Text takes 16 bytes of overhead + 36 bytes for string data to store 20 characters, versus the 24 / 52 of String. Obviously this would be a huge change and needs to have very impressive results to be considered. I'll play with it and report results here. > Improving memory efficiency of Tuple implementation > --------------------------------------------------- > > Key: PIG-793 > URL: https://issues.apache.org/jira/browse/PIG-793 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich > Assignee: Alan Gates > > Currently, our tuple is a real pig and uses a lot of extra memory. > There are several places where we can improve memory efficiency: > (1) Laying out memory for the fields rather than using java objects since > since each object for a numeric field takes 16 bytes > (2) For the cases where we know the schema using Java arrays rather than > ArrayList. > There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.