[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724880#action_12724880 ]
Alan Gates commented on PIG-793: -------------------------------- The cost for storing data raw is: 16 bytes for the tuple object 12 bytes for the byte array object 12 bytes + 2 bytes/field for a short[] to hold offsets into the byte[] Then as you say above for the data itself, plus 1 byte per field to store type and nullness. So our example tuple would take ~85 bytes. But in general, yes you can do much better with raw bytes. We played with this some and we found that the cost of Tuple.get/set goes up 10x because of the need to turn the bytes into objects. In a typical query this added about 2x to the overall run time. The solution to this would be to rewrite all the Pig operators to work on byte data instead of objects. This is a large project, and doesn't solve the UDFs. We could pay the performance penalty for UDFs, or we could change the UDFs to take byte data. Currently many of our users are asking for the ability to write UDFs in Python or other scripting languages. If we instead go the other way and basically make them write C style Java I don't think that will be popular. What we're playing with now (changing ArrayList<Object> to Object[] and String to Text) will reap somewhere around 50% of the benefits in terms of memory savings as going to fully raw data. But it's around 10% of the work. I'm not excluding moving to storing everything in a byte[] in the future. But I want to see if for a little work now we can get a descent amount of improvement. > Improving memory efficiency of Tuple implementation > --------------------------------------------------- > > Key: PIG-793 > URL: https://issues.apache.org/jira/browse/PIG-793 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich > Assignee: Alan Gates > > Currently, our tuple is a real pig and uses a lot of extra memory. > There are several places where we can improve memory efficiency: > (1) Laying out memory for the fields rather than using java objects since > since each object for a numeric field takes 16 bytes > (2) For the cases where we know the schema using Java arrays rather than > ArrayList. > There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.