[ https://issues.apache.org/jira/browse/PIG-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170754#comment-13170754 ]
Dmitriy V. Ryaboy commented on PIG-2359: ---------------------------------------- Thanks for the close read, Alan! bq. In PrimitiveTuple.get(), I wonder if you'd get faster access if you removed the array bounds check. Java is going to do that for you anyway. You can catch the IndexOutOfBoundsException and rethrow it with a nicer error message. Will do. bq. The same comment applies to checking whether the buffer capacity will be exceeded by reading the requested field. Will do. bq. Also applies to set() Will do bq. Does append ever make sense for these types of tuples? Should it just throw NotSupportedException? I think it makes sense, painful as it is -- users can get a PTuple handed to their UDF and unwittingly call append on it. I don't want existing scripts to crash, so trying to make things degrade nicely. bq. In the P*Tuple classes, when a user calls set(int pos, Object o), you are forcing o into the type of the tuple (e.g., for PIntTuple you are forcing it into an int). This is a change of semantics from the general tuple contract where whatever you pass to set is taken to be the value for that field. I would like to understand more about the use case when you would expect to see this used. Is it that you want to force this to int because the data may or may not be all ints (like there may be some floats?). I think it would be better to just take an int, and return a null and issue a warning if what you get isn't an int. This still violates the semantic, but at least it doesn't silently produce a different result. If the use case is only for the internal use of passing data between map and reducer or between MR jobs, then I definitely think we should forget all the checks and just assume the data is correct. The use case isn't just internal, I started this in the first case because I needed to construct large tuple bags in a UDF. My reasoning for taking int value was that this is what we do when people "cast" a float to an int in pig. If you declare the schema to be an int, and put in a float... seems to me like having an int come out is ok. Could also die abruptly. I think null would be most surprising of the available choices. bq. You added new methods to the TupleFactory class, which is marked as Stable. You'll need to provide default implementations of those to avoid breaking backward compatibility. Good call, will do. bq. Why is this patch changing http libraries? (See the changes to ivy/library.properties.) I did that when I was going to use ByteArrayBuffer, offered by httpcore. The nice thing about it is that it's resizable, but then again it doesn't have the r/wLong, r/wInt, etc methods, so I reverted to regular nio.ByteBuffer. There is no strict reason to change the libs, but it's a safe bump -- and the libs we are currently using are deprecated and replaced by the ones I bumped to (it's the same project, which got moved inside Apache). See http://hc.apache.org/httpclient-3.x/ > Support more efficient Tuples when schemas are known > ---------------------------------------------------- > > Key: PIG-2359 > URL: https://issues.apache.org/jira/browse/PIG-2359 > Project: Pig > Issue Type: New Feature > Reporter: Dmitriy V. Ryaboy > Assignee: Dmitriy V. Ryaboy > Attachments: PIG-2359.1.patch, PIG-2359.2.patch, PIG-2359.3.patch > > > Pig Tuples have significant overhead due to the fact that all the fields are > Objects. > When a Tuple only contains primitive fields (ints, longs, etc), it's possible > to avoid this overhead, which would result in significant memory savings. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira