[jira] [Commented] (PIG-2359) Support more efficient Tuples when schemas are known

Dmitriy V. Ryaboy (Commented) (JIRA) Thu, 15 Dec 2011 20:14:06 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170754#comment-13170754
 ]


Dmitriy V. Ryaboy commented on PIG-2359:
----------------------------------------

Thanks for the close read, Alan!

bq. In PrimitiveTuple.get(), I wonder if you'd get faster access if you removed 
the array bounds check. Java is going to do that for you anyway. You can catch 
the IndexOutOfBoundsException and rethrow it with a nicer error message.

Will do.

bq. The same comment applies to checking whether the buffer capacity will be 
exceeded by reading the requested field.

Will do.

bq. Also applies to set()

Will do

bq. Does append ever make sense for these types of tuples? Should it just throw 
NotSupportedException?

I think it makes sense, painful as it is -- users can get a PTuple handed to 
their UDF and unwittingly call append on it. I don't want existing scripts to 
crash, so trying to make things degrade nicely.

bq. In the P*Tuple classes, when a user calls set(int pos, Object o), you are 
forcing o into the type of the tuple (e.g., for PIntTuple you are forcing it 
into an int). This is a change of semantics from the general tuple contract 
where whatever you pass to set is taken to be the value for that field. I would 
like to understand more about the use case when you would expect to see this 
used. Is it that you want to force this to int because the data may or may not 
be all ints (like there may be some floats?). I think it would be better to 
just take an int, and return a null and issue a warning if what you get isn't 
an int. This still violates the semantic, but at least it doesn't silently 
produce a different result. If the use case is only for the internal use of 
passing data between map and reducer or between MR jobs, then I definitely 
think we should forget all the checks and just assume the data is correct.

The use case isn't just internal, I started this in the first case because I 
needed to construct large tuple bags in a UDF. My reasoning for taking int 
value was that this is what we do when people "cast" a float to an int in pig. 
If you declare the schema to be an int, and put in a float... seems to me like 
having an int come out is ok. Could also die abruptly. I think null would be 
most surprising of the available choices.

bq. You added new methods to the TupleFactory class, which is marked as Stable. 
You'll need to provide default implementations of those to avoid breaking 
backward compatibility.

Good call, will do.

bq. Why is this patch changing http libraries? (See the changes to 
ivy/library.properties.)

I did that when I was going to use ByteArrayBuffer, offered by httpcore. The 
nice thing about it is that it's resizable, but then again it doesn't have the 
r/wLong, r/wInt, etc methods, so I reverted to regular nio.ByteBuffer. There is 
no strict reason to change the libs, but it's a safe bump -- and the libs we 
are currently using are deprecated and replaced by the ones I bumped to (it's 
the same project, which got moved inside Apache). See 
http://hc.apache.org/httpclient-3.x/
                
> Support more efficient Tuples when schemas are known
> ----------------------------------------------------
>
>                 Key: PIG-2359
>                 URL: https://issues.apache.org/jira/browse/PIG-2359
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: PIG-2359.1.patch, PIG-2359.2.patch, PIG-2359.3.patch
>
>
> Pig Tuples have significant overhead due to the fact that all the fields are 
> Objects.
> When a Tuple only contains primitive fields (ints, longs, etc), it's possible 
> to avoid this overhead, which would result in significant memory savings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2359) Support more efficient Tuples when schemas are known

Reply via email to