[
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247999#comment-13247999
]
[email protected] commented on PIG-2632:
----------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/#review6722
-----------------------------------------------------------
This is an incredible piece of work!
See my comments bellow
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
<https://reviews.apache.org/r/4651/#comment14556>
it would be nice if Pig called outputSchema() once during the init and that
was saved somewhere.
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
<https://reviews.apache.org/r/4651/#comment14557>
So the side effect is that the Tuple class get generated here ? This should
be encapsulated in a more obvious method here.
like: generateTupleForSchema(tmpS) and not necessarily create an instance
of a tuple.
also that would remove duplication here.
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
<https://reviews.apache.org/r/4651/#comment14558>
those are the inputSchema and the outputSchema. The fields should probably
called just that and have boolean flag to know if we use generated Tuples?
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
<https://reviews.apache.org/r/4651/#comment14559>
I would make it:
TupleFactory.geInstanceForSchema().newTuple()
trunk/src/org/apache/pig/data/BinInterSedes.java
<https://reviews.apache.org/r/4651/#comment14560>
should SchemaTuple extend TypeAwareTuple ?
trunk/src/org/apache/pig/data/PrimitiveTuple.java
<https://reviews.apache.org/r/4651/#comment14564>
BytesKey ?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14573>
statics on top
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14574>
largestSetValue should already be correct. Why does largestSetValue stays
at the size of the schema?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14575>
isAppendNull()
Should the append...() methods be public?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14576>
the result would be the same, right?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14580>
setAppend()
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14578>
http://www.ibm.com/developerworks/java/library/j-codetoheap/index.html
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14631>
if it is always overridden, it should be abstract
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14579>
getAppendType()
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14584>
set? stream?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14583>
getSizeNoAppend() ?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14585>
redundant with the 2 others?
idClassMap.get(schemaIdMap.get(schema))
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14586>
could be computed statically by the generated class if you make a static
getSchema(schemaString).
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14587>
setAll() ?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14588>
setAll() ?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14589>
do we need all of set(Tuple), set(List), setEqualTo(List),
setEqualTo(Tuple) ?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14591>
you could add a properties file to the jar with this info. Or a generated
factory that knows all the classes generated. You mainly need the mapping
schema -> id
you could also use true. Although that would not be THE answer to THE
question.
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14592>
As we have schema aware Tuples, in the future we could add get(String
fieldName). We may not want to strip the aliases to reuse the class.
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14593>
looking forward to support for those :)
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14596>
there's only one code path where updateMaps is true. It could be simplified
a bit.
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14595>
Logger
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14597>
they should not be public
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14598>
refactor DataByteArray to provide this?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14599>
Same comment, the String serialization should be shared with the regular
Tuple.
Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and
convert lazilyto String (possibly caching the result)
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14600>
same comment as before.
The same logic should be use with regular tuples.
The same logic should be shared across types.
It could also use a variable size int format.
https://developers.google.com/protocol-buffers/docs/encoding#types
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14601>
is (null, null, null) the same as null ?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14609>
this.set(t) ?
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14610>
DefaultTuple replaces null with ""
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14611>
We could have a list implementation that just wraps a tuple.
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14612>
yep, sounds like it could be generated and fall back to this if the other
is a regular tuple
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14614>
you may want to log compilation warnings/errors
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14613>
Class.forName(className) uses the classLoader of the current class.
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14615>
if it does not need to be public then yes
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14616>
same comment as before about stripped aliases
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14617>
the parameter t is thrown away ?
they don't need to be public.
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14618>
depending on how those are used, java already does auto un-boxing.
The following works so you may not need them:
long unbox(Long v) {
return v;
}
they don't need to be public
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14620>
same comment, the following works:
Float box(float v) {
return v;
}
they don't need to be public
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14622>
this file has 1500 lines :)
I think those classes have earned their own package.
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14625>
this could be a method in the parent class
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14626>
this could be a method in the parent class
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14635>
yes, an array is an object, so reference + sizeof(array)
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14638>
CheckIfNullString.masks
maybe they should be in a Masks class then, along with the masking code
generation (" & (byte)+...)
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14643>
This feels weird.
what about:
"SchemaTuple.unbox(("+getBoxedTypeName()+")appendGet(diff))"
getBoxedTypeName() {
switch (type) {
case (DataType.INTEGER): return "Integer";
...
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14621>
use
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for
all those string concatenations.
trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14619>
what does isObjNotTup mean? What is this case?
- Julien
On 2012-04-05 01:31:00, Jonathan Coveney wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/4651/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2012-04-05 01:31:00)
bq.
bq.
bq. Review request for Julien Le Dem.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. This work builds on Dmitriy's PrimitiveTuple work. The idea is that,
knowing the Schema on the frontend, we can code generate Tuples which can be
used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x
better, and it's ~15% smaller serialized (heavily heavily depends on the data,
though). Need to do get/set tests, but assuming that it's on par (or even
faster) than Tuple, the memory gain is huge.
bq.
bq. Need to clean up the code and add tests.
bq.
bq. Right now, it generates a SchemaTuple for every inputSchema and
outputSchema given to UDF's. The next step is to make a SchemaBag, where I
think the serialization savings will be really huge.
bq.
bq. Needs tests and comments, but I want the code to settle a bit.
bq.
bq.
bq. This addresses bug PIG-2632.
bq. https://issues.apache.org/jira/browse/PIG-2632
bq.
bq.
bq. Diffs
bq. -----
bq.
bq. trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628
bq.
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
1309628
bq. trunk/src/org/apache/pig/data/BinInterSedes.java 1309628
bq. trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628
bq. trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION
bq. trunk/src/org/apache/pig/data/TupleFactory.java 1309628
bq. trunk/src/org/apache/pig/impl/PigContext.java 1309628
bq.
trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java
1309628
bq.
bq. Diff: https://reviews.apache.org/r/4651/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq.
bq. Thanks,
bq.
bq. Jonathan
bq.
bq.
> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
> Key: PIG-2632
> URL: https://issues.apache.org/jira/browse/PIG-2632
> Project: Pig
> Issue Type: Improvement
> Reporter: Jonathan Coveney
> Assignee: Jonathan Coveney
> Fix For: 0.11
>
> Attachments: PIG-2632-0.patch, PIG-2632-1.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing
> the Schema on the frontend, we can code generate Tuples which can be used for
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better,
> and it's ~15% smaller serialized (heavily heavily depends on the data,
> though). Need to do get/set tests, but assuming that it's on par (or even
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema
> given to UDF's. The next step is to make a SchemaBag, where I think the
> serialization savings will be really huge.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira