[ https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249683#comment-13249683 ]
jirapos...@reviews.apache.org commented on PIG-2632: ---------------------------------------------------- ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/#review6784 ----------------------------------------------------------- trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java <https://reviews.apache.org/r/4651/#comment15009> instead of checking here if the tuple is generatable, the factory could default to the regular TupleFactory if the generation fails. trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment15010> How complex is this class ? Not sure it is worth pulling the whole mahout just for this. trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment15011> append...() methods should not be null (part of your TODO list?) trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment15013> handle the case where other == null Here you get NullPointerException trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment15014> same comment trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment15015> If you override equals(), you should override hashCode(). 2 object that are equal must return the same hashcode trunk/src/org/apache/pig/data/TupleFactory.java <https://reviews.apache.org/r/4651/#comment15052> that's where we need a proper classloader trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15016> try avoiding accessing the PigContext statically. Add it as a parameter and see if it can be passed from somewhere it is actually known. trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15017> possibly we want to rename this and/or add something else for this file. trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15018> ? either get rid of the empty catch block or document why it is Ok. Here I don't see why it would be Ok that we can not instantiate that class trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15019> "org.apache.pig.generated." + classname ? trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15020> "org/apache/pig/generated/" + classname + ".class" ? trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15021> we should probably not compile in the current dir. Put it in the temp dir intead trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15022> this could be calling a method in the parent trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15023> this could be in the parent trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15024> stuff which is not calling a generated field directly should be pulled up. trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15025> same comment trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15026> same trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15027> this is wasteful. Maybe DataByteArray could provide a static compare(byte[], byte[]) ? trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15028> duplicated code trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15029> this can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15030> the exception handling can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15031> "setPos_" + fieldPos + "((Boolean)val);" trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15032> same with a mapping for the java object for the type trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15033> can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15034> you can just remove the call to box, it should work trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15035> this can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15036> can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15037> can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15038> pull up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15039> pull up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15040> pull up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15041> pull up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java <https://reviews.apache.org/r/4651/#comment15042> pull up trunk/src/org/apache/pig/data/SchemaTupleFactory.java <https://reviews.apache.org/r/4651/#comment15043> Ok, temporary trunk/src/org/apache/pig/data/SchemaTupleFactory.java <https://reviews.apache.org/r/4651/#comment15044> do we need this. Let's just fall back to regular tuple when generation fails, so that we don't need to keep those in sync trunk/src/org/apache/pig/data/SchemaTupleFactory.java <https://reviews.apache.org/r/4651/#comment15045> The factory could be generated as well so that we don't need to use reflection here. Reflection is slower to create new instances. trunk/src/org/apache/pig/data/SchemaTupleFactory.java <https://reviews.apache.org/r/4651/#comment15046> The parameter type is enough to differentiate. getSchemaTupleFactory(Schema schema) trunk/src/org/apache/pig/data/SchemaTupleFactory.java <https://reviews.apache.org/r/4651/#comment15047> same trunk/src/org/apache/pig/data/SchemaTupleFactory.java <https://reviews.apache.org/r/4651/#comment15048> getTupleClass() trunk/src/org/apache/pig/data/SchemaTupleFactory.java <https://reviews.apache.org/r/4651/#comment15049> getTupleClass() trunk/src/org/apache/pig/data/SchemaTupleFactory.java <https://reviews.apache.org/r/4651/#comment15050> getTupleClass() trunk/src/org/apache/pig/data/Tuple.java <https://reviews.apache.org/r/4651/#comment15051> typo - Julien On 2012-04-08 22:26:29, Jonathan Coveney wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/4651/ bq. ----------------------------------------------------------- bq. bq. (Updated 2012-04-08 22:26:29) bq. bq. bq. Review request for pig and Julien Le Dem. bq. bq. bq. Summary bq. ------- bq. bq. This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. bq. bq. Need to clean up the code and add tests. bq. bq. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. bq. bq. Needs tests and comments, but I want the code to settle a bit. bq. bq. bq. This addresses bug PIG-2632. bq. https://issues.apache.org/jira/browse/PIG-2632 bq. bq. bq. Diffs bq. ----- bq. bq. trunk/bin/pig 1310666 bq. trunk/build.xml 1310666 bq. trunk/ivy.xml 1310666 bq. trunk/ivy/libraries.properties 1310666 bq. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1310666 bq. trunk/src/org/apache/pig/data/BinInterSedes.java 1310666 bq. trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION bq. trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666 bq. trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION bq. trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION bq. trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION bq. trunk/src/org/apache/pig/data/Tuple.java 1310666 bq. trunk/src/org/apache/pig/data/TupleFactory.java 1310666 bq. trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666 bq. trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION bq. trunk/src/org/apache/pig/impl/PigContext.java 1310666 bq. trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666 bq. bq. Diff: https://reviews.apache.org/r/4651/diff bq. bq. bq. Testing bq. ------- bq. bq. bq. Thanks, bq. bq. Jonathan bq. bq. > Create a SchemaTuple which generates efficient Tuples via code gen > ------------------------------------------------------------------ > > Key: PIG-2632 > URL: https://issues.apache.org/jira/browse/PIG-2632 > Project: Pig > Issue Type: Improvement > Reporter: Jonathan Coveney > Assignee: Jonathan Coveney > Fix For: 0.11 > > Attachments: PIG-2632-0.patch, PIG-2632-1.patch, PIG-2632-3.patch > > > This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing > the Schema on the frontend, we can code generate Tuples which can be used for > fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, > and it's ~15% smaller serialized (heavily heavily depends on the data, > though). Need to do get/set tests, but assuming that it's on par (or even > faster) than Tuple, the memory gain is huge. > Need to clean up the code and add tests. > Right now, it generates a SchemaTuple for every inputSchema and outputSchema > given to UDF's. The next step is to make a SchemaBag, where I think the > serialization savings will be really huge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira