[ https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247999#comment-13247999 ]
jirapos...@reviews.apache.org commented on PIG-2632: ---------------------------------------------------- ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/#review6722 ----------------------------------------------------------- This is an incredible piece of work! See my comments bellow trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java <https://reviews.apache.org/r/4651/#comment14556> it would be nice if Pig called outputSchema() once during the init and that was saved somewhere. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java <https://reviews.apache.org/r/4651/#comment14557> So the side effect is that the Tuple class get generated here ? This should be encapsulated in a more obvious method here. like: generateTupleForSchema(tmpS) and not necessarily create an instance of a tuple. also that would remove duplication here. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java <https://reviews.apache.org/r/4651/#comment14558> those are the inputSchema and the outputSchema. The fields should probably called just that and have boolean flag to know if we use generated Tuples? trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java <https://reviews.apache.org/r/4651/#comment14559> I would make it: TupleFactory.geInstanceForSchema().newTuple() trunk/src/org/apache/pig/data/BinInterSedes.java <https://reviews.apache.org/r/4651/#comment14560> should SchemaTuple extend TypeAwareTuple ? trunk/src/org/apache/pig/data/PrimitiveTuple.java <https://reviews.apache.org/r/4651/#comment14564> BytesKey ? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14573> statics on top trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14574> largestSetValue should already be correct. Why does largestSetValue stays at the size of the schema? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14575> isAppendNull() Should the append...() methods be public? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14576> the result would be the same, right? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14580> setAppend() trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14578> http://www.ibm.com/developerworks/java/library/j-codetoheap/index.html trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14631> if it is always overridden, it should be abstract trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14579> getAppendType() trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14584> set? stream? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14583> getSizeNoAppend() ? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14585> redundant with the 2 others? idClassMap.get(schemaIdMap.get(schema)) trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14586> could be computed statically by the generated class if you make a static getSchema(schemaString). trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14587> setAll() ? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14588> setAll() ? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14589> do we need all of set(Tuple), set(List), setEqualTo(List), setEqualTo(Tuple) ? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14591> you could add a properties file to the jar with this info. Or a generated factory that knows all the classes generated. You mainly need the mapping schema -> id you could also use true. Although that would not be THE answer to THE question. trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14592> As we have schema aware Tuples, in the future we could add get(String fieldName). We may not want to strip the aliases to reuse the class. trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14593> looking forward to support for those :) trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14596> there's only one code path where updateMaps is true. It could be simplified a bit. trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14595> Logger trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14597> they should not be public trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14598> refactor DataByteArray to provide this? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14599> Same comment, the String serialization should be shared with the regular Tuple. Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and convert lazilyto String (possibly caching the result) trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14600> same comment as before. The same logic should be use with regular tuples. The same logic should be shared across types. It could also use a variable size int format. https://developers.google.com/protocol-buffers/docs/encoding#types trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14601> is (null, null, null) the same as null ? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14609> this.set(t) ? trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14610> DefaultTuple replaces null with "" trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14611> We could have a list implementation that just wraps a tuple. trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14612> yep, sounds like it could be generated and fall back to this if the other is a regular tuple trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14614> you may want to log compilation warnings/errors trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14613> Class.forName(className) uses the classLoader of the current class. trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14615> if it does not need to be public then yes trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14616> same comment as before about stripped aliases trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14617> the parameter t is thrown away ? they don't need to be public. trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14618> depending on how those are used, java already does auto un-boxing. The following works so you may not need them: long unbox(Long v) { return v; } they don't need to be public trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14620> same comment, the following works: Float box(float v) { return v; } they don't need to be public trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14622> this file has 1500 lines :) I think those classes have earned their own package. trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14625> this could be a method in the parent class trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14626> this could be a method in the parent class trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14635> yes, an array is an object, so reference + sizeof(array) trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14638> CheckIfNullString.masks maybe they should be in a Masks class then, along with the masking code generation (" & (byte)+...) trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14643> This feels weird. what about: "SchemaTuple.unbox(("+getBoxedTypeName()+")appendGet(diff))" getBoxedTypeName() { switch (type) { case (DataType.INTEGER): return "Integer"; ... trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14621> use http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for all those string concatenations. trunk/src/org/apache/pig/data/SchemaTuple.java <https://reviews.apache.org/r/4651/#comment14619> what does isObjNotTup mean? What is this case? - Julien On 2012-04-05 01:31:00, Jonathan Coveney wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/4651/ bq. ----------------------------------------------------------- bq. bq. (Updated 2012-04-05 01:31:00) bq. bq. bq. Review request for Julien Le Dem. bq. bq. bq. Summary bq. ------- bq. bq. This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. bq. bq. Need to clean up the code and add tests. bq. bq. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. bq. bq. Needs tests and comments, but I want the code to settle a bit. bq. bq. bq. This addresses bug PIG-2632. bq. https://issues.apache.org/jira/browse/PIG-2632 bq. bq. bq. Diffs bq. ----- bq. bq. trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 bq. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628 bq. trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 bq. trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 bq. trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION bq. trunk/src/org/apache/pig/data/TupleFactory.java 1309628 bq. trunk/src/org/apache/pig/impl/PigContext.java 1309628 bq. trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628 bq. bq. Diff: https://reviews.apache.org/r/4651/diff bq. bq. bq. Testing bq. ------- bq. bq. bq. Thanks, bq. bq. Jonathan bq. bq. > Create a SchemaTuple which generates efficient Tuples via code gen > ------------------------------------------------------------------ > > Key: PIG-2632 > URL: https://issues.apache.org/jira/browse/PIG-2632 > Project: Pig > Issue Type: Improvement > Reporter: Jonathan Coveney > Assignee: Jonathan Coveney > Fix For: 0.11 > > Attachments: PIG-2632-0.patch, PIG-2632-1.patch > > > This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing > the Schema on the frontend, we can code generate Tuples which can be used for > fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, > and it's ~15% smaller serialized (heavily heavily depends on the data, > though). Need to do get/set tests, but assuming that it's on par (or even > faster) than Tuple, the memory gain is huge. > Need to clean up the code and add tests. > Right now, it generates a SchemaTuple for every inputSchema and outputSchema > given to UDF's. The next step is to make a SchemaBag, where I think the > serialization savings will be really huge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira