[ https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248133#comment-13248133 ]
jirapos...@reviews.apache.org commented on PIG-2632: ---------------------------------------------------- bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 133 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line133> bq. > bq. > it would be nice if Pig called outputSchema() once during the init and that was saved somewhere. It calls it many times, alas. I feel like changing that is a job for another patch, but I could tackle that here. But given there is already the expectation that it will be called an arbitrary number of times, I don't think that calling it once more on the front end will be an issue? bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 136-145 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line136> bq. > bq. > So the side effect is that the Tuple class get generated here ? This should be encapsulated in a more obvious method here. bq. > like: generateTupleForSchema(tmpS) and not necessarily create an instance of a tuple. bq. > also that would remove duplication here. I was trying to leverage Dmitriy's newTupleForSchema, but I think you're right. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 151-152 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line151> bq. > bq. > those are the inputSchema and the outputSchema. The fields should probably called just that and have boolean flag to know if we use generated Tuples? Well, we need to know what the tuple to generate is. I will change it save the ID that serves as the base, but as is the Schema info isn't saved and it isn't available. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 188 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188> bq. > bq. > I would make it: bq. > TupleFactory.geInstanceForSchema().newTuple() I think it's worth thinking about how we want to do this. Dmitriy implemented newTupleForSchema for the PrimitiveTuple work, and he suggested I use that for this... however I agree, I think there should be a "SchemaTupleFactory" that generates SchemaTuples of a given SchemaTuple. At the same time, that might be a lot of scaffolding for not a lot of gain: once you generate the code for a given SchemaTuple, there's not a lot of work to be done (though this could encapsulate some of the static maps that come later). One other factor is that the TupleFactory interface doesn't really lend itself very nicely to the SchemaTuple, but I do think it could be extended and be made to work. Would appreciate more of your thoughts on this. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/BinInterSedes.java, lines 601-602 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100076#file100076line601> bq. > bq. > should SchemaTuple extend TypeAwareTuple ? Agreed, and my working version has that update. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/PrimitiveTuple.java, line 213 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100077#file100077line213> bq. > bq. > BytesKey ? good call bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 46-48 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line46> bq. > bq. > statics on top I really need to clean up where everything goes, noted. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70> bq. > bq. > largestSetValue should already be correct. Why does largestSetValue stays at the size of the schema? I guess once it is at the value of the SchemaTuple (excluding appends), there's no need to update it. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 77 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line77> bq. > bq. > isAppendNull() bq. > Should the append...() methods be public? I think a lot of these methods should be protected, agreed bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 81 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line81> bq. > bq. > the result would be the same, right? yeah, you're right bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 94 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line94> bq. > bq. > setAppend() yeah need to rename all of these. naming is what I am worst at :S bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 105 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line105> bq. > bq. > if it is always overridden, it should be abstract It's overriden, but the overriden version calls super. Right now it returns 0 because I don't want to take the time to calculate it out while the class is still evolving. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 119 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line119> bq. > bq. > set? stream? This originally meant "Schema Tuple Element Write" but yeah, it should be more descriptive. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 125 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line125> bq. > bq. > redundant with the 2 others? bq. > idClassMap.get(schemaIdMap.get(schema)) Perhaps this is what a SchemaTupleFactory would make cleaner, but my desire is for Tuple creation to be as fast as possible. This is why there are 5 maps... my thought is that they are all static, so the extra cost of caching the extra info is worth avoiding two gets in a row to link the info together when we need to make Tuples. Am open to thoughts. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130> bq. > bq. > could be computed statically by the generated class if you make a static getSchema(schemaString). Well, my thought is that a static cache wouldn't take up much memory, and it would avoid having to recreate the Schema object every time getSchema() is called. I don't forsee it being called a ton, but possibly enough that recreating the object a bunch would be wasteful. Perhaps this is premature optimization? bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 144 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line144> bq. > bq. > setAll() ? It's not really a setAll imho. It's setting it equal to. Maybe that's what setAll means, but it seems to convey something different. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159> bq. > bq. > do we need all of set(Tuple), set(List), setEqualTo(List), setEqualTo(Tuple) ? My working version adds even more. I think a lot of it should be made protected, or at least, I should be more thoughtful about what it should be. I think "set" is probably enough, but perhaps it should just be the void version? I guess this is where working with ruby where everything generally returns "this" by default, thus allowing for nice chaining of methods, is the norm. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 191 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line191> bq. > bq. > you could add a properties file to the jar with this info. Or a generated factory that knows all the classes generated. You mainly need the mapping schema -> id bq. > bq. > you could also use true. Although that would not be THE answer to THE question. I think the jar properties is the way to go, at least as the ultimate source of the classes that were generated. Alternately, it'd be nice if there was an easy way to browse a jar for certain class files, but I couldn't find a clean one? If there was, we could just search for SchemaTuple_*.class and keep track of which files went across. Do you know if there is an easier way to search jars like that? I can dig around as well. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 209 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line209> bq. > bq. > As we have schema aware Tuples, in the future we could add get(String fieldName). We may not want to strip the aliases to reuse the class. Hmm, it seems pretty silly that two Schemas that are essentially "int,int,long" except with different names would necessitate a different class. More importantly, however, I ran into cases where schema names were different due to the way that names were generated by different pieces of code, and so the lookup was thwarted if you were strict about names. Hmm hmm hmm. Do you think the win from get(String) is worth the potential complication? It could be convenient... bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 270-273 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line270> bq. > bq. > Logger woops, I usually mark these //remove and whack them before the patch goes out... bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 268 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line268> bq. > bq. > there's only one code path where updateMaps is true. It could be simplified a bit. agreed bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310> bq. > bq. > refactor DataByteArray to provide this? So here is the thing about this. Currently, a ton of the useful methods in BinInterSedes (where this logic was taken from) is private. Also, they aren't static (though there's no real reason why they shouldn't be?) This snippet (and other logic like it) should probably be made into a protected static method of BinInterSedes. Thoughts? It could also make sense for classes to have a static method to serialize themselves, but I'd argue that the BinInterSedes approach is probably more consistent with how pig is laid out (though I have no idea why most of the methods there are private instead of protected static). bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325> bq. > bq. > Same comment, the String serialization should be shared with the regular Tuple. bq. > bq. > Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and convert lazilyto String (possibly caching the result) This logic is stolen directly from BinInterSedes (see comment above). Can you flesh out what you mean about UTF8? If we change it here we should probably change it there as well (and make that the canonical protected static source of all your string serialization needs). bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 344-353 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line344> bq. > bq. > same comment as before. bq. > The same logic should be use with regular tuples. bq. > The same logic should be shared across types. bq. > bq. > It could also use a variable size int format. bq. > https://developers.google.com/protocol-buffers/docs/encoding#types varints is a good idea. general question of where this logic should live still holds as well, as this logic is also directly ripped from BinInterSedes. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452> bq. > bq. > is (null, null, null) the same as null ? In DefaultTuple, this is deprecated and not really used. Given that, I use this to basically mean "has any information been written to this." So in your case, both would be null. The more important question is to make sure I use it consistently throughout. Honestly, the whole null thing is a pain, and I need to really comb over things to make sure I incorporated it properly. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474> bq. > bq. > this.set(t) ? reference is this weird Franken-method that isn't used anywhere in the codebase. I don't know that I want to implement it when the semantics of what it should do don't seem clear at all. Open to thoughts on this though. The original intent of it doesn't make much sense for a SchemaTuple since the underlying structures are primitives, not objects. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 479 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line479> bq. > bq. > DefaultTuple replaces null with "" Good idea bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 484 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line484> bq. > bq. > We could have a list implementation that just wraps a tuple. Hmm, I think this is an intriguing idea, though it might get sills, because that means you'd have a list which wraps a tuple which wraps a list. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 495 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line495> bq. > bq. > yep, sounds like it could be generated and fall back to this if the other is a regular tuple My working copy almost has this finished. there are three comparison functions: compareTo(Object), compareTo(SchemaTuple), and compareToSpecific(SchemaTuple_). It will check (unless you tell it not to) if you were actually given a SchemaTuple_X (or just a schemaTuple) and do progressively faster comparisons. I also need to generate a RawComparator but I'd rather clean everything else up first (especially the layout of the API and where pieces live). bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 535 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line535> bq. > bq. > you may want to log compilation warnings/errors Is there a need to log it if the runtime exception is going to be thrown anyway? Because those errors get written to stderr. But I suppose I could pipe them through the Logger, if that's what you mean? bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 545 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line545> bq. > bq. > Class.forName(className) uses the classLoader of the current class. nice! bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 585 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line585> bq. > bq. > if it does not need to be public then yes it depends heavily on how we end up forcing people to make these things. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 609-635 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line609> bq. > bq. > the parameter t is thrown away ? bq. > bq. > they don't need to be public. This is more for conciseness in the code that generates the SchemaTuple_ code. It allows me to make a bunch of if then elses smaller. I suppose there are other ways to do it that don't involve a method call... I go back and forth on that. But the reason is so later on I can do: add(" setPos_X(SchemaTuple.unbox(val, pos_X));"); which means that the right one will be based on the type of the field being filled (ie saving me a bunch of lines, but perhaps that should be trumped by cleanliness of the output code/API). bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 637-659 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line637> bq. > bq. > depending on how those are used, java already does auto un-boxing. bq. > The following works so you may not need them: bq. > long unbox(Long v) { bq. > return v; bq. > } bq. > bq. > they don't need to be public Good point. I hate auto-boxing/un-boxing as a language feature, but if it can be utilized to clean things up it is worth it. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 667-685 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line667> bq. > bq. > same comment, the following works: bq. > Float box(float v) { bq. > return v; bq. > } bq. > bq. > they don't need to be public Yeah same as above as well. I think I can actually also leverage auto boxing/unboxing deeper in the code to remove some of the need for these functions. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698> bq. > bq. > this file has 1500 lines :) bq. > I think those classes have earned their own package. Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which is why I don't like the idea of splitting it out. Yeah, it makes the file larger, but it makes 0 sense outside of the context of what is going on here. Perhaps that should be rectified? bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 857-860 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line857> bq. > bq. > this could be a method in the parent class do you mean in SchemaTuple, or in TypeInFunctionStringOut? bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138> bq. > bq. > CheckIfNullString.masks bq. > maybe they should be in a Masks class then, along with the masking code generation (" & (byte)+...) I don't know how worthwhile it is to split all of this stuff out into classes. Maybe it is. A lot of stuff is used once or twice in disparate places, and I fear that splitting it out more and more will just make people jump through 50 classes to understand what a line of code gen is doing. On the other hand, it would make fixing bugs much cleaner. What probably needs to be done is some sort of cleaner code generation framework... would appreciate more thoughts here. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1283 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1283> bq. > bq. > This feels weird. bq. > what about: bq. > "SchemaTuple.unbox(("+getBoxedTypeName()+")appendGet(diff))" bq. > bq. > getBoxedTypeName() { bq. > switch (type) { bq. > case (DataType.INTEGER): return "Integer"; bq. > ... I actually was able to replace it with: add(" case ("+fieldNum+"): return "+fs.type+";"); sometimes I miss the forest for the trees. bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1413 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1413> bq. > bq. > use http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for all those string concatenations. yeahh this is a good idea. I doubt the code gen takes up much time at all but given the gen'd files are potentially large, it's probably good practice bq. On 2012-04-06 03:30:08, Julien Le Dem wrote: bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 1522-1524 bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1522> bq. > bq. > what does isObjNotTup mean? What is this case? it was an obscure case that was obsolete. I removed it - Jonathan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/#review6722 ----------------------------------------------------------- On 2012-04-05 01:31:00, Jonathan Coveney wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/4651/ bq. ----------------------------------------------------------- bq. bq. (Updated 2012-04-05 01:31:00) bq. bq. bq. Review request for Julien Le Dem. bq. bq. bq. Summary bq. ------- bq. bq. This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. bq. bq. Need to clean up the code and add tests. bq. bq. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. bq. bq. Needs tests and comments, but I want the code to settle a bit. bq. bq. bq. This addresses bug PIG-2632. bq. https://issues.apache.org/jira/browse/PIG-2632 bq. bq. bq. Diffs bq. ----- bq. bq. trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 bq. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628 bq. trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 bq. trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 bq. trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION bq. trunk/src/org/apache/pig/data/TupleFactory.java 1309628 bq. trunk/src/org/apache/pig/impl/PigContext.java 1309628 bq. trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628 bq. bq. Diff: https://reviews.apache.org/r/4651/diff bq. bq. bq. Testing bq. ------- bq. bq. bq. Thanks, bq. bq. Jonathan bq. bq. > Create a SchemaTuple which generates efficient Tuples via code gen > ------------------------------------------------------------------ > > Key: PIG-2632 > URL: https://issues.apache.org/jira/browse/PIG-2632 > Project: Pig > Issue Type: Improvement > Reporter: Jonathan Coveney > Assignee: Jonathan Coveney > Fix For: 0.11 > > Attachments: PIG-2632-0.patch, PIG-2632-1.patch > > > This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing > the Schema on the frontend, we can code generate Tuples which can be used for > fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, > and it's ~15% smaller serialized (heavily heavily depends on the data, > though). Need to do get/set tests, but assuming that it's on par (or even > faster) than Tuple, the memory gain is huge. > Need to clean up the code and add tests. > Right now, it generates a SchemaTuple for every inputSchema and outputSchema > given to UDF's. The next step is to make a SchemaBag, where I think the > serialization savings will be really huge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira