[
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248654#comment-13248654
]
[email protected] commented on PIG-2632:
----------------------------------------------------
bq. On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq. >
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
line 188
bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188>
bq. >
bq. > I would make it:
bq. > TupleFactory.geInstanceForSchema().newTuple()
bq.
bq. Jonathan Coveney wrote:
bq. I think it's worth thinking about how we want to do this. Dmitriy
implemented newTupleForSchema for the PrimitiveTuple work, and he suggested I
use that for this... however I agree, I think there should be a
"SchemaTupleFactory" that generates SchemaTuples of a given SchemaTuple. At the
same time, that might be a lot of scaffolding for not a lot of gain: once you
generate the code for a given SchemaTuple, there's not a lot of work to be done
(though this could encapsulate some of the static maps that come later). One
other factor is that the TupleFactory interface doesn't really lend itself very
nicely to the SchemaTuple, but I do think it could be extended and be made to
work. Would appreciate more of your thoughts on this.
bq.
bq. Julien Le Dem wrote:
bq. I meant:
bq. TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()
bq.
bq. The change is relatively small, it is just moving this from one class
to the other and getting rid of the Maps. PrimitiveTuple should follow the same
logic.
bq.
bq. Asking for the schema in the factory initialization will simplify the
code. As you say, you won't need to constantly look up the schema. Also pig is
about writing a lot of tuples of the same type.
bq.
bq. The block of code (see 2 comments above) that triggers the code
generation would just get the factories instead of actually generating Tuples.
This is caused by the api asking for the schema in the wrong place.
bq.
bq. TupleFactory is an abstract class so it should be doable to add
methods without breaking compatibility.
bq.
bq. If needed, backard compatibility can be maintained by having
newTupleForSchema(inputSchemaForGen) call
TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()
bq.
I'd like your thoughts fleshing out the TupleFactory interface for a
SchemaTupleFactory (which imho at this point makes the most sense --
TupleFactory.getInstanceForSchema(inputSchema); returns a SchemaTupleFactory
which extends TupleFactory).
newTuple();
easy
newTuple(int size);
meaningless, throw an error?
newTuple(List c);
uses set(List object), throws error otherwise?
newTuple(Object datum);
meaningless?
newTupleNoCopy(List list);
same as newTuple(list);
If that sounds reasonable, I like this approach
bq. On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70
bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70>
bq. >
bq. > largestSetValue should already be correct. Why does largestSetValue
stays at the size of the schema?
bq.
bq. Jonathan Coveney wrote:
bq. I guess once it is at the value of the SchemaTuple (excluding
appends), there's no need to update it.
bq.
bq. Julien Le Dem wrote:
bq. Correct but know you have to check this every time you update. I was
thinking it would simplify the code a little if it was just the actual size of
the tuple (including past the original size).
Well, I guess that depends how we define the size of the tuple. To me, the size
= number of fields (independently of whether they are set) + number of append
fields. The reason we need this field, then, is to keep track of whether or not
we've set one of the generated fields. We could redefine the size to be the
largest non-null field, but that's going to make it slower for minimal gain.
Thoughts?
bq. On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142
bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130>
bq. >
bq. > could be computed statically by the generated class if you make a
static getSchema(schemaString).
bq.
bq. Jonathan Coveney wrote:
bq. Well, my thought is that a static cache wouldn't take up much memory,
and it would avoid having to recreate the Schema object every time getSchema()
is called. I don't forsee it being called a ton, but possibly enough that
recreating the object a bunch would be wasteful. Perhaps this is premature
optimization?
bq.
bq. Julien Le Dem wrote:
bq. caching is fine.
bq. I meant that you could have the Schema object "cached" in a static
field of the generated class
bq. instead of a runtime generation + cache
bq.
bq. in the generated class:
bq.
bq. static schemaString = "....";
bq. static Schema schema = Utils.getSchemaFromString(schemaString);
bq.
bq. public Schema getSchema() {
bq. return schema;
bq. }
bq.
bq.
Ah, very good call
bq. On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159
bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159>
bq. >
bq. > do we need all of set(Tuple), set(List), setEqualTo(List),
setEqualTo(Tuple) ?
bq.
bq. Jonathan Coveney wrote:
bq. My working version adds even more. I think a lot of it should be made
protected, or at least, I should be more thoughtful about what it should be. I
think "set" is probably enough, but perhaps it should just be the void version?
I guess this is where working with ruby where everything generally returns
"this" by default, thus allowing for nice chaining of methods, is the norm.
bq.
bq. Julien Le Dem wrote:
bq. I'm fine with the chaining mechanism.
bq. We should try to avoid having both as it makes the code harder to
read. You cant set and ignore the return value;
good call
bq. On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321
bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310>
bq. >
bq. > refactor DataByteArray to provide this?
bq.
bq. Jonathan Coveney wrote:
bq. So here is the thing about this. Currently, a ton of the useful
methods in BinInterSedes (where this logic was taken from) is private. Also,
they aren't static (though there's no real reason why they shouldn't be?) This
snippet (and other logic like it) should probably be made into a protected
static method of BinInterSedes. Thoughts? It could also make sense for classes
to have a static method to serialize themselves, but I'd argue that the
BinInterSedes approach is probably more consistent with how pig is laid out
(though I have no idea why most of the methods there are private instead of
protected static).
bq.
bq. Julien Le Dem wrote:
bq. private stuff can be safely moved around and refactored :) (from a
compatibility point of view)
bq. Let's think of a way to have the same code used in all cases. There
used to be one type of tuple so what made sense before may need to changee.
bq.
bq. - parent class for both ?
bq. - TupleSerializer class ?
bq. - static helpers?
per your suggestion and dmitriy's go ahead on the list serv, I'm moving a bit
of the shared logic to a SedesHelper class which both BinInterSedes and
SchemaTuple will use
bq. On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333
bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325>
bq. >
bq. > Same comment, the String serialization should be shared with the
regular Tuple.
bq. >
bq. > Also we could keep the a UTF8 in the SchemaTuple to reduce memory
usage and convert lazilyto String (possibly caching the result)
bq.
bq. Jonathan Coveney wrote:
bq. This logic is stolen directly from BinInterSedes (see comment above).
Can you flesh out what you mean about UTF8? If we change it here we should
probably change it there as well (and make that the canonical protected static
source of all your string serialization needs).
bq.
bq. Julien Le Dem wrote:
bq. Following your goal of reducing memory utilization and as the
serialized format is UTF8, we could keep UTF8 as the memory representation
instead of String.
bq. Java Strings are UTF16 which is minimum 2 bytes per characters. If the
data is mostly ascii, this is wasteful.
bq. To maintain backward compatibility the get(int) would lazily convert
to String on demand, possibly caching the result in a second field to save
processing, but this may be overkill.
What is the latency on utf8 -> utf16 conversion? That's all I'd be worried
about but your suggestion is a good one.
bq. On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452
bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452>
bq. >
bq. > is (null, null, null) the same as null ?
bq.
bq. Jonathan Coveney wrote:
bq. In DefaultTuple, this is deprecated and not really used. Given that, I
use this to basically mean "has any information been written to this." So in
your case, both would be null. The more important question is to make sure I
use it consistently throughout. Honestly, the whole null thing is a pain, and I
need to really comb over things to make sure I incorporated it properly.
bq.
bq. Julien Le Dem wrote:
bq. agreed, we should just make sure this is the same behavior as the
DefaultTuple
DefaultTuple doesn't implement isNull(). It's just a dummy null implementation.
But the difference is that I have fields that can be "null" but aren't actually
null (ie primitives), so it's a bit more useful.
bq. On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474
bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474>
bq. >
bq. > this.set(t) ?
bq.
bq. Jonathan Coveney wrote:
bq. reference is this weird Franken-method that isn't used anywhere in the
codebase. I don't know that I want to implement it when the semantics of what
it should do don't seem clear at all. Open to thoughts on this though. The
original intent of it doesn't make much sense for a SchemaTuple since the
underlying structures are primitives, not objects.
bq.
bq. Julien Le Dem wrote:
bq. "the underlying structures are primitives, not objects"
bq. for now, as Bags and Maps will come.
bq. For backward compatibility, I think we should implement this as a
set(). It seems this is intended as a cheap set, not sure if "modifying one
modifies the others" behavior is expected. some UDFs could depend on this.
bq.
bq. Also let's mark reference() as deprecated right now so that we can
remove it later.
I have no problem making this a set. The nice part about doing that is it
means you can essentially get the functionality of a set without having to cast
it. The downside being the weird semantics around it. Probably a net win. Agree
on deprecating reference
bq. On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698
bq. > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698>
bq. >
bq. > this file has 1500 lines :)
bq. > I think those classes have earned their own package.
bq.
bq. Jonathan Coveney wrote:
bq. Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which
is why I don't like the idea of splitting it out. Yeah, it makes the file
larger, but it makes 0 sense outside of the context of what is going on here.
Perhaps that should be rectified?
bq.
bq. Julien Le Dem wrote:
bq. I don't feel strongly about it.
bq. Maybe you want to separate the base class SchemaTuple (runtime) from
the Generator ?
Agreed, very good suggestion
bq. On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138
bq. >
<https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138>
bq. >
bq. > CheckIfNullString.masks
bq. > maybe they should be in a Masks class then, along with the masking
code generation (" & (byte)+...)
bq.
bq. Jonathan Coveney wrote:
bq. I don't know how worthwhile it is to split all of this stuff out into
classes. Maybe it is. A lot of stuff is used once or twice in disparate places,
and I fear that splitting it out more and more will just make people jump
through 50 classes to understand what a line of code gen is doing. On the other
hand, it would make fixing bugs much cleaner. What probably needs to be done is
some sort of cleaner code generation framework... would appreciate more
thoughts here.
bq.
bq. Julien Le Dem wrote:
bq. My goal is to have the bit management thing implemented once so that
it is easier to change/improve/bugfix.
bq. agreed that we should keep things simple.
bq. If we have a SchemaTupleClassGenerator that contains bit management
helpers that could simplify things.
bq. Separating the runtime from the generation would reduce the amount of
things you have to look at at the same time.
+1
- Jonathan
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/#review6722
-----------------------------------------------------------
On 2012-04-06 17:20:20, Jonathan Coveney wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/4651/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2012-04-06 17:20:20)
bq.
bq.
bq. Review request for pig and Julien Le Dem.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. This work builds on Dmitriy's PrimitiveTuple work. The idea is that,
knowing the Schema on the frontend, we can code generate Tuples which can be
used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x
better, and it's ~15% smaller serialized (heavily heavily depends on the data,
though). Need to do get/set tests, but assuming that it's on par (or even
faster) than Tuple, the memory gain is huge.
bq.
bq. Need to clean up the code and add tests.
bq.
bq. Right now, it generates a SchemaTuple for every inputSchema and
outputSchema given to UDF's. The next step is to make a SchemaBag, where I
think the serialization savings will be really huge.
bq.
bq. Needs tests and comments, but I want the code to settle a bit.
bq.
bq.
bq. This addresses bug PIG-2632.
bq. https://issues.apache.org/jira/browse/PIG-2632
bq.
bq.
bq. Diffs
bq. -----
bq.
bq. trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628
bq.
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
1309628
bq. trunk/src/org/apache/pig/data/BinInterSedes.java 1309628
bq. trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628
bq. trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION
bq. trunk/src/org/apache/pig/data/TupleFactory.java 1309628
bq. trunk/src/org/apache/pig/impl/PigContext.java 1309628
bq.
trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java
1309628
bq.
bq. Diff: https://reviews.apache.org/r/4651/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq.
bq. Thanks,
bq.
bq. Jonathan
bq.
bq.
> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
> Key: PIG-2632
> URL: https://issues.apache.org/jira/browse/PIG-2632
> Project: Pig
> Issue Type: Improvement
> Reporter: Jonathan Coveney
> Assignee: Jonathan Coveney
> Fix For: 0.11
>
> Attachments: PIG-2632-0.patch, PIG-2632-1.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing
> the Schema on the frontend, we can code generate Tuples which can be used for
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better,
> and it's ~15% smaller serialized (heavily heavily depends on the data,
> though). Need to do get/set tests, but assuming that it's on par (or even
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema
> given to UDF's. The next step is to make a SchemaBag, where I think the
> serialization savings will be really huge.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira