[
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249921#comment-13249921
]
[email protected] commented on PIG-2632:
----------------------------------------------------
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. >
Thanks again for taking a look Julien! I think we are close to having the
structure where we want it, there are just a couple of outstanding higher level
points:
- How should PigContext be referenced?
- How should the intermediate compiled data be handle?
- What should the semantics of getting a SchemaTupleFactory be?
- Generating the SchemaTupleFactory so it can directly instantiate the given
SchemaTuple
- What should and shouldn't be in the generated code
And probably some more. I responded below, and hopefully we can keep honing in
on something that looks good :)
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. >
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
lines 135-136
bq. >
<https://reviews.apache.org/r/4651/diff/1-3/?file=100075#file100075line135>
bq. >
bq. > instead of checking here if the tuple is generatable, the factory
could default to the regular TupleFactory if the generation fails.
I don't know that I like that semantic though. In this case, in the client code
it's clear that they know that they either are or are not getting a
SchemaTupleFactory... if we default to a regular TupleFactory, then we might
think we are getting a SchemaTuple (and all the speed and memory benefits) when
we actually aren't. There are two cases where we would ask for a
SchemaTupleFactory and it wouldn't be available: a Pig bug in our
implementation, or user error. Shouldn't pig fail out intelligently instead of
giving them a Factory that doesn't do what they think?
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 31
bq. >
<https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line31>
bq. >
bq. > How complex is this class ? Not sure it is worth pulling the whole
mahout just for this.
It's very not complicated. I'll put the logic into SedesHelper.
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 84
bq. >
<https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line84>
bq. >
bq. > append...() methods should not be null
bq. > (part of your TODO list?)
Not sure what you mean here
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 550
bq. >
<https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line550>
bq. >
bq. > handle the case where other == null
bq. > Here you get NullPointerException
good call, will fix everywhere
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTuple.java, line 662
bq. >
<https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line662>
bq. >
bq. > If you override equals(), you should override hashCode(). 2 object
that are equal must return the same hashcode
whoops, I thought I had. will do
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 64
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line64>
bq. >
bq. > try avoiding accessing the PigContext statically. Add it as a
parameter and see if it can be passed from somewhere it is actually known.
In the pig code that I looked at, this seemed by far the more common idiom (not
that I love it, but it's pretty hard to get the PigContext in there). I can't
really find a single example of code in this sort of path that doesn't get the
PigContext statically. I could rewire Pig to try and make this possible, but do
you think that it is worth it/do you see any examples in the code base that I
missed where PigContext isn't called statically in this sort of case? There are
some examples of more purely front end things, and that's how I'd pipe it in I
suppose, but it seems like a big change for something that barely touches the
codebase. I do think a JIRA that tackles this issue in pig is reasonable
though... trying to decrease the amount of statically referenced jank. Await
your thoughts.
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 71
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line71>
bq. >
bq. > possibly we want to rename this and/or add something else for this
file.
Question about class files: does a class file have to have the same name as the
class it contains? I know this is true for .java files with the java compiler,
but not sure if it is true of class files. If it is true of class files, then
we can't muck with the name. If it isn't, then ok.
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 86-87
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line86>
bq. >
bq. > ?
bq. > either get rid of the empty catch block or document why it is Ok.
bq. >
bq. > Here I don't see why it would be Ok that we can not instantiate that
class
Hmm, the idea was that if you couldn't instantiate you'd regenerate, but
looking at it now you're right; it is odd that if we think we have the class,
that we cannot instantiate it. I'll throw a runtime error.
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 127
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line127>
bq. >
bq. > "org.apache.pig.generated." + classname ?
Java question: if we're managing the compilation, does the compiled class have
to be in org/apache/pig/generated (or whatever) in the file directory?
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 138
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line138>
bq. >
bq. > "org/apache/pig/generated/" + classname + ".class" ?
What's the win here? Or just general cleanliness?
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 140
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line140>
bq. >
bq. > we should probably not compile in the current dir.
bq. > Put it in the temp dir intead
I will see if this is possible. Do you know if Pig has an official local temp
directory? I found an hdfs temp directory flag, but couldn't find a local one.
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 322
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line322>
bq. >
bq. > this is wasteful. Maybe DataByteArray could provide a static
compare(byte[], byte[]) ?
Lol, I checked and it exists. Will use.
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines
265-269
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line265>
bq. >
bq. > stuff which is not calling a generated field directly should be
pulled up.
I don't know that I agree with this. Then we're going to have a bunch of really
random methods that have no context getting called for no real reason (I guess
so that the compiler will run it's checks instead of getting an error at
compile time)? Some of the logic makes more sense to do this with (the
comparison of appends above, for example), but do you think the win for highly
specific pieces of code like this is worth it?
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines
392-396
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line392>
bq. >
bq. > the exception handling can be pulled up
Same as above. Should it really be our goal to move everything humanly possible
into SchemaTuple?
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 518
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line518>
bq. >
bq. > same with a mapping for the java object for the type
Not sure what you want the code to look like here. If you want to explicitly do
all the casts, it will make the generated code bigger, and would seem to
contradict your suggestion to call as much from SchemaTuple as possible?
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 559
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line559>
bq. >
bq. > you can just remove the call to box, it should work
Good call
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleFactory.java, line 42
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100953#file100953line42>
bq. >
bq. > do we need this. Let's just fall back to regular tuple when
generation fails, so that we don't need to keep those in sync
See above.
bq. On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq. > trunk/src/org/apache/pig/data/SchemaTupleFactory.java, line 59
bq. > <https://reviews.apache.org/r/4651/diff/3/?file=100953#file100953line59>
bq. >
bq. > The factory could be generated as well so that we don't need to use
reflection here.
bq. > Reflection is slower to create new instances.
I didn't realize that doing Class.newInstance was slower than direct creation.
I'll generate it.
- Jonathan
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/#review6784
-----------------------------------------------------------
On 2012-04-08 22:26:29, Jonathan Coveney wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/4651/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2012-04-08 22:26:29)
bq.
bq.
bq. Review request for pig and Julien Le Dem.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. This work builds on Dmitriy's PrimitiveTuple work. The idea is that,
knowing the Schema on the frontend, we can code generate Tuples which can be
used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x
better, and it's ~15% smaller serialized (heavily heavily depends on the data,
though). Need to do get/set tests, but assuming that it's on par (or even
faster) than Tuple, the memory gain is huge.
bq.
bq. Need to clean up the code and add tests.
bq.
bq. Right now, it generates a SchemaTuple for every inputSchema and
outputSchema given to UDF's. The next step is to make a SchemaBag, where I
think the serialization savings will be really huge.
bq.
bq. Needs tests and comments, but I want the code to settle a bit.
bq.
bq.
bq. This addresses bug PIG-2632.
bq. https://issues.apache.org/jira/browse/PIG-2632
bq.
bq.
bq. Diffs
bq. -----
bq.
bq. trunk/bin/pig 1310666
bq. trunk/build.xml 1310666
bq. trunk/ivy.xml 1310666
bq. trunk/ivy/libraries.properties 1310666
bq.
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
1310666
bq. trunk/src/org/apache/pig/data/BinInterSedes.java 1310666
bq. trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION
bq. trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666
bq. trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION
bq. trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
PRE-CREATION
bq. trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION
bq. trunk/src/org/apache/pig/data/Tuple.java 1310666
bq. trunk/src/org/apache/pig/data/TupleFactory.java 1310666
bq. trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666
bq. trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION
bq. trunk/src/org/apache/pig/impl/PigContext.java 1310666
bq.
trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java
1310666
bq.
bq. Diff: https://reviews.apache.org/r/4651/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq.
bq. Thanks,
bq.
bq. Jonathan
bq.
bq.
> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
> Key: PIG-2632
> URL: https://issues.apache.org/jira/browse/PIG-2632
> Project: Pig
> Issue Type: Improvement
> Reporter: Jonathan Coveney
> Assignee: Jonathan Coveney
> Fix For: 0.11
>
> Attachments: PIG-2632-0.patch, PIG-2632-1.patch, PIG-2632-3.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing
> the Schema on the frontend, we can code generate Tuples which can be used for
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better,
> and it's ~15% smaller serialized (heavily heavily depends on the data,
> though). Need to do get/set tests, but assuming that it's on par (or even
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema
> given to UDF's. The next step is to make a SchemaBag, where I think the
> serialization savings will be really huge.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira