[
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249324#comment-13249324
]
Julien Le Dem commented on PIG-2632:
------------------------------------
* Class loading:
The main issue issue I see is for long running process that execute many Pig
queries. In that case Loading classes in the application ClassLoader would be a
memory leak and they could possibly fill up the perm space. We don't really
need the classes on the FrontEnd for the Map/Reduce execution mode, so the
class definition can just be added to the job jar. If we want to use the
generated tuples on the Frontend (for example for local mode) they can be
generated in a location outside of the classpath (like a temporary folder).
Then we can use a URLClassLoader pointing to this location. Discarding the
classloader after the execution would let the garbage collector free up this
memory. We can even extend ClassLoader so that bytes don't even have to be
written to disk.
* ASM:
I looked into ASM and the corresponding eclipse plugin. It is pretty cool. You
can take the class that was generated and ask the plugin to generate the ASM
code that would do the same (not just the bytecode). That should make it
relatively easy to move from java source generation to directly code generation.
* evolution of the generation and fail safe:
This should be hidden behing a factory so that it can be changed easily. Also
if anything goes wrong with generation it should fall back to regular tuple.
The data storage format being modified is the intermediary format in between
Pig jobs or for spills, so we don't need to maintain backward compatibility.
Correct?
If we stick with javacode gen for the first version, it should be easy to check
if javax.tools.JavaCompiler is present at runtime and fall back to regular
tuples.
> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
> Key: PIG-2632
> URL: https://issues.apache.org/jira/browse/PIG-2632
> Project: Pig
> Issue Type: Improvement
> Reporter: Jonathan Coveney
> Assignee: Jonathan Coveney
> Fix For: 0.11
>
> Attachments: PIG-2632-0.patch, PIG-2632-1.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing
> the Schema on the frontend, we can code generate Tuples which can be used for
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better,
> and it's ~15% smaller serialized (heavily heavily depends on the data,
> though). Need to do get/set tests, but assuming that it's on par (or even
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema
> given to UDF's. The next step is to make a SchemaBag, where I think the
> serialization savings will be really huge.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira