[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

Julien Le Dem (Commented) (JIRA) Sat, 07 Apr 2012 13:08:42 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249324#comment-13249324
 ]


Julien Le Dem commented on PIG-2632:
------------------------------------

* Class loading:
The main issue issue I see is for long running process that execute many Pig 
queries. In that case Loading classes in the application ClassLoader would be a 
memory leak and they could possibly fill up the perm space. We don't really 
need the classes on the FrontEnd for the Map/Reduce execution mode, so the 
class definition can just be added to the job jar. If we want to use the 
generated tuples on the Frontend (for example for local mode) they can be 
generated in a location outside of the classpath (like a temporary folder). 
Then we can use a URLClassLoader pointing to this location. Discarding the 
classloader after the execution would let the garbage collector free up this 
memory. We can even extend ClassLoader so that bytes don't even have to be 
written to disk.

* ASM: 
I looked into ASM and the corresponding eclipse plugin. It is pretty cool. You 
can take the class that was generated and ask the plugin to generate the ASM 
code that would do the same (not just the bytecode). That should make it 
relatively easy to move from java source generation to directly code generation.

* evolution of the generation and fail safe:
This should be hidden behing a factory so that it can be changed easily. Also 
if anything goes wrong with generation it should fall back to regular tuple. 
The data storage format being modified is the intermediary format in between 
Pig jobs or for spills, so we don't need to maintain backward compatibility. 
Correct?
If we stick with javacode gen for the first version, it should be easy to check 
if javax.tools.JavaCompiler is present at runtime and fall back to regular 
tuples.

                
> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
>                 Key: PIG-2632
>                 URL: https://issues.apache.org/jira/browse/PIG-2632
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11
>
>         Attachments: PIG-2632-0.patch, PIG-2632-1.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing 
> the Schema on the frontend, we can code generate Tuples which can be used for 
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, 
> and it's ~15% smaller serialized (heavily heavily depends on the data, 
> though). Need to do get/set tests, but assuming that it's on par (or even 
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema 
> given to UDF's. The next step is to make a SchemaBag, where I think the 
> serialization savings will be really huge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

Reply via email to