[ 
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248657#comment-13248657
 ] 

Scott Carey commented on PIG-2632:
----------------------------------

{quote}varints is a good idea. general question of where this logic should live 
still holds as well, as this logic is also directly ripped from 
BinInterSedes.{quote}

If you go that far, it may be better to simply replace BinInterSedes with Avro, 
which also provides about half of the stuff in this patch for free but is 
missing a bunch of other things you'd need.  If the schema is known, it can be 
mapped to an Avro schema, and then all the serialization and deserialization 
details would be free.  Code gen could be done with a custom velocity template 
(an avro extensibility feature) instead of the manual string manipulation here. 
 You'd need to have a custom Avro DatumWriter/DatumReader that handles these 
schema aware tuples and also handles the pig Tuple contracts. 

This also seems like it requires a JDK and not just a JRE, because you are 
using javax.tools.JavaCompiler.  There are alternative approaches to creating 
classes at runtime with tools like ASM, CGLIB, or java's dynamic proxies to 
extend a class without generating Java strings first, but those have more of a 
learning curve.  See also Jackson's MrBean feature 
http://jackson.codehaus.org/1.9.3/javadoc/org/codehaus/jackson/mrbean/BeanBuilder.html
 and its source for some interesting examples.
 

I am very interested in this but do not have time to help out in the near term.


{quote}
Class.forName(className) uses the classLoader of the current class
{quote}

Be careful, this approach is dangerous and not OSGi friendly.  You may want to 
consider using the thread local context class loader, or passing in a 
ClassLoader as a parameter instead of the static binding which ClassLoader is 
used with a reference to a specific class.  Even Main.class _could_ have two 
copies in two different class loaders.  If you know for sure a single class 
that must be in the same ClassLoader as the classes you are instantiating, it 
may make sense to get that one and cache it for all your uses.

                
> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
>                 Key: PIG-2632
>                 URL: https://issues.apache.org/jira/browse/PIG-2632
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11
>
>         Attachments: PIG-2632-0.patch, PIG-2632-1.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing 
> the Schema on the frontend, we can code generate Tuples which can be used for 
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, 
> and it's ~15% smaller serialized (heavily heavily depends on the data, 
> though). Need to do get/set tests, but assuming that it's on par (or even 
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema 
> given to UDF's. The next step is to make a SchemaBag, where I think the 
> serialization savings will be really huge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to