[ 
https://issues.apache.org/jira/browse/PIG-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4697:
------------------------------------
    Description: 
  What HCatLoader/HCatStorer puts in UDFContext is huge and if there are 
multiple of them in the pig script, the size of data sent to Tez AM is huge and 
also the size of data that Tez AM sends to tasks is huge causing RPC limit 
exceeded and OOM issues respectively.  If Pig serializes only part of the 
udfcontext that is required for each vertex, it will save a lot.  HCat folks 
are also looking up at cleaning what goes into the conf (it ends up serializing 
whole job conf, not just hive-site.xml) and moving out the common part to be 
shared by all hcat loaders and stores. 

Also looking at other options for faster and compact serialization. Will create 
separate jiras for that. Will use PIG-4653 to cleanup all other pig config 
other than udfcontext.

  was:
  What HCatLoader/HCatStorer put in UDFContext is huge and if there are 
multiple of them in the pig script, the size of data sent to Tez AM is huge and 
the size of data that Tez AM to tasks is huge and causing either RPC limit 
exceeded or OOM issues.  If Pig serializes only part of the udfcontext that is 
required for each vertex, it will save a lot.  HCat folks are also looking up 
at cleaning what goes into the conf (it ends up serializing whole job conf, not 
just hive-site.xml) and moving out the common part to be shared by all hcat 
loaders and stores. 

Also looking at other options for faster and compact serialization. Will create 
separate jiras for that. Will use PIG-4653 to cleanup all other pig config 
other than udfcontext.

        Summary: Serialize relevant part of the udfcontext per vertex to reduce 
payload size  (was: Pig needs to serialize only part of the udfcontext for each 
vertex)

> Serialize relevant part of the udfcontext per vertex to reduce payload size
> ---------------------------------------------------------------------------
>
>                 Key: PIG-4697
>                 URL: https://issues.apache.org/jira/browse/PIG-4697
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.16.0
>
>         Attachments: PIG-4697-1.patch
>
>
>   What HCatLoader/HCatStorer puts in UDFContext is huge and if there are 
> multiple of them in the pig script, the size of data sent to Tez AM is huge 
> and also the size of data that Tez AM sends to tasks is huge causing RPC 
> limit exceeded and OOM issues respectively.  If Pig serializes only part of 
> the udfcontext that is required for each vertex, it will save a lot.  HCat 
> folks are also looking up at cleaning what goes into the conf (it ends up 
> serializing whole job conf, not just hive-site.xml) and moving out the common 
> part to be shared by all hcat loaders and stores. 
> Also looking at other options for faster and compact serialization. Will 
> create separate jiras for that. Will use PIG-4653 to cleanup all other pig 
> config other than udfcontext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to