[jira] [Updated] (PIG-5365) Add support for PARALLEL clause in LOAD statement

Satish Subhashrao Saley (JIRA) Fri, 12 Oct 2018 12:24:07 -0700


     [ 
https://issues.apache.org/jira/browse/PIG-5365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Satish Subhashrao Saley updated PIG-5365:
-----------------------------------------
    Description: 
It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
512MB or 1G when they are reading TBs of data to avoid launching too many map 
tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
container launch and wastes lot of resources. 

Would be good to have a new settings to configure the max number of tasks which 
will override pig.maxCombinedSplitSize and combine more splits into one task. 
For eg: pig.max.input.splits=30000 and data size is 2TB, it will combine more 
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K 
tasks. That will go as default into pig-default.properties and apply to all 
users.

 Thank you [~rohini] for filing the issue.

  was:
It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
512MB or 1G when they are reading TBs of data to avoid launching too many map 
tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
container launch and wastes lot of resources. 

Would be good to have a new settings to configure the max number of tasks which 
will override pig.maxCombinedSplitSize and combine more splits into one task. 
For eg: pig.max.input.splits=30000 and data size is 2TB, it will combine more 
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K 
tasks. That will go as default into pig-default.properties and apply to all 
users.

 


> Add support for PARALLEL clause in LOAD statement
> -------------------------------------------------
>
>                 Key: PIG-5365
>                 URL: https://issues.apache.org/jira/browse/PIG-5365
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Satish Subhashrao Saley
>            Assignee: Satish Subhashrao Saley
>            Priority: Major
>
> It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
> 512MB or 1G when they are reading TBs of data to avoid launching too many map 
> tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
> container launch and wastes lot of resources. 
> Would be good to have a new settings to configure the max number of tasks 
> which will override pig.maxCombinedSplitSize and combine more splits into one 
> task. For eg: pig.max.input.splits=30000 and data size is 2TB, it will 
> combine more than 128MB (default pig.maxCombinedSplitSize) per task to have 
> maximum of 30K tasks. That will go as default into pig-default.properties and 
> apply to all users.
>  Thank you [~rohini] for filing the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PIG-5365) Add support for PARALLEL clause in LOAD statement

Reply via email to