[
https://issues.apache.org/jira/browse/PIG-5365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Satish Subhashrao Saley updated PIG-5365:
-----------------------------------------
Description:
It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to
512MB or 1G when they are reading TBs of data to avoid launching too many map
tasks (50-100K) for loading data. It has unnecessary overhead in terms of
container launch and wastes lot of resources.
Would be good to have a new settings to configure the max number of tasks which
will override pig.maxCombinedSplitSize and combine more splits into one task.
For eg: pig.max.input.splits=30000 and data size is 2TB, it will combine more
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K
tasks. That will go as default into pig-default.properties and apply to all
users.
Thank you [~rohini] for filing the issue.
was:
It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to
512MB or 1G when they are reading TBs of data to avoid launching too many map
tasks (50-100K) for loading data. It has unnecessary overhead in terms of
container launch and wastes lot of resources.
Would be good to have a new settings to configure the max number of tasks which
will override pig.maxCombinedSplitSize and combine more splits into one task.
For eg: pig.max.input.splits=30000 and data size is 2TB, it will combine more
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K
tasks. That will go as default into pig-default.properties and apply to all
users.
> Add support for PARALLEL clause in LOAD statement
> -------------------------------------------------
>
> Key: PIG-5365
> URL: https://issues.apache.org/jira/browse/PIG-5365
> Project: Pig
> Issue Type: New Feature
> Reporter: Satish Subhashrao Saley
> Assignee: Satish Subhashrao Saley
> Priority: Major
>
> It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to
> 512MB or 1G when they are reading TBs of data to avoid launching too many map
> tasks (50-100K) for loading data. It has unnecessary overhead in terms of
> container launch and wastes lot of resources.
> Would be good to have a new settings to configure the max number of tasks
> which will override pig.maxCombinedSplitSize and combine more splits into one
> task. For eg: pig.max.input.splits=30000 and data size is 2TB, it will
> combine more than 128MB (default pig.maxCombinedSplitSize) per task to have
> maximum of 30K tasks. That will go as default into pig-default.properties and
> apply to all users.
> Thank you [~rohini] for filing the issue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)