Satish Subhashrao Saley created PIG-5365:
--------------------------------------------
Summary: Add support for PARALLEL clause in LOAD statement
Key: PIG-5365
URL: https://issues.apache.org/jira/browse/PIG-5365
Project: Pig
Issue Type: New Feature
Reporter: Satish Subhashrao Saley
Assignee: Satish Subhashrao Saley
It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to
512MB or 1G when they are reading TBs of data to avoid launching too many map
tasks (50-100K) for loading data. It has unnecessary overhead in terms of
container launch and wastes lot of resources.
Would be good to have a new settings to configure the max number of tasks which
will override pig.maxCombinedSplitSize and combine more splits into one task.
For eg: pig.max.input.splits=30000 and data size is 2TB, it will combine more
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K
tasks. That will go as default into pig-default.properties and apply to all
users.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)