Hi Wei, Markus (in CC) offered the following explanation: " The Sqoop1 default is 4 map tasks. When working with customers I usually start with 1 and double the number of map tasks (e.g. 1, 2, 4, 8) until finding a performance sweet spot while keeping in mind the potential rdbms impact.
Estimating the real rdbms impact is often challenging for some of the following reasons: 1. DBAs are often not present 2. Jobs are often reviewed in isolation (excluding other simultaneous Sqoop or non-sqoop workloads) 3. Tests are often performed against smaller data volumes and/or virtual resources than what will be in production (includes rdbms, network and had pop cluster) 4. There is not a uniform way to monitor/analyze impact across rdbms vendors. 4.1. I have not really tried to review Sqoop console debug from a dB impact context, perhaps it could be used. 5. Once deployed production job volumes often change Thanks, Markus " On Wed, Jun 29, 2016 at 7:35 PM, Wei Yan <[email protected]> wrote: > Hi, > > Would like to check whether Sqoop supports this type of ingestion: > consider we have records with range [1,12], and we have 3 mappers. So in > default, the 3 mappers will be assigned [1,4], [5, 8], [9, 12]. > > Not sure whether we can split the range to smaller one, like, [1], [2], > [3], ..., [12]. But still using 3 mappers instead of 12 mappers. We want > this feature because: (1) if configured smaller mapper number, each mapper > will be assigned a larger range and take much longer time to finish, and > the infra may kill long running query; (2) But if we configured a larger > mapper number, each mapper has a smaller range, but meanwhile we generates > lots of network traffic to the database, which will also be bad. One good > way we want is: still 12 ranges, but 3 mappers, and at most 3 concurrent > connections at most. > > Appreciate any help here. > > -Wei > -- Erzsebet Szilagyi Software Engineer [image: www.cloudera.com] <http://www.cloudera.com>
