Right. That seems to be what's happening. Thank you for all the help understanding. It's making sense now.
- Dave On Wed, Jun 19, 2013 at 7:30 PM, Abraham Elmahrek <[email protected]> wrote: > David, > > It's really just a hint. So the splitters will try to hit whatever is > defined, but an extra may be created. For instance, BigDecimalSplitter will > create 4 splits for certain ranges with 3 MR tasks specified. > > -Abe > > > On Wed, Jun 19, 2013 at 5:03 PM, David Kincaid <[email protected]>wrote: > >> We don't have that set on our cluster and aren't specifying it in our >> job. When I look at the different sqoop jobs I see both 3 for some and 4 >> for others on the jobs. >> >> >> On Wed, Jun 19, 2013 at 6:50 PM, Abraham Elmahrek <[email protected]>wrote: >> >>> David, >>> >>> Well I think sqoop is looking at "mapred.map.tasks". Do you have that >>> set in mapred-site.xml? I thought that defaults to 2. >>> >>> -Abe >>> >>> >>> On Wed, Jun 19, 2013 at 4:31 PM, Abraham Elmahrek <[email protected]>wrote: >>> >>>> David, >>>> >>>> I've created https://issues.apache.org/jira/browse/SQOOP-1093 to track >>>> the documentation issue. Thanks for bringing this to the community's >>>> attention! >>>> >>>> -Abe >>>> >>>> >>>> On Wed, Jun 19, 2013 at 4:21 PM, Abraham Elmahrek <[email protected]>wrote: >>>> >>>>> Hey David, >>>>> >>>>> With oracle, the BigDecimalSplitter will be used by default for all >>>>> number types. >>>>> >>>>> -Abe >>>>> >>>>> >>>>> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <[email protected] >>>>> > wrote: >>>>> >>>>>> Abe, the database is Oracle. >>>>>> >>>>>> >>>>>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> David, >>>>>>> >>>>>>> What database are you importing from? The description I gave was for >>>>>>> datatypes that map to the BigDecimal Splitter. The userguide might be >>>>>>> referring to the IntegerSplitter which will add the remainder to the >>>>>>> last >>>>>>> value. >>>>>>> >>>>>>> -Abe >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Thanks. We didn't specify the number of mappers, so it's giving us >>>>>>>> 4. I understand your explanation, but it seems to conflict with the >>>>>>>> Sqoop >>>>>>>> user guide ( >>>>>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism >>>>>>>> ): >>>>>>>> >>>>>>>> "When performing parallel imports, Sqoop needs a criterion by >>>>>>>> which it can split the workload. Sqoop uses a *splitting column* to >>>>>>>> split the workload. By default, Sqoop will identify the primary key >>>>>>>> column >>>>>>>> (if present) in a table and use it as the splitting column. The low and >>>>>>>> high values for the splitting column are retrieved from the database, >>>>>>>> and >>>>>>>> the map tasks operate on evenly-sized components of the total range. >>>>>>>> For >>>>>>>> example, if you had a table with a primary key column of id whose >>>>>>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed >>>>>>>> to >>>>>>>> use 4 tasks, Sqoop would run four processes which each execute SQL >>>>>>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND >>>>>>>> id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), >>>>>>>> and (750, 1001) in the different tasks." >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <[email protected] >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Hey David, >>>>>>>>> >>>>>>>>> Here's the algorithm: >>>>>>>>> Split lengths are defined by (max - min)/(# mappers) and whatever >>>>>>>>> is left is tacked on at the end. So in this case, (288272191-2110)/3 >>>>>>>>> = 96090027.33... So I'm assuming the .33... is rounded down and split >>>>>>>>> lengths will be of length 96090027. Sqoop will then create splits >>>>>>>>> with the following points: (min) + (range length)*(n). We can see >>>>>>>>> that 2110 + 96090027*0 = 2110, 2110 + 96090027*1 = 96092137, 2110 >>>>>>>>> + 96090027*2 = 192182164, and 2110 + 96090027*3 = 288272191 will >>>>>>>>> be generated based off of this algorithm. The last point to be added >>>>>>>>> will >>>>>>>>> be 288272192 because the max value is not part of the generated >>>>>>>>> split points. Then sqoop will distributed accordingly based off of >>>>>>>>> these >>>>>>>>> points as you've pointed out above. >>>>>>>>> >>>>>>>>> Just to be sure, did you configure sqoop to use 3 mappers? >>>>>>>>> >>>>>>>>> Hope this helps, >>>>>>>>> -Abe >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> We're seeing a strange thing happen with a sqoop import job with >>>>>>>>>> the way the key range is getting distributed among the 4 mappers >>>>>>>>>> that are >>>>>>>>>> running. The minimum key value is 2110 and the maximum value is >>>>>>>>>> 288272191. >>>>>>>>>> We are getting one mapper that is only getting one record to import. >>>>>>>>>> Here >>>>>>>>>> is the distribution among the mappers: >>>>>>>>>> >>>>>>>>>> [2110, 96092137) >>>>>>>>>> [96092137, 192182164) >>>>>>>>>> [192182164, 288272191) >>>>>>>>>> [288272191, 288272192) >>>>>>>>>> >>>>>>>>>> you can see that the fourth mapper is given a range with only one >>>>>>>>>> value in it. Could someone help me understand what is going on? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Dave >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
