Good point. The only thing I can think of is that offsets might be slower (since the DB has to scan and keep a count internally) and the expectation that certain ranges of data end up in certain files (though I doubt this one). I'll defer this one to the broader community as I'm not sure myself.
On Tue, Sep 9, 2014 at 5:31 PM, [email protected] < [email protected]> wrote: > > > Hey,brother. > Glad to hear from you!I think we can use limit/offset(if the database > support this operation),or we can use sub-selection(if the database does > not support limint/offset) > For example: > For MySQL:select * from table limiit 0,5;select * from table limit 6,10... > For Oracle we can use rownum > I just can not understand why sqoop override this opreation above.This > override can lead to data skew. > > > *From:* Abraham Elmahrek <[email protected]> > *Date:* 2014-09-10 00:38 > *To:* [email protected] > *Subject:* Re: the confusion of --split-by parameter > Hey there, > > For databases, there needs to be a way to actually infer boundaries for a > particular column. Simply performing a "select *" would not be enough > because we would not know how to query the database. > > -Abe > > On Mon, Sep 8, 2014 at 8:33 PM, [email protected] < > [email protected]> wrote: > >> Hi,all. >> In sqoop we can specify the parameter --split-by,which can determine >> which field we will use to split map recored. >> But if the split field's data is skew.The workload between maps will be >> imbalance.I >> want to know why sqoop does not use >> select count(*) from table/num-maps to determine each map's workload.As I >> know some other base class of DataDrivenDBInputFormat's >> has the implementation of select count(*) from table/num-maps.Then why >> sqoop override this. >> >> >> >
