Does anything different happened when you set the isolationLevel to do Dirty Reads i.e. "READ_UNCOMMITTED"
On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H <manjunathshe...@live.com> wrote: > Hi, > > We are writing a ETL pipeline using Spark, that fetch the data from SQL > server in batch mode (every 15mins). Problem we are facing when we try to > parallelising single table reads into multiple tasks without missing any > data. > > We have tried this, > > > - Use `ROW_NUMBER` window function in the SQL query > - Then do > - > > DataFrame df = > hiveContext > .read() > .jdbc( > *<url>*, > query, > "row_num", > 1, > <upper_limit>, > noOfPartitions, > jdbcOptions); > > > > The problem with this approach is if our tables get updated in between in SQL > Server while tasks are still running then the `ROW_NUMBER` will change and we > may miss some records. > > > Any approach to how to fix this issue ? . Any pointers will be helpful > > > *Note*: I am on spark 1.6 > > > Thanks > > Manjiunath Shetty > >