Why don't you apply proper change data capture? This will be more complex though.
Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H < manjunathshe...@live.com>: > Hi Mike, > > Thanks for the response. > > Even with that flag set data miss can happen right ?. As the fetch is > based on the last watermark (maximum timestamp of the row that last batch > job fetched ), Take a scenario like this with table > > a : 1 > b : 2 > c : 3 > d : 4 > *f : 6* > g : 7 > h : 8 > e : 5 > > > - a,b,c,d,e get picked by 1 task > - by the time second task starts, e has been updated, so the row order > changes > - As f moves up, it will completely get missed in the fetch > > > Thanks > Manjunath > > ------------------------------ > *From:* Mike Artz <michaelea...@gmail.com> > *Sent:* Monday, May 25, 2020 10:50 AM > *To:* Manjunath Shetty H <manjunathshe...@live.com> > *Cc:* user <user@spark.apache.org> > *Subject:* Re: Parallelising JDBC reads in spark > > Does anything different happened when you set the isolationLevel to do > Dirty Reads i.e. "READ_UNCOMMITTED" > > On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H < > manjunathshe...@live.com> wrote: > > Hi, > > We are writing a ETL pipeline using Spark, that fetch the data from SQL > server in batch mode (every 15mins). Problem we are facing when we try to > parallelising single table reads into multiple tasks without missing any > data. > > We have tried this, > > > - Use `ROW_NUMBER` window function in the SQL query > - Then do > - > > DataFrame df = > hiveContext > .read() > .jdbc( > *<url>*, > query, > "row_num", > 1, > <upper_limit>, > noOfPartitions, > jdbcOptions); > > > > The problem with this approach is if our tables get updated in between in SQL > Server while tasks are still running then the `ROW_NUMBER` will change and we > may miss some records. > > > Any approach to how to fix this issue ? . Any pointers will be helpful > > > *Note*: I am on spark 1.6 > > > Thanks > > Manjiunath Shetty > >