Re: Parallelising JDBC reads in spark

Mike Artz Sun, 24 May 2020 22:21:30 -0700

Does anything different happened when you set the isolationLevel to do
Dirty Reads i.e. "READ_UNCOMMITTED"


On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H <manjunathshe...@live.com>
wrote:

> Hi,
>
> We are writing a ETL pipeline using Spark, that fetch the data from SQL
> server in batch mode (every 15mins). Problem we are facing when we try to
> parallelising single table reads into multiple tasks without missing any
> data.
>
> We have tried this,
>
>
>    - Use `ROW_NUMBER` window function in the SQL query
>    - Then do
>    -
>
>    DataFrame df =
>        hiveContext
>            .read()
>            .jdbc(
>                *<url>*,
>                query,
>                "row_num",
>                1,
>                <upper_limit>,
>                noOfPartitions,
>                jdbcOptions);
>
>
>
> The problem with this approach is if our tables get updated in between in SQL 
> Server while tasks are still running then the `ROW_NUMBER` will change and we 
> may miss some records.
>
>
> Any approach to how to fix this issue ? . Any pointers will be helpful
>
>
> *Note*: I am on spark 1.6
>
>
> Thanks
>
> Manjiunath Shetty
>
>

Re: Parallelising JDBC reads in spark

Reply via email to