Re: Parallelising JDBC reads in spark

2020-05-25 Thread Manjunath Shetty H
Thanks Dhaval for the suggestion, but in the case i mentioned in previous mail 
still data can be missed as the row number will change.


-
Manjunath

From: Dhaval Patel 
Sent: Monday, May 25, 2020 3:01 PM
To: Manjunath Shetty H 
Subject: Re: Parallelising JDBC reads in spark

If possible, set the watermark before reading data. Read the max of watermark 
column before reading actual data and add that in query to read actual data, 
like watermark <= current_watermark

It may query db twice, however it will make sure you are not missing any records

Regards
Dhaval

On Mon, May 25, 2020 at 3:38 AM Manjunath Shetty H 
mailto:manjunathshe...@live.com>> wrote:
Thanks Georg for the suggestion, but at this point changing the design is not 
really the option.

Any other pointer would be helpful.


Thanks
Manjunath

From: Georg Heiler mailto:georg.kf.hei...@gmail.com>>
Sent: Monday, May 25, 2020 11:52 AM

To: Manjunath Shetty H 
mailto:manjunathshe...@live.com>>
Cc: Mike Artz mailto:michaelea...@gmail.com>>; user 
mailto:user@spark.apache.org>>
Subject: Re: Parallelising JDBC reads in spark

Well you seem to have performance and consistency problems. Using a CDC tool 
fitting for your database you might be able to fix both.
However, streaming the change events of the database log might be a bit more 
complicated. Tools like https://debezium.io/ could be useful - depending on 
your source database.

Best,
Georg

Am Mo., 25. Mai 2020 um 08:16 Uhr schrieb Manjunath Shetty H 
mailto:manjunathshe...@live.com>>:
Hi Georg,

Thanks for the response, can please elaborate what do mean by change data 
capture ?

Thanks
Manjunath

From: Georg Heiler mailto:georg.kf.hei...@gmail.com>>
Sent: Monday, May 25, 2020 11:14 AM
To: Manjunath Shetty H 
mailto:manjunathshe...@live.com>>
Cc: Mike Artz mailto:michaelea...@gmail.com>>; user 
mailto:user@spark.apache.org>>
Subject: Re: Parallelising JDBC reads in spark

Why don't you apply proper change data capture?
This will be more complex though.

Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H 
mailto:manjunathshe...@live.com>>:
Hi Mike,

Thanks for the response.

Even with that flag set data miss can happen right ?. As the fetch is based on 
the last watermark (maximum timestamp of the row that last batch job fetched ), 
Take a scenario like this with table

a :  1
b :  2
c :  3
d :  4
f  :  6
g :  7
h :  8
e :  5


  *   a,b,c,d,e get picked by 1 task
  *   by the time second task starts, e has been updated, so the row order 
changes
  *   As f moves up, it will completely get missed in the fetch

Thanks
Manjunath


From: Mike Artz mailto:michaelea...@gmail.com>>
Sent: Monday, May 25, 2020 10:50 AM
To: Manjunath Shetty H 
mailto:manjunathshe...@live.com>>
Cc: user mailto:user@spark.apache.org>>
Subject: Re: Parallelising JDBC reads in spark

Does anything different happened when you set the isolationLevel to do Dirty 
Reads i.e. "READ_UNCOMMITTED"

On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H 
mailto:manjunathshe...@live.com>> wrote:
Hi,

We are writing a ETL pipeline using Spark, that fetch the data from SQL server 
in batch mode (every 15mins). Problem we are facing when we try to 
parallelising single table reads into multiple tasks without missing any data.

We have tried this,


  *   Use `ROW_NUMBER` window function in the SQL query
  *   Then do
  *

DataFrame df =
hiveContext
.read()
.jdbc(
,
query,
"row_num",
1,
,
noOfPartitions,
jdbcOptions);



The problem with this approach is if our tables get updated in between in SQL 
Server while tasks are still running then the `ROW_NUMBER` will change and we 
may miss some records.


Any approach to how to fix this issue ? . Any pointers will be helpful


Note: I am on spark 1.6


Thanks

Manjiunath Shetty


Re: Parallelising JDBC reads in spark

2020-05-25 Thread Manjunath Shetty H
Thanks Georg for the suggestion, but at this point changing the design is not 
really the option.

Any other pointer would be helpful.


Thanks
Manjunath

From: Georg Heiler 
Sent: Monday, May 25, 2020 11:52 AM
To: Manjunath Shetty H 
Cc: Mike Artz ; user 
Subject: Re: Parallelising JDBC reads in spark

Well you seem to have performance and consistency problems. Using a CDC tool 
fitting for your database you might be able to fix both.
However, streaming the change events of the database log might be a bit more 
complicated. Tools like https://debezium.io/ could be useful - depending on 
your source database.

Best,
Georg

Am Mo., 25. Mai 2020 um 08:16 Uhr schrieb Manjunath Shetty H 
mailto:manjunathshe...@live.com>>:
Hi Georg,

Thanks for the response, can please elaborate what do mean by change data 
capture ?

Thanks
Manjunath

From: Georg Heiler mailto:georg.kf.hei...@gmail.com>>
Sent: Monday, May 25, 2020 11:14 AM
To: Manjunath Shetty H 
mailto:manjunathshe...@live.com>>
Cc: Mike Artz mailto:michaelea...@gmail.com>>; user 
mailto:user@spark.apache.org>>
Subject: Re: Parallelising JDBC reads in spark

Why don't you apply proper change data capture?
This will be more complex though.

Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H 
mailto:manjunathshe...@live.com>>:
Hi Mike,

Thanks for the response.

Even with that flag set data miss can happen right ?. As the fetch is based on 
the last watermark (maximum timestamp of the row that last batch job fetched ), 
Take a scenario like this with table

a :  1
b :  2
c :  3
d :  4
f  :  6
g :  7
h :  8
e :  5


  *   a,b,c,d,e get picked by 1 task
  *   by the time second task starts, e has been updated, so the row order 
changes
  *   As f moves up, it will completely get missed in the fetch

Thanks
Manjunath


From: Mike Artz mailto:michaelea...@gmail.com>>
Sent: Monday, May 25, 2020 10:50 AM
To: Manjunath Shetty H 
mailto:manjunathshe...@live.com>>
Cc: user mailto:user@spark.apache.org>>
Subject: Re: Parallelising JDBC reads in spark

Does anything different happened when you set the isolationLevel to do Dirty 
Reads i.e. "READ_UNCOMMITTED"

On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H 
mailto:manjunathshe...@live.com>> wrote:
Hi,

We are writing a ETL pipeline using Spark, that fetch the data from SQL server 
in batch mode (every 15mins). Problem we are facing when we try to 
parallelising single table reads into multiple tasks without missing any data.

We have tried this,


  *   Use `ROW_NUMBER` window function in the SQL query
  *   Then do
  *

DataFrame df =
hiveContext
.read()
.jdbc(
,
query,
"row_num",
1,
,
noOfPartitions,
jdbcOptions);



The problem with this approach is if our tables get updated in between in SQL 
Server while tasks are still running then the `ROW_NUMBER` will change and we 
may miss some records.


Any approach to how to fix this issue ? . Any pointers will be helpful


Note: I am on spark 1.6


Thanks

Manjiunath Shetty


Re: Parallelising JDBC reads in spark

2020-05-24 Thread Georg Heiler
Well you seem to have performance and consistency problems. Using a CDC
tool fitting for your database you might be able to fix both.
However, streaming the change events of the database log might be a bit
more complicated. Tools like https://debezium.io/ could be useful -
depending on your source database.

Best,
Georg

Am Mo., 25. Mai 2020 um 08:16 Uhr schrieb Manjunath Shetty H <
manjunathshe...@live.com>:

> Hi Georg,
>
> Thanks for the response, can please elaborate what do mean by change data
> capture ?
>
> Thanks
> Manjunath
> --
> *From:* Georg Heiler 
> *Sent:* Monday, May 25, 2020 11:14 AM
> *To:* Manjunath Shetty H 
> *Cc:* Mike Artz ; user 
> *Subject:* Re: Parallelising JDBC reads in spark
>
> Why don't you apply proper change data capture?
> This will be more complex though.
>
> Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H <
> manjunathshe...@live.com>:
>
> Hi Mike,
>
> Thanks for the response.
>
> Even with that flag set data miss can happen right ?. As the fetch is
> based on the last watermark (maximum timestamp of the row that last batch
> job fetched ), Take a scenario like this with table
>
> a :  1
> b :  2
> c :  3
> d :  4
> *f  :  6*
> g :  7
> h :  8
> e :  5
>
>
>- a,b,c,d,e get picked by 1 task
>- by the time second task starts, e has been updated, so the row order
>changes
>- As f moves up, it will completely get missed in the fetch
>
>
> Thanks
> Manjunath
>
> --------------
> *From:* Mike Artz 
> *Sent:* Monday, May 25, 2020 10:50 AM
> *To:* Manjunath Shetty H 
> *Cc:* user 
> *Subject:* Re: Parallelising JDBC reads in spark
>
> Does anything different happened when you set the isolationLevel to do
> Dirty Reads i.e. "READ_UNCOMMITTED"
>
> On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H <
> manjunathshe...@live.com> wrote:
>
> Hi,
>
> We are writing a ETL pipeline using Spark, that fetch the data from SQL
> server in batch mode (every 15mins). Problem we are facing when we try to
> parallelising single table reads into multiple tasks without missing any
> data.
>
> We have tried this,
>
>
>- Use `ROW_NUMBER` window function in the SQL query
>- Then do
>-
>
>DataFrame df =
>hiveContext
>.read()
>.jdbc(
>**,
>query,
>"row_num",
>1,
>,
>noOfPartitions,
>jdbcOptions);
>
>
>
> The problem with this approach is if our tables get updated in between in SQL 
> Server while tasks are still running then the `ROW_NUMBER` will change and we 
> may miss some records.
>
>
> Any approach to how to fix this issue ? . Any pointers will be helpful
>
>
> *Note*: I am on spark 1.6
>
>
> Thanks
>
> Manjiunath Shetty
>
>


Re: Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H
Hi Georg,

Thanks for the response, can please elaborate what do mean by change data 
capture ?

Thanks
Manjunath

From: Georg Heiler 
Sent: Monday, May 25, 2020 11:14 AM
To: Manjunath Shetty H 
Cc: Mike Artz ; user 
Subject: Re: Parallelising JDBC reads in spark

Why don't you apply proper change data capture?
This will be more complex though.

Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H 
mailto:manjunathshe...@live.com>>:
Hi Mike,

Thanks for the response.

Even with that flag set data miss can happen right ?. As the fetch is based on 
the last watermark (maximum timestamp of the row that last batch job fetched ), 
Take a scenario like this with table

a :  1
b :  2
c :  3
d :  4
f  :  6
g :  7
h :  8
e :  5


  *   a,b,c,d,e get picked by 1 task
  *   by the time second task starts, e has been updated, so the row order 
changes
  *   As f moves up, it will completely get missed in the fetch

Thanks
Manjunath


From: Mike Artz mailto:michaelea...@gmail.com>>
Sent: Monday, May 25, 2020 10:50 AM
To: Manjunath Shetty H 
mailto:manjunathshe...@live.com>>
Cc: user mailto:user@spark.apache.org>>
Subject: Re: Parallelising JDBC reads in spark

Does anything different happened when you set the isolationLevel to do Dirty 
Reads i.e. "READ_UNCOMMITTED"

On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H 
mailto:manjunathshe...@live.com>> wrote:
Hi,

We are writing a ETL pipeline using Spark, that fetch the data from SQL server 
in batch mode (every 15mins). Problem we are facing when we try to 
parallelising single table reads into multiple tasks without missing any data.

We have tried this,


  *   Use `ROW_NUMBER` window function in the SQL query
  *   Then do
  *

DataFrame df =
hiveContext
.read()
.jdbc(
,
query,
"row_num",
1,
,
noOfPartitions,
jdbcOptions);



The problem with this approach is if our tables get updated in between in SQL 
Server while tasks are still running then the `ROW_NUMBER` will change and we 
may miss some records.


Any approach to how to fix this issue ? . Any pointers will be helpful


Note: I am on spark 1.6


Thanks

Manjiunath Shetty


Re: Parallelising JDBC reads in spark

2020-05-24 Thread Georg Heiler
Why don't you apply proper change data capture?
This will be more complex though.

Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H <
manjunathshe...@live.com>:

> Hi Mike,
>
> Thanks for the response.
>
> Even with that flag set data miss can happen right ?. As the fetch is
> based on the last watermark (maximum timestamp of the row that last batch
> job fetched ), Take a scenario like this with table
>
> a :  1
> b :  2
> c :  3
> d :  4
> *f  :  6*
> g :  7
> h :  8
> e :  5
>
>
>- a,b,c,d,e get picked by 1 task
>- by the time second task starts, e has been updated, so the row order
>changes
>- As f moves up, it will completely get missed in the fetch
>
>
> Thanks
> Manjunath
>
> --
> *From:* Mike Artz 
> *Sent:* Monday, May 25, 2020 10:50 AM
> *To:* Manjunath Shetty H 
> *Cc:* user 
> *Subject:* Re: Parallelising JDBC reads in spark
>
> Does anything different happened when you set the isolationLevel to do
> Dirty Reads i.e. "READ_UNCOMMITTED"
>
> On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H <
> manjunathshe...@live.com> wrote:
>
> Hi,
>
> We are writing a ETL pipeline using Spark, that fetch the data from SQL
> server in batch mode (every 15mins). Problem we are facing when we try to
> parallelising single table reads into multiple tasks without missing any
> data.
>
> We have tried this,
>
>
>- Use `ROW_NUMBER` window function in the SQL query
>- Then do
>-
>
>DataFrame df =
>hiveContext
>.read()
>.jdbc(
>**,
>query,
>"row_num",
>1,
>,
>noOfPartitions,
>jdbcOptions);
>
>
>
> The problem with this approach is if our tables get updated in between in SQL 
> Server while tasks are still running then the `ROW_NUMBER` will change and we 
> may miss some records.
>
>
> Any approach to how to fix this issue ? . Any pointers will be helpful
>
>
> *Note*: I am on spark 1.6
>
>
> Thanks
>
> Manjiunath Shetty
>
>


Re: Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H
Hi Mike,

Thanks for the response.

Even with that flag set data miss can happen right ?. As the fetch is based on 
the last watermark (maximum timestamp of the row that last batch job fetched ), 
Take a scenario like this with table

a :  1
b :  2
c :  3
d :  4
f  :  6
g :  7
h :  8
e :  5


  *   a,b,c,d,e get picked by 1 task
  *   by the time second task starts, e has been updated, so the row order 
changes
  *   As f moves up, it will completely get missed in the fetch

Thanks
Manjunath


From: Mike Artz 
Sent: Monday, May 25, 2020 10:50 AM
To: Manjunath Shetty H 
Cc: user 
Subject: Re: Parallelising JDBC reads in spark

Does anything different happened when you set the isolationLevel to do Dirty 
Reads i.e. "READ_UNCOMMITTED"

On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H 
mailto:manjunathshe...@live.com>> wrote:
Hi,

We are writing a ETL pipeline using Spark, that fetch the data from SQL server 
in batch mode (every 15mins). Problem we are facing when we try to 
parallelising single table reads into multiple tasks without missing any data.

We have tried this,


  *   Use `ROW_NUMBER` window function in the SQL query
  *   Then do
  *

DataFrame df =
hiveContext
.read()
.jdbc(
,
query,
"row_num",
1,
,
noOfPartitions,
jdbcOptions);



The problem with this approach is if our tables get updated in between in SQL 
Server while tasks are still running then the `ROW_NUMBER` will change and we 
may miss some records.


Any approach to how to fix this issue ? . Any pointers will be helpful


Note: I am on spark 1.6


Thanks

Manjiunath Shetty


Re: Parallelising JDBC reads in spark

2020-05-24 Thread Mike Artz
Does anything different happened when you set the isolationLevel to do
Dirty Reads i.e. "READ_UNCOMMITTED"

On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H 
wrote:

> Hi,
>
> We are writing a ETL pipeline using Spark, that fetch the data from SQL
> server in batch mode (every 15mins). Problem we are facing when we try to
> parallelising single table reads into multiple tasks without missing any
> data.
>
> We have tried this,
>
>
>- Use `ROW_NUMBER` window function in the SQL query
>- Then do
>-
>
>DataFrame df =
>hiveContext
>.read()
>.jdbc(
>**,
>query,
>"row_num",
>1,
>,
>noOfPartitions,
>jdbcOptions);
>
>
>
> The problem with this approach is if our tables get updated in between in SQL 
> Server while tasks are still running then the `ROW_NUMBER` will change and we 
> may miss some records.
>
>
> Any approach to how to fix this issue ? . Any pointers will be helpful
>
>
> *Note*: I am on spark 1.6
>
>
> Thanks
>
> Manjiunath Shetty
>
>


Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H
Hi,

We are writing a ETL pipeline using Spark, that fetch the data from SQL server 
in batch mode (every 15mins). Problem we are facing when we try to 
parallelising single table reads into multiple tasks without missing any data.

We have tried this,


  *   Use `ROW_NUMBER` window function in the SQL query
  *   Then do
  *

DataFrame df =
hiveContext
.read()
.jdbc(
,
query,
"row_num",
1,
,
noOfPartitions,
jdbcOptions);



The problem with this approach is if our tables get updated in between in SQL 
Server while tasks are still running then the `ROW_NUMBER` will change and we 
may miss some records.


Any approach to how to fix this issue ? . Any pointers will be helpful


Note: I am on spark 1.6


Thanks

Manjiunath Shetty