Re: Parallelising JDBC reads in spark

2020-05-24 Thread Georg Heiler
Why don't you apply proper change data capture? This will be more complex though. Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H < manjunathshe...@live.com>: > Hi Mike, > > Thanks for the response. > > Even with that flag set data miss can happen right ?. As the fetch is > based on

Re: Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H
Hi Mike, Thanks for the response. Even with that flag set data miss can happen right ?. As the fetch is based on the last watermark (maximum timestamp of the row that last batch job fetched ), Take a scenario like this with table a : 1 b : 2 c : 3 d : 4 f : 6 g : 7 h : 8 e : 5 *

Re: Parallelising JDBC reads in spark

2020-05-24 Thread Mike Artz
Does anything different happened when you set the isolationLevel to do Dirty Reads i.e. "READ_UNCOMMITTED" On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H wrote: > Hi, > > We are writing a ETL pipeline using Spark, that fetch the data from SQL > server in batch mode (every 15mins). Problem

Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H
Hi, We are writing a ETL pipeline using Spark, that fetch the data from SQL server in batch mode (every 15mins). Problem we are facing when we try to parallelising single table reads into multiple tasks without missing any data. We have tried this, * Use `ROW_NUMBER` window function in

Re: unsubscribe

2020-05-24 Thread Sunil Prabhakara
On Sat, 16 May 2020, 22:34 Punna Yenumala, wrote: >

Re: ETL Using Spark

2020-05-24 Thread vijay.bvp
Hi Avadhut Narayan JoshiThe use case is achievable using Spark. Connection to SQL Server possible as Mich mentioned below as longs as there a JDBC driver that can connect to SQL ServerFor a production workloads important points to consider, >> what is the QoS requirements for your case? at least

Re: [apache-spark]-spark-shuffle

2020-05-24 Thread vijay.bvp
How a Spark job reads datasources depends on the underlying source system,the job configuration about number of executors and cores per executor. https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets About Shuffle operations.

[no subject]

2020-05-24 Thread Vijaya Phanindra Sarma B

Cleanup hook for temporary files produced as part of a spark job

2020-05-24 Thread jelmer
I am writing something that partitions a data set and then trains a machine learning model on the data in each partition The resulting model is very big and right now i am storing it in an rdd as a pair of : partition_id and very_big_model_that_is_hundreds_of_megabytes_big but it is becoming

spar kafka option properties

2020-05-24 Thread Gunjan Kumar
Hi, while reading streaming data from kafka we use following API. df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ .option("subscribe", "topic1") \ .option("startingOffsets", "earliest") \ .load() My Question is how to