Hi

The RDBMS context is quite broad: It has both large facts tables with
billion rows as well as hundreds of small normalized tables. Depending
on the spark transformation, the source data can be one or multiple
tables, as well as few rows, million or even billion of them. When new
data is inserted in some tables, it should trigger a spark job that
might fetch data from old and even static related tables. In most case,
the joins are made in Spark, not in the RDBMS to let it quiet. In all
cases, the sooner/faster the spark jobs get the data the better it is.

I have explored four ways right now : CDC, Batch, spark Streaming and
also apache livy.

CDC (such debezium) looks interesting. It can be combined with triggers
to populate some table to be then fetched with spark streaming; kafka
and so on. However this approach is quite complex and add some
processing/storage on the RDBMS side.

Batch is simple. However as said, it's quite slow and resource
consuming for both RDBMS and spark cluster.

Spark Streaming, is faster than batch, but more difficult to maintain
(to me). It impacts frequently the RDBMS.

Apache Livy: looks the best. The rest API allows to trigger ready and
sized spark contexts. Even better, it allows to trigger the job from the
client application that loads the RDBMS, just after the RDBMS was
populated. Finally, this is also flexible since it can handle any
worlkoad and also py/R/scala spark.


On Fri, Dec 28, 2018 at 05:41:51PM +0000, Thakrar, Jayesh wrote:
> Yes, you can certainly use spark streaming, but reading from the original 
> source table may still be time consuming and resource intensive.
> 
> Having some context on the RDBMS platform, data size/volumes involved and the 
> tolerable lag (between changes being created and it being processed by Spark) 
> will help people give you better recommendations/best practices.
> 
> All the same, one approach is to create triggers on the source table and 
> insert data into a different table and then read from there.
> Another approach is to push the delta data into something like Kafka and then 
> use Spark streaming against that.
> Taking that Kafka approach further, you can capture the delta upstream so 
> that the processing that pushes it into the RDBMS can also push it to Kafka 
> directly.
> 
> On 12/27/18, 4:52 PM, "Nicolas Paris" <nicolas.pa...@riseup.net> wrote:
> 
>     Hi
>     
>     I have this living RDBMS and I d'like to apply a spark job on several
>     tables once new data get in.
>     
>     I could run batch spark jobs thought cron jobs every minutes. But the
>     job takes time and resources to begin (sparkcontext, yarn....)
>     
>     I wonder if I could run one instance of a spark streaming job to save
>     those resources. However I haven't seen about structured streaming from
>     jdbc source in the documentation.
>     
>     Any recommendation ?
>     
>     
>     -- 
>     nicolas
>     
>     
> 

-- 
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to