Cascading Spark Structured streams

2017-12-28 Thread Eric Dain
I need to write a Spark Structured Streaming pipeline that involves
multiple aggregations, splitting data into multiple sub-pipes and union
them. Also it need to have stateful aggregation with timeout.

Spark Structured Streaming support all of the required functionality but
not as one stream. I did a proof of concept that divide the pipeline into 3
sub-streams cascaded using Kafka and it seems to work. But I was wondering
if it would be a good idea to skip Kafka and use HDFS files as integration.
Or maybe there is another way to cascade streams without needing extra
service like Kafka.

Thanks,


Ingesting Large csv File to relational database

2017-01-25 Thread Eric Dain
Hi,

I need to write nightly job that ingest large csv files (~15GB each) and
add/update/delete the changed rows to relational database.

If a row is identical to what in the database, I don't want to re-write the
row to the database. Also, if same item comes from multiple sources (files)
I need to implement a logic to choose if the new source is preferred or the
current one in the database should be kept unchanged.

Obviously, I don't want to query the database for each item to check if the
item has changed or no. I prefer to maintain the state inside Spark.

Is there a preferred and performant way to do that using Apache Spark ?

Best,
Eric