Cascading Spark Structured streams
I need to write a Spark Structured Streaming pipeline that involves multiple aggregations, splitting data into multiple sub-pipes and union them. Also it need to have stateful aggregation with timeout. Spark Structured Streaming support all of the required functionality but not as one stream. I did a proof of concept that divide the pipeline into 3 sub-streams cascaded using Kafka and it seems to work. But I was wondering if it would be a good idea to skip Kafka and use HDFS files as integration. Or maybe there is another way to cascade streams without needing extra service like Kafka. Thanks,
Ingesting Large csv File to relational database
Hi, I need to write nightly job that ingest large csv files (~15GB each) and add/update/delete the changed rows to relational database. If a row is identical to what in the database, I don't want to re-write the row to the database. Also, if same item comes from multiple sources (files) I need to implement a logic to choose if the new source is preferred or the current one in the database should be kept unchanged. Obviously, I don't want to query the database for each item to check if the item has changed or no. I prefer to maintain the state inside Spark. Is there a preferred and performant way to do that using Apache Spark ? Best, Eric