What relational db are you using? We do this at work, and the way we handle it is to unload the db into Spark (actually, we unload it to S3 and then into Spark). Redshift is very efficient at dumlping tables this way.
_________________________________________________________________________________________________ Paul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP Tel. + ▪ Mobile + _________________________________________________________________________________________________ From: Eric Dain [mailto:ericdai...@gmail.com] Sent: Wednesday, January 25, 2017 11:14 PM To: user@spark.apache.org Subject: Ingesting Large csv File to relational database Hi, I need to write nightly job that ingest large csv files (~15GB each) and add/update/delete the changed rows to relational database. If a row is identical to what in the database, I don't want to re-write the row to the database. Also, if same item comes from multiple sources (files) I need to implement a logic to choose if the new source is preferred or the current one in the database should be kept unchanged. Obviously, I don't want to query the database for each item to check if the item has changed or no. I prefer to maintain the state inside Spark. Is there a preferred and performant way to do that using Apache Spark ? Best, Eric ______________________________________________________________________________ The Boston Consulting Group, Inc. This e-mail message may contain confidential and/or privileged information. If you are not an addressee or otherwise authorized to receive this message, you should not use, copy, disclose or take any action based on this e-mail or any information contained in the message. If you have received this material in error, please advise the sender immediately by reply e-mail and delete this message. Thank you.