What relational db are you using? We do this at work, and the way we handle it 
is to unload the db into Spark (actually, we unload it to S3 and then into 
Spark).  Redshift is very efficient at dumlping tables this way.



_________________________________________________________________________________________________

Paul Tremblay
Analytics Specialist
THE BOSTON CONSULTING GROUP
Tel. + ▪ Mobile +

_________________________________________________________________________________________________

From: Eric Dain [mailto:ericdai...@gmail.com]
Sent: Wednesday, January 25, 2017 11:14 PM
To: user@spark.apache.org
Subject: Ingesting Large csv File to relational database

Hi,

I need to write nightly job that ingest large csv files (~15GB each) and 
add/update/delete the changed rows to relational database.

If a row is identical to what in the database, I don't want to re-write the row 
to the database. Also, if same item comes from multiple sources (files) I need 
to implement a logic to choose if the new source is preferred or the current 
one in the database should be kept unchanged.

Obviously, I don't want to query the database for each item to check if the 
item has changed or no. I prefer to maintain the state inside Spark.

Is there a preferred and performant way to do that using Apache Spark ?

Best,
Eric

______________________________________________________________________________
The Boston Consulting Group, Inc.
 
This e-mail message may contain confidential and/or privileged information.
If you are not an addressee or otherwise authorized to receive this message,
you should not use, copy, disclose or take any action based on this e-mail or
any information contained in the message. If you have received this material
in error, please advise the sender immediately by reply e-mail and delete this
message. Thank you.

Reply via email to