Also try to read about SCD and the fact that Hive may be a very good alternative as well for running updates on data
Regards, Gourav On Wed, 24 Oct 2018, 14:53 , <omer.ozsaka...@sony.com> wrote: > Thank you very much 😊 > > > > *From: *Gourav Sengupta <gourav.sengu...@gmail.com> > *Date: *24 October 2018 Wednesday 11:20 > *To: *"Ozsakarya, Omer" <omer.ozsaka...@sony.com> > *Cc: *Spark Forum <user@spark.apache.org> > *Subject: *Re: Triggering sql on Was S3 via Apache Spark > > > > This is interesting you asked and then answered the questions (almost) as > well > > > > Regards, > > Gourav > > > > On Tue, 23 Oct 2018, 13:23 , <omer.ozsaka...@sony.com> wrote: > > Hi guys, > > > > We are using Apache Spark on a local machine. > > > > I need to implement the scenario below. > > > > In the initial load: > > 1. CRM application will send a file to a folder. This file contains > customer information of all customers. This file is in a folder in the > local server. File name is: customer.tsv > > > 1. Customer.tsv contains customerid, country, birty_month, > activation_date etc > > > 1. I need to read the contents of customer.tsv. > 2. I will add current timestamp info to the file. > 3. I will transfer customer.tsv to the S3 bucket: customer.history.data > > > > In the daily loads: > > 1. CRM application will send a new file which contains the > updated/deleted/inserted customer information. > > File name is daily_customer.tsv > > 1. Daily_customer.tsv contains contains customerid, cdc_field, > country, birty_month, activation_date etc > > Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. > > 1. I need to read the contents of daily_customer.tsv. > 2. I will add current timestamp info to the file. > 3. I will transfer daily_customer.tsv to the S3 bucket: > customer.daily.data > 4. I need to merge two buckets customer.history.data and > customer.daily.data. > > > 1. Two buckets have timestamp fields. So I need to query all records > whose timestamp is the last timestamp. > 2. I can use row_number() over(partition by customer_id order by > timestamp_field desc) as version_number > 3. Then I can put the records whose version is one, to the final > bucket: customer.dimension.data > > > > I am running Spark on premise. > > - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD > on a local Spark cluster? > - Is this approach efficient? Will the queries transfer all historical > data from AWS S3 to the local cluster? > - How can I implement this scenario in a more effective way? Like just > transferring daily data to AWS S3 and then running queries on AWS. > > > - For instance Athena can query on AWS. But it is just a query engine. > As I know I can not call it by using an sdk and I can not write the > results > to a bucket/folder. > > > > Thanks in advance, > > Ömer > > > > > > > > > >