This is interesting you asked and then answered the questions (almost) as well
Regards, Gourav On Tue, 23 Oct 2018, 13:23 , <omer.ozsaka...@sony.com> wrote: > Hi guys, > > > > We are using Apache Spark on a local machine. > > > > I need to implement the scenario below. > > > > In the initial load: > > 1. CRM application will send a file to a folder. This file contains > customer information of all customers. This file is in a folder in the > local server. File name is: customer.tsv > 1. Customer.tsv contains customerid, country, birty_month, > activation_date etc > 2. I need to read the contents of customer.tsv. > 3. I will add current timestamp info to the file. > 4. I will transfer customer.tsv to the S3 bucket: customer.history.data > > > > In the daily loads: > > 1. CRM application will send a new file which contains the > updated/deleted/inserted customer information. > > File name is daily_customer.tsv > > 1. Daily_customer.tsv contains contains customerid, cdc_field, > country, birty_month, activation_date etc > > Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. > > 1. I need to read the contents of daily_customer.tsv. > 2. I will add current timestamp info to the file. > 3. I will transfer daily_customer.tsv to the S3 bucket: > customer.daily.data > 4. I need to merge two buckets customer.history.data and > customer.daily.data. > 1. Two buckets have timestamp fields. So I need to query all > records whose timestamp is the last timestamp. > 2. I can use row_number() over(partition by customer_id order by > timestamp_field desc) as version_number > 3. Then I can put the records whose version is one, to the final > bucket: customer.dimension.data > > > > I am running Spark on premise. > > - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD > on a local Spark cluster? > - Is this approach efficient? Will the queries transfer all historical > data from AWS S3 to the local cluster? > - How can I implement this scenario in a more effective way? Like just > transferring daily data to AWS S3 and then running queries on AWS. > - For instance Athena can query on AWS. But it is just a query > engine. As I know I can not call it by using an sdk and I can not write > the > results to a bucket/folder. > > > > Thanks in advance, > > Ömer > > > > > > > > >