Thank you very much 😊

From: Gourav Sengupta <gourav.sengu...@gmail.com>
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer" <omer.ozsaka...@sony.com>
Cc: Spark Forum <user@spark.apache.org>
Subject: Re: Triggering sql on Was S3 via Apache Spark

This is interesting you asked and then answered the questions (almost) as well

Regards,
Gourav

On Tue, 23 Oct 2018, 13:23 , 
<omer.ozsaka...@sony.com<mailto:omer.ozsaka...@sony.com>> wrote:
Hi guys,

We are using Apache Spark on a local machine.

I need to implement the scenario below.

In the initial load:

  1.  CRM application will send a file to a folder. This file contains customer 
information of all customers. This file is in a folder in the local server. 
File name is: customer.tsv

     *   Customer.tsv contains customerid, country, birty_month, 
activation_date etc

  1.  I need to read the contents of customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer customer.tsv to the S3 bucket: customer.history.data

In the daily loads:

  1.   CRM application will send a new file which contains the 
updated/deleted/inserted customer information.

  File name is daily_customer.tsv

     *   Daily_customer.tsv contains contains customerid, cdc_field, country, 
birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1.  I need to read the contents of daily_customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4.  I need to merge two buckets customer.history.data and customer.daily.data.

     *   Two buckets have timestamp fields. So I need to query all records 
whose timestamp is the last timestamp.
     *   I can use row_number() over(partition by customer_id order by 
timestamp_field desc) as version_number
     *   Then I can put the records whose version is one, to the final bucket: 
customer.dimension.data

I am running Spark on premise.

  *   Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a 
local Spark cluster?
  *   Is this approach efficient? Will the queries transfer all historical data 
from AWS S3 to the local cluster?
  *   How can I implement this scenario in a more effective way? Like just 
transferring daily data to AWS S3 and then running queries on AWS.

     *   For instance Athena can query on AWS. But it is just a query engine. 
As I know I can not call it by using an sdk and I can not write the results to 
a bucket/folder.

Thanks in advance,
Ömer




Reply via email to