Also try to read about SCD and the fact that Hive may be a very good
alternative as well for running updates on data

Regards,
Gourav

On Wed, 24 Oct 2018, 14:53 , <omer.ozsaka...@sony.com> wrote:

> Thank you very much 😊
>
>
>
> *From: *Gourav Sengupta <gourav.sengu...@gmail.com>
> *Date: *24 October 2018 Wednesday 11:20
> *To: *"Ozsakarya, Omer" <omer.ozsaka...@sony.com>
> *Cc: *Spark Forum <user@spark.apache.org>
> *Subject: *Re: Triggering sql on Was S3 via Apache Spark
>
>
>
> This is interesting you asked and then answered the questions (almost) as
> well
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Tue, 23 Oct 2018, 13:23 , <omer.ozsaka...@sony.com> wrote:
>
> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>    1. CRM application will send a file to a folder. This file contains
>    customer information of all customers. This file is in a folder in the
>    local server. File name is: customer.tsv
>
>
>    1. Customer.tsv contains customerid, country, birty_month,
>       activation_date etc
>
>
>    1. I need to read the contents of customer.tsv.
>    2. I will add current timestamp info to the file.
>    3. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>    1.  CRM application will send a new file which contains the
>    updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>    1. Daily_customer.tsv contains contains customerid, cdc_field,
>       country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>    1. I need to read the contents of daily_customer.tsv.
>    2. I will add current timestamp info to the file.
>    3. I will transfer daily_customer.tsv to the S3 bucket:
>    customer.daily.data
>    4. I need to merge two buckets customer.history.data and
>    customer.daily.data.
>
>
>    1. Two buckets have timestamp fields. So I need to query all records
>       whose timestamp is the last timestamp.
>       2. I can use row_number() over(partition by customer_id order by
>       timestamp_field desc) as version_number
>       3. Then I can put the records whose version is one, to the final
>       bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>    - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>    on a local Spark cluster?
>    - Is this approach efficient? Will the queries transfer all historical
>    data from AWS S3 to the local cluster?
>    - How can I implement this scenario in a more effective way? Like just
>    transferring daily data to AWS S3 and then running queries on AWS.
>
>
>    - For instance Athena can query on AWS. But it is just a query engine.
>       As I know I can not call it by using an sdk and I can not write the 
> results
>       to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>
>

Reply via email to