Re: Triggering sql on Was S3 via Apache Spark

Gourav Sengupta Wed, 24 Oct 2018 01:20:54 -0700

This is interesting you asked and then answered the questions (almost) as
well


Regards,
Gourav

On Tue, 23 Oct 2018, 13:23 , <omer.ozsaka...@sony.com> wrote:

> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>    1. CRM application will send a file to a folder. This file contains
>    customer information of all customers. This file is in a folder in the
>    local server. File name is: customer.tsv
>       1. Customer.tsv contains customerid, country, birty_month,
>       activation_date etc
>    2. I need to read the contents of customer.tsv.
>    3. I will add current timestamp info to the file.
>    4. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>    1.  CRM application will send a new file which contains the
>    updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>    1. Daily_customer.tsv contains contains customerid, cdc_field,
>       country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>    1. I need to read the contents of daily_customer.tsv.
>    2. I will add current timestamp info to the file.
>    3. I will transfer daily_customer.tsv to the S3 bucket:
>    customer.daily.data
>    4. I need to merge two buckets customer.history.data and
>    customer.daily.data.
>       1. Two buckets have timestamp fields. So I need to query all
>       records whose timestamp is the last timestamp.
>       2. I can use row_number() over(partition by customer_id order by
>       timestamp_field desc) as version_number
>       3. Then I can put the records whose version is one, to the final
>       bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>    - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>    on a local Spark cluster?
>    - Is this approach efficient? Will the queries transfer all historical
>    data from AWS S3 to the local cluster?
>    - How can I implement this scenario in a more effective way? Like just
>    transferring daily data to AWS S3 and then running queries on AWS.
>       - For instance Athena can query on AWS. But it is just a query
>       engine. As I know I can not call it by using an sdk and I can not write 
> the
>       results to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>

Re: Triggering sql on Was S3 via Apache Spark

Reply via email to