Hi Omer ,
Here are couple of the solutions which you can implement for your use case
:
*Option 1 : *
you can mount the S3 bucket as local file system
Here are the details :
https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
*Option 2 :*
 You can use Amazon Glue for your use case
here are the details :
https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/

*Option 3 :*
Store the file in the local file system and later push it s3 bucket
here are the details
https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket

Thanks,
Divya

On Tue, 23 Oct 2018 at 15:53, <omer.ozsaka...@sony.com> wrote:

> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>    1. CRM application will send a file to a folder. This file contains
>    customer information of all customers. This file is in a folder in the
>    local server. File name is: customer.tsv
>       1. Customer.tsv contains customerid, country, birty_month,
>       activation_date etc
>    2. I need to read the contents of customer.tsv.
>    3. I will add current timestamp info to the file.
>    4. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>    1.  CRM application will send a new file which contains the
>    updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>    1. Daily_customer.tsv contains contains customerid, cdc_field,
>       country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>    1. I need to read the contents of daily_customer.tsv.
>    2. I will add current timestamp info to the file.
>    3. I will transfer daily_customer.tsv to the S3 bucket:
>    customer.daily.data
>    4. I need to merge two buckets customer.history.data and
>    customer.daily.data.
>       1. Two buckets have timestamp fields. So I need to query all
>       records whose timestamp is the last timestamp.
>       2. I can use row_number() over(partition by customer_id order by
>       timestamp_field desc) as version_number
>       3. Then I can put the records whose version is one, to the final
>       bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>    - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>    on a local Spark cluster?
>    - Is this approach efficient? Will the queries transfer all historical
>    data from AWS S3 to the local cluster?
>    - How can I implement this scenario in a more effective way? Like just
>    transferring daily data to AWS S3 and then running queries on AWS.
>       - For instance Athena can query on AWS. But it is just a query
>       engine. As I know I can not call it by using an sdk and I can not write 
> the
>       results to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>

Reply via email to