[jira] [Commented] (HUDI-311) Support AWS DMS source on DeltaStreamer

2020-02-02 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028629#comment-17028629
 ] 

leesf commented on HUDI-311:


Fix via master: 350b0ecb4d137411c6231a1568add585c6d7b7d5

> Support AWS DMS source on DeltaStreamer
> ---
>
> Key: HUDI-311
> URL: https://issues.apache.org/jira/browse/HUDI-311
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://aws.amazon.com/dms/ seems like a one-stop shop for database change 
> logs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-311) Support AWS DMS source on DeltaStreamer

2019-11-26 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982449#comment-16982449
 ] 

Vinoth Chandar commented on HUDI-311:
-

This is what we get as parquet files on S3, for bulk load, Insert, Update, 
Delete sequence off a MySQL CDC

{code}
scala> 
spark.read.parquet("file:///home/vinoth/Downloads/LOAD0001.parquet").show(10,
 false)
+---+--+-+-+
|user_id|first_name|last_name|company  |
+---+--+-+-+
|1  |vinoth|chandar  |confluent inc|
|2  |balaji|varadarajan  |uber |
|3  |sudha |saktheeswaran|uber |
+---+--+-+-+


scala> 
spark.read.parquet("file:///home/vinoth/Downloads/20191126-124151666.parquet").show(10,
 false)
+---+---+--+---+-+
|Op |user_id|first_name|last_name  |company  |
+---+---+--+---+-+
|I  |4  |prasanna  |rajaperumal|snowflake|
+---+---+--+---+-+


scala> 
spark.read.parquet("file:///home/vinoth/Downloads/20191126-124528981.parquet").show(10,
 false)
+---+---+--+-+---+
|Op |user_id|first_name|last_name|company|
+---+---+--+-+---+
|U  |1  |vinoth|chandar  |   |
+---+---+--+-+---+


scala> 
spark.read.parquet("file:///home/vinoth/Downloads/20191126-125001909.parquet").show(10,
 false)
+---+---+--+---+-+
|Op |user_id|first_name|last_name  |company  |
+---+---+--+---+-+
|D  |4  |prasanna  |rajaperumal|snowflake|
+---+---+--+---+-+


scala> 
{code}

We need 
- a special payload implementation that looks at Op type and issues deletes
- Custom SQL transformer, that can  add the OP column if not present (seems its 
not present for the bulk load schema)


cc [~uditme] [~rbhartia] Seems doable..  Just FYI 

> Support AWS DMS source on DeltaStreamer
> ---
>
> Key: HUDI-311
> URL: https://issues.apache.org/jira/browse/HUDI-311
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
>
> https://aws.amazon.com/dms/ seems like a one-stop shop for database change 
> logs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)