Thanks for starting the thread, Minh!

We do the same thing at Uber actually. Its handy to join these two at times
and its a common pattern.
so curious to know what others think?

DeltaStreamer option seems like a good idea. Some implementation
considerations on how we configure this second table etc..
but we can figure that out on the PR/JIRA.

>  Can we update both tables transactionally? This would be a nice
property to have. The current 2-job pattern does not support this.
It's achievable with some caveats. For e.g, you can write both to datasets,
then commit the second one only after first one succeeds. If second commit
fails, then we do restore/rollback first one. Note that some queries may
have already picked up the first commit changes technically speaking (race
time window will be small). General support for this, needs more work and
overlaying timelines etc... You are welcome to take this on if you are
interested. :)

> Can we share the Avro logs? This might save some time as well
as achieving the transactionality mentioned above but it increases
complexity.
yes. it would change the core models and design a lot. In some cases, the
logs may not even be the same across these tables. for e.g, if you take the
HBase data model, you might get new cells out of your change stream, which
is the raw change log . You can have the snapshot/row table have either
cells in the Avro log or full row images, depending on where you want to
pay the cost of merge. let me know what you think.



On Mon, May 6, 2019 at 10:19 PM Minh Pham <m...@csscompany.com> wrote:

> Hi,
>
> A common pattern that I see is having 1 Kafka topic for data change events
> and 2 Hudi ingestion job (1 in insert mode and 1 in upsert mode). This
> creates 2 tables, 1 with all raw data change events and 1 with the latest
> snapshot of data.
>
> What do you guys think about adding support for as an option in
> DeltaStreamer?
>
> There are some complications to consider:
> - Can we update both tables transactionally? This would be a nice property
> to have. The current 2-job pattern does not support this.
> - Can we share the Avro logs? This might save some time as well as
> achieving the transactionality mentioned above but it increases complexity.
>
> Best,
> Minh
>

Reply via email to