Very good points, Gwen.  I hadn't thought of Oracle Streams case of
dependencies.  I wonder if GoldenGate handles this better?

The tradeoff of these approaches is that each RDBMS will be proprietary on
how to get this CDC information.  I guess GoldenGate can be a standard
interface on RDBMs, but there really isn't anything covering NoSQL stores
like HBase, Cassandra, Mongo.

I wish us poor analytics guys had more say on what OLTP stores the
application teams use :)  LinkedIn solved this pretty well with having
their teams use Expresso which has a nice CDC pattern with MySQL engine
under the covers.


On Sun, May 10, 2015 at 12:48 AM, Gwen Shapira <gshap...@cloudera.com>
wrote:

> Hi Jonathan,
>
> I agree we can have topic-per-table, but some transactions may span
> multiple tables and therefore will get applied partially out-of-order. I
> suspect this can be a consistency issue and create a state that is
> different than the state in the original database, but I don't have good
> proof of it.
>
> I know that Oracle Streams has "Parallel Apply" feature where they figure
> out whether transactions have dependencies and apply in parallel only if
> they don't. So it sounds like dependencies may be an issue.
>
> Planning to give this more thought :)
>
> Gwen
>
> On Fri, May 1, 2015 at 7:56 PM, Jonathan Hodges <hodg...@gmail.com> wrote:
>
> > Hi Gwen,
> >
> > As you said I see Bottled Water and Sqoop managing slightly different use
> > cases so I don't see this feature as a Sqoop killer.  However I did have
> a
> > question on your comment that the transaction log or CDC approach will
> have
> > problems with very large, very active databases.
> >
> > I get that you need to have a single producer that transmits the
> > transaction log changes to Kafka in order.  However on the consumer side
> > you can have a topic per table and then partition these topics by primary
> > key to achieve nice parallelism.  So it seems the producer is the
> potential
> > bottleneck, but I imagine you can scale that appropriately vertically and
> > put the proper HA.
> >
> > Would love to hear your thoughts on this.
> >
> > Jonathan
> >
> >
> >
> > On Thu, Apr 30, 2015 at 5:09 PM, Gwen Shapira <gshap...@cloudera.com>
> > wrote:
> >
> > > I feel a need to respond to the Sqoop-killer comment :)
> > >
> > > 1) Note that most databases have a single transaction log per db and in
> > > order to get the correct view of the DB, you need to read it in order
> > > (otherwise transactions will get messed up). This means you are limited
> > to
> > > a single producer reading data from the log, writing it to a single
> > > partition and getting it read from a single consumer. If the database
> is
> > > very large and very active, you may run into some issues there...
> > >
> > > Because Sqoop doesn't try to catch up with all the changes, but takes a
> > > snapshot (from multiple mappers in parallel), we can very rapidly Sqoop
> > > 10TB databases.
> > >
> > > 2) If HDFS is the target of getting data from Postgres, then postgresql
> > ->
> > > kafka -> HDFS seems less optimal than postgresql -> HDFS directly (in
> > > parallel). There are good reasons to get Postgres data to Kafka, but if
> > the
> > > eventual goal is HDFS (or HBase), I suspect Sqoop still has a place.
> > >
> > > 3) Due to its parallelism and general purpose JDBC connector, I suspect
> > > that Sqoop is even a very viable way of getting data into Kafka.
> > >
> > > Gwen
> > >
> > >
> > > On Thu, Apr 30, 2015 at 2:27 PM, Jan Filipiak <
> jan.filip...@trivago.com>
> > > wrote:
> > >
> > > > Hello Everyone,
> > > >
> > > > I am quite exited about the recent example of replicating PostgresSQL
> > > > Changes to Kafka. My view on the log compaction feature always had
> > been a
> > > > very sceptical one, but now with its great potential exposed to the
> > wide
> > > > public, I think its an awesome feature. Especially when pulling this
> > data
> > > > into HDFS as a Snapshot, it is (IMO) a sqoop killer. So I want to
> thank
> > > > everyone who had the vision of building these kind of systems during
> a
> > > time
> > > > I could not imagine those.
> > > >
> > > > There is one open question that I would like people to help me with.
> > When
> > > > pulling a snapshot of a partition into HDFS using a camus-like
> > > application
> > > > I feel the need of keeping a Set of all keys read so far and stop as
> > soon
> > > > as I find a key beeing already in my set. I use this as an indicator
> of
> > > how
> > > > far the log compaction has happened already and only pull up to this
> > > point.
> > > > This works quite well as I do not need to keep the messages but only
> > > their
> > > > keys in memory.
> > > >
> > > > The question I want to raise with the community is:
> > > >
> > > > How do you prevent pulling the same record twice (in different
> > versions)
> > > > and would it be beneficial if the "OffsetResponse" would also return
> > the
> > > > last offset that got compacted so far and the application would just
> > pull
> > > > up to this point?
> > > >
> > > > Looking forward for some recommendations and comments.
> > > >
> > > > Best
> > > > Jan
> > > >
> > > >
> > >
> >
>

Reply via email to