Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Ryanne Dolan Tue, 05 Feb 2019 09:13:16 -0800

Thanks Harsha, makes sense.

Ryanne


On Mon, Feb 4, 2019 at 5:53 PM Harsha Chintalapani <ka...@harsha.io> wrote:

> "I think you are saying that this enables additional (potentially cheaper)
> storage options without *requiring* an existing ETL pipeline. “
> Yes.
>
> " But it's not really a replacement for the sort of pipelines people build
> with Connect, Gobblin etc.”
>
> It is not. But also making an assumption that everyone runs these
> pipelines for storing raw Kafka data into HDFS or S3 is also wrong
>  assumption.
> The aim of this KIP is to provide tiered storage as whole package not
> asking users to ship the data on their own using existing ETL, which means
> running a consumer and maintaining those pipelines.
>
> " My point was that, if you are already offloading records in an ETL
> pipeline, why do you need a new pipeline built into the broker to ship the
> same data to the same place?”
>
> As you said its ETL pipeline, which means users of these pipelines are
> reading the data from broker and transforming its state and storing it
> somewhere.
> The point of this KIP is store log segments as it is without changing
> their structure so that we can use the existing offset mechanisms to look
> it up when the consumer needs to read old data. When you do load it via
> your existing pipelines you are reading the topic as a whole , which
> doesn’t guarantee that you’ll produce this data back into HDFS in S3 in the
> same order and who is going to generate the Index files again.
>
>
> "So you'd end up with one of 1)cold segments are only useful to Kafka; 2)
> you have the same data written to HDFS/etc twice, once for Kafka and once
> for everything else, in two separate formats”
>
> You are talking two different use cases. If someone is storing raw data
> out of Kafka for long term access.
> By storing the data as it is in HDFS though Kafka will solve this issue.
> They do not need to run another pipe-line to ship these logs.
>
> If they are running pipelines to store in HDFS in a different format,
> thats a different use case. May be they are transforming Kafka logs to ORC
> so that they can query through Hive.  Once you transform the log segment it
> does loose its ability to use the existing offset index.
> Main objective here not to change the existing protocol and still be able
> to write and read logs from remote storage.
>
>
> -Harsha
>
> On Feb 4, 2019, 2:53 PM -0800, Ryanne Dolan <ryannedo...@gmail.com>,
> wrote:
> > Thanks Harsha, makes sense for the most part.
> >
> > > tiered storage is to get away from this and make this transparent to
> the
> > user
> >
> > I think you are saying that this enables additional (potentially cheaper)
> > storage options without *requiring* an existing ETL pipeline. But it's
> not
> > really a replacement for the sort of pipelines people build with Connect,
> > Gobblin etc. My point was that, if you are already offloading records in
> an
> > ETL pipeline, why do you need a new pipeline built into the broker to
> ship
> > the same data to the same place? I think in most cases this will be an
> > additional pipeline, not a replacement, because the segments written to
> > cold storage won't be useful outside Kafka. So you'd end up with one of
> 1)
> > cold segments are only useful to Kafka; 2) you have the same data written
> > to HDFS/etc twice, once for Kafka and once for everything else, in two
> > separate formats; 3) you use your existing ETL pipeline and read cold
> data
> > directly.
> >
> > To me, an ideal solution would let me spool segments from Kafka to any
> sink
> > I would like, and then let Kafka clients seamlessly access that cold
> data.
> > Today I can do that in the client, but ideally the broker would do it for
> > me via some HDFS/Hive/S3 plugin. The KIP seems to accomplish that -- just
> > without leveraging anything I've currently got in place.
> >
> > Ryanne
> >
> > On Mon, Feb 4, 2019 at 3:34 PM Harsha <ka...@harsha.io> wrote:
> >
> > > Hi Eric,
> > > Thanks for your questions. Answers are in-line
> > >
> > > "The high-level design seems to indicate that all of the logic for
> when and
> > > how to copy log segments to remote storage lives in the RLM class. The
> > > default implementation is then HDFS specific with additional
> > > implementations being left to the community. This seems like it would
> > > require anyone implementing a new RLM to also re-implement the logic
> for
> > > when to ship data to remote storage."
> > >
> > > RLM will be responsible for shipping log segments and it will decide
> when
> > > a log segment is ready to be shipped over.
> > > Once a Log Segement(s) are identified as rolled over, RLM will delegate
> > > this responsibility to a pluggable remote storage implementation.
> Users who
> > > are looking add their own implementation to enable other storages all
> they
> > > need to do is to implement the copy and read mechanisms and not to
> > > re-implement RLM itself.
> > >
> > >
> > > "Would it not be better for the Remote Log Manager implementation to be
> > > non-configurable, and instead have an interface for the remote storage
> > > layer? That way the "when" of the logic is consistent across all
> > > implementations and it's only a matter of "how," similar to how the
> Streams
> > > StateStores are managed."
> > >
> > > It's possible that we can RLM non-configurable. But for the initial
> > > release and to keep the backward compatibility
> > > we want to make this configurable and for any users who might not be
> > > interested in having the LogSegments shipped to remote, they don't
> need to
> > > worry about this.
> > >
> > >
> > > Hi Ryanne,
> > > Thanks for your questions.
> > >
> > > "How could this be used to leverage fast key-value stores, e.g.
> Couchbase,
> > > which can serve individual records but maybe not entire segments? Or
> is the
> > > idea to only support writing and fetching entire segments? Would it
> make
> > > sense to support both?"
> > >
> > > LogSegment once its rolled over are immutable objects and we want to
> keep
> > > the current structure of LogSegments and corresponding Index files. It
> will
> > > be easy to copy the whole segment as it is, instead of re-reading each
> file
> > > and use a key/value store.
> > >
> > > "
> > > - Instead of defining a new interface and/or mechanism to ETL segment
> files
> > > from brokers to cold storage, can we just leverage Kafka itself? In
> > > particular, we can already ETL records to HDFS via Kafka Connect,
> Gobblin
> > > etc -- we really just need a way for brokers to read these records
> back.
> > > I'm wondering whether the new API could be limited to the fetch, and
> then
> > > existing ETL pipelines could be more easily leveraged. For example, if
> you
> > > already have an ETL pipeline from Kafka to HDFS, you could leave that
> in
> > > place and just tell Kafka how to read these records/segments from cold
> > > storage when necessary."
> > >
> > > This is pretty much what everyone does and it has the additional
> overhead
> > > of keeping these pipelines operating and monitoring.
> > > What's proposed in the KIP is not ETL. It's just looking a the logs
> that
> > > are written and rolled over to copy the file as it is.
> > > Each new topic needs to be added (sure we can do so via wildcard or
> > > another mechanism) but new topics need to be onboard to ship the data
> into
> > > remote storage through a traditional ETL pipeline.
> > > Once the data lands somewhere like HDFS/HIVE etc.. Users need to write
> > > another processing line to re-process this data similar to how they are
> > > doing it in their Stream processing pipelines. Tiered storage is to get
> > > away from this and make this transparent to the user. They don't need
> to
> > > run another ETL process to ship the logs.
> > >
> > > "I'm wondering if we could just add support for loading segments from
> > > remote URIs instead of from file, i.e. via plugins for s3://, hdfs://
> etc.
> > > I suspect less broker logic would change in that case -- the broker
> > > wouldn't necessarily care if it reads from file:// or s3:// to load a
> given
> > > segment."
> > >
> > > Yes, this is what we are discussing in KIP. We are leaving the details
> of
> > > loading segments to RLM read part instead of directly exposing this in
> the
> > > Broker. This way we can keep the current Kafka code as it is without
> > > changing the assumptions around the local disk. Let the RLM handle the
> > > remote storage part.
> > >
> > >
> > > Thanks,
> > > Harsha
> > >
> > >
> > >
> > > On Mon, Feb 4, 2019, at 12:54 PM, Ryanne Dolan wrote:
> > > > Harsha, Sriharsha, Suresh, a couple thoughts:
> > > >
> > > > - How could this be used to leverage fast key-value stores, e.g.
> > > Couchbase,
> > > > which can serve individual records but maybe not entire segments? Or
> is
> > > the
> > > > idea to only support writing and fetching entire segments? Would it
> make
> > > > sense to support both?
> > > >
> > > > - Instead of defining a new interface and/or mechanism to ETL segment
> > > files
> > > > from brokers to cold storage, can we just leverage Kafka itself? In
> > > > particular, we can already ETL records to HDFS via Kafka Connect,
> Gobblin
> > > > etc -- we really just need a way for brokers to read these records
> back.
> > > > I'm wondering whether the new API could be limited to the fetch, and
> then
> > > > existing ETL pipelines could be more easily leveraged. For example,
> if
> > > you
> > > > already have an ETL pipeline from Kafka to HDFS, you could leave
> that in
> > > > place and just tell Kafka how to read these records/segments from
> cold
> > > > storage when necessary.
> > > >
> > > > - I'm wondering if we could just add support for loading segments
> from
> > > > remote URIs instead of from file, i.e. via plugins for s3://, hdfs://
> > > etc.
> > > > I suspect less broker logic would change in that case -- the broker
> > > > wouldn't necessarily care if it reads from file:// or s3:// to load a
> > > given
> > > > segment.
> > > >
> > > > Combining the previous two comments, I can imagine a URI resolution
> chain
> > > > for segments. For example, first try
> file:///logs/{topic}/{segment}.log,
> > > > then s3://mybucket/{topic}/{date}/{segment}.log, etc, leveraging your
> > > > existing ETL pipeline(s).
> > > >
> > > > Ryanne
> > > >
> > > >
> > > > On Mon, Feb 4, 2019 at 12:01 PM Harsha <ka...@harsha.io> wrote:
> > > >
> > > > > Hi All,
> > > > > We are interested in adding tiered storage to Kafka. More
> > > details
> > > > > about motivation and design are in the KIP. We are working towards
> an
> > > > > initial POC. Any feedback or questions on this KIP are welcome.
> > > > >
> > > > > Thanks,
> > > > > Harsha
> > > > >
> > >
>

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Reply via email to