Suggestion over architecture

2018-03-09 Thread adrien ruffie
Hello all,


in my company we plan to set up the following architecture for our client:


An internal kafka cluster in our company, and deploy a webapp (our software 
solution) on premise for our clients.


We think to create one producer by "webapp" client in order to push in a global 
topic (in our kafka) an message which represent an email.


The idea behind this, is to unload the client webapp to process several mass 
mailing operation groups, and treat them ourselves with

dedicateds servers into our infrastructure. And each dedicated server will be a 
topic's consumer where the message(email) will be streamed.


My main question is, do you think, that each client can be a producer ? (if we 
have for example 200/300 clients ?)

Second question, each client should be a producer ? 

Do you have another idea for this subject ?


Thank you & best regards.


Adrien


Re: committing offset metadata in kafka streams

2018-03-09 Thread Stas Chizhov
Sure. Sorry I was not clear.

Thank you!


lör 10 mars 2018 kl. 00:54 skrev Matthias J. Sax :

> If there is only one partition by task, processing order is guaranteed.
>
> For default partitions grouper, it depends on your program. If you read
> from multiple topics and join/merge them, a task gets multiple
> partitions (from different topics) assigned.
>
>
> -Matthias
>
> On 3/9/18 2:42 PM, Stas Chizhov wrote:
> >> Also note, that the processing order might slightly differ if you
> > process the same data twice 
> >
> > Is this still a problem when default partition grouper is used (with 1
> > partition per task)?
> >
> > Thank you,
> > Stanislav.
> >
> >
> >
> > 2018-03-09 3:19 GMT+01:00 Matthias J. Sax :
> >
> >> Thanks for the explanation.
> >>
> >> Not sure if setting the metadata you want to get committed in
> >> punctuation() would be sufficient. But I would think about it in more
> >> details if we get a KIP for this.
> >>
> >> It's correct that flushing and committing offsets is correlated. But
> >> it's not related to punctuation.
> >>
> >> Also note, that the processing order might slightly differ if you
> >> process the same data twice (it depends in which order the brokers
> >> return data on poll() and that it something Streams cannot fully
> >> control). Thus, you code would need to be "robust" against different
> >> processing orders (ie, if there are multiple input partitions, you might
> >> get data first for partition 0 and there for partition 1 or the other
> >> way round -- the order per partitions is guaranteed to be in offset
> order).
> >>
> >>
> >> -Matthias
> >>
> >>
> >>
> >> On 3/6/18 2:17 AM, Stas Chizhov wrote:
> >>> Thank you, Matthias!
> >>>
> >>> We currently do use kafka consumer and store event time highwatermarks
> as
> >>> offset metadata. This is used during backup procedure, which is to
> >> create a
> >>> snapshot of the target storage with all events up to certain timestamp
> >> and
> >>> no other.
> >>>
> >>> As for the API - I guess being able to provide partition-to-metadata
> map
> >> in
> >>> the context commit method would do it (to be called from within
> punctuate
> >>> method). BTW as far as I understand if Processor API is used flushing
> >>> producers and committing offsets is correlated and both output topic
> >> state
> >>> and committed offsets do correspond to a state at the moment of some
> >>> punctuation. Meaning that if I do have a deterministic processing
> >> topology
> >>> my output topic is going to be deterministic as well (modulo duplicates
> >> of
> >>> course).  Am I correct here?
> >>>
> >>> Best regards,
> >>> Stanislav.
> >>>
> >>>
> >>> 2018-03-05 2:31 GMT+01:00 Matthias J. Sax :
> >>>
>  You are correct. This is not possible atm.
> 
>  Note, that commits happen "under the hood" and users cannot commit
>  explicitly. Users can only "request" as commit -- this implies that
>  Kafka Streams will commit as soon as possible -- but when
>  `context#commit()` returns, the commit is not done yet (it only sets a
>  flag).
> 
>  What is your use case for this? How would you want to use this from an
>  API point of view?
> 
>  Feel free to open a feature request JIRA -- we don't have any plans to
>  add this atm -- it's the first time anybody asks for this feature. If
>  there is a JIRA, maybe somebody picks it up :)
> 
> 
>  -Matthias
> 
>  On 3/3/18 6:51 AM, Stas Chizhov wrote:
> > Hi,
> >
> > There seems to be no way to commit custom metadata along with offsets
>  from
> > within Kafka Streams.
> > Are there any plans to expose this functionality or have I missed
>  something?
> >
> > Best regards,
> > Stanislav.
> >
> 
> 
> >>>
> >>
> >>
> >
>
>


Re: Re: kafka steams with TimeWindows ,incorrect result

2018-03-09 Thread 杰 杨
thx for your reply!
I see that it is designed to operate on an infinite, unbounded stream of data.
now I want to process for  unbounded stream but divided by time interval .
so what can I do for doing this ?


funk...@live.com

From: Guozhang Wang
Date: 2018-03-10 02:50
To: users
Subject: Re: kafka steams with TimeWindows ,incorrect result
Hi Jie,

This is by design of Kafka Streams, please read this doc for more details
(search for "outputs of the Wordcount application is actually a continuous
stream of updates"):

https://kafka.apache.org/0110/documentation/streams/quickstart

Note this semantics applies for both windowed and un-windowed tables.


Guozhang

On Fri, Mar 9, 2018 at 5:36 AM, 杰 杨  wrote:

> Hi:
> I used TimeWindow for aggregate data in kafka.
>
> this is code snippet ;
>
>   view.flatMap(new MultipleKeyValueMapper(client)
> ).groupByKey(Serialized.with(Serdes.String(),
> Serdes.serdeFrom(new CountInfoSerializer(), new
> CountInfoDeserializer(
> .windowedBy(TimeWindows.of(6)).reduce(new
> Reducer() {
> @Override
> public CountInfo apply(CountInfo value1, CountInfo value2) {
> return new CountInfo(value1.start + value2.start,
> value1.active + value2.active, value1.fresh + value2.fresh);
> }
> }) .toStream(new KeyValueMapper String>() {
> @Override
> public String apply(Windowed key, CountInfo value) {
> return key.key();
> }
> }).print(Printed.toSysOut());
>
> KafkaStreams streams = new KafkaStreams(builder.build(),
> KStreamReducer.getConf());
> streams.start();
>
> and I test 3 data in kafka .
> and I print key value .
>
>
> [KTABLE-TOSTREAM-07]: [9_9_2018-03-09_hour_
> 21@152060130/152060136], CountInfo{start=12179, active=12179,
> fresh=12179}
> [KTABLE-TOSTREAM-07]: 
> [9_9_2018-03-09@152060130/152060136],
> CountInfo{start=12179, active=12179, fresh=12179}
> [KTABLE-TOSTREAM-07]: [9_9_2018-03-09_hour_
> 21@152060130/152060136], CountInfo{start=3, active=3,
> fresh=3}
> [KTABLE-TOSTREAM-07]: 
> [9_9_2018-03-09@152060130/152060136],
> CountInfo{start=3, active=3, fresh=3}
> why in one window duration will be print two result but not one result ?
>
> 
> funk...@live.com
>



--
-- Guozhang


Re: [VOTE] 1.1.0 RC1

2018-03-09 Thread Jeff Chao
Great! Have a good weekend.

On Fri, Mar 9, 2018 at 4:41 PM, Ismael Juma  wrote:

> Thanks Jeff:
>
> https://github.com/apache/kafka/pull/4678
>
> Ismael
>
> On Fri, Mar 9, 2018 at 1:56 AM, Damian Guy  wrote:
>
> > Hi Jeff,
> >
> > Thanks, we will look into this.
> >
> > Regards,
> > Damian
> >
> > On Thu, 8 Mar 2018 at 18:27 Jeff Chao  wrote:
> >
> > > Hello,
> > >
> > > We at Heroku have run 1.1.0 RC1 through our normal performance and
> > > regression test suite and have found performance to be comparable to
> > 1.0.0.
> > >
> > > That said, we're however -1 (non-binding) since this release includes
> > > Zookeeper 3.4.11 
> > which
> > > is affected by the critical regression ZOOKEEPER-2960
> > > . As 3.4.12
> isn't
> > > released yet, it might be better to have 3.4.10 included instead.
> > >
> > > Jeff
> > > Heroku
> > >
> > >
> > > On Tue, Mar 6, 2018 at 1:19 PM, Ted Yu  wrote:
> > >
> > > > +1
> > > >
> > > > Checked signature
> > > > Ran test suite - apart from flaky testMetricsLeak, other tests
> passed.
> > > >
> > > > On Tue, Mar 6, 2018 at 2:45 AM, Damian Guy 
> > wrote:
> > > >
> > > > > Hello Kafka users, developers and client-developers,
> > > > >
> > > > > This is the second candidate for release of Apache Kafka 1.1.0.
> > > > >
> > > > > This is minor version release of Apache Kakfa. It Includes 29 new
> > KIPs.
> > > > > Please see the release plan for more details:
> > > > >
> > > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > > action?pageId=71764913
> > > > >
> > > > > A few highlights:
> > > > >
> > > > > * Significant Controller improvements (much faster and session
> > > expiration
> > > > > edge cases fixed)
> > > > > * Data balancing across log directories (JBOD)
> > > > > * More efficient replication when the number of partitions is large
> > > > > * Dynamic Broker Configs
> > > > > * Delegation tokens (KIP-48)
> > > > > * Kafka Streams API improvements (KIP-205 / 210 / 220 / 224 / 239)
> > > > >
> > > > > Release notes for the 1.1.0 release:
> > > > > http://home.apache.org/~damianguy/kafka-1.1.0-rc1/
> RELEASE_NOTES.html
> > > > >
> > > > > *** Please download, test and vote by Friday, March 9th, 5pm PT
> > > > >
> > > > > Kafka's KEYS file containing PGP keys we use to sign the release:
> > > > > http://kafka.apache.org/KEYS
> > > > >
> > > > > * Release artifacts to be voted upon (source and binary):
> > > > > http://home.apache.org/~damianguy/kafka-1.1.0-rc1/
> > > > >
> > > > > * Maven artifacts to be voted upon:
> > > > > https://repository.apache.org/content/groups/staging/
> > > > >
> > > > > * Javadoc:
> > > > > http://home.apache.org/~damianguy/kafka-1.1.0-rc1/javadoc/
> > > > >
> > > > > * Tag to be voted upon (off 1.1 branch) is the 1.1.0 tag:
> > > > > https://github.com/apache/kafka/tree/1.1.0-rc1
> > > > >
> > > > >
> > > > > * Documentation:
> > > > > http://kafka.apache.org/11/documentation.html
> > > > >
> > > > > * Protocol:
> > > > > http://kafka.apache.org/11/protocol.html
> > > > >
> > > > > * Successful Jenkins builds for the 1.1 branch:
> > > > > Unit/integration tests:
> > > https://builds.apache.org/job/kafka-1.1-jdk7/68
> > > > > System tests: https://jenkins.confluent.io/
> > > > job/system-test-kafka/job/1.1/
> > > > > 30/
> > > > >
> > > > > /**
> > > > >
> > > > > Thanks,
> > > > > Damian Guy
> > > > >
> > > >
> > >
> >
>


Re: [VOTE] 1.1.0 RC1

2018-03-09 Thread Ismael Juma
Thanks Jeff:

https://github.com/apache/kafka/pull/4678

Ismael

On Fri, Mar 9, 2018 at 1:56 AM, Damian Guy  wrote:

> Hi Jeff,
>
> Thanks, we will look into this.
>
> Regards,
> Damian
>
> On Thu, 8 Mar 2018 at 18:27 Jeff Chao  wrote:
>
> > Hello,
> >
> > We at Heroku have run 1.1.0 RC1 through our normal performance and
> > regression test suite and have found performance to be comparable to
> 1.0.0.
> >
> > That said, we're however -1 (non-binding) since this release includes
> > Zookeeper 3.4.11 
> which
> > is affected by the critical regression ZOOKEEPER-2960
> > . As 3.4.12 isn't
> > released yet, it might be better to have 3.4.10 included instead.
> >
> > Jeff
> > Heroku
> >
> >
> > On Tue, Mar 6, 2018 at 1:19 PM, Ted Yu  wrote:
> >
> > > +1
> > >
> > > Checked signature
> > > Ran test suite - apart from flaky testMetricsLeak, other tests passed.
> > >
> > > On Tue, Mar 6, 2018 at 2:45 AM, Damian Guy 
> wrote:
> > >
> > > > Hello Kafka users, developers and client-developers,
> > > >
> > > > This is the second candidate for release of Apache Kafka 1.1.0.
> > > >
> > > > This is minor version release of Apache Kakfa. It Includes 29 new
> KIPs.
> > > > Please see the release plan for more details:
> > > >
> > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > action?pageId=71764913
> > > >
> > > > A few highlights:
> > > >
> > > > * Significant Controller improvements (much faster and session
> > expiration
> > > > edge cases fixed)
> > > > * Data balancing across log directories (JBOD)
> > > > * More efficient replication when the number of partitions is large
> > > > * Dynamic Broker Configs
> > > > * Delegation tokens (KIP-48)
> > > > * Kafka Streams API improvements (KIP-205 / 210 / 220 / 224 / 239)
> > > >
> > > > Release notes for the 1.1.0 release:
> > > > http://home.apache.org/~damianguy/kafka-1.1.0-rc1/RELEASE_NOTES.html
> > > >
> > > > *** Please download, test and vote by Friday, March 9th, 5pm PT
> > > >
> > > > Kafka's KEYS file containing PGP keys we use to sign the release:
> > > > http://kafka.apache.org/KEYS
> > > >
> > > > * Release artifacts to be voted upon (source and binary):
> > > > http://home.apache.org/~damianguy/kafka-1.1.0-rc1/
> > > >
> > > > * Maven artifacts to be voted upon:
> > > > https://repository.apache.org/content/groups/staging/
> > > >
> > > > * Javadoc:
> > > > http://home.apache.org/~damianguy/kafka-1.1.0-rc1/javadoc/
> > > >
> > > > * Tag to be voted upon (off 1.1 branch) is the 1.1.0 tag:
> > > > https://github.com/apache/kafka/tree/1.1.0-rc1
> > > >
> > > >
> > > > * Documentation:
> > > > http://kafka.apache.org/11/documentation.html
> > > >
> > > > * Protocol:
> > > > http://kafka.apache.org/11/protocol.html
> > > >
> > > > * Successful Jenkins builds for the 1.1 branch:
> > > > Unit/integration tests:
> > https://builds.apache.org/job/kafka-1.1-jdk7/68
> > > > System tests: https://jenkins.confluent.io/
> > > job/system-test-kafka/job/1.1/
> > > > 30/
> > > >
> > > > /**
> > > >
> > > > Thanks,
> > > > Damian Guy
> > > >
> > >
> >
>


Re: committing offset metadata in kafka streams

2018-03-09 Thread Matthias J. Sax
If there is only one partition by task, processing order is guaranteed.

For default partitions grouper, it depends on your program. If you read
from multiple topics and join/merge them, a task gets multiple
partitions (from different topics) assigned.


-Matthias

On 3/9/18 2:42 PM, Stas Chizhov wrote:
>> Also note, that the processing order might slightly differ if you
> process the same data twice 
> 
> Is this still a problem when default partition grouper is used (with 1
> partition per task)?
> 
> Thank you,
> Stanislav.
> 
> 
> 
> 2018-03-09 3:19 GMT+01:00 Matthias J. Sax :
> 
>> Thanks for the explanation.
>>
>> Not sure if setting the metadata you want to get committed in
>> punctuation() would be sufficient. But I would think about it in more
>> details if we get a KIP for this.
>>
>> It's correct that flushing and committing offsets is correlated. But
>> it's not related to punctuation.
>>
>> Also note, that the processing order might slightly differ if you
>> process the same data twice (it depends in which order the brokers
>> return data on poll() and that it something Streams cannot fully
>> control). Thus, you code would need to be "robust" against different
>> processing orders (ie, if there are multiple input partitions, you might
>> get data first for partition 0 and there for partition 1 or the other
>> way round -- the order per partitions is guaranteed to be in offset order).
>>
>>
>> -Matthias
>>
>>
>>
>> On 3/6/18 2:17 AM, Stas Chizhov wrote:
>>> Thank you, Matthias!
>>>
>>> We currently do use kafka consumer and store event time highwatermarks as
>>> offset metadata. This is used during backup procedure, which is to
>> create a
>>> snapshot of the target storage with all events up to certain timestamp
>> and
>>> no other.
>>>
>>> As for the API - I guess being able to provide partition-to-metadata map
>> in
>>> the context commit method would do it (to be called from within punctuate
>>> method). BTW as far as I understand if Processor API is used flushing
>>> producers and committing offsets is correlated and both output topic
>> state
>>> and committed offsets do correspond to a state at the moment of some
>>> punctuation. Meaning that if I do have a deterministic processing
>> topology
>>> my output topic is going to be deterministic as well (modulo duplicates
>> of
>>> course).  Am I correct here?
>>>
>>> Best regards,
>>> Stanislav.
>>>
>>>
>>> 2018-03-05 2:31 GMT+01:00 Matthias J. Sax :
>>>
 You are correct. This is not possible atm.

 Note, that commits happen "under the hood" and users cannot commit
 explicitly. Users can only "request" as commit -- this implies that
 Kafka Streams will commit as soon as possible -- but when
 `context#commit()` returns, the commit is not done yet (it only sets a
 flag).

 What is your use case for this? How would you want to use this from an
 API point of view?

 Feel free to open a feature request JIRA -- we don't have any plans to
 add this atm -- it's the first time anybody asks for this feature. If
 there is a JIRA, maybe somebody picks it up :)


 -Matthias

 On 3/3/18 6:51 AM, Stas Chizhov wrote:
> Hi,
>
> There seems to be no way to commit custom metadata along with offsets
 from
> within Kafka Streams.
> Are there any plans to expose this functionality or have I missed
 something?
>
> Best regards,
> Stanislav.
>


>>>
>>
>>
> 



signature.asc
Description: OpenPGP digital signature


Re: committing offset metadata in kafka streams

2018-03-09 Thread Stas Chizhov
> Also note, that the processing order might slightly differ if you
process the same data twice 

Is this still a problem when default partition grouper is used (with 1
partition per task)?

Thank you,
Stanislav.



2018-03-09 3:19 GMT+01:00 Matthias J. Sax :

> Thanks for the explanation.
>
> Not sure if setting the metadata you want to get committed in
> punctuation() would be sufficient. But I would think about it in more
> details if we get a KIP for this.
>
> It's correct that flushing and committing offsets is correlated. But
> it's not related to punctuation.
>
> Also note, that the processing order might slightly differ if you
> process the same data twice (it depends in which order the brokers
> return data on poll() and that it something Streams cannot fully
> control). Thus, you code would need to be "robust" against different
> processing orders (ie, if there are multiple input partitions, you might
> get data first for partition 0 and there for partition 1 or the other
> way round -- the order per partitions is guaranteed to be in offset order).
>
>
> -Matthias
>
>
>
> On 3/6/18 2:17 AM, Stas Chizhov wrote:
> > Thank you, Matthias!
> >
> > We currently do use kafka consumer and store event time highwatermarks as
> > offset metadata. This is used during backup procedure, which is to
> create a
> > snapshot of the target storage with all events up to certain timestamp
> and
> > no other.
> >
> > As for the API - I guess being able to provide partition-to-metadata map
> in
> > the context commit method would do it (to be called from within punctuate
> > method). BTW as far as I understand if Processor API is used flushing
> > producers and committing offsets is correlated and both output topic
> state
> > and committed offsets do correspond to a state at the moment of some
> > punctuation. Meaning that if I do have a deterministic processing
> topology
> > my output topic is going to be deterministic as well (modulo duplicates
> of
> > course).  Am I correct here?
> >
> > Best regards,
> > Stanislav.
> >
> >
> > 2018-03-05 2:31 GMT+01:00 Matthias J. Sax :
> >
> >> You are correct. This is not possible atm.
> >>
> >> Note, that commits happen "under the hood" and users cannot commit
> >> explicitly. Users can only "request" as commit -- this implies that
> >> Kafka Streams will commit as soon as possible -- but when
> >> `context#commit()` returns, the commit is not done yet (it only sets a
> >> flag).
> >>
> >> What is your use case for this? How would you want to use this from an
> >> API point of view?
> >>
> >> Feel free to open a feature request JIRA -- we don't have any plans to
> >> add this atm -- it's the first time anybody asks for this feature. If
> >> there is a JIRA, maybe somebody picks it up :)
> >>
> >>
> >> -Matthias
> >>
> >> On 3/3/18 6:51 AM, Stas Chizhov wrote:
> >>> Hi,
> >>>
> >>> There seems to be no way to commit custom metadata along with offsets
> >> from
> >>> within Kafka Streams.
> >>> Are there any plans to expose this functionality or have I missed
> >> something?
> >>>
> >>> Best regards,
> >>> Stanislav.
> >>>
> >>
> >>
> >
>
>


Re: [DISCUSS] KIP-267: Add Processor Unit Test Support to Kafka Streams Test Utils

2018-03-09 Thread John Roesler
Sweet! I think this pretty much wraps up all the discussion points.

I'll update the KIP with all the relevant aspects we discussed and call for
a vote.

I'll also comment on the TopologyTestDriver ticket noting this modular test
strategy.

Thanks, everyone.
-John

On Fri, Mar 9, 2018 at 10:57 AM, Guozhang Wang  wrote:

> Hey John,
>
> Re: Mock Processor Context:
>
> That's a good point, I'm convinced that we should keep them as two classes.
>
>
> Re: test-utils module:
>
> I think I agree with your proposed changes, in fact in order to not scatter
> the test classes in two places maybe it's better to move all of them to the
> new module. One caveat is that it will make streams' project hierarchy
> inconsistent with other projects where the unit test classes are maintained
> inside the main artifact package, but I think it is a good cost to pay,
> plus once we start publishing test-util artifacts for other projects like
> client and connect, we may face the same issue and need to do this
> refactoring as well.
>
>
>
> Guozhang
>
>
>
>
> On Fri, Mar 9, 2018 at 9:54 AM, John Roesler  wrote:
>
> > Hi Guozhang and Bill,
> >
> > I'll summarize what I'm currently thinking in light of all the
> discussion:
> >
> > Mock Processor Context:
> > ===
> >
> > Here's how I see the use cases for the two mocks differing:
> >
> > 1. o.a.k.test.MPC: Crafted for testing Streams use cases. Implements
> > AbstractProcessorContext, actually forward to child processor nodes,
> allow
> > restoring a state store. Most importantly, the freedom to do stuff
> > convenient for our tests without impacting anyone.
> >
> > 2. (test-utils) MPC: Crafted for testing community Processors (and
> > friends). Very flat and simple implementation (so people can read it in
> one
> > sitting); i.e., doesn't drag in other data models like RecordContext.
> Test
> > one processor in isolation, so generally don't bother with complex logic
> > like scheduling punctuators, forwarding results, or restoring state
> stores.
> > Most importantly, an API that can be stable.
> >
> > So, I really am leaning toward keeping both implementations. I like
> Bill's
> > suggestion of renaming the unit testing class to
> > InternalMockProcessorContext, since having classes with the same name in
> > different packages is confusing. I look forward to the day when Java 9
> > takes off and we can actually hide internal classes from the public
> > interface.
> >
> > test-utils module:
> > =
> >
> > This is actually out of scope for this KIP if we keep both MPC
> > implementations, but it has been a major feature of this discussion, so
> we
> > may as well see it though.
> >
> > I've waffled a bit on this point, but right now I would propose we
> > restructure the streams directory thusly:
> >
> > streams/ (artifact name := "streams", the actual streams code lives here)
> > - test-utils/ (this is the current test-utils artifact, depends on
> > "streams")
> > - tests/ (new module, depends on "streams" and "test-utils", *NO
> published
> > artifact*)
> >
> > This gets us out of the circular dependency without having to engage in
> any
> > Gradle shenanigans while preserving "test-utils" as a separate artifact.
> > This is good because: 1) the test-utils don't need to be in production
> > code, so it's nice to have a separate artifact, 2) test-utils is already
> > public in 1.1, and it's a bummer to introduce users' code when we can so
> > easily avoid it.
> >
> > Note, though, that if we agree to keep both MPC implementations, then
> this
> > really is just important for rewriting our tests to use
> TopologyTestDriver,
> > and in fact only the tests that need it should move to "streams/tests/".
> >
> > What say you?
> >
> > -John
> >
> > On Fri, Mar 9, 2018 at 9:01 AM, Guozhang Wang 
> wrote:
> >
> > > Hmm.. it seems to be a general issue then, since we were planning to
> also
> > > replace the KStreamTestDriver and ProcessorTopologyTestDriver with the
> > new
> > > TopologyTestDriver soon, so if the argument that testing dependency
> could
> > > still cause circular dependencies holds it means we cannot do that as
> > well.
> > >
> > > My understanding on gradle dependencies has been that test dependencies
> > are
> > > not required to compile when compiling the project, but only required
> > when
> > > testing the project; and the way we script gradle follows the way that
> > for
> > > any test tasks of the project we require compiling it first so this is
> > > fine. John / Bill, could you elaborate a bit more on the maintenance
> > > complexity concerns?
> > >
> > >
> > > Guozhang
> > >
> > > On Fri, Mar 9, 2018 at 7:40 AM, Bill Bejeck  wrote:
> > >
> > > > John,
> > > >
> > > > Sorry for the delayed response.  Thanks for the KIP, I'm +1 on it,
> and
> > I
> > > > don't have any further comments on the KIP itself aside from the
> > comments
> > > > that others have 

Re: Delayed processing

2018-03-09 Thread John Roesler
Hi Wim,

One off-the-cuff idea is that you maybe don't need to actually delay
anonymizing the data. Instead, you can just create a separate pathway that
immediately anonymizes the data. Something like this:

(raw-input topic, GDPR retention period set)
 |\->[streams apps that needs non-anonymized data]
 |
 \->[anonymizer app]->(anonymized-input topic, arbitrary retention
period)->[any apps that can handle anonymous data]

This way, you can keep your anonymized-input topic forever if you want,
using log compaction. You also get a very clean data lineage, so you know
for sure which components are viewing non-anonymized data (because they
consume the raw-input topic) versus the ones that are "safe" according to
GDPR, since they consume only the anonymized-input topic.

For downstream apps that are happy using anonymized input, it seems like it
wouldn't matter whether the input is anonymized right away or 6 months
delayed.

And since you know clearly which components may be "dirty" because they are
downstream of raw-input, you can audit those components and make sure they
are either stateless or that they also have proper retention periods set.

Of course, without knowing the details, I can't say whether this actually
works for you, but I wanted to share the thought.

Thanks,
-John

On Thu, Mar 8, 2018 at 11:21 PM, adrien ruffie 
wrote:

> Hi Wim,
> this topic (processing order) has been cropping up for a while, several
> article, benchmark, and other think on the subject
> reaching this conclusion.
>
> After that you can, you can ask someone else another opinion on the
> subject.
>
> regards,
>
> Adrien
>
> 
> De : Wim Van Leuven 
> Envoyé : vendredi 9 mars 2018 08:03:15
> À : users@kafka.apache.org
> Objet : Re: Delayed processing
>
> Hey Adrien,
>
> thank you for the elaborate explanation!
>
> We are ingesting call data records here, which due to the nature of a telco
> network might not arrive in absolute logical order.
>
> If I understand your explanation correctly, you are saying that with your
> setup, Kafka guarantees the processing in order of ingestion of the
> messages. Correct?
>
> Thanks!
> -wim
>
> On Thu, 8 Mar 2018 at 22:58 adrien ruffie 
> wrote:
>
> > Hello Wim,
> >
> >
> > does it matter (I think), because one of the big and principal features
> of
> > Kafka is:
> >
> > Kafka is to do load balancing of messages and guarantee ordering in a
> > distributed cluster.
> >
> >
> > The order of the messages should be guaranteed, unless several cases:
> >
> > 1] Producer can cause data loss when, block.on.buffer.full = false,
> > retries are exhausted and sending message without using acks=all
> >
> > 2] unclean leader election enable: because if one of follower (out of
> > sync) become the new leader, messages that were not synced to the new
> >
> > leader are lost.
> >
> >
> > message reordering might happen when:
> >
> > 1] max.in.flight.requests.per.connection > 1 and retries are enabled
> >
> > 2] when a producer is not correclty closed like, without calling .close()
> >
> > Because close method allowing to ensure that accumulator is closed first
> > to guarantee that no more appends are accepted after breaking the send
> loop.
> >
> >
> >
> > If you wan't to avoir these cases:
> >
> > - close producer in the callback error
> >
> > - close producer with close(0) to prevent sending after previous message
> > send failed
> >
> >
> > Avoid data loss:
> >
> > - block.on.buffer.fill=TRUE
> >
> > - retries=Long.MAX_VALUE
> >
> > - acks=all
> >
> >
> > Avoid reordering:
> >
> > max.in.flight.request.per.connection=1 (be aware about latency)
> >
> >
> > take attention about, if your producer is down, messages in buffer will
> > still be lost ... (perhaps manage a local storage if you are punctilious)
> >
> > moreover at least two replicas are nedded at any time to guarantee data
> > persistence. example replication factor = 3, min.isr = 2 , unclean leader
> > election disabled
> >
> >
> > Also keep in mind that consumer can lose message when offsets are not
> > correctly commited. Disable auto.offset.commit and commit offsets only
> > after make your job for each message (or commit several processed
> messages
> > at one time and kept in a local memory buffer)
> >
> >
> > I hope, these previous suggestions help you 
> >
> >
> > Best regards,
> >
> > Adrien
> >
> > 
> > De : Wim Van Leuven 
> > Envoyé : jeudi 8 mars 2018 21:35:13
> > À : users@kafka.apache.org
> > Objet : Delayed processing
> >
> > Hello,
> >
> > I'm wondering how to design a KStreams or regular Kafka application that
> > can hold of processing of messages until a future time.
> >
> > This related to EU's data protection regulation: we can store raw
> messages
> > for a given time; afterwards we have to store the anonymised message.
> So, I
> > was 

Re: kafka steams with TimeWindows ,incorrect result

2018-03-09 Thread Guozhang Wang
Hi Jie,

This is by design of Kafka Streams, please read this doc for more details
(search for "outputs of the Wordcount application is actually a continuous
stream of updates"):

https://kafka.apache.org/0110/documentation/streams/quickstart

Note this semantics applies for both windowed and un-windowed tables.


Guozhang

On Fri, Mar 9, 2018 at 5:36 AM, 杰 杨  wrote:

> Hi:
> I used TimeWindow for aggregate data in kafka.
>
> this is code snippet ;
>
>   view.flatMap(new MultipleKeyValueMapper(client)
> ).groupByKey(Serialized.with(Serdes.String(),
> Serdes.serdeFrom(new CountInfoSerializer(), new
> CountInfoDeserializer(
> .windowedBy(TimeWindows.of(6)).reduce(new
> Reducer() {
> @Override
> public CountInfo apply(CountInfo value1, CountInfo value2) {
> return new CountInfo(value1.start + value2.start,
> value1.active + value2.active, value1.fresh + value2.fresh);
> }
> }) .toStream(new KeyValueMapper String>() {
> @Override
> public String apply(Windowed key, CountInfo value) {
> return key.key();
> }
> }).print(Printed.toSysOut());
>
> KafkaStreams streams = new KafkaStreams(builder.build(),
> KStreamReducer.getConf());
> streams.start();
>
> and I test 3 data in kafka .
> and I print key value .
>
>
> [KTABLE-TOSTREAM-07]: [9_9_2018-03-09_hour_
> 21@152060130/152060136], CountInfo{start=12179, active=12179,
> fresh=12179}
> [KTABLE-TOSTREAM-07]: 
> [9_9_2018-03-09@152060130/152060136],
> CountInfo{start=12179, active=12179, fresh=12179}
> [KTABLE-TOSTREAM-07]: [9_9_2018-03-09_hour_
> 21@152060130/152060136], CountInfo{start=3, active=3,
> fresh=3}
> [KTABLE-TOSTREAM-07]: 
> [9_9_2018-03-09@152060130/152060136],
> CountInfo{start=3, active=3, fresh=3}
> why in one window duration will be print two result but not one result ?
>
> 
> funk...@live.com
>



-- 
-- Guozhang


Re: Delayed processing

2018-03-09 Thread Guozhang Wang
Hello Wim,

Thanks for your explanations, it makes more sense to me know. I think your
scenario may be better described as a "re-processing" case than a "delayed
processing" case, since you are effectively processing the un-anonymised
data once, sending results to the topic; and then later you will
re-processing the same but anonymised data again, and sending results to
the same topic.

In Kafka Streams, late arrival records can be correctly handled since the
library support timestamp-based process ordering, and in practice you can
specify some "windowing" operations to wait for the late arrival records to
apply them. You can read more about that on some existing docs below.
Please let me know if you have further questions:

https://kafka.apache.org/0110/documentation/streams/core-concepts#streams_time

https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#streams-developer-guide-dsl-joins


Guozhang


On Thu, Mar 8, 2018 at 11:07 PM, Wim Van Leuven <
wim.vanleu...@highestpoint.biz> wrote:

> Hello Guozhang,
>
> we ingest messages that we outgest is user facing datastore, after some
> additional processing. Due to GDPR, we can only retain that unanonymised
> data for a maximum period of time. Let's say 6 months.
>
> So, right before sending the data to the out topic, we'll branch the data
> into an anonymisation leg of the processing chain. Now, we want to keep the
> messages in that chain until the expiration of the unanonymised messages.
> At that point we want to send the anonymous record to the outgoing topic so
> that the anonymous message replaces the unanonymised version.
>
> Does that make the problem/idea more clear?
> --
> wim
>
>
>
> On Fri, 9 Mar 2018 at 00:47 Guozhang Wang  wrote:
>
> > Hello Wim,
> >
> > Not sure if I understand your motivations for delayed processing, could
> you
> > elaborate a bit more? Do you want to process raw messages, or do you want
> > to process anonymised messages?
> >
> >
> > Guozhang
> >
> >
> > On Thu, Mar 8, 2018 at 12:35 PM, Wim Van Leuven <
> > wim.vanleu...@highestpoint.biz> wrote:
> >
> > > Hello,
> > >
> > > I'm wondering how to design a KStreams or regular Kafka application
> that
> > > can hold of processing of messages until a future time.
> > >
> > > This related to EU's data protection regulation: we can store raw
> > messages
> > > for a given time; afterwards we have to store the anonymised message.
> > So, I
> > > was thinking about branching the stream, anonymise the messages into a
> > > waiting topic and than continue from there until the retention time
> > passes.
> > >
> > > But that approach has some caveats:
> > >
> > >- This is not an exact solution as order of events is not
> guaranteed:
> > we
> > >might encounter a message that triggers the stop processing while
> some
> > >events arriving later should normally still pass
> > >- how to stop properly stop processing if we encounter a message
> that
> > >indicates to not continue?
> > >- ...
> > >
> > > Are there better know solutions or best practices to delay message
> > > processing with Kafka streams / consumers+producers?
> > >
> > > Thanks for any insights/help here!
> > > -wim
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang


Unit Test with Compacted Topics having non consecutive offsets

2018-03-09 Thread Sirisha Sindiri
Hi,

I need to write unit tests with Compacted Topics in local cluster. Has
anyone done something like that? Any tips/guidance will be much appreciated.

Thanks
Sirisha


Re: [DISCUSS] KIP-267: Add Processor Unit Test Support to Kafka Streams Test Utils

2018-03-09 Thread Bill Bejeck
John,

Sorry for the delayed response.  Thanks for the KIP, I'm +1 on it, and I
don't have any further comments on the KIP itself aside from the comments
that others have raised.

Regarding the existing MockProcessorContext and its removal in favor of the
one added from this KIP, I'm actually in favor of keeping both.

IMHO it's reasonable to have both because the testing requirements are
different.  Most users are trying to verify their logic works as expected
within a Kafka Streams application and aren't concerned (or shouldn't be at
least, again IMHO) with testing Kafka Streams itself, that is the
responsibility of the Kafka Streams developers and contributors.

However, for the developers and contributors of Kafka Streams, the need to
test the internals of how Streams works is the primary concern and could at
times require different logic or available methods from a given mock object.

I have a couple of thoughts on mitigation of having two
MockProcessorContext objects

   1. Leave the current MockProcessorContext in the o.a.k.test package but
   rename it to InternalMockProcessorContext and add some documentation as to
   why it's there.
   2. Create a new package under o.a.k.test, called internals and move the
   existing MockProcessorContext there, but that would require a change to the
   visibility of the MockProcessorContext#allStateStores() method to public.

Just wanted to throw in my 2 cents.

Thanks,
Bill

On Thu, Mar 8, 2018 at 11:51 PM, John Roesler  wrote:

> I think what you're suggesting is to:
> 1. compile the main streams code, but not the tests
> 2. compile test-utils (and compile and run the test-utils tests)
> 3. compile and run the streams tests
>
> This works in theory, since the test-utils depends on the main streams
> code, but not the streams tests. and the streams tests depend on test-utils
> while the main streams code does not.
>
> But after poking around a bit and reading up on it, I think this is not
> possible, or at least not mainstream.
>
> The issue is that dependencies are formed between projects, in this case
> streams and streams:test-utils. The upstream project must be built before
> the dependant one, regardless of whether the dependency is for compiling
> the main code or the test code. This means we do have a circular dependency
> on our hands if we want the tests in streams to use the test-utils, since
> they'd both have to be built before the other.
>
> Gradle seems to be quite scriptable, so there may be some way to achieve
> this, but increasing the complexity of the build also introduces a project
> maintenance concern.
>
> The MockProcessorContext itself is pretty simple, so I'm tempted to argue
> that we should just have one for internal unit tests and another for
> test-utils, however this situation also afflicts KAFKA-6474
> , and the
> TopologyTestDriver is not so trivial.
>
> I think the best thing at this point is to go ahead and fold the test-utils
> into the streams project. We can put it into a separate "testutils" package
> to make it easy to identify which code is for test support and which code
> is Kafka Streams. The biggest bummer about this suggestion is that it we
> *just* introduced the test-utils artifact, so folks would to add that
> artifact in 1.1 to write their tests and then have to drop it again in 1.2.
>
> The other major solution is to create a new gradle project for the streams
> unit tests, which depends on streams and test-utils and move all the
> streams unit tests there. I'm pretty sure we can configure gradle just to
> include this project for running tests and not actually package any
> artifacts. This structure basically expresses your observation that the
> test code is essentially a separate module from the main streams code.
>
> Of course, I'm open to alternatives, especially if someone with more
> experience in Gradle is aware of a solution.
>
> Thanks,
> -John
>
>
> On Thu, Mar 8, 2018 at 3:39 PM, Matthias J. Sax 
> wrote:
>
> > Isn't MockProcessorContext in o.a.k.test part of the unit-test package
> > but not the main package?
> >
> > This should resolve the dependency issue.
> >
> > -Matthias
> >
> > On 3/8/18 3:32 PM, John Roesler wrote:
> > > Actually, replacing the MockProcessorContext in o.a.k.test could be a
> bit
> > > tricky, since it would make the "streams" module depend on
> > > "streams:test-utils", but "streams:test-utils" already depends on
> > "streams".
> > >
> > > At first glance, it seems like the options are:
> > > 1. leave the two separate implementations in place. This shouldn't be
> > > underestimated, especially since our internal tests may need different
> > > things from a mocked P.C. than our API users.
> > > 2. move the public testing artifacts into the regular streams module
> > > 3. move the unit tests for Streams into a third module that depends on
> > both
> > > streams and test-utils. Yuck!
> > 

kafka steams with TimeWindows ,incorrect result

2018-03-09 Thread 杰 杨
Hi:
I used TimeWindow for aggregate data in kafka.

this is code snippet ;

  view.flatMap(new 
MultipleKeyValueMapper(client)).groupByKey(Serialized.with(Serdes.String(),
Serdes.serdeFrom(new CountInfoSerializer(), new 
CountInfoDeserializer(
.windowedBy(TimeWindows.of(6)).reduce(new Reducer() {
@Override
public CountInfo apply(CountInfo value1, CountInfo value2) {
return new CountInfo(value1.start + value2.start, value1.active 
+ value2.active, value1.fresh + value2.fresh);
}
}) .toStream(new KeyValueMapper() {
@Override
public String apply(Windowed key, CountInfo value) {
return key.key();
}
}).print(Printed.toSysOut());

KafkaStreams streams = new KafkaStreams(builder.build(), 
KStreamReducer.getConf());
streams.start();

and I test 3 data in kafka .
and I print key value .


[KTABLE-TOSTREAM-07]: 
[9_9_2018-03-09_hour_21@152060130/152060136], 
CountInfo{start=12179, active=12179, fresh=12179}
[KTABLE-TOSTREAM-07]: 
[9_9_2018-03-09@152060130/152060136], CountInfo{start=12179, 
active=12179, fresh=12179}
[KTABLE-TOSTREAM-07]: 
[9_9_2018-03-09_hour_21@152060130/152060136], 
CountInfo{start=3, active=3, fresh=3}
[KTABLE-TOSTREAM-07]: 
[9_9_2018-03-09@152060130/152060136], CountInfo{start=3, 
active=3, fresh=3}
why in one window duration will be print two result but not one result ?


funk...@live.com


Re: WELCOME to users@kafka.apache.org

2018-03-09 Thread Venkateswara Chandragiri
Hello,


I would like to report you on the website access issue from our Organization 
LAN public IP: 182.74.52.154.


Looks like, public IP: 182.74.52.154 has been blocked at your end causing 
access issue from our network users.


Could you please help me to unblock it?


Thanks in advance!


--

Regards

Venkat Chandragiri

Development Manager – Enterprise Technologies

OSI Consulting Pvt. Ltd

Office:(+91) 40 30410117

(+1) 213 408 3703 Extn: 5308 (VOIP)

Mobile: (+91) 9908252460

--

[1516880246210_osi_logo.png]

www.osius.com


From: users-h...@kafka.apache.org 
Sent: Friday, March 9, 2018 5:59:56 PM
To: Venkateswara Chandragiri
Subject: WELCOME to users@kafka.apache.org

Hi! This is the ezmlm program. I'm managing the
users@kafka.apache.org mailing list.

I'm working for my owner, who can be reached
at users-ow...@kafka.apache.org.

Acknowledgment: I have added the address

   vchandrag...@osius.com

to the users mailing list.

Welcome to users@kafka.apache.org!

Please save this message so that you know the address you are
subscribed under, in case you later want to unsubscribe or change your
subscription address.


--- Administrative commands for the users list ---

I can handle administrative requests automatically. Please
do not send them to the list address! Instead, send
your message to the correct command address:

To subscribe to the list, send a message to:
   

To remove your address from the list, send a message to:
   

Send mail to the following for info and FAQ for this list:
   
   

Similar addresses exist for the digest list:
   
   

To get messages 123 through 145 (a maximum of 100 per request), mail:
   

To get an index with subject and author for messages 123-456 , mail:
   

They are always returned as sets of 100, max 2000 per request,
so you'll actually get 100-499.

To receive all messages with the same subject as message 12345,
send a short message to:
   

The messages should contain one line or word of text to avoid being
treated as sp@m, but I will ignore their content.
Only the ADDRESS you send to is important.

You can start a subscription for an alternate address,
for example "john@host.domain", just add a hyphen and your
address (with '=' instead of '@') after the command word:

Re: [VOTE] 1.1.0 RC1

2018-03-09 Thread Damian Guy
Hi Jeff,

Thanks, we will look into this.

Regards,
Damian

On Thu, 8 Mar 2018 at 18:27 Jeff Chao  wrote:

> Hello,
>
> We at Heroku have run 1.1.0 RC1 through our normal performance and
> regression test suite and have found performance to be comparable to 1.0.0.
>
> That said, we're however -1 (non-binding) since this release includes
> Zookeeper 3.4.11  which
> is affected by the critical regression ZOOKEEPER-2960
> . As 3.4.12 isn't
> released yet, it might be better to have 3.4.10 included instead.
>
> Jeff
> Heroku
>
>
> On Tue, Mar 6, 2018 at 1:19 PM, Ted Yu  wrote:
>
> > +1
> >
> > Checked signature
> > Ran test suite - apart from flaky testMetricsLeak, other tests passed.
> >
> > On Tue, Mar 6, 2018 at 2:45 AM, Damian Guy  wrote:
> >
> > > Hello Kafka users, developers and client-developers,
> > >
> > > This is the second candidate for release of Apache Kafka 1.1.0.
> > >
> > > This is minor version release of Apache Kakfa. It Includes 29 new KIPs.
> > > Please see the release plan for more details:
> > >
> > > https://cwiki.apache.org/confluence/pages/viewpage.
> > action?pageId=71764913
> > >
> > > A few highlights:
> > >
> > > * Significant Controller improvements (much faster and session
> expiration
> > > edge cases fixed)
> > > * Data balancing across log directories (JBOD)
> > > * More efficient replication when the number of partitions is large
> > > * Dynamic Broker Configs
> > > * Delegation tokens (KIP-48)
> > > * Kafka Streams API improvements (KIP-205 / 210 / 220 / 224 / 239)
> > >
> > > Release notes for the 1.1.0 release:
> > > http://home.apache.org/~damianguy/kafka-1.1.0-rc1/RELEASE_NOTES.html
> > >
> > > *** Please download, test and vote by Friday, March 9th, 5pm PT
> > >
> > > Kafka's KEYS file containing PGP keys we use to sign the release:
> > > http://kafka.apache.org/KEYS
> > >
> > > * Release artifacts to be voted upon (source and binary):
> > > http://home.apache.org/~damianguy/kafka-1.1.0-rc1/
> > >
> > > * Maven artifacts to be voted upon:
> > > https://repository.apache.org/content/groups/staging/
> > >
> > > * Javadoc:
> > > http://home.apache.org/~damianguy/kafka-1.1.0-rc1/javadoc/
> > >
> > > * Tag to be voted upon (off 1.1 branch) is the 1.1.0 tag:
> > > https://github.com/apache/kafka/tree/1.1.0-rc1
> > >
> > >
> > > * Documentation:
> > > http://kafka.apache.org/11/documentation.html
> > >
> > > * Protocol:
> > > http://kafka.apache.org/11/protocol.html
> > >
> > > * Successful Jenkins builds for the 1.1 branch:
> > > Unit/integration tests:
> https://builds.apache.org/job/kafka-1.1-jdk7/68
> > > System tests: https://jenkins.confluent.io/
> > job/system-test-kafka/job/1.1/
> > > 30/
> > >
> > > /**
> > >
> > > Thanks,
> > > Damian Guy
> > >
> >
>