from:"Jan Filipiak"

Re: [DISCUSS] KIP-455 Create an Admin API for Replica Reassignments

2019-07-12 Thread Jan Filipiak



On 12.07.2019 12:05, Stanislav Kozlovski wrote:
> We will
> not support fetching the progress of any particular reassignment call from
> the server-side. If the client is interested in a particular reassignment's
> progress, it should know what that reassignment consisted of and query
> those partitions only.

thats enough id say. neat

Re: [DISCUSS] KIP-455 Create an Admin API for Replica Reassignments

2019-07-12 Thread Jan Filipiak

Great KIP,

pure java cruise-control would be a nice thing to have <3

I just want to ask what the opinions are on a way to get the Futures of
the assignment back. Say accross JVms.

Best Jan

On 02.07.2019 19:47, Stanislav Kozlovski wrote:
> Hey there, I need to start a new thread on KIP-455. I think there might be
> an issue with the mailing server. For some reason, my replies to the
> previous discussion thread could not be seen by others. After numerous
> attempts, Colin suggested I start a new thread.
> 
> Original Discussion Thread:
> https://sematext.com/opensee/m/Kafka/uyzND1Yl7Er128CQu1?subj=+DISCUSS+KIP+455+Create+an+Administrative+API+for+Replica+Reassignment
> KIP:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment
> Last Reply of Previous Thread:
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201906.mbox/%3C679a4c5b-3da6-4556-bb89-e680d8cbb705%40www.fastmail.com%3E
> 
> The following is my reply:
> 
> Hi again,
> 
> This has been a great discussion on a tricky KIP. I appreciate everybody's
> involvement in improving this crucial API.
> That being said, I wanted to apologize for my first comment, it was a bit
> rushed and not thought out.
> 
> I've got a few questions now that I dove into this better:
> 
> 1. Does it make sense to have an easy way to cancel all ongoing
> reassignments? To cancel all ongoing reassignments, users had the crude
> option of deleting the znode, bouncing the controller and running the
> rollback JSON assignment that kafka-reassign-partitions.sh gave them
> (KAFKA-6304).
> Now that we support multiple reassignment requests, users may add execute
> them incrementally. Suppose something goes horribly wrong and they want to
> revert as quickly as possible - they would need to run the tool with
> multiple rollback JSONs.  I think that it would be useful to have an easy
> way to stop all ongoing reassignments for emergency situations.
> 
> -
> 
> 2. Our kafka-reassign-partitions.sh tool doesn't seem to currently let you
> figure out the ongoing assignments - I guess we expect people to use
> kafka-topics.sh for that. I am not sure how well that would continue to
> work now that we update the replica set only after the new replica joins
> the ISR.
> Do you think it makes sense to add an option for listing the current
> reassignments to the reassign tool as part of this KIP?
> 
> We might want to think whether we want to show the TargetReplicas
> information in the kafka-topics command for completeness as well. That
> might involve the need to update the DescribeTopicsResponse. Personally I
> can't see a downside but I haven't given it too much thought. I fully agree
> that we don't want to add the target replicas to the full replica set and
> nothing useful comes out of telling users they have a replica that might
> not have copied a single byte. Yet, telling them that we have the intention
> of copying bytes sounds useful so maybe having a separate column in
> kafka-topics.sh would provide better clarity?
> 
> -
> 
> 3. What happens if we do another reassignment to a partition while one is
> in progress? Do we overwrite the TargetReplicas?
> In the example sequence you gave:
> R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
> What would the behavior be if a new reassign request came with
> TargetReplicas of [7, 8, 9] for that partition?
> 
> To avoid complexity and potential race conditions, would it make sense to
> reject a reassign request once one is in progress for the specific
> partition, essentially forcing the user to cancel it first?
> Forcing the user to cancel has the benefit of being explicit and guarding
> against human mistakes. The downside I can think of is that in some
> scenarios it might be inefficient, e.g
> R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
> Cancel request sent out. Followed by a new reassign request with
> TargetReplicas of [5, 6, 7] (note that 5 and 6 already fully copied the
> partition). Becomes a bit of a race condition of whether we deleted the
> partitions in between requests or not - I assume in practice this won't be
> an issue. I still feel like I prefer the explicit cancellation step
> 
> -
> 
> 4. My biggest concern - I want to better touch on the interaction between
> the new API and the current admin/reassign_partitions znode, the
> compatibility and our strategy there.
> The KIP says:
> 
>> For compatibility purposes, we will continue to allow assignments to be
>> submitted through the /admin/reassign_partitions node. Just as with the
>> current code, this will only be possible if there are no current
>> assignments. In other words, the znode has two states: empty and waiting
>> for a write, and non-empty because there are assignments in progress. Once
>> the znode is non-empty, further writes to it will be ignored.
> 
> Given the current proposal, I can think of 4 scenarios I want to get a
> better

Re: [jira] [Created] (KAFKA-8326) Add List Serde

2019-07-11 Thread Jan Filipiak

I think this encourages bad descissions.
Lets just have people define repeated fields in thrift,avro,json,
protobuf. Its gonna look nasty if you got your 11th layer of lists.

If you really want to add lists, please do Map aswell in 1 shot

Best Jan

On 06.05.2019 17:59, Daniyar Yeralin (JIRA) wrote:
> Daniyar Yeralin created KAFKA-8326:
> --
> 
>  Summary: Add List Serde
>  Key: KAFKA-8326
>  URL: https://issues.apache.org/jira/browse/KAFKA-8326
>  Project: Kafka
>   Issue Type: Improvement
>   Components: clients, streams
> Reporter: Daniyar Yeralin
> 
> 
> I propose adding serializers and deserializers for the java.util.List class.
> 
> I have many use cases where I want to set the key of a Kafka message to be a 
> UUID. Currently, I need to turn UUIDs into strings or byte arrays and use 
> their associated Serdes, but it would be more convenient to serialize and 
> deserialize UUIDs directly.
> 
> I believe there are many use cases where one would want to have a List serde. 
> Ex. 
> [https://stackoverflow.com/questions/41427174/aggregate-java-objects-in-a-list-with-kafka-streams-dsl-windows],
>  
> [https://stackoverflow.com/questions/46365884/issue-with-arraylist-serde-in-kafka-streams-api]
> 
>  
> 
> KIP Link: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-466%3A+Add+support+for+List%3CT%3E+serialization+and+deserialization]
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>

Re: [DISCUSS] KIP-213: Second follow-up on Foreign Key Joins

2019-07-11 Thread Jan Filipiak



On 10.07.2019 06:25, Adam Bellemare wrote:
> In my experience (obviously empirical) it seems that many people just want
> the ability to join on foreign keys for the sake of handling all the
> relational data in their event streams and extra tombstones don't matter at
> all. This has been my own experience from our usage of our internal
> implementation at my company, and that of many others who have reached out
> to me.

backing this.

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2019-03-05 Thread Jan Filipiak

On 04.03.2019 19:14, Matthias J. Sax wrote:
> Thanks Adam,
> 
> *> Q) Range scans work with caching enabled, too. Thus, there is no
> functional/correctness requirement to disable caching. I cannot 
> remember why Jan's proposal added this? It might be an 
> implementation detail though (maybe just remove it from the KIP?
> -- might be miss leading).

I dont know how to range scan over a caching store, probably one had
to open 2 iterators and merge them.

Other than that, I still think even the regualr join is broken with
caching enabled right? I once files a ticket, because with caching
enabled it would return values that havent been published downstream yet.

Re: [VOTE] KIP-349 Priorities for Source Topics

2019-01-24 Thread Jan Filipiak



On 24.01.2019 15:51, Thomas Becker wrote:
> Yes, I think this type of strategy interface would be valuable.
> 

Thank you for leaving this here!

Re: [VOTE] KIP-349 Priorities for Source Topics

2019-01-16 Thread Jan Filipiak



On 16.01.2019 14:05, Thomas Becker wrote:
> I'm going to bow out of this discussion since it's been made clear that
> the feature is not targeted at streams. But for the record, my desire is
> to have an alternative to the timestamp based message choosing strategy
> streams currently imposes, and I thought topic prioritization in the
> consumer could potentially enable that. See
> https://issues.apache.org/jira/browse/KAFKA-4113
>
> -Tommy
>

Would you be so kind to leave an impression about a MessageChooser 
interface? Might be important for an extra KIP later

Best Jan

Re: [DISCUSS] KIP-402: Improve fairness in SocketServer processors

2019-01-15 Thread Jan Filipiak

On 15.01.2019 13:27, Rajini Sivaram wrote:
> Hi Jan,
>
> If the queue of one Processor is full, we move to the next Processor
> immediately without blocking. So as long as the queue of any Processor is
> not full, we accept the connection immediately. If the queue of all
> Processors are full, we assign a Processor and block until the connection
> can be added. There is currently no timeout for this. The PR is here:
> https://github.com/apache/kafka/pull/6022
>
> Thanks,
>
> Rajini
>
> On Tue, Jan 15, 2019 at 12:02 PM Jan Filipiak 
> wrote:
>

Thank you, the code makes it obvious! From the KIP this wasnt to obvious 
for me. Any concerns about logging a warn when we hit the mayBlock case? 
I would argue that the broker is pretty stressed out under this 
conditions and clients may run into connect time-outs.

Best jan

@Collin could you please try using threads?

Re: [DISCUSS] KIP-402: Improve fairness in SocketServer processors

2019-01-15 Thread Jan Filipiak



> The connection queue for Processors will be changed to ArrayBlockingQueue 
> with a fixed size of 20. Acceptor will use round-robin allocation to allocate 
> each new connection to the next available Processor to which the connection 
> can be added without blocking. If a Processor's queue is full, the next 
> Processor will be chosen. If the connection queue on all Processors are full, 
> Acceptor blocks until the connection can be added to the selected Processor. 
> No new connections will be accepted during this period. The amount of time 
> Acceptor is blocked can be monitored using the new AcceptorIdlePercent metric.

So if the queue of one Processor is full, what is the strategy to move 
to the next queue? Are we using offer with a timeout here? How else can 
we make sure that a single slow processor will not block the entire 
processing? I assume we do not allow us to get stuck during put as you 
mention that all queues full is a scenario. I think there is quite some 
uncertainty here. Is there any code one could check out?

Best Jan

For Fun

2019-01-13 Thread Jan Filipiak

I made this for fun :) maybe you like it
https://imgflip.com/i/2r2y15

Re: [VOTE] KIP-349 Priorities for Source Topics

2019-01-13 Thread Jan Filipiak

On 14.01.2019 02:48, n...@afshartous.com wrote:

>
> On reflection, it would be hard to describe the semantics of an API that 
> tried to address starvation by temporarily disabling prioritization, and then 
> oscillating back and forth.
> Thus I agree that it makes sense not to try and address starvation to 
> Mathias’ point that this is intended by design.  The KIP has been updated to 
> reflect this by removing the second method.
>

The semantics of almost everything are hard to describe with only those 
two tools at hand. Just here to remember yall that Samza already shows 
us the interface of a powerful enough abstraction to get stuff done :)

https://samza.apache.org/learn/documentation/0.12/api/javadocs/org/apache/samza/system/chooser/MessageChooser.html

welcome :)

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2019-01-13 Thread Jan Filipiak


On 11.01.2019 21:29, John Roesler wrote:
> Hi Jan,
>
> Thanks for the reply.
>
> It sounds like your larger point is that if we provide a building block
> instead of the whole operation, then it's not too hard for users to
> implement the whole operation, and maybe the building block is
> independently useful.

exactly
>
> This is a very fair point. In fact, it's not exclusive with the current
> plan,
> in that we can always add the "building block" version in addition to,
> rather than instead of, the full operation. It very well might be a mistake,
> but I still prefer to begin by introducing the fully encapsulated operation
> and subsequently consider adding the "building block" version if it turns
> out that the encapsulated version is insufficient.

Raising my hand here, I wont be using the new API unless the scattered 
table is there. I am going to stick with my PAPI solution.

>
> IMHO, one of Streams's strengths over other processing frameworks
> is a simple API, so simplicity as a design goal seems to suggest that:
>> a.tomanyJoin(B)
> is preferable to
>> a.map(retain(key and FK)).tomanyJoin(B).groupBy(a.key()).join(A)
> at least to start with.
>
> To answer your question about my latter potential optimization,
> no I don't have any code to look at. But, yes, the implementation
> would bring B into A's tasks and keep them in a state store for joining.
> Thanks for that reference, it does indeed sound similar to what
> MapJoin does in Hive.

always a pleasure with you John.

>
> Thanks again,
> -John

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2019-01-07 Thread Jan Filipiak

On 02.01.2019 23:44, John Roesler wrote:
> However, you seem to have a strong intuition that the scatter/gather
> approach is better.
> Is this informed by your actual applications at work? Perhaps you can
> provide an example
> data set and sequence of operations so we can all do the math and agree
> with you.
> It seems like we should have a convincing efficiency argument before
> choosing a more
> complicated API over a simpler one.

The way I see this is simple. If we only provide the basic 
implementation of 1:n join (repartition by FK, Range scan on Foreign 
table update). Then this is such a fundamental building block.

I do A join B.

a.map(retain(key and FK)).tomanyJoin(B).groupBy(a.key()).join(A). This 
pretty much performs all your "wire saving optimisations". I don't know! 
to be honest if someone did put this ContextAwareMapper() that was 
discussed at some point. Then I could actually do the high watermark 
thing. a.contextMap(reatain(key, fk and offset). 
omanyJoin(B).aggregate(a.key(), oldest offset wins).join(A).
I don't find the KIP though. I guess it didn't make it.

After the repartition and the range read the abstraction just becomes to 
weak. I just showed that your implementation is my implementation with 
stuff around it.

I don't know if your scatter gather thing is in code somewhere. If the 
join will only be applied after the gather phase I really wonder where 
we get the other record from? do you also persist the foreign table on 
the original side? If that is put into code somewhere already?

This would essentially bring B to each of the A's tasks. Factors for 
this in my case a rather easy and dramatic. Nevertheless an approach I 
would appreciate. In Hive this could be something closely be related to 
the concept of a MapJoin. Something I whish we had in streams. I often 
stated that at some point we need unbounded ammount off offsets per 
topicpartition and group :D So good.

Long story short. I hope you can follow my line of thought. I hope you 
can clarify my missunderstanding how the join is performed on A side 
without materializing B there.

I would love if streams would get it right. The basic rule I always say 
is do what Hive does. done.

>
> Last thought:
>> Regarding what will be observed. I consider it a plus that all events
>> that are in the inputs have an respective output. Whereas your solution
>> might "swallow" events.
>
> I didn't follow this. Following Adam's example, we have two join results: a
> "dead" one and
> a "live" one. If we get the dead one first, both solutions emit it,
> followed by the live result.

there might be multiple dead once in flight right? But it doesn't really 
matter, I never did something with the extra benefit i mentioned.

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-12-10 Thread Jan Filipiak



On 10.12.2018 07:42, Guozhang Wang wrote:
> Hello Adam / Jan / John,
>
> Sorry for being late on this thread! I've finally got some time this
> weekend to cleanup a load of tasks on my queue (actually I've also realized
> there are a bunch of other things I need to enqueue while cleaning them up
> --- sth I need to improve on my side). So here are my thoughts:
>
> Regarding the APIs: I like the current written API in the KIP. More
> generally I'd prefer to keep the 1) one-to-many join functionalities as
> well as 2) other join types than inner as separate KIPs since 1) may worth
> a general API refactoring that can benefit not only foreignkey joins but
> collocate joins as well (e.g. an extended proposal of
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-150+-+Kafka-Streams+Cogroup),
> and I'm not sure if other join types would actually be needed (maybe left
> join still makes sense), so it's better to wait-for-people-to-ask-and-add
> than add-sth-that-no-one-uses.
>
> Regarding whether we enforce step 3) / 4) v.s. introducing a
> KScatteredTable for users to inject their own optimization: I'd prefer to
> do the current option as-is, and my main rationale is for optimization
> rooms inside the Streams internals and the API succinctness. For advanced
> users who may indeed prefer KScatteredTable and do their own optimization,
> while it is too much of the work to use Processor API directly, I think we
> can still extend the current API to support it in the future if it becomes
> necessary.

no internal optimization potential. it's a myth

¯\_(ツ)_/¯

:-)

>
> Another note about step 4) resolving out-of-ordering data, as I mentioned
> before I think with KIP-258 (embedded timestamp with key-value store) we
> can actually make this step simpler than the current proposal. In fact, we
> can just keep a single final-result store with timestamps and reject values
> that have a smaller timestamp, is that right?

Which is the correct output should at least be decided on the offset of 
the original message.

>
>
> That's all I have in mind now. Again, great appreciation to Adam to make
> such HUGE progress on this KIP!
>
>
> Guozhang
>
> On Wed, Dec 5, 2018 at 2:40 PM Jan Filipiak 
> wrote:
>
>> If they don't find the time:
>> They usually take the opposite path from me :D
>> so the answer would be clear.
>>
>> hence my suggestion to vote.
>>
>>
>> On 04.12.2018 21:06, Adam Bellemare wrote:
>>> Hi Guozhang and Matthias
>>>
>>> I know both of you are quite busy, but we've gotten this KIP to a point
>>> where we need more guidance on the API (perhaps a bit of a tie-breaker,
>> if
>>> you will). If you have anyone else you may think should look at this,
>>> please tag them accordingly.
>>>
>>> The scenario is as such:
>>>
>>> Current Option:
>>> API:
>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable#KIP-213Supportnon-keyjoininginKTable-PublicInterfaces
>>> 1) Rekey the data to CombinedKey, and shuffles it to the partition with
>> the
>>> foreignKey (repartition 1)
>>> 2) Join the data
>>> 3) Shuffle the data back to the original node (repartition 2)
>>> 4) Resolve out-of-order arrival / race condition due to foreign-key
>> changes.
>>>
>>> Alternate Option:
>>> Perform #1 and #2 above, and return a KScatteredTable.
>>> - It would be keyed on a wrapped key function: , VR>
>> (KO
>>> = Other Table Key, K = This Table Key, VR = Joined Result)
>>> - KScatteredTable.resolve() would perform #3 and #4 but otherwise a user
>>> would be able to perform additional functions directly from the
>>> KScatteredTable (TBD - currently out of scope).
>>> - John's analysis 2-emails up is accurate as to the tradeoffs.
>>>
>>> Current Option is coded as-is. Alternate option is possible, but will
>>> require for implementation details to be made in the API and some
>> exposure
>>> of new data structures into the API (ie: CombinedKey).
>>>
>>> I appreciate any insight into this.
>>>
>>> Thanks.
>>>
>>> On Tue, Dec 4, 2018 at 2:59 PM Adam Bellemare 
>>> wrote:
>>>
>>>> Hi John
>>>>
>>>> Thanks for your feedback and assistance. I think your summary is
>> accurate
>>>> from my perspective. Additionally, I would like to add that there is a
>> risk
>>>> of inconsistent final states without performing the resolution. This is
>> a
>>>> major concern for m

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-12-05 Thread Jan Filipiak

vs. value vectors), etc.). If we
>>> publish something like a KScatteredTable with the right-partitioned joined
>>> data, then the API pretty much locks in the implementation as well.
>>> * The API seems simpler to understand and use. I do mean "seems"; if
>>> anyone
>>> wants to make the case that KScatteredTable is actually simpler, I think
>>> hypothetical usage code would help. From a relational algebra perspective,
>>> it seems like KTable.join(KTable) should produce a new KTable in all
>>> cases.
>>> * That said, there might still be room in the API for a different
>>> operation
>>> like what Jan has proposed to scatter a KTable, and then do things like
>>> join, re-group, etc from there... I'm not sure; I haven't thought through
>>> all the consequences yet.
>>>
>>> This is all just my opinion after thinking over the discussion so far...
>>> -John
>>>
>>> On Mon, Dec 3, 2018 at 2:56 PM Adam Bellemare 
>>> wrote:
>>>
>>>> Updated the PR to take into account John's feedback.
>>>>
>>>> I did some preliminary testing for the performance of the prefixScan. I
>>>> have attached the file, but I will also include the text in the body
>>> here
>>>> for archival purposes (I am not sure what happens to attached files). I
>>>> also updated the PR and the KIP accordingly.
>>>>
>>>> Summary: It scales exceptionally well for scanning large values of
>>>> records. As Jan mentioned previously, the real issue would be more
>>> around
>>>> processing the resulting records after obtaining them. For instance, it
>>>> takes approximately ~80-120 mS to flush the buffer and a further
>>> ~35-85mS
>>>> to scan 27.5M records, obtaining matches for 2.5M of them. Iterating
>>>> through the records just to generate a simple count takes ~ 40 times
>>> longer
>>>> than the flush + scan combined.
>>>>
>>>>
>>> 
>>>> Setup:
>>>>
>>>>
>>> 
>>>> Java 9 with default settings aside from a 512 MB heap (Xmx512m, Xms512m)
>>>> CPU: i7 2.2 Ghz.
>>>>
>>>> Note: I am using a slightly-modified, directly-accessible Kafka Streams
>>>> RocksDB
>>>> implementation (RocksDB.java, basically just avoiding the
>>>> ProcessorContext).
>>>> There are no modifications to the default RocksDB values provided in the
>>>> 2.1/trunk release.
>>>>
>>>>
>>>> keysize = 128 bytes
>>>> valsize = 512 bytes
>>>>
>>>> Step 1:
>>>> Write X positive matching events: (key = prefix + left-padded
>>>> auto-incrementing integer)
>>>> Step 2:
>>>> Write 10X negative matching events (key = left-padded auto-incrementing
>>>> integer)
>>>> Step 3:
>>>> Perform flush
>>>> Step 4:
>>>> Perform prefixScan
>>>> Step 5:
>>>> Iterate through return Iterator and validate the count of expected
>>> events.
>>>>
>>>>
>>>>
>>> 
>>>> Results:
>>>>
>>>>
>>> 
>>>> X = 1k (11k events total)
>>>> Flush Time = 39 mS
>>>> Scan Time = 7 mS
>>>> 6.9 MB disk
>>>>
>>>>
>>> 
>>>> X = 10k (110k events total)
>>>> Flush Time = 45 mS
>>>> Scan Time = 8 mS
>>>> 127 MB
>>>>
>>>>
>>> 
>>>> X = 100k (1.1M events total)
>>>> Test1:
>>>> Flush Time = 60 mS
>>>> Scan Time = 12 mS
>>>> 678 MB
>>>>
>>>> Test2:
>>>> Flush Time = 45 mS
>>>> Scan Time = 7 mS
>>>> 576 MB
>>>>
>>>>
>>> 
>>>> X = 1MB

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-12-03 Thread Jan Filipiak

 the event, since it must be outdated.
>

> In your KIP, you gave an example in which (1: A, xyz) gets updated to (1:
> B, xyz), ultimately yielding a conundrum about whether the final state
> should be (1: null) or (1: joined-on-B). With the algorithm above, you
> would be considering (1: (B, xyz), (A, null)) vs (1: (B, xyz), (B,
> EntityB)). It seems like this does give you enough information to make the
> right choice, regardless of disordering.

Will check Adams patch, but this should work. As mentioned often I am 
not convinced on partitioning back for the user automatically. I think 
this is the real performance eater ;)

>
>
> 7. Last thought... I'm a little concerned about the performance of the
> range scans when records change in the right table. You've said that you've
> been using the algorithm you presented in production for a while. Can you
> give us a sense of the performance characteristics you've observed?
>

Make it work, make it fast, make it beautiful. The topmost thing here is 
/ was correctness. In practice I do not measure the performance of the 
range scan. Usual cases I run this with is emitting 500k - 1kk rows
on a left hand side change. The range scan is just the work you gotta 
do, also when you pack your data into different formats, usually the 
rocks performance is very tight to the size of the data and we can't 
really change that. It is more important for users to prevent useless 
updates to begin with. My left hand side is guarded to drop changes that 
are not going to change my join output.

usually it's:

drop unused fields and then don't forward if old.equals(new)

regarding to the performance of creating an iterator for smaller 
fanouts, users can still just do a group by first then anyways.



> I could only think of one alternative, but I'm not sure if it's better or
> worse... If the first re-key only needs to preserve the original key, as I
> proposed in #6, then we could store a vector of keys in the value:
>
> Left table:
> 1: A,...
> 2: B,...
> 3: A,...
>
> Gets re-keyed:
> A: [1, 3]
> B: [2]
>
> Then, the rhs part of the join would only need a regular single-key lookup.
> Of course we have to deal with the problem of large values, as there's no
> bound on the number of lhs records that can reference rhs records. Offhand,
> I'd say we could page the values, so when one row is past the threshold, we
> append the key for the next page. Then in most cases, it would be a single
> key lookup, but for large fan-out updates, it would be one per (max value
> size)/(avg lhs key size).
>
> This seems more complex, though... Plus, I think there's some extra
> tracking we'd need to do to know when to emit a retraction. For example,
> when record 1 is deleted, the re-key table would just have (A: [3]). Some
> kind of tombstone is needed so that the join result for 1 can also be
> retracted.
>
> That's all!
>
> Thanks so much to both Adam and Jan for the thoughtful KIP. Sorry the
> discussion has been slow.
> -John
>
>
> On Fri, Oct 12, 2018 at 2:20 AM Jan Filipiak 
> wrote:
>
>> Id say you can just call the vote.
>>
>> that happens all the time, and if something comes up, it just goes back
>> to discuss.
>>
>> would not expect to much attention with another another email in this
>> thread.
>>
>> best Jan
>>
>> On 09.10.2018 13:56, Adam Bellemare wrote:
>>> Hello Contributors
>>>
>>> I know that 2.1 is about to be released, but I do need to bump this to
>> keep
>>> visibility up. I am still intending to push this through once contributor
>>> feedback is given.
>>>
>>> Main points that need addressing:
>>> 1) Any way (or benefit) in structuring the current singular graph node
>> into
>>> multiple nodes? It has a whopping 25 parameters right now. I am a bit
>> fuzzy
>>> on how the optimizations are supposed to work, so I would appreciate any
>>> help on this aspect.
>>>
>>> 2) Overall strategy for joining + resolving. This thread has much
>> discourse
>>> between Jan and I between the current highwater mark proposal and a
>> groupBy
>>> + reduce proposal. I am of the opinion that we need to strictly handle
>> any
>>> chance of out-of-order data and leave none of it up to the consumer. Any
>>> comments or suggestions here would also help.
>>>
>>> 3) Anything else that you see that would prevent this from moving to a
>> vote?
>>>
>>> Thanks
>>>
>>> Adam
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Sep 30, 2018 at 10:23 AM Adam Bellemare <
>> adam.b

Re: [VOTE] - KIP-213 Support non-key joining in KTable

2018-11-26 Thread Jan Filipiak



On 07.11.2018 22:24, Adam Bellemare wrote:
> Bumping this thread, as per convention - 1
>
> On Fri, Nov 2, 2018 at 8:22 AM Adam Bellemare 
> wrote:
>
>> As expected :) But still, thanks none-the-less!
>>
>> On Fri, Nov 2, 2018 at 3:36 AM Jan Filipiak 
>> wrote:
>>
>>> reminder
>>>
>>> On 30.10.2018 15:47, Adam Bellemare wrote:
>>>> Hi All
>>>>
>>>> I would like to call a vote on
>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable
>>> .
>>>> This allows a Kafka Streams DSL user to perform KTable to KTable
>>>> foreign-key joins on their data. I have been using this in production
>>> for
>>>> some time and I have composed a PR that enables this. It is a fairly
>>>> extensive PR, but I believe it will add considerable value to the Kafka
>>>> Streams DSL.
>>>>
>>>> The PR can be found here:
>>>> https://github.com/apache/kafka/pull/5527
>>>>
>>>> See
>>> http://mail-archives.apache.org/mod_mbox/kafka-dev/201810.mbox/browser
>>>> for previous discussion thread.
>>>>
>>>> I would also like to give a shout-out to Jan Filipiak who helped me out
>>>> greatly in this project, and who led the initial work into this problem.
>>>> Without Jan's help and insight I do not think this would have been
>>> possible
>>>> to get to this point.
>>>>
>>>> Adam
>>>>
>>>
>>
>

Re: [VOTE] - KIP-213 Support non-key joining in KTable

2018-11-02 Thread Jan Filipiak

reminder

On 30.10.2018 15:47, Adam Bellemare wrote:
> Hi All
>
> I would like to call a vote on
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable.
> This allows a Kafka Streams DSL user to perform KTable to KTable
> foreign-key joins on their data. I have been using this in production for
> some time and I have composed a PR that enables this. It is a fairly
> extensive PR, but I believe it will add considerable value to the Kafka
> Streams DSL.
>
> The PR can be found here:
> https://github.com/apache/kafka/pull/5527
>
> See http://mail-archives.apache.org/mod_mbox/kafka-dev/201810.mbox/browser
> for previous discussion thread.
>
> I would also like to give a shout-out to Jan Filipiak who helped me out
> greatly in this project, and who led the initial work into this problem.
> Without Jan's help and insight I do not think this would have been possible
> to get to this point.
>
> Adam
>

Re: [VOTE] - KIP-213 Support non-key joining in KTable

2018-11-02 Thread Jan Filipiak

Hi Adam,

congrats for pulling it of! As I mentioned its not something I can use 
in can / would use in production. So I am throwing a non binding minus 
one in here.

I don't expect it todo any harm for the vote.

Thanks for the credits :)

Best Jan



On 30.10.2018 15:47, Adam Bellemare wrote:
> Hi All
>
> I would like to call a vote on
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable.
> This allows a Kafka Streams DSL user to perform KTable to KTable
> foreign-key joins on their data. I have been using this in production for
> some time and I have composed a PR that enables this. It is a fairly
> extensive PR, but I believe it will add considerable value to the Kafka
> Streams DSL.
>
> The PR can be found here:
> https://github.com/apache/kafka/pull/5527
>
> See http://mail-archives.apache.org/mod_mbox/kafka-dev/201810.mbox/browser
> for previous discussion thread.
>
> I would also like to give a shout-out to Jan Filipiak who helped me out
> greatly in this project, and who led the initial work into this problem.
> Without Jan's help and insight I do not think this would have been possible
> to get to this point.
>
> Adam
>

Re: Throwing away prefetched records optimisation.

2018-10-18 Thread Jan Filipiak

The idea for you would be that Messagechooser could hang on to the 
prefetched messages.

ccing cmcc...@apache.org

@Collin
just for you to see that MessageChooser is a powerfull abstraction.

:)

Best jan

On 18.10.2018 13:59, Zahari Dichev wrote:
> Jan,
>
> Quite insightful indeed. I think your propositions are valid.
>
> Ryanne,
>
> I understand that consumers are using a pull model... And yes, indeed if a
> consumer is not ready for more records it surely should not call poll.
> Except that it needs to do so periodically in order to indicate that its
> live. Forget about the "backpressure", I guess I was wrong with phrasing
> this so lets not get caught up on it.
>
> You say pause/resume can be used to prioritise certain topics/partitions
> over others. And indeed this is the case. So instead of thinking about it
> in terms of backpressure, lets put it in a different way. The Akka streams
> connector would like to prioritise certain topics over others, using once
> consumer instance. On top of that, add the detail that the priorities
> change quite frequently (which translates to calling pause/resume
> frequently). So all that being said, what would be a proper way to handle
> the situation without throwing the pre-fetched records away when calling
> poll on a consumer that happens to have a topic that was recently paused
> (and that might be un-paused soon )? Am I the only one who considers that
> an actual problem with the use os pause/resume ? Not sure how to explain
> the situation in a better way..
>
> Zahari
>
>
> On Thu, Oct 18, 2018 at 9:46 AM Zahari Dichev 
> wrote:
>
>> Thanks a lot Jan,
>>
>> I will read it.
>>
>> Zahari
>>
>> On Thu, Oct 18, 2018 at 9:31 AM Jan Filipiak 
>> wrote:
>>
>>> especially my suggestions ;)
>>>
>>> On 18.10.2018 08:30, Jan Filipiak wrote:
>>>> Hi Zahari,
>>>>
>>>> would you be willing to scan through the KIP-349 discussion a little?
>>>> I think it has suggestions that could be interesting for you
>>>>
>>>> Best Jan
>>>>
>>>> On 16.10.2018 09:29, Zahari Dichev wrote:
>>>>> Hi there Kafka developers,
>>>>>
>>>>> I am currently trying to find a solution to an issue that has been
>>>>> manifesting itself in the Akka streams implementation of the Kafka
>>>>> connector. When it comes to consuming messages, the implementation
>>> relies
>>>>> heavily on the fact that we can pause and resume partitions. In some
>>>>> situations when a single consumer instance is shared among several
>>>>> streams,
>>>>> we might end up with frequently pausing and unpausing a set of topic
>>>>> partitions, which is the main facility that allows us to implement back
>>>>> pressure. This however has certain disadvantages, especially when
>>>>> there are
>>>>> two consumers that differ in terms of processing speed.
>>>>>
>>>>> To articulate the issue more clearly, imagine that a consumer maintains
>>>>> assignments for two topic partitions *TP1* and *TP2*. This consumer is
>>>>> shared by two streams - S1 and S2. So effectively when we have demand
>>>>> from
>>>>> only one of the streams - *S1*, we will pause one of the topic
>>> partitions
>>>>> *TP2* and call *poll()* on the consumer to only retrieve the records
>>> for
>>>>> the demanded topic partition - *TP1*. The result of that is all the
>>>>> records
>>>>> that have been prefetched for *TP2* are now thrown away by the fetcher
>>>>> ("*Not
>>>>> returning fetched records for assigned partition TP2 since it is no
>>>>> longer
>>>>> fetchable"*). If we extrapolate that to multiple streams sharing the
>>> same
>>>>> consumer, we might quickly end up in a situation where we throw
>>>>> prefetched
>>>>> data quite often. This does not seem like the most efficient approach
>>> and
>>>>> in fact produces quite a lot of overlapping fetch requests as
>>> illustrated
>>>>> in the following issue:
>>>>>
>>>>> https://github.com/akka/alpakka-kafka/issues/549
>>>>>
>>>>> I am writing this email to get some initial opinion on a KIP I was
>>>>> thinking
>>>>> about. What if we give the clients of the Consumer API a bit more
>>> control
>&g

Re: [DISCUSS] KIP-382: MirrorMaker 2.0

2018-10-18 Thread Jan Filipiak

then I just hope that in the midsts of all this new features I can still 
at least use it to copy my messages from A to B later.

Another hint you should be aware of:

https://cwiki.apache.org/confluence/display/KAFKA/Hierarchical+Topics

That was always a design I admired, with active / active replications 
and stuff, It feels like we are going away from this another step.

On 17.10.2018 17:34, Ryanne Dolan wrote:

Jan, these are two separate issues.

1) consumer coordination should not, ideally, involve unreliable or slow
connections. Naively, a KafkaSourceConnector would coordinate via the
source cluster. We can do better than this, but I'm deferring this
optimization for now.

2) exactly-once between two clusters is mind-bending. But keep in mind
that transactions are managed by the producer, not the consumer. In
fact, it's the producer that requests that offsets be committed for the
current transaction. Obviously, these offsets are committed in whatever
cluster the producer is sending to.

These two issues are closely related. They are both resolved by not
coordinating or committing via the source cluster. And in fact, this is
the general model of SourceConnectors anyway, since most
SourceConnectors _only_ have a destination cluster.

If there is a lot of interest here, I can expound further on this aspect
of MM2, but again I think this is premature until this first KIP is
approved. I intend to address each of these in separate KIPs following
this one.

Ryanne

On Wed, Oct 17, 2018 at 7:09 AM Jan Filipiak mailto:jan.filip...@trivago.com>> wrote:

This is not a performance optimisation. Its a fundamental design choice.

I never really took a look how streams does exactly once. (its a trap
anyways and you usually can deal with at least once donwstream pretty
easy). But I am very certain its not gonna get somewhere if offset
commit and record produce cluster are not the same.

Pretty sure without this _design choice_ you can skip on that exactly
once already

Best Jan

On 16.10.2018 18:16, Ryanne Dolan wrote:
 >  >  But one big obstacle in this was
 > always that group coordination happened on the source cluster.
 >
 > Jan, thank you for bringing up this issue with legacy MirrorMaker. I
 > totally agree with you. This is one of several problems with
MirrorMaker
 > I intend to solve in MM2, and I already have a design and
prototype that
 > solves this and related issues. But as you pointed out, this KIP is
 > already rather complex, and I want to focus on the core feature set
 > rather than performance optimizations for now. If we can agree on
what
 > MM2 looks like, it will be very easy to agree to improve its
performance
 > and reliability.
 >
 > That said, I look forward to your support on a subsequent KIP that
 > addresses consumer coordination and rebalance issues. Stay tuned!
 >
 > Ryanne
 >
 > On Tue, Oct 16, 2018 at 6:58 AM Jan Filipiak
mailto:jan.filip...@trivago.com>
 > <mailto:jan.filip...@trivago.com
<mailto:jan.filip...@trivago.com>>> wrote:
 >
 > Hi,
 >
 > Currently MirrorMaker is usually run collocated with the target
 > cluster.
 > This is all nice and good. But one big obstacle in this was
 > always that group coordination happened on the source
cluster. So when
 > then network was congested, you sometimes loose group
membership and
 > have to rebalance and all this.
 >
 > So one big request from we would be the support of having
coordination
 > cluster != source cluster.
 >
 > I would generally say a LAN is better than a WAN for doing group
 > coordinaton and there is no reason we couldn't have a group
consuming
 > topics from a different cluster and committing offsets to another
 > one right?
 >
 > Other than that. It feels like the KIP has too much features
where many
 > of them are not really wanted and counter productive but I
will just
 > wait and see how the discussion goes.
 >
 > Best Jan
 >
 >
 > On 15.10.2018 18:16, Ryanne Dolan wrote:
 >  > Hey y'all!
 >  >
 >  > Please take a look at KIP-382:
 >  >
 >  >
 >
https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
 >  >
 >  > Thanks for your feedback and support.
 >  >
 >  > Ryanne
 >  >
 >

Re: Throwing away prefetched records optimisation.

2018-10-18 Thread Jan Filipiak


especially my suggestions ;)

On 18.10.2018 08:30, Jan Filipiak wrote:

Hi Zahari,

would you be willing to scan through the KIP-349 discussion a little?
I think it has suggestions that could be interesting for you

Best Jan

On 16.10.2018 09:29, Zahari Dichev wrote:

Hi there Kafka developers,

I am currently trying to find a solution to an issue that has been
manifesting itself in the Akka streams implementation of the Kafka
connector. When it comes to consuming messages, the implementation relies
heavily on the fact that we can pause and resume partitions. In some
situations when a single consumer instance is shared among several
streams,
we might end up with frequently pausing and unpausing a set of topic
partitions, which is the main facility that allows us to implement back
pressure. This however has certain disadvantages, especially when
there are
two consumers that differ in terms of processing speed.

To articulate the issue more clearly, imagine that a consumer maintains
assignments for two topic partitions *TP1* and *TP2*. This consumer is
shared by two streams - S1 and S2. So effectively when we have demand
from
only one of the streams - *S1*, we will pause one of the topic partitions
*TP2* and call *poll()* on the consumer to only retrieve the records for
the demanded topic partition - *TP1*. The result of that is all the
records
that have been prefetched for *TP2* are now thrown away by the fetcher
("*Not
returning fetched records for assigned partition TP2 since it is no
longer
fetchable"*). If we extrapolate that to multiple streams sharing the same
consumer, we might quickly end up in a situation where we throw
prefetched
data quite often. This does not seem like the most efficient approach and
in fact produces quite a lot of overlapping fetch requests as illustrated
in the following issue:

https://github.com/akka/alpakka-kafka/issues/549

I am writing this email to get some initial opinion on a KIP I was
thinking
about. What if we give the clients of the Consumer API a bit more control
of what to do with this prefetched data. Two options I am wondering
about:

1. Introduce a configuration setting, such as*
"return-prefetched-data-for-paused-topic-partitions = false"* (have to
think of a better name), which when set to true will return what is
prefetched instead of throwing it away on calling *poll()*. Since this is
amount of data that is bounded by the maximum size of the prefetch, we
can
control what is the most amount of records returned. The client of the
consumer API can then be responsible for keeping that data around and use
it when appropriate (i.e. when demand is present)

2. Introduce a facility to pass in a buffer into which the prefetched
records are drained when poll is called and paused partitions have some
prefetched records.

Any opinions on the matter are welcome. Thanks a lot !

Zahari Dichev

Re: Throwing away prefetched records optimisation.

2018-10-18 Thread Jan Filipiak


Hi Zahari,

would you be willing to scan through the KIP-349 discussion a little?
I think it has suggestions that could be interesting for you

Best Jan

On 16.10.2018 09:29, Zahari Dichev wrote:

Hi there Kafka developers,

I am currently trying to find a solution to an issue that has been
manifesting itself in the Akka streams implementation of the Kafka
connector. When it comes to consuming messages, the implementation relies
heavily on the fact that we can pause and resume partitions. In some
situations when a single consumer instance is shared among several streams,
we might end up with frequently pausing and unpausing a set of topic
partitions, which is the main facility that allows us to implement back
pressure. This however has certain disadvantages, especially when there are
two consumers that differ in terms of processing speed.

To articulate the issue more clearly, imagine that a consumer maintains
assignments for two topic partitions *TP1* and *TP2*. This consumer is
shared by two streams - S1 and S2. So effectively when we have demand from
only one of the streams - *S1*, we will pause one of the topic partitions
*TP2* and call *poll()* on the consumer to only retrieve the records for
the demanded topic partition - *TP1*. The result of that is all the records
that have been prefetched for *TP2* are now thrown away by the fetcher ("*Not
returning fetched records for assigned partition TP2 since it is no longer
fetchable"*). If we extrapolate that to multiple streams sharing the same
consumer, we might quickly end up in a situation where we throw prefetched
data quite often. This does not seem like the most efficient approach and
in fact produces quite a lot of overlapping fetch requests as illustrated
in the following issue:

https://github.com/akka/alpakka-kafka/issues/549

I am writing this email to get some initial opinion on a KIP I was thinking
about. What if we give the clients of the Consumer API a bit more control
of what to do with this prefetched data. Two options I am wondering about:

1. Introduce a configuration setting, such as*
"return-prefetched-data-for-paused-topic-partitions = false"* (have to
think of a better name), which when set to true will return what is
prefetched instead of throwing it away on calling *poll()*. Since this is
amount of data that is bounded by the maximum size of the prefetch, we can
control what is the most amount of records returned. The client of the
consumer API can then be responsible for keeping that data around and use
it when appropriate (i.e. when demand is present)

2. Introduce a facility to pass in a buffer into which the prefetched
records are drained when poll is called and paused partitions have some
prefetched records.

Any opinions on the matter are welcome. Thanks a lot !

Zahari Dichev

Re: [DISCUSS] KIP-382: MirrorMaker 2.0

2018-10-17 Thread Jan Filipiak

This is not a performance optimisation. Its a fundamental design choice.


I never really took a look how streams does exactly once. (its a trap 
anyways and you usually can deal with at least once donwstream pretty 
easy). But I am very certain its not gonna get somewhere if offset 
commit and record produce cluster are not the same.

Pretty sure without this _design choice_ you can skip on that exactly 
once already

Best Jan

On 16.10.2018 18:16, Ryanne Dolan wrote:
>  >  But one big obstacle in this was
> always that group coordination happened on the source cluster.
>
> Jan, thank you for bringing up this issue with legacy MirrorMaker. I
> totally agree with you. This is one of several problems with MirrorMaker
> I intend to solve in MM2, and I already have a design and prototype that
> solves this and related issues. But as you pointed out, this KIP is
> already rather complex, and I want to focus on the core feature set
> rather than performance optimizations for now. If we can agree on what
> MM2 looks like, it will be very easy to agree to improve its performance
> and reliability.
>
> That said, I look forward to your support on a subsequent KIP that
> addresses consumer coordination and rebalance issues. Stay tuned!
>
> Ryanne
>
> On Tue, Oct 16, 2018 at 6:58 AM Jan Filipiak  <mailto:jan.filip...@trivago.com>> wrote:
>
> Hi,
>
> Currently MirrorMaker is usually run collocated with the target
> cluster.
> This is all nice and good. But one big obstacle in this was
> always that group coordination happened on the source cluster. So when
> then network was congested, you sometimes loose group membership and
> have to rebalance and all this.
>
> So one big request from we would be the support of having coordination
> cluster != source cluster.
>
> I would generally say a LAN is better than a WAN for doing group
> coordinaton and there is no reason we couldn't have a group consuming
> topics from a different cluster and committing offsets to another
> one right?
>
> Other than that. It feels like the KIP has too much features where many
> of them are not really wanted and counter productive but I will just
> wait and see how the discussion goes.
>
> Best Jan
>
>
> On 15.10.2018 18:16, Ryanne Dolan wrote:
>  > Hey y'all!
>  >
>  > Please take a look at KIP-382:
>  >
>  >
> 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
>  >
>  > Thanks for your feedback and support.
>  >
>  > Ryanne
>  >
>

Re: [DISCUSS] KIP-382: MirrorMaker 2.0

2018-10-16 Thread Jan Filipiak

no worries,

glad i could clarify

On 16.10.2018 15:14, Andrew Otto wrote:
> O ok apologies. Interesting!
>
> On Tue, Oct 16, 2018 at 9:06 AM Jan Filipiak 
> wrote:
>
>> Hi Andrew,
>>
>> thanks for your message, you missed my point.
>>
>> Mirrormaker collocation with target is for sure correct.
>> But then group coordination happens across WAN which is unnecessary.
>> And I request to be thought about again.
>> I made a PR back then for zk Consumer to allow having 2 zookeeper
>> connects. One for group coordination one for broker and topic discovery.
>>
>> I am requesting this to be added to the kip so that the target cluster
>> can become the group coordinator.
>>
>>
>>
>> On 16.10.2018 15:04, Andrew Otto wrote:
>>>> I would generally say a LAN is better than a WAN for doing group
>>>> coordinaton
>>>
>>> For sure, but a LAN is better than a WAN for producing messages too.  If
>>> there is network congestion during network production, messages will be
>>> dropped.  With MirrorMaker currently, you can either skip these dropped
>>> messages, or have the MirrorMaker processes themselves die on produce
>>> failure, which will also cause (a series) of MirrorMaker consumer
>>> rebalances.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Oct 16, 2018 at 7:58 AM Jan Filipiak 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Currently MirrorMaker is usually run collocated with the target cluster.
>>>> This is all nice and good. But one big obstacle in this was
>>>> always that group coordination happened on the source cluster. So when
>>>> then network was congested, you sometimes loose group membership and
>>>> have to rebalance and all this.
>>>>
>>>> So one big request from we would be the support of having coordination
>>>> cluster != source cluster.
>>>>
>>>> I would generally say a LAN is better than a WAN for doing group
>>>> coordinaton and there is no reason we couldn't have a group consuming
>>>> topics from a different cluster and committing offsets to another one
>>>> right?
>>>>
>>>> Other than that. It feels like the KIP has too much features where many
>>>> of them are not really wanted and counter productive but I will just
>>>> wait and see how the discussion goes.
>>>>
>>>> Best Jan
>>>>
>>>>
>>>> On 15.10.2018 18:16, Ryanne Dolan wrote:
>>>>> Hey y'all!
>>>>>
>>>>> Please take a look at KIP-382:
>>>>>
>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
>>>>>
>>>>> Thanks for your feedback and support.
>>>>>
>>>>> Ryanne
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] KIP-382: MirrorMaker 2.0

2018-10-16 Thread Jan Filipiak


Hi Andrew,

thanks for your message, you missed my point.

Mirrormaker collocation with target is for sure correct.
But then group coordination happens across WAN which is unnecessary.
And I request to be thought about again.
I made a PR back then for zk Consumer to allow having 2 zookeeper 
connects. One for group coordination one for broker and topic discovery.


I am requesting this to be added to the kip so that the target cluster
can become the group coordinator.



On 16.10.2018 15:04, Andrew Otto wrote:

I would generally say a LAN is better than a WAN for doing group
coordinaton


For sure, but a LAN is better than a WAN for producing messages too.  If
there is network congestion during network production, messages will be
dropped.  With MirrorMaker currently, you can either skip these dropped
messages, or have the MirrorMaker processes themselves die on produce
failure, which will also cause (a series) of MirrorMaker consumer
rebalances.






On Tue, Oct 16, 2018 at 7:58 AM Jan Filipiak 
wrote:


Hi,

Currently MirrorMaker is usually run collocated with the target cluster.
This is all nice and good. But one big obstacle in this was
always that group coordination happened on the source cluster. So when
then network was congested, you sometimes loose group membership and
have to rebalance and all this.

So one big request from we would be the support of having coordination
cluster != source cluster.

I would generally say a LAN is better than a WAN for doing group
coordinaton and there is no reason we couldn't have a group consuming
topics from a different cluster and committing offsets to another one
right?

Other than that. It feels like the KIP has too much features where many
of them are not really wanted and counter productive but I will just
wait and see how the discussion goes.

Best Jan


On 15.10.2018 18:16, Ryanne Dolan wrote:

Hey y'all!

Please take a look at KIP-382:



https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0


Thanks for your feedback and support.

Ryanne

Re: [DISCUSS] KIP-382: MirrorMaker 2.0

2018-10-16 Thread Jan Filipiak


Hi,

Currently MirrorMaker is usually run collocated with the target cluster. 
This is all nice and good. But one big obstacle in this was

always that group coordination happened on the source cluster. So when
then network was congested, you sometimes loose group membership and 
have to rebalance and all this.


So one big request from we would be the support of having coordination 
cluster != source cluster.


I would generally say a LAN is better than a WAN for doing group 
coordinaton and there is no reason we couldn't have a group consuming 
topics from a different cluster and committing offsets to another one right?


Other than that. It feels like the KIP has too much features where many 
of them are not really wanted and counter productive but I will just 
wait and see how the discussion goes.


Best Jan


On 15.10.2018 18:16, Ryanne Dolan wrote:

Hey y'all!

Please take a look at KIP-382:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0

Thanks for your feedback and support.

Ryanne

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-10-12 Thread Jan Filipiak


Id say you can just call the vote.

that happens all the time, and if something comes up, it just goes back 
to discuss.


would not expect to much attention with another another email in this 
thread.


best Jan

On 09.10.2018 13:56, Adam Bellemare wrote:

Hello Contributors

I know that 2.1 is about to be released, but I do need to bump this to keep
visibility up. I am still intending to push this through once contributor
feedback is given.

Main points that need addressing:
1) Any way (or benefit) in structuring the current singular graph node into
multiple nodes? It has a whopping 25 parameters right now. I am a bit fuzzy
on how the optimizations are supposed to work, so I would appreciate any
help on this aspect.

2) Overall strategy for joining + resolving. This thread has much discourse
between Jan and I between the current highwater mark proposal and a groupBy
+ reduce proposal. I am of the opinion that we need to strictly handle any
chance of out-of-order data and leave none of it up to the consumer. Any
comments or suggestions here would also help.

3) Anything else that you see that would prevent this from moving to a vote?

Thanks

Adam







On Sun, Sep 30, 2018 at 10:23 AM Adam Bellemare 
wrote:


Hi Jan

With the Stores.windowStoreBuilder and Stores.persistentWindowStore, you
actually only need to specify the amount of segments you want and how large
they are. To the best of my understanding, what happens is that the
segments are automatically rolled over as new data with new timestamps are
created. We use this exact functionality in some of the work done
internally at my company. For reference, this is the hopping windowed store.

https://kafka.apache.org/11/documentation/streams/developer-guide/dsl-api.html#id21

In the code that I have provided, there are going to be two 24h segments.
When a record is put into the windowStore, it will be inserted at time T in
both segments. The two segments will always overlap by 12h. As time goes on
and new records are added (say at time T+12h+), the oldest segment will be
automatically deleted and a new segment created. The records are by default
inserted with the context.timestamp(), such that it is the record time, not
the clock time, which is used.

To the best of my understanding, the timestamps are retained when
restoring from the changelog.

Basically, this is heavy-handed way to deal with TTL at a segment-level,
instead of at an individual record level.

On Tue, Sep 25, 2018 at 5:18 PM Jan Filipiak 
wrote:


Will that work? I expected it to blow up with ClassCastException or
similar.

You either would have to specify the window you fetch/put or iterate
across all windows the key was found in right?

I just hope the window-store doesn't check stream-time under the hoods
that would be a questionable interface.

If it does: did you see my comment on checking all the windows earlier?
that would be needed to actually give reasonable time gurantees.

Best



On 25.09.2018 13:18, Adam Bellemare wrote:

Hi Jan

Check for  " highwaterMat " in the PR. I only changed the state store,

not

the ProcessorSupplier.

Thanks,
Adam

On Mon, Sep 24, 2018 at 2:47 PM, Jan Filipiak 


On 24.09.2018 16:26, Adam Bellemare wrote:


@Guozhang

Thanks for the information. This is indeed something that will be
extremely
useful for this KIP.

@Jan
Thanks for your explanations. That being said, I will not be moving

ahead

with an implementation using reshuffle/groupBy solution as you

propose.

That being said, if you wish to implement it yourself off of my

current PR

and submit it as a competitive alternative, I would be more than

happy to

help vet that as an alternate solution. As it stands right now, I do

not

really have more time to invest into alternatives without there being

a

strong indication from the binding voters which they would prefer.



Hey, total no worries. I think I personally gave up on the streams DSL

for

some time already, otherwise I would have pulled this KIP through

already.

I am currently reimplementing my own DSL based on PAPI.



I will look at finishing up my PR with the windowed state store in the
next
week or so, exercising it via tests, and then I will come back for

final

discussions. In the meantime, I hope that any of the binding voters

could

take a look at the KIP in the wiki. I have updated it according to the
latest plan:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+
Support+non-key+joining+in+KTable

I have also updated the KIP PR to use a windowed store. This could be
replaced by the results of KIP-258 whenever they are completed.
https://github.com/apache/kafka/pull/5527

Thanks,

Adam



Is the HighWatermarkResolverProccessorsupplier already updated in the

PR?

expected it to change to Windowed,Long Missing something?






On Fri, Sep 14, 2018 at 2:24 PM, Guozhang Wang 
wrote:

Correction on my previous email: KAFKA-5533 is the wrong link, as it

is

for
corresponding changelog

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-09-25 Thread Jan Filipiak


Will that work? I expected it to blow up with ClassCastException or similar.

You either would have to specify the window you fetch/put or iterate 
across all windows the key was found in right?


I just hope the window-store doesn't check stream-time under the hoods 
that would be a questionable interface.


If it does: did you see my comment on checking all the windows earlier?
that would be needed to actually give reasonable time gurantees.

Best



On 25.09.2018 13:18, Adam Bellemare wrote:

Hi Jan

Check for  " highwaterMat " in the PR. I only changed the state store, not
the ProcessorSupplier.

Thanks,
Adam

On Mon, Sep 24, 2018 at 2:47 PM, Jan Filipiak 
wrote:




On 24.09.2018 16:26, Adam Bellemare wrote:


@Guozhang

Thanks for the information. This is indeed something that will be
extremely
useful for this KIP.

@Jan
Thanks for your explanations. That being said, I will not be moving ahead
with an implementation using reshuffle/groupBy solution as you propose.
That being said, if you wish to implement it yourself off of my current PR
and submit it as a competitive alternative, I would be more than happy to
help vet that as an alternate solution. As it stands right now, I do not
really have more time to invest into alternatives without there being a
strong indication from the binding voters which they would prefer.



Hey, total no worries. I think I personally gave up on the streams DSL for
some time already, otherwise I would have pulled this KIP through already.
I am currently reimplementing my own DSL based on PAPI.



I will look at finishing up my PR with the windowed state store in the
next
week or so, exercising it via tests, and then I will come back for final
discussions. In the meantime, I hope that any of the binding voters could
take a look at the KIP in the wiki. I have updated it according to the
latest plan:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+
Support+non-key+joining+in+KTable

I have also updated the KIP PR to use a windowed store. This could be
replaced by the results of KIP-258 whenever they are completed.
https://github.com/apache/kafka/pull/5527

Thanks,

Adam



Is the HighWatermarkResolverProccessorsupplier already updated in the PR?
expected it to change to Windowed,Long Missing something?






On Fri, Sep 14, 2018 at 2:24 PM, Guozhang Wang 
wrote:

Correction on my previous email: KAFKA-5533 is the wrong link, as it is

for
corresponding changelog mechanisms. But as part of KIP-258 we do want to
have "handling out-of-order data for source KTable" such that instead of
blindly apply the updates to the materialized store, i.e. following
offset
ordering, we will reject updates that are older than the current key's
timestamps, i.e. following timestamp ordering.


Guozhang

On Fri, Sep 14, 2018 at 11:21 AM, Guozhang Wang 
wrote:

Hello Adam,


Thanks for the explanation. Regarding the final step (i.e. the high
watermark store, now altered to be replaced with a window store), I
think
another current on-going KIP may actually help:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-
258%3A+Allow+to+Store+Record+Timestamps+in+RocksDB


This is for adding the timestamp into a key-value store (i.e. only for
non-windowed KTable), and then one of its usage, as described in
https://issues.apache.org/jira/browse/KAFKA-5533, is that we can then
"reject" updates from the source topics if its timestamp is smaller than
the current key's latest update timestamp. I think it is very similar to
what you have in mind for high watermark based filtering, while you only
need to make sure that the timestamps of the joining records are


correctly


inherited though the whole topology to the final stage.

Note that this KIP is for key-value store and hence non-windowed KTables
only, but for windowed KTables we do not really have a good support for
their joins anyways (https://issues.apache.org/jira/browse/KAFKA-7107)
I
think we can just consider non-windowed KTable-KTable non-key joins for
now. In which case, KIP-258 should help.



Guozhang



On Wed, Sep 12, 2018 at 9:20 PM, Jan Filipiak 


wrote:



On 11.09.2018 18:00, Adam Bellemare wrote:

Hi Guozhang


Current highwater mark implementation would grow endlessly based on
primary key of original event. It is a pair of (
key>,



). This is used to differentiate



between



late arrivals and new updates. My newest proposal would be to replace



it



with a Windowed state store of Duration N. This would allow the same

behaviour, but cap the size based on time. This should allow for all
late-arriving events to be processed, and should be customizable by
the
user to tailor to their own needs (ie: perhaps just 10 minutes of


window,



or perhaps 7 days...).


Hi Adam, using time based retention can do the trick here. Even if I

would still like to see the automatic repartitioning optional since I


would



just reshuffle again. With windowed store I am a little bit sceptical



abo

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-09-24 Thread Jan Filipiak

On 24.09.2018 16:26, Adam Bellemare wrote:

@Guozhang

Thanks for the information. This is indeed something that will be extremely
useful for this KIP.

@Jan
Thanks for your explanations. That being said, I will not be moving ahead
with an implementation using reshuffle/groupBy solution as you propose.
That being said, if you wish to implement it yourself off of my current PR
and submit it as a competitive alternative, I would be more than happy to
help vet that as an alternate solution. As it stands right now, I do not
really have more time to invest into alternatives without there being a
strong indication from the binding voters which they would prefer.

Hey, total no worries. I think I personally gave up on the streams DSL
for some time already, otherwise I would have pulled this KIP through
already. I am currently reimplementing my own DSL based on PAPI.

I will look at finishing up my PR with the windowed state store in the next
week or so, exercising it via tests, and then I will come back for final
discussions. In the meantime, I hope that any of the binding voters could
take a look at the KIP in the wiki. I have updated it according to the
latest plan:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable

I have also updated the KIP PR to use a windowed store. This could be
replaced by the results of KIP-258 whenever they are completed.
https://github.com/apache/kafka/pull/5527

Thanks,

Adam

Is the HighWatermarkResolverProccessorsupplier already updated in the
PR? expected it to change to Windowed,Long Missing something?

On Fri, Sep 14, 2018 at 2:24 PM, Guozhang Wang wrote:

Correction on my previous email: KAFKA-5533 is the wrong link, as it is for
corresponding changelog mechanisms. But as part of KIP-258 we do want to
have "handling out-of-order data for source KTable" such that instead of
blindly apply the updates to the materialized store, i.e. following offset
ordering, we will reject updates that are older than the current key's
timestamps, i.e. following timestamp ordering.

Guozhang

On Fri, Sep 14, 2018 at 11:21 AM, Guozhang Wang
wrote:

Hello Adam,

Thanks for the explanation. Regarding the final step (i.e. the high
watermark store, now altered to be replaced with a window store), I think
another current on-going KIP may actually help:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-
258%3A+Allow+to+Store+Record+Timestamps+in+RocksDB

This is for adding the timestamp into a key-value store (i.e. only for
non-windowed KTable), and then one of its usage, as described in
https://issues.apache.org/jira/browse/KAFKA-5533, is that we can then
"reject" updates from the source topics if its timestamp is smaller than
the current key's latest update timestamp. I think it is very similar to
what you have in mind for high watermark based filtering, while you only
need to make sure that the timestamps of the joining records are

correctly

inherited though the whole topology to the final stage.

Note that this KIP is for key-value store and hence non-windowed KTables
only, but for windowed KTables we do not really have a good support for
their joins anyways (https://issues.apache.org/jira/browse/KAFKA-7107) I
think we can just consider non-windowed KTable-KTable non-key joins for
now. In which case, KIP-258 should help.

Guozhang

On Wed, Sep 12, 2018 at 9:20 PM, Jan Filipiak
wrote:

On 11.09.2018 18:00, Adam Bellemare wrote:

Hi Guozhang

Current highwater mark implementation would grow endlessly based on
primary key of original event. It is a pair of (
key>,

). This is used to differentiate

between

late arrivals and new updates. My newest proposal would be to replace

with a Windowed state store of Duration N. This would allow the same
behaviour, but cap the size based on time. This should allow for all
late-arriving events to be processed, and should be customizable by the
user to tailor to their own needs (ie: perhaps just 10 minutes of

window,

or perhaps 7 days...).

Hi Adam, using time based retention can do the trick here. Even if I
would still like to see the automatic repartitioning optional since I

would

just reshuffle again. With windowed store I am a little bit sceptical

about

how to determine the window. So esentially one could run into problems

when

the rapid change happens near a window border. I will check you
implementation in detail, if its problematic, we could still check _all_
windows on read with not to bad performance impact I guess. Will let you
know if the implementation would be correct as is. I wouldn't not like

assume that: offset(A) < offset(B) => timestamp(A) < timestamp(B). I

think

we can't expect that.

@Jan
I believe I understand what you mean now - thanks for the diagram, it
did really help. You are correct that I do not have the original

primary

key available, and I can see that if it was available then you would be
ab

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-09-12 Thread Jan Filipiak

On 11.09.2018 18:00, Adam Bellemare wrote:

Hi Guozhang

Current highwater mark implementation would grow endlessly based on 
primary key of original event. It is a pair of (key>, ). This is used to 
differentiate between late arrivals and new updates. My newest 
proposal would be to replace it with a Windowed state store of 
Duration N. This would allow the same behaviour, but cap the size 
based on time. This should allow for all late-arriving events to be 
processed, and should be customizable by the user to tailor to their 
own needs (ie: perhaps just 10 minutes of window, or perhaps 7 days...).
Hi Adam, using time based retention can do the trick here. Even if I 
would still like to see the automatic repartitioning optional since I 
would just reshuffle again. With windowed store I am a little bit 
sceptical about how to determine the window. So esentially one could run 
into problems when the rapid change happens near a window border. I will 
check you implementation in detail, if its problematic, we could still 
check _all_ windows on read with not to bad performance impact I guess. 
Will let you know if the implementation would be correct as is. I 
wouldn't not like to assume that: offset(A) < offset(B) => timestamp(A)  
< timestamp(B). I think we can't expect that.

@Jan
I believe I understand what you mean now - thanks for the diagram, it 
did really help. You are correct that I do not have the original 
primary key available, and I can see that if it was available then you 
would be able to add and remove events from the Map. That being said, 
I encourage you to finish your diagrams / charts just for clarity for 
everyone else.

Yeah 100%, this giphy thing is just really hard work. But I understand 
the benefits for the rest. Sorry about the original primary key, We have 
join and Group by implemented our own in PAPI and basically not using 
any DSL (Just the abstraction). Completely missed that in original DSL 
its not there and just assumed it. total brain mess up on my end. Will 
finish the chart as soon as i get a quite evening this week.

My follow up question for you is, won't the Map stay inside the State 
Store indefinitely after all of the changes have propagated? Isn't 
this effectively the same as a highwater mark state store?
Thing is that if the map is empty, substractor is gonna return `null` 
and the key is removed from the keyspace. But there is going to be a 
store 100%, the good thing is that I can use this store directly for
materialize() / enableSendingOldValues() is a regular store, satisfying 
all gurantees needed for further groupby / join. The Windowed store is 
not keeping the values, so for the next statefull operation we would
need to instantiate an extra store. or we have the window store also 
have the values then.

Long story short. if we can flip in a custom group by before 
repartitioning to the original primary key i think it would help the 
users big time in building efficient apps. Given the original primary 
key issue I understand that we do not have a solid foundation to build on.
Leaving primary key carry along to the user. very unfortunate. I could 
understand the decision goes like that. I do not think its a good decision.

Thanks
Adam

On Tue, Sep 11, 2018 at 10:07 AM, Prajakta Dumbre 
mailto:dumbreprajakta...@gmail.com>> wrote:

please remove me from this group

On Tue, Sep 11, 2018 at 1:29 PM Jan Filipiak
mailto:jan.filip...@trivago.com>>
wrote:

> Hi Adam,
>
> give me some time, will make such a chart. last time i didn't
get along
> well with giphy and ruined all your charts.
> Hopefully i can get it done today
>
> On 08.09.2018 16:00, Adam Bellemare wrote:
> > Hi Jan
> >
> > I have included a diagram of what I attempted on the KIP.
> >
>

https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable#KIP-213Supportnon-keyjoininginKTable-GroupBy+Reduce/Aggregate

<https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable#KIP-213Supportnon-keyjoininginKTable-GroupBy+Reduce/Aggregate>
> >
> > I attempted this back at the start of my own implementation of
this
> > solution, and since I could not get it to work I have since
discarded the
> > code. At this point in time, if you wish to continue pursuing
for your
> > groupBy solution, I ask that you please create a diagram on
the KIP
> > carefully explaining your solution. Please feel free to use
the image I
> > just posted as a starting point. I am having trouble
understanding your
> > explanations but I think that a carefully constructed diagram
will clear
> up
> > any misunderstandings. Alternately, please post a
comprehensive PR with
>

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-09-11 Thread Jan Filipiak

Hi Adam,

give me some time, will make such a chart. last time i didn't get along
well with giphy and ruined all your charts.

Hopefully i can get it done today

On 08.09.2018 16:00, Adam Bellemare wrote:

Hi Jan

I have included a diagram of what I attempted on the KIP.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable#KIP-213Supportnon-keyjoininginKTable-GroupBy+Reduce/Aggregate

I attempted this back at the start of my own implementation of this
solution, and since I could not get it to work I have since discarded the
code. At this point in time, if you wish to continue pursuing for your
groupBy solution, I ask that you please create a diagram on the KIP
carefully explaining your solution. Please feel free to use the image I
just posted as a starting point. I am having trouble understanding your
explanations but I think that a carefully constructed diagram will clear up
any misunderstandings. Alternately, please post a comprehensive PR with
your solution. I can only guess at what you mean, and since I value my own
time as much as you value yours, I believe it is your responsibility to
provide an implementation instead of me trying to guess.

Adam

On Sat, Sep 8, 2018 at 8:00 AM, Jan Filipiak
wrote:

Hi James,

nice to see you beeing interested. kafka streams at this point supports
all sorts of joins as long as both streams have the same key.
Adam is currently implementing a join where a KTable and a KTable can have
a one to many relation ship (1:n). We exploit that rocksdb is a
datastore that keeps data sorted (At least exposes an API to access the
stored data in a sorted fashion).

I think the technical caveats are well understood now and we are basically
down to philosophy and API Design ( when Adam sees my newest message).
I have a lengthy track record of loosing those kinda arguments within the
streams community and I have no clue why. So I literally can't wait for you
to churn through this thread and give you opinion on how we should design
the return type of the oneToManyJoin and how many power we want to give to
the user vs "simplicity" (where simplicity isn't really that as users still
need to understand it I argue)

waiting for you to join in on the discussion

Best Jan

On 07.09.2018 15:49, James Kwan wrote:

I am new to this group and I found this subject interesting. Sounds like
you guys want to implement a join table of two streams? Is there somewhere
I can see the original requirement or proposal?

On Sep 7, 2018, at 8:13 AM, Jan Filipiak

wrote:

On 05.09.2018 22:17, Adam Bellemare wrote:

I'm currently testing using a Windowed Store to store the highwater
mark.
By all indications this should work fine, with the caveat being that it
can
only resolve out-of-order arrival for up to the size of the window (ie:
24h, 72h, etc). This would remove the possibility of it being unbounded
in
size.

With regards to Jan's suggestion, I believe this is where we will have
to
remain in disagreement. While I do not disagree with your statement
about
there likely to be additional joins done in a real-world workflow, I do
not
see how you can conclusively deal with out-of-order arrival of
foreign-key
changes and subsequent joins. I have attempted what I think you have
proposed (without a high-water, using groupBy and reduce) and found
that if
the foreign key changes too quickly, or the load on a stream thread is
too
high, the joined messages will arrive out-of-order and be incorrectly
propagated, such that an intermediate event is represented as the final
event.

Can you shed some light on your groupBy implementation. There must be
some sort of flaw in it.
I have a suspicion where it is, I would just like to confirm. The idea
is bullet proof and it must be
an implementation mess up. I would like to clarify before we draw a
conclusion.

Repartitioning the scattered events back to their original

partitions is the only way I know how to conclusively deal with
out-of-order events in a given time frame, and to ensure that the data
is
eventually consistent with the input events.

If you have some code to share that illustrates your approach, I would
be
very grateful as it would remove any misunderstandings that I may have.

ah okay you were looking for my code. I don't have something easily
readable here as its bloated with OO-patterns.

its anyhow trivial:

@Override
public T apply(K aggKey, V value, T aggregate)
{
Map currentStateAsMap = asMap(aggregate); << imaginary
U toModifyKey = mapper.apply(value);
<< this is the place where people actually gonna have issues
and why you probably couldn't do it. we would need to find a solution here.
I didn't realize that yet.
<< we propagate the field in the joiner, so that we can pick
it up in an aggregate. Probably you have not thought of this in your
approach right?
<< I am very open to find a generic s

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-09-08 Thread Jan Filipiak


Hi James,

nice to see you beeing interested. kafka streams at this point supports 
all sorts of joins as long as both streams have the same key.
Adam is currently implementing a join where a KTable and a KTable can 
have a one to many relation ship (1:n). We exploit that rocksdb is a
datastore that keeps data sorted (At least exposes an API to access the 
stored data in a sorted fashion).


I think the technical caveats are well understood now and we are 
basically down to philosophy and API Design ( when Adam sees my newest 
message).
I have a lengthy track record of loosing those kinda arguments within 
the streams community and I have no clue why. So I literally can't wait 
for you to churn through this thread and give you opinion on how we 
should design the return type of the oneToManyJoin and how many power we 
want to give to the user vs "simplicity" (where simplicity isn't really 
that as users still need to understand it I argue)


waiting for you to join in on the discussion

Best Jan


On 07.09.2018 15:49, James Kwan wrote:

I am new to this group and I found this subject interesting.  Sounds like you 
guys want to implement a join table of two streams? Is there somewhere I can 
see the original requirement or proposal?


On Sep 7, 2018, at 8:13 AM, Jan Filipiak  wrote:


On 05.09.2018 22:17, Adam Bellemare wrote:

I'm currently testing using a Windowed Store to store the highwater mark.
By all indications this should work fine, with the caveat being that it can
only resolve out-of-order arrival for up to the size of the window (ie:
24h, 72h, etc). This would remove the possibility of it being unbounded in
size.

With regards to Jan's suggestion, I believe this is where we will have to
remain in disagreement. While I do not disagree with your statement about
there likely to be additional joins done in a real-world workflow, I do not
see how you can conclusively deal with out-of-order arrival of foreign-key
changes and subsequent joins. I have attempted what I think you have
proposed (without a high-water, using groupBy and reduce) and found that if
the foreign key changes too quickly, or the load on a stream thread is too
high, the joined messages will arrive out-of-order and be incorrectly
propagated, such that an intermediate event is represented as the final
event.

Can you shed some light on your groupBy implementation. There must be some sort 
of flaw in it.
I have a suspicion where it is, I would just like to confirm. The idea is 
bullet proof and it must be
an implementation mess up. I would like to clarify before we draw a conclusion.


  Repartitioning the scattered events back to their original
partitions is the only way I know how to conclusively deal with
out-of-order events in a given time frame, and to ensure that the data is
eventually consistent with the input events.

If you have some code to share that illustrates your approach, I would be
very grateful as it would remove any misunderstandings that I may have.

ah okay you were looking for my code. I don't have something easily readable 
here as its bloated with OO-patterns.

its anyhow trivial:

@Override
public T apply(K aggKey, V value, T aggregate)
{
Map currentStateAsMap = asMap(aggregate); << imaginary
U toModifyKey = mapper.apply(value);
<< this is the place where people actually gonna have issues and 
why you probably couldn't do it. we would need to find a solution here. I didn't 
realize that yet.
<< we propagate the field in the joiner, so that we can pick it up 
in an aggregate. Probably you have not thought of this in your approach right?
<< I am very open to find a generic solution here. In my honest 
opinion this is broken in KTableImpl.GroupBy that it looses the keys and only 
maintains the aggregate key.
<< I abstracted it away back then way before i was thinking of 
oneToMany join. That is why I didn't realize its significance here.
<< Opinions?

for (V m : current)
{
currentStateAsMap.put(mapper.apply(m), m);
}
if (isAdder)
{
currentStateAsMap.put(toModifyKey, value);
}
else
{
currentStateAsMap.remove(toModifyKey);
if(currentStateAsMap.isEmpty()){
return null;
}
}
retrun asAggregateType(currentStateAsMap)
}






Thanks,

Adam





On Wed, Sep 5, 2018 at 3:35 PM, Jan Filipiak 
wrote:


Thanks Adam for bringing Matthias to speed!

about the differences. I think re-keying back should be optional at best.
I would say we return a KScatteredTable with reshuffle() returning
KTable to make the backwards repartitioning optional.
I am also in a big favour of doing the out of order processing using group
by instead high water mark tracking.
Just because unbounded growth is just scary + It saves us the header stuff.

I think the abs

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-09-07 Thread Jan Filipiak




On 05.09.2018 22:17, Adam Bellemare wrote:

I'm currently testing using a Windowed Store to store the highwater mark.
By all indications this should work fine, with the caveat being that it can
only resolve out-of-order arrival for up to the size of the window (ie:
24h, 72h, etc). This would remove the possibility of it being unbounded in
size.

With regards to Jan's suggestion, I believe this is where we will have to
remain in disagreement. While I do not disagree with your statement about
there likely to be additional joins done in a real-world workflow, I do not
see how you can conclusively deal with out-of-order arrival of foreign-key
changes and subsequent joins. I have attempted what I think you have
proposed (without a high-water, using groupBy and reduce) and found that if
the foreign key changes too quickly, or the load on a stream thread is too
high, the joined messages will arrive out-of-order and be incorrectly
propagated, such that an intermediate event is represented as the final
event.
Can you shed some light on your groupBy implementation. There must be 
some sort of flaw in it.
I have a suspicion where it is, I would just like to confirm. The idea 
is bullet proof and it must be
an implementation mess up. I would like to clarify before we draw a 
conclusion.



  Repartitioning the scattered events back to their original
partitions is the only way I know how to conclusively deal with
out-of-order events in a given time frame, and to ensure that the data is
eventually consistent with the input events.

If you have some code to share that illustrates your approach, I would be
very grateful as it would remove any misunderstandings that I may have.


ah okay you were looking for my code. I don't have something easily 
readable here as its bloated with OO-patterns.


its anyhow trivial:

@Override
public T apply(K aggKey, V value, T aggregate)
{
Map currentStateAsMap = asMap(aggregate); << imaginary
U toModifyKey = mapper.apply(value);
<< this is the place where people actually gonna have 
issues and why you probably couldn't do it. we would need to find a 
solution here. I didn't realize that yet.
<< we propagate the field in the joiner, so that we can 
pick it up in an aggregate. Probably you have not thought of this in 
your approach right?
<< I am very open to find a generic solution here. In my 
honest opinion this is broken in KTableImpl.GroupBy that it looses the 
keys and only maintains the aggregate key.
<< I abstracted it away back then way before i was thinking 
of oneToMany join. That is why I didn't realize its significance here.

<< Opinions?

for (V m : current)
{
currentStateAsMap.put(mapper.apply(m), m);
}
if (isAdder)
{
currentStateAsMap.put(toModifyKey, value);
}
else
{
currentStateAsMap.remove(toModifyKey);
if(currentStateAsMap.isEmpty()){
return null;
}
}
retrun asAggregateType(currentStateAsMap)
}







Thanks,

Adam





On Wed, Sep 5, 2018 at 3:35 PM, Jan Filipiak 
wrote:


Thanks Adam for bringing Matthias to speed!

about the differences. I think re-keying back should be optional at best.
I would say we return a KScatteredTable with reshuffle() returning
KTable to make the backwards repartitioning optional.
I am also in a big favour of doing the out of order processing using group
by instead high water mark tracking.
Just because unbounded growth is just scary + It saves us the header stuff.

I think the abstraction of always repartitioning back is just not so
strong. Like the work has been done before we partition back and grouping
by something else afterwards is really common.






On 05.09.2018 13:49, Adam Bellemare wrote:


Hi Matthias

Thank you for your feedback, I do appreciate it!

While name spacing would be possible, it would require to deserialize

user headers what implies a runtime overhead. I would suggest to no
namespace for now to avoid the overhead. If this becomes a problem in
the future, we can still add name spacing later on.


Agreed. I will go with using a reserved string and document it.



My main concern about the design it the type of the result KTable: If I
understood the proposal correctly,


In your example, you have table1 and table2 swapped. Here is how it works
currently:

1) table1 has the records that contain the foreign key within their value.
table1 input stream: , , 
table2 input stream: , 

2) A Value mapper is required to extract the foreign key.
table1 foreign key mapper: ( value => value.fk )

The mapper is applied to each element in table1, and a new combined key is
made:
table1 mapped: , , 

3) The rekeyed events are copartitioned with table2:

a) Stream Thread with Partition 0:
RepartitionedTable1: , 
Table2: 

b) Stream Thread with Partiti

Re: [DISCUSS] KIP-349 Priorities for Source Topics

2018-09-07 Thread Jan Filipiak




On 07.09.2018 05:21, Matthias J. Sax wrote:

I am still not sure how Samza's MessageChooser actually works and how
this would align with KafkaConsumer fetch requests.


Maybe I can give some background (conceptually); @Colin, please correct
me if I say anything wrong:


When a fetch request is send, all assigned topic partitions of the
consumers are ordered in a list and the broker will return data starting
with the first partition in the list and returning as many messages as
possible (either until partition end-offset or fetch size is reached).
If end-offset is reached but not fetch size, the next partition in the
list is considered. This repeats until fetch size is reached. If a
partition in the list has no data available, it's skipped.

When data is return to the consumer, the consumer moves all partitions
for which data was returned to the end of the list. Thus, in the next
fetch request, data from other partitions is returned (this avoid
starving of partitions). Note, that partitions that do not return data
(even if they are in the head of the list), stay in the head of the list.

(Note, that this topic list is actually also maintained broker side to
allow for incremental fetch request).

Because different partitions are hosted on different brokers, the
consumer will send fetch requests to different brokers (@Colin: how does
this work in detail? Does the consumer just do a round robin over all
brokers it needs to fetch from?)


Given the original proposal about topic priorities, it would be possible
to have multiple lists, one per priority. If topic partitions return
data, they would be moved to the end of their priority list. The list
would be consider in priority order. Thus, higher priority topic
partitions stay at the head and are consider first.


If I understand MessageChooser correctly (and consider a client side
implementation only), the MessageChooser can only pick from the data
that was returned in a fetch request -- but it cannot alter what the
fetch request returns.

It seems that for each fetched message, update() would be called and the
MessageChooser buffers the message. When a message should be processed
(ie, when Iterator.next() is called on the iterator returned from
poll()), choose() is called to return a message from the buffer (based
on whatever strategy the MessageChooser implements).

Thus, MessageChooser can actually not prioritize some topics over other,
because the prioritization depends on the fetch requests that the
MessageChooser cannot influence (MessageChooser can only prioritize
records from different partitions that are already fetched). Thus,
MessageChooser interface seems not to achieve what was proposed.

@Jan: please correct me, if I my understanding of MessageChooser is wrong.

Hi Matthias,

It is. I explicitly said, that if I were going for it the Message 
Chooser it could get the change to pause and resume partitions.
mentioned that in the mail on ~4.9  (pausing / resuming capabilities). 
Probably easy missed.


Main point is: I didn't say we shall do exactly Samaza interface, that 
would be nonsense just get some inspirations if there isn't a more powerful

abstraction since we are tackling the problem twice anyway.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-349%3A+Priorities+for+Source+Topics
https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization

I am definitely against simple priority list
I am definitely against broker support.
I am happy to sacrifice incremental fetch support (mainly for 
MirrorMakers and replicas with tons of partitions).

I would love to see a Message Chooser.

With this said. I wish yall best luck and fun finding a good solution. 
*mic drop*


Bye




If my understanding is correct, I am not sure how the MessageChooser
interface could be used to prioritize topics in fetch requests.


Overall, I get the impression that topic prioritization and
MessageChosser are orthogonal (or complementary) to each other.



-Matthias



On 9/6/18 5:24 AM, Jan Filipiak wrote:

On 05.09.2018 17:18, Colin McCabe wrote:

Hi all,

I agree that DISCUSS is more appropriate than VOTE at this point,
since I don't remember the last discussion coming to a definite
conclusion.

I guess my concern is that this will add complexity and memory
consumption on the server side.  In the case of incremental fetch
requests, we will have to track at least two extra bytes per
partition, to know what the priority of each partition is within each
active fetch session.

It would be nice to hear more about the use-cases for this feature.  I
think Gwen asked about this earlier, and I don't remember reading a
response.  The fact that we're now talking about Samza interfaces is a
bit of a red flag.  After all, Samza didn't need partition priorities
to do what it did.  You can do a lot with muting partitions and using
appropriate threading in your code.

to show a usecase, I linked 353, especially since

Re: [DISCUSS] KIP-349 Priorities for Source Topics

2018-09-06 Thread Jan Filipiak




On 05.09.2018 17:18, Colin McCabe wrote:

Hi all,

I agree that DISCUSS is more appropriate than VOTE at this point, since I don't 
remember the last discussion coming to a definite conclusion.

I guess my concern is that this will add complexity and memory consumption on 
the server side.  In the case of incremental fetch requests, we will have to 
track at least two extra bytes per partition, to know what the priority of each 
partition is within each active fetch session.

It would be nice to hear more about the use-cases for this feature.  I think 
Gwen asked about this earlier, and I don't remember reading a response.  The 
fact that we're now talking about Samza interfaces is a bit of a red flag.  
After all, Samza didn't need partition priorities to do what it did.  You can 
do a lot with muting partitions and using appropriate threading in your code.
to show a usecase, I linked 353, especially since the threading model is 
pretty fixed there.


No clue why Samza should be a red flag. They handle purely on the 
consumer side. Which I think is reasonable. I would not try to implement 
any broker side support for this, if I were todo it. Just don't support 
incremental fetch then.
In the end if you have broker side support, you would need to ship the 
logic of the message chooser to the broker. I don't think that will 
allow for the flexibility I had in mind on purely consumer based 
implementation.





For example, you can hand data from a partition off to a work queue with a 
fixed size, which is handled by a separate service thread.  If the queue gets 
full, you can mute the partition until some of the buffered data is processed.  
Kafka Streams uses a similar approach to avoid reading partition data that 
isn't immediately needed.

There might be some use-cases that need priorities eventually, but I'm 
concerned that we're jumping the gun by trying to implement this before we know 
what they are.

best,
Colin


On Wed, Sep 5, 2018, at 01:06, Jan Filipiak wrote:

On 05.09.2018 02:38, n...@afshartous.com wrote:

On Sep 4, 2018, at 4:20 PM, Jan Filipiak  wrote:

what I meant is litterally this interface:

https://samza.apache.org/learn/documentation/0.7.0/api/javadocs/org/apache/samza/system/chooser/MessageChooser.html
 
<https://samza.apache.org/learn/documentation/0.7.0/api/javadocs/org/apache/samza/system/chooser/MessageChooser.html>

Hi Jan,

Thanks for the reply and I have a few questions.  This Samza doc

https://samza.apache.org/learn/documentation/0.14/container/streams.html 
<https://samza.apache.org/learn/documentation/0.14/container/streams.html>

indicates that the chooser is set via configuration.  Are you suggesting adding 
a new configuration for Kafka ?  Seems like we could also have a method on 
KafkaConsumer

  public void register(MessageChooser messageChooser)

I don't have strong opinions regarding this. I like configs, i also
don't think it would be a problem to have both.


to make it more dynamic.

Also, the Samza MessageChooser interface has method

/* Notify the chooser that a new envelope is available for a processing. */
void update(IncomingMessageEnvelope envelope)

and I’m wondering how this method would be translated to Kafka API.  In 
particular what corresponds to IncomingMessageEnvelope.

I think Samza uses the envelop abstraction as they support other sources
besides kafka aswell. They are more
on the spark end of things when it comes to different input types. I
don't have strong opinions but it feels like
we wouldn't need such a thing in the kafka consumer but just use a
regular ConsumerRecord or so.

Best,
--
Nick

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-09-05 Thread Jan Filipiak


Thanks Adam for bringing Matthias to speed!

about the differences. I think re-keying back should be optional at best.
I would say we return a KScatteredTable with reshuffle() returning 
KTable to make the backwards repartitioning optional.
I am also in a big favour of doing the out of order processing using 
group by instead high water mark tracking.

Just because unbounded growth is just scary + It saves us the header stuff.

I think the abstraction of always repartitioning back is just not so 
strong. Like the work has been done before we partition back and 
grouping by something else afterwards is really common.






On 05.09.2018 13:49, Adam Bellemare wrote:

Hi Matthias

Thank you for your feedback, I do appreciate it!


While name spacing would be possible, it would require to deserialize
user headers what implies a runtime overhead. I would suggest to no
namespace for now to avoid the overhead. If this becomes a problem in
the future, we can still add name spacing later on.

Agreed. I will go with using a reserved string and document it.




My main concern about the design it the type of the result KTable: If I

understood the proposal correctly,


In your example, you have table1 and table2 swapped. Here is how it works
currently:

1) table1 has the records that contain the foreign key within their value.
table1 input stream: , , 
table2 input stream: , 

2) A Value mapper is required to extract the foreign key.
table1 foreign key mapper: ( value => value.fk )

The mapper is applied to each element in table1, and a new combined key is
made:
table1 mapped: , , 

3) The rekeyed events are copartitioned with table2:

a) Stream Thread with Partition 0:
RepartitionedTable1: , 
Table2: 

b) Stream Thread with Partition 1:
RepartitionedTable1: 
Table2: 

4) From here, they can be joined together locally by applying the joiner
function.



At this point, Jan's design and my design deviate. My design goes on to
repartition the data post-join and resolve out-of-order arrival of records,
finally returning the data keyed just the original key. I do not expose the
CombinedKey or any of the internals outside of the joinOnForeignKey
function. This does make for larger footprint, but it removes all agency
for resolving out-of-order arrivals and handling CombinedKeys from the
user. I believe that this makes the function much easier to use.

Let me know if this helps resolve your questions, and please feel free to
add anything else on your mind.

Thanks again,
Adam




On Tue, Sep 4, 2018 at 8:36 PM, Matthias J. Sax 
wrote:


Hi,

I am just catching up on this thread. I did not read everything so far,
but want to share couple of initial thoughts:



Headers: I think there is a fundamental difference between header usage
in this KIP and KP-258. For 258, we add headers to changelog topic that
are owned by Kafka Streams and nobody else is supposed to write into
them. In fact, no user header are written into the changelog topic and
thus, there are not conflicts.

Nevertheless, I don't see a big issue with using headers within Streams.
As long as we document it, we can have some "reserved" header keys and
users are not allowed to use when processing data with Kafka Streams.
IMHO, this should be ok.


I think there is a safe way to avoid conflicts, since these headers are
only needed in internal topics (I think):
For internal and changelog topics, we can namespace all headers:
* user-defined headers are namespaced as "external." + headerKey
* internal headers are namespaced as "internal." + headerKey

While name spacing would be possible, it would require to deserialize
user headers what implies a runtime overhead. I would suggest to no
namespace for now to avoid the overhead. If this becomes a problem in
the future, we can still add name spacing later on.



My main concern about the design it the type of the result KTable: If I
understood the proposal correctly,

KTable table1 = ...
KTable table2 = ...

KTable joinedTable = table1.join(table2,...);

implies that the `joinedTable` has the same key as the left input table.
IMHO, this does not work because if table2 contains multiple rows that
join with a record in table1 (what is the main purpose of a foreign key
join), the result table would only contain a single join result, but not
multiple.

Example:

table1 input stream: 
table2 input stream: , 

We use table2 value a foreign key to table1 key (ie, "A" joins). If the
result key is the same key as key of table1, this implies that the
result can either be  or  but not both.
Because the share the same key, whatever result record we emit later,
overwrite the previous result.

This is the reason why Jan originally proposed to use a combination of
both primary keys of the input tables as key of the output table. This
makes the keys of the output table unique and we can store both in the
output table:

Result would be , 


Thoughts?


-Matthias




On 9/4/18 1:36 PM, Jan Fil

Re: [DISCUSS] KIP-349 Priorities for Source Topics

2018-09-05 Thread Jan Filipiak




On 05.09.2018 02:38, n...@afshartous.com wrote:



On Sep 4, 2018, at 4:20 PM, Jan Filipiak  wrote:

what I meant is litterally this interface:

https://samza.apache.org/learn/documentation/0.7.0/api/javadocs/org/apache/samza/system/chooser/MessageChooser.html
 
<https://samza.apache.org/learn/documentation/0.7.0/api/javadocs/org/apache/samza/system/chooser/MessageChooser.html>

Hi Jan,

Thanks for the reply and I have a few questions.  This Samza doc

   https://samza.apache.org/learn/documentation/0.14/container/streams.html 
<https://samza.apache.org/learn/documentation/0.14/container/streams.html>

indicates that the chooser is set via configuration.  Are you suggesting adding 
a new configuration for Kafka ?  Seems like we could also have a method on 
KafkaConsumer

 public void register(MessageChooser messageChooser)
I don't have strong opinions regarding this. I like configs, i also 
don't think it would be a problem to have both.




to make it more dynamic.

Also, the Samza MessageChooser interface has method

   /* Notify the chooser that a new envelope is available for a processing. */
void update(IncomingMessageEnvelope envelope)

and I’m wondering how this method would be translated to Kafka API.  In 
particular what corresponds to IncomingMessageEnvelope.
I think Samza uses the envelop abstraction as they support other sources 
besides kafka aswell. They are more
on the spark end of things when it comes to different input types. I 
don't have strong opinions but it feels like
we wouldn't need such a thing in the kafka consumer but just use a 
regular ConsumerRecord or so.


Best,
--
   Nick

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-09-04 Thread Jan Filipiak


Just on remark here.
The high-watermark could be disregarded. The decision about the forward 
depends on the size of the aggregated map.
Only 1 element long maps would be unpacked and forwarded. 0 element maps 
would be published as delete. Any other count

of map entries is in "waiting for correct deletes to arrive"-state.

On 04.09.2018 21:29, Adam Bellemare wrote:

It does look like I could replace the second repartition store and
highwater store with a groupBy and reduce.  However, it looks like I would
still need to store the highwater value within the materialized store, to
compare the arrival of out-of-order records (assuming my understanding of
THIS is correct...). This in effect is the same as the design I have now,
just with the two tables merged together.

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-09-04 Thread Jan Filipiak


I was completely brain-dead in my mail before.
Completly missed that you already repartition back for the user and only 
apply the high water mark filtering after the second repartition source.

I missed the sink / source bounce while poking, sorry for the confusion.

Yes a group by can give the same semantics w/o the ever growing 
highwatermark store. I made it optional
so that when users wanna group by something different after the join 
they still can and it will shuffle 1 times less.


I felt that was / is usefully and we sometimes exploit this facts in 
some of our joins, sometimes we do aggregate back to

the "original key" aswell.

I have a hard time estimating the implications of the ever growing 
highwatermark store. From the top of my head
there wouldn't be a current use case where it would be a concern for us. 
That might be different for other users though.


Sorry for the confusion!

Best Jan

PS planning to put the comments regarding java stuff on the PR.




On 04.09.2018 21:29, Adam Bellemare wrote:

Yep, I definitely misunderstood some of the groupBy and groupByKey
functionality. I would say disregard what I said in my previous email
w.r.t. my assumptions about record size. I was looking into the code more
today and I did not understand it correctly the first time I read it.

It does look like I could replace the second repartition store and
highwater store with a groupBy and reduce.  However, it looks like I would
still need to store the highwater value within the materialized store, to
compare the arrival of out-of-order records (assuming my understanding of
THIS is correct...). This in effect is the same as the design I have now,
just with the two tables merged together.

I will keep looking at this but I am not seeing a great simplification.
Advice and comments are welcomed as always.

On Tue, Sep 4, 2018 at 9:38 AM, Adam Bellemare 
wrote:


As I was looking more into RocksDB TTL, I see that we currently do not
support it in Kafka Streams due to a number of technical reasons. As I
don't think that I will be tackling that JIRA at the moment, the current
implementation is indeed unbounded in the highwater table growth.

An alternate option may be to replace the highwater mark table with a
groupBy and then perform a reduce/aggregate. My main concern here is that
technically we could have an unbounded amount of data to group together by
key, and the grouped size could exceed the Kafka maximum record size. When
I built the highwater mark table my intent was to work around this
possibility, as each record is evaluated independently and record sizing
issues do not come into play. If I am incorrect in this assumption, please
correct me, because I am a bit fuzzy on exactly how the groupBy currently
works.

Any thoughts on this are appreciated. I will revisit it again when I have
a bit more time.

Thanks



On Mon, Sep 3, 2018 at 4:55 PM, Adam Bellemare 
wrote:


Hi Jan

Thank you for taking the time to look into my PR. I have updated it
accordingly along with the suggestions from John. Please note that I am by
no means an expert on Java, so I really do appreciate any Java-specific
feedback you may have. Do not worry about being overly verbose on it.

You are correct with regards to the highwater mark growing unbounded. One
option would be to implement the rocksDB TTL to expire records. I am open
to other options as well.

I have tried to detail the reasoning behind it in the KIP - I have added
additional comments and I hope that it is clearer now.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+
Support+non-key+joining+in+KTable#KIP-213Supportnon-keyjo
ininginKTable-MultipleRapidForeign-KeyValueChanges-Whyahighw
atertableisrequired.

Please keep in mind that there may be something about ordering guarantees
that I am not aware of. As far as I know, once you begin to operate on
events in parallel across different nodes within the processor API, there
are no ordering guarantees and everything is simple first-come,
first-served(processed). If this is not the case then I am unaware of that
fact.



Thanks

Adam





On Mon, Sep 3, 2018 at 8:38 AM, Jan Filipiak 
wrote:


Finished my deeper scan on your approach.
Most of the comments I put at the PR are minor code style things.
One forward call seems to be a bug though, would be great if you could
double check.

the one problem I see is that the high watermark store grows unbounded.
A key being deleted from the source table does not lead to deletion in
the watermark store.

I also don't quite grasp the concept why it's needed.  I think the whole
offset part can go away?
It seems to deal with node failures of some kind but everything should
turn out okay without it?

Best Jan


On 01.09.2018 20:44, Guozhang Wang wrote:


Yes Adam, that makes sense.

I think it may be better to have a working PR to review before we
complete
the VOTE thread. In my previous experience a large feature like this are
mostly definitely going to miss s

Re: [DISCUSS] KIP-349 Priorities for Source Topics

2018-09-04 Thread Jan Filipiak


Hi Nick,

sorry for not beeing so helpfull. I don't quite understand what _this_ 
would be in your email.


Is this the part in question?

/interface TopicPrioritizer {
  List prioritize(List topicPriorities);
}
//
//public void registerTopicPrioritizer(TopicPrioritizer topicPrioritizer);//
/
this is basically the same as

/|public||void|///|/subscribe(//java.util.List//> 
topicPriorities);//

/
what I meant is litterally this interface:

https://samza.apache.org/learn/documentation/0.7.0/api/javadocs/org/apache/samza/system/chooser/MessageChooser.html

ontop I would change choose to return a wrapper that would allow for 
pausing / resuming a topic partition (connect kinda style) and gets 
called immediately after again.


with that one could prevent OOM when update only gives messages that you 
don't need ATM.


Was that more helpfull?

Best Jan
|||


On 04.09.2018 15:06, n...@afshartous.com wrote:

@Jan - can you comment on whether or not this is what you had in mind ?
--
   Nick


On Aug 30, 2018, at 10:18 AM, n...@afshartous.com wrote:


Just clarifying that the API below would be in addition to the API specified in 
KIP-349

   
https://cwiki.apache.org/confluence/display/KAFKA/KIP-349%3A+Priorities+for+Source+Topics
 

   --
   Nick



On Aug 30, 2018, at 9:57 AM, n...@afshartous.com  
wrote:

Here’s an attempt at incorporating a Samza MessageChooser type interface.
--
  Nick


New interface TopicPrioritizer allows one to create a method implementation 
that prioritizes topics. The topic priorities that were assigned with method 
KafkaConsumer.subscribe may or may not be used.  The input is the list of 
subscribed topics, and output is ordered list of topics. The ordering 
represents the priority that the TopicPrioritizer implementation has assigned.  
Calls to KafkaConsumer.poll will use the TopicPrioritizer to determine the 
priority of topics.

interface TopicPrioritizer {
   List prioritize(List topicPriorities);
}


New method KafkaConsumer.registerTopicPrioritizer is used to register the 
TopicPrioritizer

  public void registerTopicPrioritizer(TopicPrioritizer topicPrioritizer);

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-09-03 Thread Jan Filipiak


Finished my deeper scan on your approach.
Most of the comments I put at the PR are minor code style things.
One forward call seems to be a bug though, would be great if you could 
double check.


the one problem I see is that the high watermark store grows unbounded.
A key being deleted from the source table does not lead to deletion in 
the watermark store.


I also don't quite grasp the concept why it's needed.  I think the whole 
offset part can go away?
It seems to deal with node failures of some kind but everything should 
turn out okay without it?


Best Jan


On 01.09.2018 20:44, Guozhang Wang wrote:

Yes Adam, that makes sense.

I think it may be better to have a working PR to review before we complete
the VOTE thread. In my previous experience a large feature like this are
mostly definitely going to miss some devils in the details in the design
and wiki discussion phases.

That would unfortunately mean that your implementations may need to be
modified / updated along with the review and further KIP discussion. I can
understand this can be painful, but that may be the best option we can do
to avoid as much work to be wasted as possible.


Guozhang


On Wed, Aug 29, 2018 at 10:06 AM, Adam Bellemare 
wrote:


Hi Guozhang

By workflow I mean just the overall process of how the KIP is implemented.
Any ideas on the ways to reduce the topic count, materializations, if there
is a better way to resolve out-of-order than a highwater mark table, if the
design philosophy of “keep everything encapsulated within the join
function” is appropriate, etc. I can implement the changes that John
suggested, but if my overall workflow is not acceptable I would rather
address that before making minor changes.

If this requires a full candidate PR ready to go to prod then I can make
those changes. Hope that clears things up.

Thanks

Adam


On Aug 29, 2018, at 12:42 PM, Guozhang Wang  wrote:

Hi Adam,

What do you mean by "additional comments on the workflow.", do you mean

to

let other review your PR https://github.com/apache/kafka/pull/5527 ? Is

is

ready for reviews?


Guozhang

On Tue, Aug 28, 2018 at 5:00 AM, Adam Bellemare <

adam.bellem...@gmail.com>

wrote:


Okay, I will implement John's suggestion of namespacing the external
headers prior to processing, and then removing the namespacing prior to
emitting. A potential future KIP could be to provide this namespacing
automatically.

I would also appreciate any other additional comments on the workflow.

My

goal is suss out agreement prior to moving to a vote.


On Mon, Aug 27, 2018 at 3:19 PM, Guozhang Wang 

wrote:

I like John's idea as well: for this KIP specifically as we do not

expect

any other consumers to read the repartition topics externally, we can
slightly prefix the header to be safe, while keeping the additional

cost

(note the header field is per-record, so any additional byte is

per-record

as well) low.


Guozhang

On Tue, Aug 21, 2018 at 11:58 AM, Adam Bellemare <

adam.bellem...@gmail.com

wrote:


Hi John

That is an excellent idea. The header usage I propose would be limited
entirely to internal topics, and this could very well be the solution

to

potential conflicts. If we do not officially reserve a prefix "__"

then I

think this would be the safest idea, as it would entirely avoid any
accidents (perhaps if a company is using its own "__" prefix for other
reasons).

Thanks

Adam


On Tue, Aug 21, 2018 at 2:24 PM, John Roesler 

wrote:

Just a quick thought regarding headers:

I think there is no absolute-safe ways to avoid conflicts, but we

can

still

consider using some name patterns to reduce the likelihood as much

as

possible.. e.g. consider sth. like the internal topics naming: e.g.
"__internal_[name]"?

I think there is a safe way to avoid conflicts, since these headers

are

only needed in internal topics (I think):
For internal and changelog topics, we can namespace all headers:
* user-defined headers are namespaced as "external." + headerKey
* internal headers are namespaced as "internal." + headerKey

This is a lot of characters, so we could use a sigil instead (e.g.,

"_"

for

internal, "~" for external)

We simply apply the namespacing when we read user headers from

external

topics into the topology and then de-namespace them before we emit

them

to

an external topic (via "to" or "through").
Now, it is not possible to collide with user-defined headers.

That said, I'd also be fine with just reserving "__" as a header

prefix

and

not worrying about collisions.

Thanks for the KIP,
-John

On Tue, Aug 21, 2018 at 9:48 AM Jan Filipiak <

jan.filip...@trivago.com

wrote:


Still havent completly grabbed it.
sorry will read more


On 17.08.2018 21:23, Jan Filipiak wrote:
Cool stuff.

I made some random remarks. Did not touch the core of the

algorithm

yet.

Will do Monday 100%

I don't see In

Re: [DISCUSS] KIP-349 Priorities for Source Topics

2018-09-03 Thread Jan Filipiak





On 30.08.2018 15:17, Matthias J. Sax wrote:

Nick,

Would be good to understand the difference between the current approach
and what Jan suggested. If we believe that the current proposal is too
limited in functionality and also hard to extend later on, it might make
sense to work on a more generic solution from the beginning on. On the
other hand, if we can extend the current proposal easily, I see no
reason to build this incrementally.
Difference is that from my POV topic priorities is a special 
implementation of a MessageChooser




@Jan: Can you summarize the differences from you point of view? You also
linked to KIP-353 in the VOTE thread -- how does it related to this KIP?
The relationship is rather obvious to me. KIP-353 can be implemented as 
a MessageChooser


-Matthias
To add more confusion, why not also make "pause partition" as in connect 
part of the MessageChooser interface?





On 8/12/18 11:15 PM, Matt Farmer wrote:

Ah, sorry, yes it does.

On Sun, Aug 12, 2018 at 2:58 PM  wrote:


Does this clarify ?
--
   Nick

On Aug 9, 2018, at 7:44 PM, n...@afshartous.com wrote:

Since there are questions I changed the heading from VOTE to DISCUSS

On Aug 8, 2018, at 9:09 PM, Matt Farmer  wrote:

s it worth spelling out explicitly what the behavior is when two topics
have the same priority? I'm a bit fuzzy on how we choose what topics to
consume from right now, if I'm being honest, so it could be useful to
outline the current behavior in the background and to spell out how that
would change (or if it would change) when two topics are given the same
priority.


I added an additional note in the KIP’s Compatibility section to clarify
that current behavior would not change in order to preserve backwards
compatibility.

Also, how does this play with max.poll.records? Does the consumer read from

all the topics in priority order until we've hit the number of records or
the poll timeout? Or does it immediately return the high priority records
without pulling low priority records?


My own interpretation would be to read from all the topics in priority
order as the consumer is subscribed to multiple topics.
--
   Nick

Re: [VOTE] KIP-349 Priorities for Source Topics

2018-08-23 Thread Jan Filipiak

also:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization

On 20.08.2018 15:01, Thomas Becker wrote:

I agree with Jan. A strategy interface for choosing processing order is nice,
and would hopefully be a step towards getting this in streams.

-Tommy

On Mon, 2018-08-20 at 12:52 +0200, Jan Filipiak wrote:

On 20.08.2018 00:19, Matthias J. Sax wrote:

@Nick: A KIP is only accepted if it got 3 binding votes, ie, votes from

committers. If you close the vote before that, the KIP would not be

accepted. Note that committers need to pay attention to a lot of KIPs

and it can take a while until people can look into it. Thanks for your

understanding.

@Jan: Can you give a little bit more context on your concerns? It's

unclear why you mean atm.

Just saying that we should peek at the Samza approach, it's a much more

powerful abstraction. We can ship a default MessageChooser

that looks at the topics priority.

@Adam: anyone can vote :)

-Matthias

On 8/19/18 9:58 AM, Adam Bellemare wrote:

While I am not sure if I can or can’t vote, my question re: Jan’s comment is,
“should we be implementing it as Samza does?”

I am not familiar with the drawbacks of the current approach vs how samza does
it.

On Aug 18, 2018, at 5:06 PM, n...@afshartous.com<mailto:n...@afshartous.com>
wrote:

I only saw one vote on KIP-349, just checking to see if anyone else would like
to vote before closing this out.

Nick

On Aug 13, 2018, at 9:19 PM, n...@afshartous.com<mailto:n...@afshartous.com>
wrote:

Hi All,

Calling for a vote on KIP-349

https://cwiki.apache.org/confluence/display/KAFKA/KIP-349%3A+Priorities+for+Source+Topics

Nick

This email and any attachments may contain confidential and privileged material
for the sole use of the intended recipient. Any review, copying, or
distribution of this email (or any attachments) by others is prohibited. If you
are not the intended recipient, please contact the sender immediately and
permanently delete this email and any attachments. No employee or agent of TiVo
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by
email. Binding agreements with TiVo Inc. may only be made by a signed written
agreement.

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-08-21 Thread Jan Filipiak


Still havent completly grabbed it.
sorry will read more

On 17.08.2018 21:23, Jan Filipiak wrote:

Cool stuff.

I made some random remarks. Did not touch the core of the algorithm yet.

Will do Monday 100%

I don't see Interactive Queries :) like that!




On 17.08.2018 20:28, Adam Bellemare wrote:

I have submitted a PR with my code against trunk:
https://github.com/apache/kafka/pull/5527

Do I continue on this thread or do we begin a new one for discussion?

On Thu, Aug 16, 2018 at 1:40 AM, Jan Filipiak 
wrote:

even before message headers, the option for me always existed to 
just wrap

the messages into my own custom envelop.
So I of course thought this through. One sentence in your last email
triggered all the thought process I put in the back then
again to design it in the, what i think is the "kafka-way". It ended up
ranting a little about what happened in the past.

I see plenty of colleagues of mine falling into traps in the API, 
that I

did warn about in the 1.0 DSL rewrite. I have the same
feeling again. So I hope it gives you some insights into my though
process. I am aware that since i never ported 213 to higher
streams version, I don't really have a steak here and initially I 
didn't

feel like actually sending it. But maybe you can pull
something good from it.

  Best jan



On 15.08.2018 04:44, Adam Bellemare wrote:


@Jan
Thanks Jan. I take it you mean "key-widening" somehow includes 
information
about which record is processed first? I understand about a 
CombinedKey

with both the Foreign and Primary key, but I don't see how you track
ordering metadata in there unless you actually included a metadata 
field

in
the key type as well.

@Guozhang
As Jan mentioned earlier, is Record Headers mean to strictly be 
used in
just the user-space? It seems that it is possible that a collision 
on the

(key,value) tuple I wish to add to it could occur. For instance, if I
wanted to add a ("foreignKeyOffset",10) to the Headers but the user
already
specified their own header with the same key name, then it appears 
there
would be a collision. (This is one of the issues I brought up in 
the KIP).




I will be posting a prototype PR against trunk within the next day 
or two.
One thing I need to point out is that my design very strictly wraps 
the
entire foreignKeyJoin process entirely within the DSL function. 
There is

no
exposure of CombinedKeys or widened keys, nothing to resolve with 
regards

to out-of-order processing and no need for the DSL user to even know
what's
going on inside of the function. The code simply returns the 
results of

the
join, keyed by the original key. Currently my API mirrors 
identically the
format of the data returned by the regular join function, and I 
believe
that this is very useful to many users of the DSL. It is my 
understanding
that one of the main design goals of the DSL is to provide higher 
level
functionality without requiring the users to know exactly what's 
going on
under the hood. With this in mind, I thought it best to solve 
ordering and
partitioning problems within the function and eliminate the 
requirement

for
users to do additional work after the fact to resolve the results 
of their
join. Basically, I am assuming that most users of the DSL just 
"want it to
work" and want it to be easy. I did this operating under the 
assumption

that if a user truly wants to optimize their own workflow down to the
finest details then they will break from strictly using the DSL and 
move

down to the processors API.


I think. The abstraction is not powerful enough
to not have kafka specifics leak up The leak I currently think this 
has is

that you can not reliable prevent the delete coming out first,
before you emit the correct new record. As it is an abstraction 
entirely

around kafka.
I can only recommend to not to. Honesty and simplicity should always be
first prio
trying to hide this just makes it more complex, less understandable and
will lead to mistakes
in usage.

Exactly why I am also in big disfavour of GraphNodes and later
optimization stages.
Can someone give me an example of an optimisation that really can't be
handled by the user
constructing his topology differently?
Having reusable Processor API components accessible by the DSL and
composable as
one likes is exactly where DSL should max out and KSQL should do the 
next

step.
I find it very unprofessional from a software engineering approach 
to run

software where
you can not at least senseful reason about the inner workings of the
libraries used.
Gives this people have to read and understand in anyway, why try to 
hide

it?

It really miss the beauty of 0.10 version DSL.
Apparently not a thing I can influence but just warn about.

@gouzhang
you can't imagine how many extra IQ-Statestores I constantly prune from
stream app's
because people just keep passing Materialized's into all the 
operations.

:D :'-(
I regre

Re: [ANNOUNCE] New Kafka PMC member: Dong Lin

2018-08-21 Thread Jan Filipiak


Congrats Dong!




On 20.08.2018 16:35, Attila Sasvári wrote:

Congratulations Dong! Well deserved.

Regards,
Attila
Paolo Patierno  ezt írta (időpont: 2018. aug. 20., H
15:09):


Congratulations Dong!

Paolo Patierno
Principal Software Engineer (IoT) @ Red Hat
Microsoft MVP on Azure & IoT
Microsoft Azure Advisor

Twitter : @ppatierno
Linkedin : paolopatierno
Blog : DevExperience


From: Dongjin Lee 
Sent: Monday, August 20, 2018 1:00 PM
To: dev@kafka.apache.org
Subject: Re: [ANNOUNCE] New Kafka PMC member: Dong Lin

Congratulations!!

On Mon, Aug 20, 2018 at 9:22 PM Mickael Maison 
wrote:


Congratulations Dong!
On Mon, Aug 20, 2018 at 12:46 PM Manikumar 
wrote:

Congrats,  Dong Lin!

On Mon, Aug 20, 2018 at 4:25 PM Ismael Juma  wrote:


Hi everyone,

Dong Lin became a committer in March 2018. Since then, he has

remained

active in the community and contributed a number of patches, reviewed
several pull requests and participated in numerous KIP discussions. I

am

happy to announce that Dong is now a member of the
Apache Kafka PMC.

Congratulation Dong! Looking forward to your future contributions.

Ismael, on behalf of the Apache Kafka PMC


--
*Dongjin Lee*

*A hitchhiker in the mathematical world.*

*github:  github.com/dongjinleekr
linkedin:

kr.linkedin.com/in/dongjinleekr

slideshare:

www.slideshare.net/dongjinleekr

*

Re: [VOTE] KIP-349 Priorities for Source Topics

2018-08-20 Thread Jan Filipiak




On 20.08.2018 00:19, Matthias J. Sax wrote:

@Nick: A KIP is only accepted if it got 3 binding votes, ie, votes from
committers. If you close the vote before that, the KIP would not be
accepted. Note that committers need to pay attention to a lot of KIPs
and it can take a while until people can look into it. Thanks for your
understanding.

@Jan: Can you give a little bit more context on your concerns? It's
unclear why you mean atm.
Just saying that we should peek at the Samza approach, it's a much more 
powerful abstraction. We can ship a default MessageChooser

that looks at the topics priority.

@Adam: anyone can vote :)



-Matthias

On 8/19/18 9:58 AM, Adam Bellemare wrote:

While I am not sure if I can or can’t vote, my question re: Jan’s comment is, 
“should we be implementing it as Samza does?”

I am not familiar with the drawbacks of the current approach vs how samza does 
it.


On Aug 18, 2018, at 5:06 PM, n...@afshartous.com wrote:


I only saw one vote on KIP-349, just checking to see if anyone else would like 
to vote before closing this out.
--
  Nick



On Aug 13, 2018, at 9:19 PM, n...@afshartous.com wrote:


Hi All,

Calling for a vote on KIP-349

https://cwiki.apache.org/confluence/display/KAFKA/KIP-349%3A+Priorities+for+Source+Topics

--
 Nick

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-08-17 Thread Jan Filipiak


Cool stuff.

I made some random remarks. Did not touch the core of the algorithm yet.

Will do Monday 100%

I don't see Interactive Queries :) like that!




On 17.08.2018 20:28, Adam Bellemare wrote:

I have submitted a PR with my code against trunk:
https://github.com/apache/kafka/pull/5527

Do I continue on this thread or do we begin a new one for discussion?

On Thu, Aug 16, 2018 at 1:40 AM, Jan Filipiak 
wrote:


even before message headers, the option for me always existed to just wrap
the messages into my own custom envelop.
So I of course thought this through. One sentence in your last email
triggered all the thought process I put in the back then
again to design it in the, what i think is the "kafka-way". It ended up
ranting a little about what happened in the past.

I see plenty of colleagues of mine falling into traps in the API, that I
did warn about in the 1.0 DSL rewrite. I have the same
feeling again. So I hope it gives you some insights into my though
process. I am aware that since i never ported 213 to higher
streams version, I don't really have a steak here and initially I didn't
feel like actually sending it. But maybe you can pull
something good from it.

  Best jan



On 15.08.2018 04:44, Adam Bellemare wrote:


@Jan
Thanks Jan. I take it you mean "key-widening" somehow includes information
about which record is processed first? I understand about a CombinedKey
with both the Foreign and Primary key, but I don't see how you track
ordering metadata in there unless you actually included a metadata field
in
the key type as well.

@Guozhang
As Jan mentioned earlier, is Record Headers mean to strictly be used in
just the user-space? It seems that it is possible that a collision on the
(key,value) tuple I wish to add to it could occur. For instance, if I
wanted to add a ("foreignKeyOffset",10) to the Headers but the user
already
specified their own header with the same key name, then it appears there
would be a collision. (This is one of the issues I brought up in the KIP).



I will be posting a prototype PR against trunk within the next day or two.
One thing I need to point out is that my design very strictly wraps the
entire foreignKeyJoin process entirely within the DSL function. There is
no
exposure of CombinedKeys or widened keys, nothing to resolve with regards
to out-of-order processing and no need for the DSL user to even know
what's
going on inside of the function. The code simply returns the results of
the
join, keyed by the original key. Currently my API mirrors identically the
format of the data returned by the regular join function, and I believe
that this is very useful to many users of the DSL. It is my understanding
that one of the main design goals of the DSL is to provide higher level
functionality without requiring the users to know exactly what's going on
under the hood. With this in mind, I thought it best to solve ordering and
partitioning problems within the function and eliminate the requirement
for
users to do additional work after the fact to resolve the results of their
join. Basically, I am assuming that most users of the DSL just "want it to
work" and want it to be easy. I did this operating under the assumption
that if a user truly wants to optimize their own workflow down to the
finest details then they will break from strictly using the DSL and move
down to the processors API.


I think. The abstraction is not powerful enough
to not have kafka specifics leak up The leak I currently think this has is
that you can not reliable prevent the delete coming out first,
before you emit the correct new record. As it is an abstraction entirely
around kafka.
I can only recommend to not to. Honesty and simplicity should always be
first prio
trying to hide this just makes it more complex, less understandable and
will lead to mistakes
in usage.

Exactly why I am also in big disfavour of GraphNodes and later
optimization stages.
Can someone give me an example of an optimisation that really can't be
handled by the user
constructing his topology differently?
Having reusable Processor API components accessible by the DSL and
composable as
one likes is exactly where DSL should max out and KSQL should do the next
step.
I find it very unprofessional from a software engineering approach to run
software where
you can not at least senseful reason about the inner workings of the
libraries used.
Gives this people have to read and understand in anyway, why try to hide
it?

It really miss the beauty of 0.10 version DSL.
Apparently not a thing I can influence but just warn about.

@gouzhang
you can't imagine how many extra IQ-Statestores I constantly prune from
stream app's
because people just keep passing Materialized's into all the operations.
:D :'-(
I regret that I couldn't convince you guys back then. Plus this whole
entire topology as a floating
interface chain, never seen it anywhere :-/ :'(

I don't know. I

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-08-15 Thread Jan Filipiak

, Aug 14, 2018 at 5:22 PM, Guozhang Wang  wrote:


Hello Adam,

As for your question regarding GraphNodes, it is for extending Streams
optimization framework. You can find more details on
https://issues.apache.org/jira/browse/KAFKA-6761.

The main idea is that instead of directly building up the "physical
topology" (represented as Topology in the public package, and internally
built as the ProcessorTopology class) while users are specifying the
transformation operators, we first keep it as a "logical topology"
(represented as GraphNode inside InternalStreamsBuilder). And then only
execute the optimization and the construction of the "physical" Topology
when StreamsBuilder.build() is called.

Back to your question, I think it makes more sense to add a new type of
StreamsGraphNode (maybe you can consider inheriting from the
BaseJoinProcessorNode). Note that although in the Topology we will have
multiple connected ProcessorNodes to represent a (foreign-key) join, we
still want to keep it as a single StreamsGraphNode, or just a couple of
them in the logical representation so that in the future we can construct
the physical topology differently (e.g. having another way than the current
distributed hash-join).

---

Back to your questions to KIP-213, I think Jan has summarized it
pretty-well. Note that back then we do not have headers support so we have
to do such "key-widening" approach to ensure ordering.


Guozhang



On Mon, Aug 13, 2018 at 11:39 PM, Jan Filipiak
wrote:


Hi Adam,

I love how you are on to this already! I resolve this by "key-widening" I
treat the result of FKA,and FKB differently.
As you can see the output of my join has a Combined Key and therefore I
can resolve the "race condition" in a group by
if I so desire.

I think this reflects more what happens under the hood and makes it more
clear to the user what is going on. The Idea
of hiding this behind metadata and handle it in the DSL is from my POV
unideal.

To write into your example:

key + A, null)
(key +B, )

is what my output would look like.


Hope that makes sense :D

Best Jan





On 13.08.2018 18:16, Adam Bellemare wrote:


Hi Jan

If you do not use headers or other metadata, how do you ensure that
changes
to the foreign-key value are not resolved out-of-order?
ie: If an event has FK = A, but you change it to FK = B, you need to
propagate both a delete (FK=A -> null) and an addition (FK=B). In my
solution, without maintaining any metadata, it is possible for the final
output to be in either order - the correctly updated joined value, or

the

null for the delete.

(key, null)
(key, )

or

(key, )
(key, null)

I looked back through your code and through the discussion threads, and
didn't see any information on how you resolved this. I have a version of
my
code working for 2.0, I am just adding more integration tests and will
update the KIP accordingly. Any insight you could provide on resolving
out-of-order semantics without metadata would be helpful.

Thanks
Adam


On Mon, Aug 13, 2018 at 3:34 AM, Jan Filipiak 
Happy to see that you want to make an effort here.

Regarding the ProcessSuppliers I couldn't find a way to not rewrite the
joiners + the merger.
The re-partitioners can be reused in theory. I don't know if

repartition

is optimized in 2.0 now.

I made this
https://cwiki.apache.org/confluence/display/KAFKA/KIP-241+
KTable+repartition+with+compacted+Topics
back then and we are running KIP-213 with KIP-241 in combination.

For us it is vital as it minimized the size we had in our repartition
topics plus it removed the factor of 2 in events on every message.
I know about this new  "delete once consumer has read it".  I don't

think

241 is vital for all usecases, for ours it is. I wanted
to use 213 to sneak in the foundations for 241 aswell.

I don't quite understand what a PropagationWrapper is, but I am certain
that you do not need RecordHeaders
for 213 and I would try to leave them out. They either belong to the

DSL

or to the user, having a mixed use is
to be avoided. We run the join with 0.8 logformat and I don't think one
needs more.

This KIP will be very valuable for the streams project! I couldn't

never

convince myself to invest into the 1.0+ DSL
as I used almost all my energy to fight against it. Maybe this can also
help me see the good sides a little bit more.

If there is anything unclear with all the text that has been written,
feel
free to just directly cc me so I don't miss it on
the mailing list.

Best Jan





On 08.08.2018 15:26, Adam Bellemare wrote:

More followup, and +dev as Guozhang replied to me directly previously.

I am currently porting the code over to trunk. One of the major

changes

since 1.0 is the usage of GraphNodes. I have a question about this:

For a foreignKey joiner, should it have its own dedicated node type?

Or

would it be advisable to construct it from existing

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-08-14 Thread Jan Filipiak


Hi Adam,

I love how you are on to this already! I resolve this by "key-widening" 
I treat the result of FKA,and FKB differently.
As you can see the output of my join has a Combined Key and therefore I 
can resolve the "race condition" in a group by

if I so desire.

I think this reflects more what happens under the hood and makes it more 
clear to the user what is going on. The Idea
of hiding this behind metadata and handle it in the DSL is from my POV 
unideal.


To write into your example:

key + A, null)
(key +B, )

is what my output would look like.


Hope that makes sense :D

Best Jan




On 13.08.2018 18:16, Adam Bellemare wrote:

Hi Jan

If you do not use headers or other metadata, how do you ensure that changes
to the foreign-key value are not resolved out-of-order?
ie: If an event has FK = A, but you change it to FK = B, you need to
propagate both a delete (FK=A -> null) and an addition (FK=B). In my
solution, without maintaining any metadata, it is possible for the final
output to be in either order - the correctly updated joined value, or the
null for the delete.

(key, null)
(key, )

or

(key, )
(key, null)

I looked back through your code and through the discussion threads, and
didn't see any information on how you resolved this. I have a version of my
code working for 2.0, I am just adding more integration tests and will
update the KIP accordingly. Any insight you could provide on resolving
out-of-order semantics without metadata would be helpful.

Thanks
Adam


On Mon, Aug 13, 2018 at 3:34 AM, Jan Filipiak 
wrote:


Hi,

Happy to see that you want to make an effort here.

Regarding the ProcessSuppliers I couldn't find a way to not rewrite the
joiners + the merger.
The re-partitioners can be reused in theory. I don't know if repartition
is optimized in 2.0 now.

I made this
https://cwiki.apache.org/confluence/display/KAFKA/KIP-241+
KTable+repartition+with+compacted+Topics
back then and we are running KIP-213 with KIP-241 in combination.

For us it is vital as it minimized the size we had in our repartition
topics plus it removed the factor of 2 in events on every message.
I know about this new  "delete once consumer has read it".  I don't think
241 is vital for all usecases, for ours it is. I wanted
to use 213 to sneak in the foundations for 241 aswell.

I don't quite understand what a PropagationWrapper is, but I am certain
that you do not need RecordHeaders
for 213 and I would try to leave them out. They either belong to the DSL
or to the user, having a mixed use is
to be avoided. We run the join with 0.8 logformat and I don't think one
needs more.

This KIP will be very valuable for the streams project! I couldn't never
convince myself to invest into the 1.0+ DSL
as I used almost all my energy to fight against it. Maybe this can also
help me see the good sides a little bit more.

If there is anything unclear with all the text that has been written, feel
free to just directly cc me so I don't miss it on
the mailing list.

Best Jan





On 08.08.2018 15:26, Adam Bellemare wrote:


More followup, and +dev as Guozhang replied to me directly previously.

I am currently porting the code over to trunk. One of the major changes
since 1.0 is the usage of GraphNodes. I have a question about this:

For a foreignKey joiner, should it have its own dedicated node type? Or
would it be advisable to construct it from existing GraphNode components?
For instance, I believe I could construct it from several
OptimizableRepartitionNode, some SinkNode, some SourceNode, and several
StatefulProcessorNode. That being said, there is some underlying
complexity
to each approach.

I will be switching the KIP-213 to use the RecordHeaders in Kafka Streams
instead of the PropagationWrapper, but conceptually it should be the same.

Again, any feedback is welcomed...


On Mon, Jul 30, 2018 at 9:38 AM, Adam Bellemare 
I was just reading the 2.0 release notes and noticed a section on Record
Headers.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-
244%3A+Add+Record+Header+support+to+Kafka+Streams+Processor+API

I am not yet sure if the contents of a RecordHeader is propagated all the
way through the Sinks and Sources, but if it is, and if it remains
attached
to the record (including null records) I may be able to ditch the
propagationWrapper for an implementation using RecordHeader. I am not yet
sure if this is doable, so if anyone understands RecordHeader impl better
than I, I would be happy to hear from you.

In the meantime, let me know of any questions. I believe this PR has a
lot
of potential to solve problems for other people, as I have encountered a
number of other companies in the wild all home-brewing their own
solutions
to come up with a method of handling relational data in streams.

Adam


On Fri, Jul 27, 2018 at 1:45 AM, Guozhang Wang 
wrote:

Hello Adam,

Thanks for rebooting the discussion of this KIP ! Let me finish my pass
on the wiki and get back to you soon. Sorry f

Re: [VOTE] KIP-349 Priorities for Source Topics

2018-08-13 Thread Jan Filipiak


Sorry for missing the discussion

-1 nonbinding

see

https://samza.apache.org/learn/documentation/0.7.0/api/javadocs/org/apache/samza/system/chooser/MessageChooser.html

Best Jan


On 14.08.2018 03:19, n...@afshartous.com wrote:

Hi All,

Calling for a vote on KIP-349

https://cwiki.apache.org/confluence/display/KAFKA/KIP-349%3A+Priorities+for+Source+Topics

--
   Nick

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

2018-08-13 Thread Jan Filipiak

Hi,

Happy to see that you want to make an effort here.

Regarding the ProcessSuppliers I couldn't find a way to not rewrite the
joiners + the merger.
The re-partitioners can be reused in theory. I don't know if repartition
is optimized in 2.0 now.

I made this
https://cwiki.apache.org/confluence/display/KAFKA/KIP-241+KTable+repartition+with+compacted+Topics
back then and we are running KIP-213 with KIP-241 in combination.

For us it is vital as it minimized the size we had in our repartition
topics plus it removed the factor of 2 in events on every message.
I know about this new "delete once consumer has read it". I don't
think 241 is vital for all usecases, for ours it is. I wanted

to use 213 to sneak in the foundations for 241 aswell.

I don't quite understand what a PropagationWrapper is, but I am certain
that you do not need RecordHeaders
for 213 and I would try to leave them out. They either belong to the DSL
or to the user, having a mixed use is
to be avoided. We run the join with 0.8 logformat and I don't think one
needs more.

This KIP will be very valuable for the streams project! I couldn't never
convince myself to invest into the 1.0+ DSL
as I used almost all my energy to fight against it. Maybe this can also
help me see the good sides a little bit more.

If there is anything unclear with all the text that has been written,
feel free to just directly cc me so I don't miss it on

the mailing list.

Best Jan

On 08.08.2018 15:26, Adam Bellemare wrote:

More followup, and +dev as Guozhang replied to me directly previously.

I am currently porting the code over to trunk. One of the major changes
since 1.0 is the usage of GraphNodes. I have a question about this:

For a foreignKey joiner, should it have its own dedicated node type? Or
would it be advisable to construct it from existing GraphNode components?
For instance, I believe I could construct it from several
OptimizableRepartitionNode, some SinkNode, some SourceNode, and several
StatefulProcessorNode. That being said, there is some underlying complexity
to each approach.

I will be switching the KIP-213 to use the RecordHeaders in Kafka Streams
instead of the PropagationWrapper, but conceptually it should be the same.

Again, any feedback is welcomed...

On Mon, Jul 30, 2018 at 9:38 AM, Adam Bellemare
wrote:

Hi Guozhang et al

I was just reading the 2.0 release notes and noticed a section on Record
Headers.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-
244%3A+Add+Record+Header+support+to+Kafka+Streams+Processor+API

I am not yet sure if the contents of a RecordHeader is propagated all the
way through the Sinks and Sources, but if it is, and if it remains attached
to the record (including null records) I may be able to ditch the
propagationWrapper for an implementation using RecordHeader. I am not yet
sure if this is doable, so if anyone understands RecordHeader impl better
than I, I would be happy to hear from you.

In the meantime, let me know of any questions. I believe this PR has a lot
of potential to solve problems for other people, as I have encountered a
number of other companies in the wild all home-brewing their own solutions
to come up with a method of handling relational data in streams.

Adam

On Fri, Jul 27, 2018 at 1:45 AM, Guozhang Wang wrote:

Hello Adam,

Thanks for rebooting the discussion of this KIP ! Let me finish my pass
on the wiki and get back to you soon. Sorry for the delays..

Guozhang

On Tue, Jul 24, 2018 at 6:08 AM, Adam Bellemare
wrote:
Let me kick this off with a few starting points that I would like to
generate some discussion on.

1) It seems to me that I will need to repartition the data twice - once
on
the foreign key, and once back to the primary key. Is there anything I am
missing here?

2) I believe I will also need to materialize 3 state stores: the
prefixScan
SS, the highwater mark SS (for out-of-order resolution) and the final
state
store, due to the workflow I have laid out. I have not thought of a
better
way yet, but would appreciate any input on this matter. I have gone back
through the mailing list for the previous discussions on this KIP, and I
did not see anything relating to resolving out-of-order compute. I cannot
see a way around the current three-SS structure that I have.

3) Caching is disabled on the prefixScan SS, as I do not know how to
resolve the iterator obtained from rocksDB with that of the cache. In
addition, I must ensure everything is flushed before scanning. Since the
materialized prefixScan SS is under "control" of the function, I do not
anticipate this to be a problem. Performance throughput will need to be
tested, but as Jan observed in his initial overview of this issue, it is
generally a surge of output events which affect performance moreso than
the
flush or prefixScan itself.

Thoughts on any of these are greatly appreciated, since these elements
are
really the cornerstone of the whole design. I can put up the code I have
written

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-04-05 Thread Jan Filipiak


Yes I can be there.

On 04.04.2018 22:34, Jun Rao wrote:

Hi, Jan, Dong, John, Guozhang,

Perhaps it will be useful to have a KIP meeting to discuss this together as
a group. Would Apr. 9 (Monday) at 9:00am PDT work? If so, I will send out
an invite to the mailing list.

Thanks,

Jun


On Wed, Apr 4, 2018 at 1:25 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Want to quickly step in here again because it is going places again.

The last part of the discussion is just a pain to read and completely
diverged from what I suggested without making the reasons clear to me.

I don't know why this happens here are my comments anyway.

@Guozhang: That Streams is working on automatic creating
copartition-usuable topics: great for streams, has literally nothing todo
with the KIP as we want to grow the
input topic. Everyone can reshuffle rel. easily but that is not what we
need todo, we need to grow the topic in question. After streams
automatically reshuffled the input topic
still has the same size and it didn't help a bit. I fail to see why this
is relevant. What am i missing here?

@Dong
I am still on the position that the current proposal brings us into the
wrong direction. Especially introducing PartitionKeyRebalanceListener
 From this point we can never move away to proper state full handling
without completely deprecating this creature from hell again.
Linear hashing is not the optimising step we have todo here. An interface
that when a topic is a topic its always the same even after it had
grown or shrunk is important. So from my POV I have major concerns that
this KIP is benefitial in its current state.

What is it that makes everyone so addicted to the idea of linear hashing?
not attractive at all for me.
And with statefull consumers still a complete mess. Why not stick with the
Kappa architecture???





On 03.04.2018 17:38, Dong Lin wrote:


Hey John,

Thanks much for your comments!!

I have yet to go through the emails of John/Jun/Guozhang in detail. But
let
me present my idea for how to minimize the delay for state loading for
stream use-case.

For ease of understanding, let's assume that the initial partition number
of input topics and change log topic are both 10. And initial number of
stream processor is also 10. If we only increase initial partition number
of input topics to 15 without changing number of stream processor, the
current KIP already guarantees in-order delivery and no state needs to be
moved between consumers for stream use-case. Next, let's say we want to
increase the number of processor to expand the processing capacity for
stream use-case. This requires us to move state between processors which
will take time. Our goal is to minimize the impact (i.e. delay) for
processing while we increase the number of processors.

Note that stream processor generally includes both consumer and producer.
In addition to consume from the input topic, consumer may also need to
consume from change log topic on startup for recovery. And producer may
produce state to the change log topic.



The solution will include the following steps:

1) Increase partition number of the input topic from 10 to 15. Since the
messages with the same key will still go to the same consume before and
after the partition expansion, this step can be done without having to
move
state between processors.

2) Increase partition number of the change log topic from 10 to 15. Note
that this step can also be done without impacting existing workflow. After
we increase partition number of the change log topic, key space may split
and some key will be produced to the newly-added partition. But the same
key will still go to the same processor (i.e. consumer) before and after
the partition. Thus this step can also be done without having to move
state
between processors.

3) Now, let's add 5 new consumers whose groupId is different from the
existing processor's groupId. Thus these new consumers will not impact
existing workflow. Each of these new consumers should consume two
partitions from the earliest offset, where these two partitions are the
same partitions that will be consumed if the consumers have the same
groupId as the existing processor's groupId. For example, the first of the
five consumers will consume partition 0 and partition 10. The purpose of
these consumers is to rebuild the state (e.g. RocksDB) for the processors
in advance. Also note that, by design of the current KIP, each consume
will
consume the existing partition of the change log topic up to the offset
before the partition expansion. Then they will only need to consume the
state of the new partition of the change log topic.

4) After consumers have caught up in step 3), we should stop these
consumers and add 5 new processors to the stream processing job. These 5
new processors should run in the same location as the previous 5 consumers
to re-use the state (e.g. RocksDB). And these processors' consumers should
consume partitions of the change log topi

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-04-04 Thread Jan Filipiak

milarly, the approach you proposed does

not

seem

to

ensure that the messages can be delivered in order, even if

we

can

make

sure that each consumer instance is assigned the set of new

partitions

covering the same set of keys.


Let me correct this comment. The approach of copying data to a

new

topic

can ensure in-order message delivery suppose we properly

migrate

offsets

from old topic to new topic.



- I am trying to understand why it is better to copy the data

instead

of

copying the change log topic for streaming use-case. For core

Kafka

use-case, and for the stream use-case that does not need to

increase

consumers, the current KIP already supports in-order delivery

without

the

overhead of copying the data. For stream use-case that needs

to

increase

consumer number, the existing consumer can backfill the

existing

data

in

the change log topic to the same change log topic with the

new

partition

number, before the new set of consumers bootstrap state from

the

new

partitions of the change log topic. If this solution works,

then

could

you

summarize the advantage of copying the data of input topic as

compared

to

copying the change log topic? For example, does it enable

more

use-case,

simplify the implementation of Kafka library, or reduce the

operation

overhead etc?

Thanks,
Dong


On Wed, Mar 21, 2018 at 6:57 AM, Jan Filipiak <

jan.filip...@trivago.com>

wrote:


Hi Jun,

I was really seeing progress in our conversation but your

latest

reply

is

just devastating.
I though we were getting close being on the same page now it

feels

like

we are in different libraries.

I just quickly slam my answers in here. If they are to

brief I

am

sorry

give me a ping and try to go into details more.
Just want to show that your pro/cons listing is broken.

Best Jan

and want to get rid of this horrible compromise


On 19.03.2018 05:48, Jun Rao wrote:


Hi, Jan,

Thanks for the discussion. Great points.

Let me try to summarize the approach that you are

proposing.

On

the

broker
side, we reshuffle the existing data in a topic from

current

partitions

to
the new partitions. Once the reshuffle fully catches up,

switch

the

consumers to start consuming from the new partitions. If a

consumer

needs

to rebuild its local state (due to partition changes), let

the

consumer

rebuild its state by reading all existing data from the new

partitions.

Once all consumers have switches over, cut over the

producer

to

the

new

partitions.

The pros for this approach are that :
1. There is just one way to rebuild the local state, which

is

simpler.

true thanks


The cons for this approach are:
1. Need to copy existing data.


Very unfair and not correct. It does not require you to copy

over

existing data. It _allows_ you to copy all existing data.

2. The cutover of the producer is a bit complicated since it

needs

to

coordinate with all consumer groups.


Also not true. I explicitly tried to make clear that there

is

only

one

special consumer (in the case of actually copying data)

coordination

is

required.


3. The rebuilding of the state in the consumer is from the

input

topic,

which can be more expensive than rebuilding from the

existing

state.

true, but rebuilding state is only required if you want to

increase

processing power, so we assume this is at hand.


4. The broker potentially has to know the partitioning

function.

If

this

needs to be customized at the topic level, it can be a bit

messy.

I would argue against having the operation being performed

by

the

broker.

This was not discussed yet but if you see my original email

i

suggested

otherwise from the beginning.


Here is an alternative approach by applying your idea not

in

the

broker,

but in the consumer. When new partitions are added, we

don't

move

existing
data. In KStreams, we first reshuffle the new input data

to a

new

topic

T1
with the old number of partitions and feed T1's data to the

rest

of

the

pipeline. In the meantime, KStreams reshuffles all existing

data

of

the

change capture topic to another topic C1 with the new

number

of

partitions.
We can then build the state of the new tasks from C1. Once

the

new

states

have been fully built, we can cut over the consumption to

the

input

topic

and delete T1. This approach works with compacted topic

too.

If

an

application reads from the beginning of a compacted topic,

the

consumer

will reshuffle the portion of the input when the number of

partitions

doesn't match the number of tasks.


We all wipe this idea from our heads instantly. Mixing Ideas

from

an

argument is not a resolution strategy
just leads to horrible horrible software.



The pros of this approach are:
1. No need to copy existing data.
2. Each consumer group can cut over to the new partitions

independently.

3. The state is rebuilt from the change capture topic,

which

is

cheaper

than rebuilding from the input topic

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-03-29 Thread Jan Filipiak

ed topics, regardless of
whether that topic is an "input" or a "changelog" or anything else for

that

matter.


Point 6:
On the concern about the performance overhead of copying data between

the

brokers, I think it's actually a bit overestimated. Splitting a topic's
partition is probably rare, certainly rarer in general than

bootstrapping

new consumers on that topic. If "bootstrapping new consumers" means

that

they have to re-shuffle the data before they consume it, then you wind

up

copying the same record multiple times:

(broker: input topic) -> (initial consumer) -> (broker: repartition

topic)

-> (real consumer)

That's 3x, and it's also 3x for every new record after the split as

well,

since you don't get to stop repartitioning/reshuffling once you start.

Whereas if you do a backfill in something like the procedure I

outlined,

you only copy the prefix of the partition before the split, and you

send

it

once to the producer and then once to the new generation partition.

Plus,

assuming we're splitting the partition for the benefit of consumers,
there's no reason we can't co-locate the post-split partitions on the

same

host as the pre-split partition, making the second copy a local

filesystem

operation.

Even if you follow these two copies up with bootstrapping a new

consumer,

it's still rare for this to occur, so you get to amortize these copies

over

the lifetime of the topic, whereas a reshuffle just keeps making copies

for

every new event.

And finally, I really do think that regardless of any performance

concerns

about this operation, if it preserves loose organizational coupling, it

is

certainly worth it.


In conclusion:
It might actually be a good idea for us to clarify the scope of

KIP-253.

If

we're all agreed that it's a good algorithm for allowing in-order

message

delivery during partition expansion, then we can continue this

discussion

as a new KIP, something like "backfill with partition expansion". This
would let Dong proceed with KIP-253. On the other hand, if it seems

like

this conversation may alter the design of KIP-253, then maybe we

*should*

just finish working it out.

For my part, my only concern about KIP-253 is the one I raised earlier.

Thanks again, all, for considering these points,
-John


On Tue, Mar 27, 2018 at 2:10 AM, Dong Lin <lindon...@gmail.com> wrote:


On Tue, Mar 27, 2018 at 12:04 AM, Dong Lin <lindon...@gmail.com>

wrote:

Hey Jan,

Thanks for the enthusiasm in improving Kafka's design. Now that I

have

read through your discussion with Jun, here are my thoughts:

- The latest proposal should with log compacted topics by properly
deleting old messages after a new message with the same key is

produced.

So

it is probably not a concern anymore. Could you comment if there is

still

issue?

- I wrote the SEP-5 and I am pretty familiar with the motivation

and

the

design of SEP-5. SEP-5 is probably orthornal to the motivation of

this

KIP.

The goal of SEP-5 is to allow user to increase task number of an

existing

Samza job. But if we increase the partition number of input topics,
messages may still be consumed out-of-order by tasks in Samza which

cause

incorrect result. Similarly, the approach you proposed does not

seem

to

ensure that the messages can be delivered in order, even if we can

make

sure that each consumer instance is assigned the set of new

partitions

covering the same set of keys.


Let me correct this comment. The approach of copying data to a new

topic

can ensure in-order message delivery suppose we properly migrate

offsets

from old topic to new topic.



- I am trying to understand why it is better to copy the data

instead

of

copying the change log topic for streaming use-case. For core Kafka
use-case, and for the stream use-case that does not need to

increase

consumers, the current KIP already supports in-order delivery

without

the

overhead of copying the data. For stream use-case that needs to

increase

consumer number, the existing consumer can backfill the existing

data

in

the change log topic to the same change log topic with the new

partition

number, before the new set of consumers bootstrap state from the

new

partitions of the change log topic. If this solution works, then

could

you

summarize the advantage of copying the data of input topic as

compared

to

copying the change log topic? For example, does it enable more

use-case,

simplify the implementation of Kafka library, or reduce the

operation

overhead etc?

Thanks,
Dong


On Wed, Mar 21, 2018 at 6:57 AM, Jan Filipiak <

jan.filip...@trivago.com>

wrote:


Hi Jun,

I was really seeing progress in our conversation but your latest

reply

is

just devastating.
I though we were getting close being on the same page now it feels

like

we are in different libraries.

I just quickly slam my answers in here. If they are to brief I am

sorry

giv

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-03-27 Thread Jan Filipiak

5. SEP-5 is probably orthornal to the motivation of this

KIP.

The goal of SEP-5 is to allow user to increase task number of an existing
Samza job. But if we increase the partition number of input topics,
messages may still be consumed out-of-order by tasks in Samza which cause
incorrect result. Similarly, the approach you proposed does not seem to
ensure that the messages can be delivered in order, even if we can make
sure that each consumer instance is assigned the set of new partitions
covering the same set of keys.


Let me correct this comment. The approach of copying data to a new topic
can ensure in-order message delivery suppose we properly migrate offsets
from old topic to new topic.



- I am trying to understand why it is better to copy the data instead of
copying the change log topic for streaming use-case. For core Kafka
use-case, and for the stream use-case that does not need to increase
consumers, the current KIP already supports in-order delivery without the
overhead of copying the data. For stream use-case that needs to increase
consumer number, the existing consumer can backfill the existing data in
the change log topic to the same change log topic with the new partition
number, before the new set of consumers bootstrap state from the new
partitions of the change log topic. If this solution works, then could

you

summarize the advantage of copying the data of input topic as compared to
copying the change log topic? For example, does it enable more use-case,
simplify the implementation of Kafka library, or reduce the operation
overhead etc?

Thanks,
Dong


On Wed, Mar 21, 2018 at 6:57 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Hi Jun,

I was really seeing progress in our conversation but your latest reply

is

just devastating.
I though we were getting close being on the same page now it feels like
we are in different libraries.

I just quickly slam my answers in here. If they are to brief I am sorry
give me a ping and try to go into details more.
Just want to show that your pro/cons listing is broken.

Best Jan

and want to get rid of this horrible compromise


On 19.03.2018 05:48, Jun Rao wrote:


Hi, Jan,

Thanks for the discussion. Great points.

Let me try to summarize the approach that you are proposing. On the
broker
side, we reshuffle the existing data in a topic from current partitions
to
the new partitions. Once the reshuffle fully catches up, switch the
consumers to start consuming from the new partitions. If a consumer

needs

to rebuild its local state (due to partition changes), let the consumer
rebuild its state by reading all existing data from the new partitions.
Once all consumers have switches over, cut over the producer to the new
partitions.

The pros for this approach are that :
1. There is just one way to rebuild the local state, which is simpler.


true thanks


The cons for this approach are:
1. Need to copy existing data.


Very unfair and not correct. It does not require you to copy over
existing data. It _allows_ you to copy all existing data.

2. The cutover of the producer is a bit complicated since it needs to

coordinate with all consumer groups.


Also not true. I explicitly tried to make clear that there is only one
special consumer (in the case of actually copying data) coordination is
required.


3. The rebuilding of the state in the consumer is from the input topic,
which can be more expensive than rebuilding from the existing state.


true, but rebuilding state is only required if you want to increase
processing power, so we assume this is at hand.


4. The broker potentially has to know the partitioning function. If

this

needs to be customized at the topic level, it can be a bit messy.


I would argue against having the operation being performed by the

broker.

This was not discussed yet but if you see my original email i suggested
otherwise from the beginning.


Here is an alternative approach by applying your idea not in the

broker,

but in the consumer. When new partitions are added, we don't move
existing
data. In KStreams, we first reshuffle the new input data to a new topic
T1
with the old number of partitions and feed T1's data to the rest of the
pipeline. In the meantime, KStreams reshuffles all existing data of the
change capture topic to another topic C1 with the new number of
partitions.
We can then build the state of the new tasks from C1. Once the new

states

have been fully built, we can cut over the consumption to the input

topic

and delete T1. This approach works with compacted topic too. If an
application reads from the beginning of a compacted topic, the consumer
will reshuffle the portion of the input when the number of partitions
doesn't match the number of tasks.


We all wipe this idea from our heads instantly. Mixing Ideas from an
argument is not a resolution strategy
just leads to horrible horrible software.



The pros of this approach are:
1. No need to copy existing data.
2. Each consumer gr

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-03-23 Thread Jan Filipiak

21, 2018 at 6:57 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Hi Jun,

I was really seeing progress in our conversation but your latest reply is
just devastating.
I though we were getting close being on the same page now it feels like we
are in different libraries.

I just quickly slam my answers in here. If they are to brief I am sorry
give me a ping and try to go into details more.
Just want to show that your pro/cons listing is broken.

Best Jan

and want to get rid of this horrible compromise


On 19.03.2018 05:48, Jun Rao wrote:


Hi, Jan,

Thanks for the discussion. Great points.

Let me try to summarize the approach that you are proposing. On the broker
side, we reshuffle the existing data in a topic from current partitions to
the new partitions. Once the reshuffle fully catches up, switch the
consumers to start consuming from the new partitions. If a consumer needs
to rebuild its local state (due to partition changes), let the consumer
rebuild its state by reading all existing data from the new partitions.
Once all consumers have switches over, cut over the producer to the new
partitions.

The pros for this approach are that :
1. There is just one way to rebuild the local state, which is simpler.


true thanks


The cons for this approach are:
1. Need to copy existing data.


Very unfair and not correct. It does not require you to copy over existing
data. It _allows_ you to copy all existing data.

2. The cutover of the producer is a bit complicated since it needs to

coordinate with all consumer groups.


Also not true. I explicitly tried to make clear that there is only one
special consumer (in the case of actually copying data) coordination is
required.


3. The rebuilding of the state in the consumer is from the input topic,
which can be more expensive than rebuilding from the existing state.


true, but rebuilding state is only required if you want to increase
processing power, so we assume this is at hand.


4. The broker potentially has to know the partitioning function. If this
needs to be customized at the topic level, it can be a bit messy.


I would argue against having the operation being performed by the broker.
This was not discussed yet but if you see my original email i suggested
otherwise from the beginning.


Here is an alternative approach by applying your idea not in the broker,
but in the consumer. When new partitions are added, we don't move existing
data. In KStreams, we first reshuffle the new input data to a new topic T1
with the old number of partitions and feed T1's data to the rest of the
pipeline. In the meantime, KStreams reshuffles all existing data of the
change capture topic to another topic C1 with the new number of
partitions.
We can then build the state of the new tasks from C1. Once the new states
have been fully built, we can cut over the consumption to the input topic
and delete T1. This approach works with compacted topic too. If an
application reads from the beginning of a compacted topic, the consumer
will reshuffle the portion of the input when the number of partitions
doesn't match the number of tasks.


We all wipe this idea from our heads instantly. Mixing Ideas from an
argument is not a resolution strategy
just leads to horrible horrible software.



The pros of this approach are:
1. No need to copy existing data.
2. Each consumer group can cut over to the new partitions independently.
3. The state is rebuilt from the change capture topic, which is cheaper
than rebuilding from the input topic.
4. Only the KStreams job needs to know the partitioning function.

The cons of this approach are:
1. Potentially the same input topic needs to be reshuffled more than once
in different consumer groups during the transition phase.

What do you think?

Thanks,

Jun



On Thu, Mar 15, 2018 at 1:04 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:

Hi Jun,

thank you for following me on these thoughts. It was important to me to
feel that kind of understanding for my arguments.

What I was hoping for (I mentioned this earlier) is that we can model the
case where we do not want to copy the data the exact same way as the case
when we do copy the data. Maybe you can peek into the mails before to see
more details for this.

This means we have the same mechanism to transfer consumer groups to
switch topic. The offset mapping that would be generated would even be
simpler End Offset of the Old topic => offset 0 off all the partitions of
the new topic. Then we could model the transition of a non-copy expansion
the exact same way as a copy-expansion.

I know this only works when topic growth by a factor. But the benefits of
only growing by a factor are to strong anyways. See Clemens's hint and
remember that state reshuffling is entirely not needed if one doesn't
want
to grow processing power.

I think these benefits should be clear, and that there is basically no
downside to what is currently at hand but just makes everything easy.

One thing you ne

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-03-21 Thread Jan Filipiak


Hi Jun,

I was really seeing progress in our conversation but your latest reply 
is just devastating.
I though we were getting close being on the same page now it feels like 
we are in different libraries.


I just quickly slam my answers in here. If they are to brief I am sorry 
give me a ping and try to go into details more.

Just want to show that your pro/cons listing is broken.

Best Jan

and want to get rid of this horrible compromise


On 19.03.2018 05:48, Jun Rao wrote:

Hi, Jan,

Thanks for the discussion. Great points.

Let me try to summarize the approach that you are proposing. On the broker
side, we reshuffle the existing data in a topic from current partitions to
the new partitions. Once the reshuffle fully catches up, switch the
consumers to start consuming from the new partitions. If a consumer needs
to rebuild its local state (due to partition changes), let the consumer
rebuild its state by reading all existing data from the new partitions.
Once all consumers have switches over, cut over the producer to the new
partitions.

The pros for this approach are that :
1. There is just one way to rebuild the local state, which is simpler.

true thanks


The cons for this approach are:
1. Need to copy existing data.
Very unfair and not correct. It does not require you to copy over 
existing data. It _allows_ you to copy all existing data.



2. The cutover of the producer is a bit complicated since it needs to
coordinate with all consumer groups.
Also not true. I explicitly tried to make clear that there is only one 
special consumer (in the case of actually copying data) coordination is 
required.

3. The rebuilding of the state in the consumer is from the input topic,
which can be more expensive than rebuilding from the existing state.
true, but rebuilding state is only required if you want to increase 
processing power, so we assume this is at hand.

4. The broker potentially has to know the partitioning function. If this
needs to be customized at the topic level, it can be a bit messy.
I would argue against having the operation being performed by the 
broker. This was not discussed yet but if you see my original email i 
suggested otherwise from the beginning.


Here is an alternative approach by applying your idea not in the broker,
but in the consumer. When new partitions are added, we don't move existing
data. In KStreams, we first reshuffle the new input data to a new topic T1
with the old number of partitions and feed T1's data to the rest of the
pipeline. In the meantime, KStreams reshuffles all existing data of the
change capture topic to another topic C1 with the new number of partitions.
We can then build the state of the new tasks from C1. Once the new states
have been fully built, we can cut over the consumption to the input topic
and delete T1. This approach works with compacted topic too. If an
application reads from the beginning of a compacted topic, the consumer
will reshuffle the portion of the input when the number of partitions
doesn't match the number of tasks.
We all wipe this idea from our heads instantly. Mixing Ideas from an 
argument is not a resolution strategy

just leads to horrible horrible software.


The pros of this approach are:
1. No need to copy existing data.
2. Each consumer group can cut over to the new partitions independently.
3. The state is rebuilt from the change capture topic, which is cheaper
than rebuilding from the input topic.
4. Only the KStreams job needs to know the partitioning function.

The cons of this approach are:
1. Potentially the same input topic needs to be reshuffled more than once
in different consumer groups during the transition phase.

What do you think?

Thanks,

Jun



On Thu, Mar 15, 2018 at 1:04 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Hi Jun,

thank you for following me on these thoughts. It was important to me to
feel that kind of understanding for my arguments.

What I was hoping for (I mentioned this earlier) is that we can model the
case where we do not want to copy the data the exact same way as the case
when we do copy the data. Maybe you can peek into the mails before to see
more details for this.

This means we have the same mechanism to transfer consumer groups to
switch topic. The offset mapping that would be generated would even be
simpler End Offset of the Old topic => offset 0 off all the partitions of
the new topic. Then we could model the transition of a non-copy expansion
the exact same way as a copy-expansion.

I know this only works when topic growth by a factor. But the benefits of
only growing by a factor are to strong anyways. See Clemens's hint and
remember that state reshuffling is entirely not needed if one doesn't want
to grow processing power.

I think these benefits should be clear, and that there is basically no
downside to what is currently at hand but just makes everything easy.

One thing you need to know is. that if you do not offer rebuilding a log
compacted topic like

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-03-15 Thread Jan Filipiak


Hi Jun,

thank you for following me on these thoughts. It was important to me to 
feel that kind of understanding for my arguments.


What I was hoping for (I mentioned this earlier) is that we can model 
the case where we do not want to copy the data the exact same way as the 
case when we do copy the data. Maybe you can peek into the mails before 
to see more details for this.


This means we have the same mechanism to transfer consumer groups to 
switch topic. The offset mapping that would be generated would even be 
simpler End Offset of the Old topic => offset 0 off all the partitions 
of the new topic. Then we could model the transition of a non-copy 
expansion the exact same way as a copy-expansion.


I know this only works when topic growth by a factor. But the benefits 
of only growing by a factor are to strong anyways. See Clemens's hint 
and remember that state reshuffling is entirely not needed if one 
doesn't want to grow processing power.


I think these benefits should be clear, and that there is basically no 
downside to what is currently at hand but just makes everything easy.


One thing you need to know is. that if you do not offer rebuilding a log 
compacted topic like i suggest that even if you have consumer state 
reshuffling. The topic is broken and can not be used to bootstrap new 
consumers. They don't know if they need to apply a key from and old 
partition or not. This is a horrible downside I haven't seen a solution 
for in the email conversation.


I argue to:

Only grow topic by a factor always.
Have the "no copy consumer" transition as the trivial case of the "copy 
consumer transition".
If processors needs to be scaled, let them rebuild from the new topic 
and leave the old running in the mean time.

Do not implement key shuffling in streams.

I hope I can convince you especially with the fact how I want to handle 
consumer transition. I think
you didn't quite understood me there before. I think the term "new 
topic" intimidated you a little.
How we solve this on disc doesn't really matter, If the data goes into 
the same Dir or a different Dir or anything. I do think that it needs to 
involve at least rolling a new segment for the existing partitions.
But most of the transitions should work without restarting consumers. 
(newer consumers with support for this). But with new topic i just meant 
the topic that now has a different partition count. Plenty of ways to 
handle that (versions, aliases)


Hope I can further get my idea across.

Best Jan





On 14.03.2018 02:45, Jun Rao wrote:

Hi, Jan,

Thanks for sharing your view.

I agree with you that recopying the data potentially makes the state
management easier since the consumer can just rebuild its state from
scratch (i.e., no need for state reshuffling).

On the flip slide, I saw a few disadvantages of the approach that you
suggested. (1) Building the state from the input topic from scratch is in
general less efficient than state reshuffling. Let's say one computes a
count per key from an input topic. The former requires reading all existing
records in the input topic whereas the latter only requires reading data
proportional to the number of unique keys. (2) The switching of the topic
needs modification to the application. If there are many applications on a
topic, coordinating such an effort may not be easy. Also, it's not clear
how to enforce exactly-once semantic during the switch. (3) If a topic
doesn't need any state management, recopying the data seems wasteful. In
that case, in place partition expansion seems more desirable.

I understand your concern about adding complexity in KStreams. But, perhaps
we could iterate on that a bit more to see if it can be simplified.

Jun


On Mon, Mar 12, 2018 at 11:21 PM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Hi Jun,

I will focus on point 61 as I think its _the_ fundamental part that I cant
get across at the moment.

Kafka is the platform to have state materialized multiple times from one
input. I emphasize this: It is the building block in architectures that
allow you to
have your state maintained multiple times. You put a message in once, and
you have it pop out as often as you like. I believe you understand this.

Now! The path of thinking goes the following: I am using apache kafka and
I _want_ my state multiple times. What am I going todo?

A) Am I going to take my state that I build up, plunge some sort of RPC
layer ontop of it, use that RPC layer to throw my records across instances?
B) Am I just going to read the damn message twice?

Approach A is fundamentally flawed and a violation of all that is good and
holy in kafka deployments. I can not understand how this Idea can come in
the first place.
(I do understand: IQ in streams, they polluted the kafka streams codebase
really bad already. It is not funny! I think they are equally flawed as A)

I say, we do what Kafka is good at. We repartition the topic once. We
swit

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-03-13 Thread Jan Filipiak


Hi Jun,

I will focus on point 61 as I think its _the_ fundamental part that I 
cant get across at the moment.


Kafka is the platform to have state materialized multiple times from one 
input. I emphasize this: It is the building block in architectures that 
allow you to
have your state maintained multiple times. You put a message in once, 
and you have it pop out as often as you like. I believe you understand this.


Now! The path of thinking goes the following: I am using apache kafka 
and I _want_ my state multiple times. What am I going todo?


A) Am I going to take my state that I build up, plunge some sort of RPC 
layer ontop of it, use that RPC layer to throw my records across instances?

B) Am I just going to read the damn message twice?

Approach A is fundamentally flawed and a violation of all that is good 
and holy in kafka deployments. I can not understand how this Idea can 
come in the first place.
(I do understand: IQ in streams, they polluted the kafka streams 
codebase really bad already. It is not funny! I think they are equally 
flawed as A)


I say, we do what Kafka is good at. We repartition the topic once. We 
switch the consumers.
(Those that need more partitions are going to rebuild their state in 
multiple partitions by reading the new topic, those that don't just 
assign the new partitions properly)

We switch producers. Done!

The best thing! It is trivial, hipster stream processor will have an 
easy time with that aswell. Its so super simple. And simple IS good!
It is what kafka was build todo. It is how we do it today. All I am 
saying is that a little broker help doing the producer swap is super useful.


For everyone interested in why kafka is so powerful with approach B, 
please watch https://youtu.be/bEbeZPVo98c?t=1633
I already looked up a good point in time, I think after 5 minutes the 
"state" topic is handled and you should be able to understand me

and inch better.

Please do not do A to the project, it deserves better!

Best Jan



On 13.03.2018 02:40, Jun Rao wrote:

Hi, Jan,

Thanks for the reply. A few more comments below.

50. Ok, we can think a bit harder for supporting compacted topics.

51. This is a fundamental design question. In the more common case, the
reason why someone wants to increase the number of partitions is that the
consumer application is slow and one wants to run more consumer instances
to increase the degree of parallelism. So, fixing the number of running
consumer instances when expanding the partitions won't help this case. If
we do need to increase the number of consumer instances, we need to somehow
reshuffle the state of the consumer across instances. What we have been
discussing in this KIP is whether we can do this more effectively through
the KStream library (e.g. through a 2-phase partition expansion). This will
add some complexity, but it's probably better than everyone doing this in
the application space. The recopying approach that you mentioned doesn't
seem to address the consumer state management issue when the consumer
switches from an old to a new topic.

52. As for your example, it depends on whether the join key is the same
between (A,B) and (B,C). If the join key is the same, we can do a 2-phase
partition expansion of A, B, and C together. If the join keys are
different, one would need to repartition the data on a different key for
the second join, then the partition expansion can be done independently
between (A,B) and (B,C).

53. If you always fix the number of consumer instances, we you described
works. However, as I mentioned in #51, I am not sure how your proposal
deals with consumer states when the number of consumer instances grows.
Also, it just seems that it's better to avoid re-copying the existing data.

60. "just want to throw in my question from the longer email in the other
Thread here. How will the bloom filter help a new consumer to decide to
apply the key or not?" Not sure that I fully understood your question. The
consumer just reads whatever key is in the log. The bloom filter just helps
clean up the old keys.

61. "Why can we afford having a topic where its apparently not possible to
start a new application on? I think this is an overall flaw of the
discussed idea here. Not playing attention to the overall architecture."
Could you explain a bit more when one can't start a new application?

Jun



On Sat, Mar 10, 2018 at 1:40 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Hi Jun, thanks for your mail.

Thank you for your questions!
I think they are really good and tackle the core of the problem I see.

I will answer inline, mostly but still want to set the tone here.

The core strength of kafka is what Martin once called the
kappa-Architecture. How does this work?
You have everything as a log as in kafka. When you need to change
something.
You create the new version of your application and leave it running in
parallel.
Once the new version is good you switch y

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-03-12 Thread Jan Filipiak

rs that everyone prefers

linear

hashing because it reduces the amount of state that needs to be

moved

between consumers (for stream processing). The KIP has been

updated to

use

linear hashing.

Regarding the migration endeavor: it seems that migrating

producer

library

to use linear hashing should be pretty straightforward without
much operational endeavor. If we don't upgrade client library to

use

this

KIP, we can not support in-order delivery after partition is

changed

anyway. Suppose we upgrade client library to use this KIP, if

partition

number is not changed, the key -> partition mapping will be

exactly the

same as it is now because it is still determined using

murmur_hash(key)

%

original_partition_num. In other words, this change is backward

compatible.

Regarding the load distribution: if we use linear hashing, the

load may

be

unevenly distributed because those partitions which are not split

may

receive twice as much traffic as other partitions that are split.

This

issue can be mitigated by creating topic with partitions that are

several

times the number of consumers. And there will be no imbalance if

the

partition number is always doubled. So this imbalance seems

acceptable.

Regarding storing the partition strategy as per-topic config: It

seems

not

necessary since we can still use murmur_hash as the default hash

function

and additionally apply the linear hashing algorithm if the

partition

number

has increased. Not sure if there is any use-case for producer to

use a

different hash function. Jason, can you check if there is some

use-case

that I missed for using the per-topic partition strategy?

Regarding how to reduce latency (due to state store/load) in

stream

processing consumer when partition number changes: I need to read

the

Kafka

Stream code to understand how Kafka Stream currently migrate

state

between

consumers when the application is added/removed for a given job.

I

will

reply after I finish reading the documentation and code.


Thanks,
Dong


On Mon, Mar 5, 2018 at 10:43 AM, Jason Gustafson <

ja...@confluent.io>

wrote:


Great discussion. I think I'm wondering whether we can continue

to

leave

Kafka agnostic to the partitioning strategy. The challenge is

communicating

the partitioning logic from producers to consumers so that the

dependencies

between each epoch can be determined. For the sake of

discussion,

imagine

you did something like the following:

1. The name (and perhaps version) of a partitioning strategy is

stored

in

topic configuration when a topic is created.
2. The producer looks up the partitioning strategy before

writing

to a

topic and includes it in the produce request (for fencing). If

it

doesn't

have an implementation for the configured strategy, it fails.
3. The consumer also looks up the partitioning strategy and uses

it to

determine dependencies when reading a new epoch. It could either

fail

or

make the most conservative dependency assumptions if it doesn't

know

how

to

implement the partitioning strategy. For the consumer, the new

interface

might look something like this:

// Return the partition dependencies following an epoch bump
Map<Integer, List> dependencies(int

numPartitionsBeforeEpochBump,

int numPartitionsAfterEpochBump)

The unordered case then is just a particular implementation

which

never

has

any epoch dependencies. To implement this, we would need some

way

for

the

consumer to find out how many partitions there were in each

epoch, but

maybe that's not too unreasonable.

Thanks,
Jason


On Mon, Mar 5, 2018 at 4:51 AM, Jan Filipiak <

jan.filip...@trivago.com

wrote:


Hi Dong

thank you very much for your questions.

regarding the time spend copying data across:
It is correct that copying data from a topic with one partition

mapping

to

a topic with a different partition mapping takes way longer

than

we

can

stop producers. Tens of minutes is a very optimistic estimate

here.

Many

people can not afford copy full steam and therefore will have

some

rate

limiting in place, this can bump the timespan into the day's.

The good

part

is that the vast majority of the data can be copied while the

producers

are

still going. One can then, piggyback the consumers ontop of

this

timeframe,

by the method mentioned (provide them an mapping from their old

offsets

to

new offsets in their repartitioned topics. In that way we

separate

migration of consumers from migration of producers (decoupling

these

is

what kafka is strongest at). The time to actually swap over the

producers

should be kept minimal by ensuring that when a swap attempt is

started

the

consumer copying over should be very close to the log end and

is

expected

to finish within the next fetch. The operation should have a

time-out

and

should be "reattemtable".

Importance of logcompaction:
If a producer produces key A, to partiton 0, its forever go

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-03-10 Thread Jan Filipiak

ions to the 2
topics one at a time but without publishing to the new patitions. Then, we
can add new consumer instances to pick up the new partitions. In this
transition phase, no reshuffling is needed since no data is coming from the
new partitions. Finally, we can enable the publishing to the new
partitions.
I think its even worse than you think. I would like to introduce the 
Term transitive copartitioning. Imagine
2 streams application. One joins (A,B) the other (B,C) then there is a 
transitive copartition requirement for
(A,C) to be copartitioned aswell. This can spread significantly and 
require many consumers to adapt at the same time.


It is also not entirely clear to me how you not need reshuffling in this 
case. If A has a record that never gets updated after the expansion and 
the coresponding B record moves to a new partition. How shall they meet 
w/o shuffle?


53. "Migrating consumer is a step that might be made completly unnecessary
if - for example streams - takes the gcd as partitioning scheme instead of
enforcing 1 to 1." Not sure that I fully understand this. I think you mean
that a consumer application can run more instances than the number of
partitions. In that case, the consumer can just repartitioning the input
data according to the number of instances. This is possible, but just has
the overhead of reshuffling the data.
No what I meant is ( that is also your question i think Mathias) that if 
you grow a topic by a factor.
Even if your processor is statefull you can can just assign all the 
multiples of the previous partition to
this consumer and the state to keep processing correctly will be present 
w/o any shuffling.


Say you have an assignment
Statefull consumer => partition
0 => 0
1 => 1
2 => 2

and you grow you topic by 4 you get,

0 => 0,3,6,9
1 => 1,4,7,10
2 => 2,5,8,11

Say your hashcode is 8. 8%3 => 2  before so consumer for partition 2 has it.
Now you you have 12 partitions so 8%12 => 8, so it goes into partition 8 
which is assigned to the same consumer

who had 2 before and therefore knows the key.

Userland reshuffeling is there as an options. And it does exactly what I 
suggest. And I think its the perfect strategie. All I am suggestion is 
broker side support to switch the producers to the newly partitioned 
topic. Then the old (to few partition topic) can go away.  Remember the 
list of steps in the beginning of this thread. If one has broker support 
for all where its required and streams support for those that aren’t 
necessarily. Then one has solved the problem.
I repeat it because I think its important. I am really happy that you 
brought that up! because its 100% what I want just with the differences 
to have an option to discard the to small topic later (after all 
consumers adapted). And to have order correct there. I need broker 
support managing the copy process + the produces and fence them against 
each other. I also repeat. the copy process can run for weeks in the 
worst case. Copying the data is not the longest task migrating consumers 
might very well be.
Once all consumers switched and copying is really up to date (think ISR 
like up to date) only then we stop the producer, wait for the copy to 
finish and use the new topic for producing.


After this the topic is perfect in shape. and no one needs to worry 
about complicated stuff. (old keys hanging around might arrive in some 
other topic later.). can only imagine how many tricky bugs gonna 
arrive after someone had grown and shrunken is topic 10 times.








54. "The other thing I wanted to mention is that I believe the current
suggestion (without copying data over) can be implemented in pure userland
with a custom partitioner and a small feedbackloop from ProduceResponse =>
Partitionier in coorporation with a change management system." I am not
sure a customized partitioner itself solves the problem. We probably need
some broker side support to enforce when the new partitions can be used. We
also need some support on the consumer/kstream side to preserve the per key
ordering and potentially migrate the processing state. This is not trivial
and I am not sure if it's ideal to fully push to the application space.
Broker support is defenitly the preferred way here. I have nothing 
against broker support.
I tried to say that for what I would preffer - copying the data over, at 
least for log compacted topics -

I would require more broker support than the KIP currently offers.



Jun


On Tue, Mar 6, 2018 at 10:33 PM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Hi Dong,

are you actually reading my emails, or are you just using the thread I
started for general announcements regarding the KIP?

I tried to argue really hard against linear hashing. Growing the topic by
an integer factor does not require any state redistribution at all. I fail
to see completely where linear hashing helps on log compacted topics.

If you are not willi

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-03-08 Thread Jan Filipiak


Hi Dong,

are you actually reading my emails, or are you just using the thread I 
started for general announcements regarding the KIP?


I tried to argue really hard against linear hashing. Growing the topic 
by an integer factor does not require any state redistribution at all. I 
fail to see completely where linear hashing helps on log compacted topics.


If you are not willing to explain to me what I might be overlooking: 
that is fine.
But I ask you to not reply to my emails then. Please understand my 
frustration with this.


Best Jan


On 06.03.2018 19:38, Dong Lin wrote:

Hi everyone,

Thanks for all the comments! It appears that everyone prefers linear
hashing because it reduces the amount of state that needs to be moved
between consumers (for stream processing). The KIP has been updated to use
linear hashing.

Regarding the migration endeavor: it seems that migrating producer library
to use linear hashing should be pretty straightforward without
much operational endeavor. If we don't upgrade client library to use this
KIP, we can not support in-order delivery after partition is changed
anyway. Suppose we upgrade client library to use this KIP, if partition
number is not changed, the key -> partition mapping will be exactly the
same as it is now because it is still determined using murmur_hash(key) %
original_partition_num. In other words, this change is backward compatible.

Regarding the load distribution: if we use linear hashing, the load may be
unevenly distributed because those partitions which are not split may
receive twice as much traffic as other partitions that are split. This
issue can be mitigated by creating topic with partitions that are several
times the number of consumers. And there will be no imbalance if the
partition number is always doubled. So this imbalance seems acceptable.

Regarding storing the partition strategy as per-topic config: It seems not
necessary since we can still use murmur_hash as the default hash function
and additionally apply the linear hashing algorithm if the partition number
has increased. Not sure if there is any use-case for producer to use a
different hash function. Jason, can you check if there is some use-case
that I missed for using the per-topic partition strategy?

Regarding how to reduce latency (due to state store/load) in stream
processing consumer when partition number changes: I need to read the Kafka
Stream code to understand how Kafka Stream currently migrate state between
consumers when the application is added/removed for a given job. I will
reply after I finish reading the documentation and code.


Thanks,
Dong


On Mon, Mar 5, 2018 at 10:43 AM, Jason Gustafson <ja...@confluent.io> wrote:


Great discussion. I think I'm wondering whether we can continue to leave
Kafka agnostic to the partitioning strategy. The challenge is communicating
the partitioning logic from producers to consumers so that the dependencies
between each epoch can be determined. For the sake of discussion, imagine
you did something like the following:

1. The name (and perhaps version) of a partitioning strategy is stored in
topic configuration when a topic is created.
2. The producer looks up the partitioning strategy before writing to a
topic and includes it in the produce request (for fencing). If it doesn't
have an implementation for the configured strategy, it fails.
3. The consumer also looks up the partitioning strategy and uses it to
determine dependencies when reading a new epoch. It could either fail or
make the most conservative dependency assumptions if it doesn't know how to
implement the partitioning strategy. For the consumer, the new interface
might look something like this:

// Return the partition dependencies following an epoch bump
Map<Integer, List> dependencies(int numPartitionsBeforeEpochBump,
int numPartitionsAfterEpochBump)

The unordered case then is just a particular implementation which never has
any epoch dependencies. To implement this, we would need some way for the
consumer to find out how many partitions there were in each epoch, but
maybe that's not too unreasonable.

Thanks,
Jason


On Mon, Mar 5, 2018 at 4:51 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Hi Dong

thank you very much for your questions.

regarding the time spend copying data across:
It is correct that copying data from a topic with one partition mapping

to

a topic with a different partition mapping takes way longer than we can
stop producers. Tens of minutes is a very optimistic estimate here. Many
people can not afford copy full steam and therefore will have some rate
limiting in place, this can bump the timespan into the day's. The good

part

is that the vast majority of the data can be copied while the producers

are

still going. One can then, piggyback the consumers ontop of this

timeframe,

by the method mentioned (provide them an mapping from their old offsets

to

new offsets in their repartitioned topics. In that way

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-03-05 Thread Jan Filipiak


Hi Dong

thank you very much for your questions.

regarding the time spend copying data across:
It is correct that copying data from a topic with one partition mapping 
to a topic with a different partition mapping takes way longer than we 
can stop producers. Tens of minutes is a very optimistic estimate here. 
Many people can not afford copy full steam and therefore will have some 
rate limiting in place, this can bump the timespan into the day's. The 
good part is that the vast majority of the data can be copied while the 
producers are still going. One can then, piggyback the consumers ontop 
of this timeframe, by the method mentioned (provide them an mapping from 
their old offsets to new offsets in their repartitioned topics. In that 
way we separate migration of consumers from migration of producers 
(decoupling these is what kafka is strongest at). The time to actually 
swap over the producers should be kept minimal by ensuring that when a 
swap attempt is started the consumer copying over should be very close 
to the log end and is expected to finish within the next fetch. The 
operation should have a time-out and should be "reattemtable".


Importance of logcompaction:
If a producer produces key A, to partiton 0, its forever gonna be there, 
unless it gets deleted. The record might sit in there for years. A new 
producer started with the new partitions will fail to delete the record 
in the correct partition. Th record will be there forever and one can 
not reliable bootstrap new consumers. I cannot see how linear hashing 
can solve this.


Regarding your skipping of userland copying:
100%, copying the data across in userland is, as far as i can see, only 
a usecase for log compacted topics. Even for logcompaction + retentions 
it should only be opt-in. Why did I bring it up? I think log compaction 
is a very important feature to really embrace kafka as a "data 
plattform". The point I also want to make is that copying data this way 
is completely inline with the kafka architecture. it only consists of 
reading and writing to topics.


I hope it clarifies more why I think we should aim for more than the 
current KIP. I fear that once the KIP is done not much more effort will 
be taken.




On 04.03.2018 02:28, Dong Lin wrote:

Hey Jan,

In the current proposal, the consumer will be blocked on waiting for other
consumers of the group to consume up to a given offset. In most cases, all
consumers should be close to the LEO of the partitions when the partition
expansion happens. Thus the time waiting should not be long e.g. on the
order of seconds. On the other hand, it may take a long time to wait for
the entire partition to be copied -- the amount of time is proportional to
the amount of existing data in the partition, which can take tens of
minutes. So the amount of time that we stop consumers may not be on the
same order of magnitude.

If we can implement this suggestion without copying data over in purse
userland, it will be much more valuable. Do you have ideas on how this can
be done?

Not sure why the current KIP not help people who depend on log compaction.
Could you elaborate more on this point?

Thanks,
Dong

On Wed, Feb 28, 2018 at 10:55 PM, Jan Filipiak<jan.filip...@trivago.com>
wrote:


Hi Dong,

I tried to focus on what the steps are one can currently perform to expand
or shrink a keyed topic while maintaining a top notch semantics.
I can understand that there might be confusion about "stopping the
consumer". It is exactly the same as proposed in the KIP. there needs to be
a time the producers agree on the new partitioning. The extra semantics I
want to put in there is that we have a possibility to wait until all the
existing data
is copied over into the new partitioning scheme. When I say stopping I
think more of having a memory barrier that ensures the ordering. I am still
aming for latencies  on the scale of leader failovers.

Consumers have to explicitly adapt the new partitioning scheme in the
above scenario. The reason is that in these cases where you are dependent
on a particular partitioning scheme, you also have other topics that have
co-partition enforcements or the kind -frequently. Therefore all your other
input topics might need to grow accordingly.


What I was suggesting was to streamline all these operations as best as
possible to have "real" partition grow and shrinkage going on. Migrating
the producers to a new partitioning scheme can be much more streamlined
with proper broker support for this. Migrating consumer is a step that
might be made completly unnecessary if - for example streams - takes the
gcd as partitioning scheme instead of enforcing 1 to 1. Connect consumers
and other consumers should be fine anyways.

I hope this makes more clear where I was aiming at. The rest needs to be
figured out. The only danger i see is that when we are introducing this
feature as supposed in the KIP, it wont help any people de

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-02-28 Thread Jan Filipiak


Hi Dong,

I tried to focus on what the steps are one can currently perform to 
expand or shrink a keyed topic while maintaining a top notch semantics.
I can understand that there might be confusion about "stopping the 
consumer". It is exactly the same as proposed in the KIP. there needs to be
a time the producers agree on the new partitioning. The extra semantics 
I want to put in there is that we have a possibility to wait until all 
the existing data
is copied over into the new partitioning scheme. When I say stopping I 
think more of having a memory barrier that ensures the ordering. I am 
still aming for latencies  on the scale of leader failovers.


Consumers have to explicitly adapt the new partitioning scheme in the 
above scenario. The reason is that in these cases where you are 
dependent on a particular partitioning scheme, you also have other 
topics that have co-partition enforcements or the kind -frequently. 
Therefore all your other input topics might need to grow accordingly.



What I was suggesting was to streamline all these operations as best as 
possible to have "real" partition grow and shrinkage going on. Migrating 
the producers to a new partitioning scheme can be much more streamlined 
with proper broker support for this. Migrating consumer is a step that 
might be made completly unnecessary if - for example streams - takes the 
gcd as partitioning scheme instead of enforcing 1 to 1. Connect 
consumers and other consumers should be fine anyways.


I hope this makes more clear where I was aiming at. The rest needs to be 
figured out. The only danger i see is that when we are introducing this 
feature as supposed in the KIP, it wont help any people depending on  
log compaction.


The other thing I wanted to mention is that I believe the current 
suggestion (without copying data over) can be implemented in pure 
userland with a custom partitioner and a small feedbackloop from 
ProduceResponse => Partitionier in coorporation with a change management 
system.


Best Jan







On 28.02.2018 07:13, Dong Lin wrote:

Hey Jan,

I am not sure if it is acceptable for producer to be stopped for a while,
particularly for online application which requires low latency. I am also
not sure how consumers can switch to a new topic. Does user application
needs to explicitly specify a different topic for producer/consumer to
subscribe to? It will be helpful for discussion if you can provide more
detail on the interface change for this solution.

Thanks,
Dong

On Mon, Feb 26, 2018 at 12:48 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Hi,

just want to throw my though in. In general the functionality is very
usefull, we should though not try to find the architecture to hard while
implementing.

The manual steps would be to

create a new topic
the mirrormake from the new old topic to the new topic
wait for mirror making to catch up.
then put the consumers onto the new topic
 (having mirrormaker spit out a mapping from old offsets to new offsets:
 if topic is increased by factor X there is gonna be a clean
mapping from 1 offset in the old topic to X offsets in the new topic,
 if there is no factor then there is no chance to generate a
mapping that can be reasonable used for continuing)
 make consumers stop at appropriate points and continue consumption
with offsets from the mapping.
have the producers stop for a minimal time.
wait for mirrormaker to finish
let producer produce with the new metadata.


Instead of implementing the approach suggest in the KIP which will leave
log compacted topic completely crumbled and unusable.
I would much rather try to build infrastructure to support the mentioned
above operations more smoothly.
Especially having producers stop and use another topic is difficult and
it would be nice if one can trigger "invalid metadata" exceptions for them
and
if one could give topics aliases so that their produces with the old topic
will arrive in the new topic.

The downsides are obvious I guess ( having the same data twice for the
transition period, but kafka tends to scale well with datasize). So its a
nicer fit into the architecture.

I further want to argument that the functionality by the KIP can
completely be implementing in "userland" with a custom partitioner that
handles the transition as needed. I would appreciate if someone could point
out what a custom partitioner couldn't handle in this case?

With the above approach, shrinking a topic becomes the same steps. Without
loosing keys in the discontinued partitions.

Would love to hear what everyone thinks.

Best Jan


















On 11.02.2018 00:35, Dong Lin wrote:


Hi all,

I have created KIP-253: Support in-order message delivery with partition
expansion. See
https://cwiki.apache.org/confluence/display/KAFKA/KIP-253%
3A+Support+in-order+message+delivery+with+partition+expansion
.

This KIP provides a way to allow messages of the same key

[jira] [Created] (KAFKA-6599) KTable KTable join semantics violated when caching enabled

2018-02-27 Thread Jan Filipiak (JIRA)

Jan Filipiak created KAFKA-6599:
---

 Summary: KTable KTable join semantics violated when caching enabled
 Key: KAFKA-6599
 URL: https://issues.apache.org/jira/browse/KAFKA-6599
 Project: Kafka
  Issue Type: Bug
  Components: streams
Reporter: Jan Filipiak


Say a tuple A,B got emmited after joining and the delete for A goes into the 
cache. After that the B record would be deleted aswell. B's join processor 
would look up A and see `null` while computing for old and new value (at this 
point we can execute joiner with A beeing null and still emit something, but 
its not gonna represent the actual oldValue) Then As cache flushes it doesn't 
see B so its also not gonna put a proper oldValue. The output can then not be 
used for say any aggregate as a delete would not reliably find its old 
aggregate where it needs to be removed from filter will also break as it stopps 
null,null changes from propagating. So for me it looks pretty clearly that 
Caching with Join breaks KTable semantics. be it my new join or the currently 
existing once.

 

this if branch here

[https://github.com/apache/kafka/blob/1.0/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java#L155]

is not usefull. I think its there because when one would delegate the true case 
to the underlying. One would get proper semantics for streams, but the 
weiredest cache I've seen.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

2018-02-26 Thread Jan Filipiak

Hi,

just want to throw my though in. In general the functionality is very
usefull, we should though not try to find the architecture to hard while
implementing.

The manual steps would be to

create a new topic
the mirrormake from the new old topic to the new topic
wait for mirror making to catch up.
then put the consumers onto the new topic
(having mirrormaker spit out a mapping from old offsets to new offsets:
if topic is increased by factor X there is gonna be a clean
mapping from 1 offset in the old topic to X offsets in the new topic,
if there is no factor then there is no chance to generate a
mapping that can be reasonable used for continuing)
make consumers stop at appropriate points and continue consumption
with offsets from the mapping.

have the producers stop for a minimal time.
wait for mirrormaker to finish
let producer produce with the new metadata.

Instead of implementing the approach suggest in the KIP which will leave
log compacted topic completely crumbled and unusable.
I would much rather try to build infrastructure to support the mentioned
above operations more smoothly.

Especially having producers stop and use another topic is difficult and
it would be nice if one can trigger "invalid metadata" exceptions for
them and
if one could give topics aliases so that their produces with the old
topic will arrive in the new topic.

The downsides are obvious I guess ( having the same data twice for the
transition period, but kafka tends to scale well with datasize). So its
a nicer fit into the architecture.

I further want to argument that the functionality by the KIP can
completely be implementing in "userland" with a custom partitioner that
handles the transition as needed. I would appreciate if someone could
point out what a custom partitioner couldn't handle in this case?

With the above approach, shrinking a topic becomes the same steps.
Without loosing keys in the discontinued partitions.

Would love to hear what everyone thinks.

Best Jan

On 11.02.2018 00:35, Dong Lin wrote:

Hi all,

I have created KIP-253: Support in-order message delivery with partition
expansion. See
https://cwiki.apache.org/confluence/display/KAFKA/KIP-253%3A+Support+in-order+message+delivery+with+partition+expansion
.

This KIP provides a way to allow messages of the same key from the same
producer to be consumed in the same order they are produced even if we
expand partition of the topic.

Thanks,
Dong

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

2018-02-16 Thread Jan Filipiak

Update:

I want to give a quick update on what I found porting the 0.10 version
towards 1.0.

1. It is difficult to provide a stock CombinedKey Serde.
We effectively wrap 2 serdes for the key. We do not have good topic
names to feed into the Avro Serde for K1 and K2 for the same topic.
We can also not carry along the Serdes from the creation of the
table and remember the topic name because of whitelist subscriptions.

2. We should drop the Idea of keysplitter and combiner
I cannot seem to find a good place to have a single layer to handle
this. It seems to spread everywhere throughout the codebase. I think
that its due to the fact that it is an oddity and a break in the
architecture to have something like this. Maybe one introduces that in a
later step but it's
very messy to have that in the first step and really consuming 80%
of the effort put into the KIP.
3. Caching is messing with my head very heavily at the moment. I have
full control over the RocksDB holding the right side (b), So I can make
it not cache. Which is good. I do inherit the store of the left side
(A) and I have no control over its caching behaviour.

Let me elaborate:

Say a tuple A,B got emmited after joining and the delete for A goes into
the cache.

After that the B record would be deleted aswell.
B's join processor would look up A and see `null` while computing for
old and new value
(at this point we can execute joiner with A beeing null and still emit
something, but its not gonna represent the actual oldValue)

Then As cache flushes
it doesn't see B so its also not gonna put a proper oldValue.

The output can then not be used for say any aggregate as a delete
would not reliably find its old aggregate where it needs to be removed from
filter will also break as it stopps null,null changes from
propagating. So for me it looks pretty clearly that Caching with Join
breaks KTable semantics. be it my new join or the
currently existing once.

4. I further want to propose that I leave out IQ support in the first
step. Copy pasting the if(storeName == null) that is in almost any
processor is unideal. I want to lift it to the topology level in
the next step (adding a new processor that will maintain the user
provided store as a downstream processor)

That is where I stand currently. I would appreciate feedback on all the
points

Best Jan

On 27.10.2017 06:38, Jan Filipiak wrote:

Hello everyone,

this is the new discussion thread after the ID-clash.

Best
Jan

Hello Kafka-users,

I want to continue with the development of KAFKA-3705, which allows
the Streams DSL to perform KTableKTable-Joins when the KTables have a
one-to-many relationship.
To make sure we cover the requirements of as many users as possible
and have a good solution afterwards I invite everyone to read through
the KIP I put together and discuss it here in this Thread.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable

https://issues.apache.org/jira/browse/KAFKA-3705
https://github.com/apache/kafka/pull/3720

I think a public discussion and vote on a solution is exactly what is
needed to bring this feauture into kafka-streams. I am looking forward
to everyones opinion!

Please keep the discussion on the mailing list rather than commenting
on the wiki (wiki discussions get unwieldy fast).

Best
Jan

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-19 Thread Jan Filipiak


Sorry for coming back at this so late.



On 11.12.2017 07:12, Colin McCabe wrote:

On Sun, Dec 10, 2017, at 22:10, Colin McCabe wrote:

On Fri, Dec 8, 2017, at 01:16, Jan Filipiak wrote:

Hi,

sorry for the late reply, busy times :-/

I would ask you one thing maybe. Since the timeout
argument seems to be settled I have no further argument
form your side except the "i don't want to".

Can you see that connection.max.idle.max is the exact time
that expresses "We expect the client to be away for this long,
and come back and continue"?

Hi Jan,

Sure, connection.max.idle.max is the exact time that we want to keep
around a TCP session.  TCP sessions are relatively cheap, so we can
afford to keep them around for 10 minutes by default.  Incremental fetch
state is less cheap, so we want to set a shorter timeout for it.  We
also want new TCP sessions to be able to reuse an existing incremental
fetch session rather than creating a new one and waiting for the old one
to time out.


also clarified some stuff inline

Best Jan




On 05.12.2017 23:14, Colin McCabe wrote:

On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:

Hi Colin

Addressing the topic of how to manage slots from the other thread.
With tcp connections all this comes for free essentially.

Hi Jan,

I don't think that it's accurate to say that cache management "comes for
free" by coupling the incremental fetch session with the TCP session.
When a new TCP session is started by a fetch request, you still have to
decide whether to grant that request an incremental fetch session or
not.  If your answer is that you always grant the request, I would argue
that you do not have cache management.

First I would say, the client has a big say in this. If the client
is not going to issue incremental he shouldn't ask for a cache
when the client ask for the cache we still have all options to deny.

To put it simply, we have to have some cache management above and beyond
just giving out an incremental fetch session to anyone who has a TCP
session.  Therefore, caching does not become simpler if you couple the
fetch session to the TCP session.
Simply giving out an fetch session for everyone with a connection is too 
simple,
but I think it plays well into the idea of consumers choosing to use the 
feature
therefore only enabling where it brings maximum gains 
(replicas,MirrorMakers)



I guess you could argue that timeouts are cache management, but I don't
find that argument persuasive.  Anyone could just create a lot of TCP
sessions and use a lot of resources, in that case.  So there is
essentially no limit on memory use.  In any case, TCP sessions don't
help us implement fetch session timeouts.

We still have all the options denying the request to keep the state.
What you want seems like a max connections / ip safeguard.
I can currently take down a broker with to many connections easily.



I still would argue we disable it by default and make a flag in the
broker to ask the leader to maintain the cache while replicating and also only
have it optional in consumers (default to off) so one can turn it on
where it really hurts.  MirrorMaker and audit consumers prominently.

I agree with Jason's point from earlier in the thread.  Adding extra
configuration knobs that aren't really necessary can harm usability.
Certainly asking people to manually turn on a feature "where it really
hurts" seems to fall in that category, when we could easily enable it
automatically for them.

This doesn't make much sense to me.

There are no tradeoffs to think about from the client's point of view:
it always wants an incremental fetch session.  So there is no benefit to
making the clients configure an extra setting.  Updating and managing
client configurations is also more difficult than managing broker
configurations for most users.


You also wanted to implement
a "turn of in case of bug"-knob. Having the client indicate if the
feauture will be used seems reasonable to me.,

True.  However, if there is a bug, we could also roll back the client,
so having this configuration knob is not strictly required.


Otherwise I left a few remarks in-line, which should help to understand
my view of the situation better

Best Jan


On 05.12.2017 08:06, Colin McCabe wrote:

On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:

On 03.12.2017 21:55, Colin McCabe wrote:

On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:

Thanks for the explanation, Colin. A few more questions.


The session epoch is not complex.  It's just a number which increments
on each incremental fetch.  The session epoch is also useful for
debugging-- it allows you to match up requests and responses when
looking at log files.

Currently each request in Kafka has a correlation id to help match the
requests and responses. Is epoch doing something differently?

Hi Becket,

The correlation ID is used within a single TCP session, to uniquely
associate a request with a response.  The correla

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

2017-12-18 Thread Jan Filipiak


Hi Guozhang,

 thanks for the update.

On 15.12.2017 22:54, Guozhang Wang wrote:

Jan,

Thanks for the updated KIP, and the raised questions. Here are my thoughts
around the "back and forth mapper" approach on your wiki:

1) regarding the key-value types of KTableValueGetter, we do not
necessarily enforce its template K, V to be the same as its calling
Processor<K, V>, although today in all implementations we happen to do so.
So I think it is ok to extend this internal implementation to allow getter
and forwarding with different types.
I am not entirely sure how you mean this. The dependencies is only there 
because
the downstream processor is going to invoke the ValueGetter with the 
downstream key.
if this key is of different type we would run into a Runtime Exception. 
We had to introduce
a fourth generic type. The "new key type" wich would also be the access 
type of the
Value Getter. I like introducing this a lot but I don't think it works 
w/o 4th generic type

and then we have
KTableProcessorSupplier<KEY_IN,K_OUT,VALUE_IN,VALUE_OUT> extends 
ProcessorSupplier<KEY_IN,VALUE_IN>{


view ValueGetterSupplier<KEY_OUT,VALUE_OUT>()
processor ProcessorSupplier<KEY_IN,VALUE_IN>()

}

This would conveniently allow for some flatmap() on Ktable wich is a 
neat thing IMO


2) regarding the KTableProcessorSupplier enforcing to return the same
key/value types of its "KTableValueGetterSupplier<K, T> view();" the key
observation to note is that "ProcessorSupplier" inside "KTableImpl<K,
V>" does not enforce to have the same key-value types of the KTable, i.e.
we can use a "ProcessorSupplier<K1, V1>" inside the impl of a `KTable<K,
V>`. I think that should help getting around the issue.
I think it has nothing really todo with places where the 
ProccessorSupllier is referenced
but this is quircked inside KTableProccessorSuplier. regardless of the 
scope of usage I cannot
come up with a KTableProccessorSupplier that changes keys while 
maintaining all invariants (beeing querieable). One
can jump across with a ProcessorSupplier where its obvious that you 
can't have a ValueGetterSupplier, but this is

rather a hack.

Why is it only inside KTableProcessorSupplier? We process key K, and 
then we forward K1, but our ValueGetterSupplier
can only have K as Generic and therefore is will crash if we invoke the 
ValueGetter.




3) About the alternative KTable::mapKeys(), I think the major issue is that
this mapKeys() cannot enforce users to always call it to get the
"non-combined" Key, and hence users may still need to consider the serde of
"CombinedKey" if they do not call mapKeys and then directly pipe it to the
output, while this approach enforce them to always "map" it before trying
to write it to anywhere externally exposable.

It cannot force them, but folks who want this can do it. People that are
fine with any Combinedkey type could just let it be forwarded as such.

A new aspect that I had not thought of as yet is that of course in an
to() call they could pass in a CombinedKeySerde on their own. I think
this flexibility is a plus rather than a minus. What do you think?


4) A very minor comment on the wiki page itself, about the "back and forth
mapper" section: the parameter names "customCombinedKey" and "combinedKey"
seems a bit hard to understand to normal users; should we consider renaming
them to something more understandable? For example, "outputKeyCombiner" and
"outputKeySpliter"?

yes your naming is superior.




Guozhang


On Thu, Dec 7, 2017 at 3:58 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


On 05.12.2017 00:42, Matthias J. Sax wrote:


Jan,

The KTableValueGetter thing is a valid point. I think we would need a
backwards mapper (or merge both into one and sacrifices lambdas?).
Another alternative would be, to drop the optimization and materialize
the KTable.operator() result... (not a great solution either). I am
personally fine with a backwards mapper (we should call it KeySplitter).

2. I am not sure if we can pull it of w/o said forth generic type in

KTable (that I am in favour of btw)


Not sure if I can follow here. I am personally not worried about the

number of generic types -- it's just to have a clear definition what
each passed parameter does.


I need to double check this again. Its good that we are open to introduce
a new one
I think it will not work currently as a KTableProcessorSupplier when asked
for a
ValueGetterSupplier it can only return a ValueGetter Supplier that has the
same Keytype
as the key it receives in the process method. Even though it would forward
a different
key type and therefore KTables key Type can't change. I am thinking how to
pull this off but I see little chance

But I am always in big favour of introducing the forth type OutputKey, it
would become
straight for

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

2017-12-12 Thread Jan Filipiak


Hi, I updated the KIP

I would be open for this:

We mark the "less intrusive" and the "back and forth mapper" approach as 
rejected alternatives.

and implement the two remaining methods.

any thoughts?

Best jan


On 07.12.2017 12:58, Jan Filipiak wrote:


On 05.12.2017 00:42, Matthias J. Sax wrote:

Jan,

The KTableValueGetter thing is a valid point. I think we would need a
backwards mapper (or merge both into one and sacrifices lambdas?).
Another alternative would be, to drop the optimization and materialize
the KTable.operator() result... (not a great solution either). I am
personally fine with a backwards mapper (we should call it KeySplitter).


2. I am not sure if we can pull it of w/o said forth generic type in
KTable (that I am in favour of btw)

Not sure if I can follow here. I am personally not worried about the
number of generic types -- it's just to have a clear definition what
each passed parameter does.
I need to double check this again. Its good that we are open to 
introduce a new one
I think it will not work currently as a KTableProcessorSupplier when 
asked for a
ValueGetterSupplier it can only return a ValueGetter Supplier that has 
the same Keytype
as the key it receives in the process method. Even though it would 
forward a different
key type and therefore KTables key Type can't change. I am thinking 
how to pull this off but I see little chance


But I am always in big favour of introducing the forth type OutputKey, 
it would become

straight forward then. I hope you can follow.

+ It won't solves peoples problem having CombinedKey on the wire and 
not being able to inspect the topic with say there default tools.

I see your point, but do we not have this issue always? To make range
scan work, we need to serialize the prefix (K1) and suffix (K)
independently from each other. IMHO, it would be too much of a burden to
the user, to provide a single serialized for K0 that guaranteed the
ordering we need. Still, advanced user can provide custom Serde for the
changelog topic via `Joined` -- and they can serialize as they wish (ie,
get CombinedKey<K1,K>, convert internally to K0 and serialized -- but
this is an opt-in).

I think, this actually aligns with what you are saying. However, I think
the #prefix() call is not the best idea. We can just use Serde for
this (if users overwrite CombinedKey-Serde, it must overwrite Serde
too and can return the proper perfix (or do I miss something?).

I can't follow. For the stock implementation user would get
they wouldn't need prefix. Users had not to define it we can implement
that ourself by just getting K1 Serde.

But to Override with a custom Serde that prefix method is needed as an
indicator if only a prefix or the full thing is to be rendered.



  - Id rather introduce KTable::mapKeys() or something (4th generic 
in Ktable?) than overloading. It is better SOCs wise.

What overload are you talking about? From my understanding, we want to
add one single method (or maybe one for inner,left,outter each), but I
don't see any overloads atm?

The back and forth mapper would get an overload


Also, `KTable.mapKeys()` would have the issue, that one could create an
invalid KTable with key collisions. I would rather shield users to shoot
themselves in the foot.
This mapkeys would not be used to remove the actual values but to get 
rid of the CombinedKey-type.
Users can shoot themself with the proposed back and forth mapper you 
suggested.






Side remark:

In the KIP, in the Step-by-Step table (that I really like a lot!) I
think in line 5 (input A, with key A2 arrives, the columns "state B
materialized" and "state B other task" should not be empty but the same
as in line 4?

Will double check tonight. totally plausible i messed this up!

best Jan




-Matthias


On 11/25/17 8:56 PM, Jan Filipiak wrote:

Hi Matthias,

2 things that pop into my mind sunday morning. Can we provide an
KTableValueGetter when key in the store is different from the key
forwarded?
1. we would need a backwards mapper
2. I am not sure if we can pull it of w/o said forth generic type in
KTable (that I am in favour of btw)

+ It won't solves peoples problem having CombinedKey on the wire and 
not

beeing able to inspect the topic with say there default tools.
  - Id rather introduce KTable::mapKeys() or something (4th generic in
Ktable?) than overloading. It is better SOCs wise.

I am thinking more into an overload where we replace the Comined key
Serde. So people can use a default CombinedKey Serde
but could provide an own implementation that would internally use K0 
vor

serialisation and deserialisation. One could implement
a ##prefix() into this call to make explicit that we only want the
prefix rendered. This would take CombinedKey logic out of publicly 
visible
data. A Stock CombinedKey Serde that would be used by default could 
also

handle the JSON users correctly.

Users would still get CombinedKey back. The downside of ge

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-08 Thread Jan Filipiak



On 08.12.2017 10:43, Ismael Juma wrote:

One correction below.

On Fri, Dec 8, 2017 at 11:16 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


We only check max.message.bytes to late to guard against consumer stalling.
we dont have a notion of max.networkpacket.size before we allocate the
bytebuffer to read it into.


We do: socket.request.max.bytes.

Ismael



perfect, didn't knew we have this in the meantime. :) good that we have it.

Its a very good safeguard. and a nice fail fast for dodgy clients or 
network interfaces.

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-08 Thread Jan Filipiak


Hi,

sorry for the late reply, busy times :-/

I would ask you one thing maybe. Since the timeout
argument seems to be settled I have no further argument
form your side except the "i don't want to".

Can you see that connection.max.idle.max is the exact time
that expresses "We expect the client to be away for this long,
and come back and continue"?

also clarified some stuff inline

Best Jan




On 05.12.2017 23:14, Colin McCabe wrote:

On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:

Hi Colin

Addressing the topic of how to manage slots from the other thread.
With tcp connections all this comes for free essentially.

Hi Jan,

I don't think that it's accurate to say that cache management "comes for
free" by coupling the incremental fetch session with the TCP session.
When a new TCP session is started by a fetch request, you still have to
decide whether to grant that request an incremental fetch session or
not.  If your answer is that you always grant the request, I would argue
that you do not have cache management.

First I would say, the client has a big say in this. If the client
is not going to issue incremental he shouldn't ask for a cache
when the client ask for the cache we still have all options to deny.



I guess you could argue that timeouts are cache management, but I don't
find that argument persuasive.  Anyone could just create a lot of TCP
sessions and use a lot of resources, in that case.  So there is
essentially no limit on memory use.  In any case, TCP sessions don't
help us implement fetch session timeouts.

We still have all the options denying the request to keep the state.
What you want seems like a max connections / ip safeguard.
I can currently take down a broker with to many connections easily.



I still would argue we disable it by default and make a flag in the
broker to ask the leader to maintain the cache while replicating and also only
have it optional in consumers (default to off) so one can turn it on
where it really hurts.  MirrorMaker and audit consumers prominently.

I agree with Jason's point from earlier in the thread.  Adding extra
configuration knobs that aren't really necessary can harm usability.
Certainly asking people to manually turn on a feature "where it really
hurts" seems to fall in that category, when we could easily enable it
automatically for them.

This doesn't make much sense to me. You also wanted to implement
a "turn of in case of bug"-knob. Having the client indicate if the feauture
will be used seems reasonable to me.,



Otherwise I left a few remarks in-line, which should help to understand
my view of the situation better

Best Jan


On 05.12.2017 08:06, Colin McCabe wrote:

On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:

On 03.12.2017 21:55, Colin McCabe wrote:

On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:

Thanks for the explanation, Colin. A few more questions.


The session epoch is not complex.  It's just a number which increments
on each incremental fetch.  The session epoch is also useful for
debugging-- it allows you to match up requests and responses when
looking at log files.

Currently each request in Kafka has a correlation id to help match the
requests and responses. Is epoch doing something differently?

Hi Becket,

The correlation ID is used within a single TCP session, to uniquely
associate a request with a response.  The correlation ID is not unique
(and has no meaning) outside the context of that single TCP session.

Keep in mind, NetworkClient is in charge of TCP sessions, and generally
tries to hide that information from the upper layers of the code.  So
when you submit a request to NetworkClient, you don't know if that
request creates a TCP session, or reuses an existing one.

Unfortunately, this doesn't work.  Imagine the client misses an
increment fetch response about a partition.  And then the partition is
never updated after that.  The client has no way to know about the
partition, since it won't be included in any future incremental fetch
responses.  And there are no offsets to compare, since the partition is
simply omitted from the response.

I am curious about in which situation would the follower miss a response
of a partition. If the entire FetchResponse is lost (e.g. timeout), the
follower would disconnect and retry. That will result in sending a full
FetchRequest.

Basically, you are proposing that we rely on TCP for reliable delivery
in a distributed system.  That isn't a good idea for a bunch of
different reasons.  First of all, TCP timeouts tend to be very long.  So
if the TCP session timing out is your error detection mechanism, you
have to wait minutes for messages to timeout.  Of course, we add a
timeout on top of that after which we declare the connection bad and
manually close it.  But just because the session is closed on one end
doesn't mean that the other end knows that it is closed.  So the leader
may have to wait quite a long time before TC

Re: [DISCUSS]: KIP-159: Introducing Rich functions to Streams

2017-12-07 Thread Jan Filipiak


Thank you Bill,

I think this is reasonable. Do you have any suggestion
for handling oldValues in cases like

builder.table().filter(RichPredicate).join()

where we process a Change with old and new value and dont have a record 
context for old.


my suggestion would be that instead of

SOURCE -> KTABLESOURCE -> KTABLEFILTER -> JOIN -> SINK

we build

SOURCE  -> KTABLEFILTER ->  KTABLESOURCE -> JOIN -> SINK

We should build a topology like this from the beginning and not have
an optimisation phase afterwards.

Any opinions?

Best Jan




On 05.12.2017 17:34, Bill Bejeck wrote:

Matthias,

Overall I agree with what you've presented here.

Initially, I was hesitant to remove information from the context of the
result records (Joins or Aggregations) with the thought that when there are
unexpected results, the source information would be useful for tracing back
where the error could have occurred.  But in the case of Joins and
Aggregations, the amount of data needed to do meaningful analysis could be
too much. For example, a join result could come from two topics so you'd
need to keep both original topic names, offsets, etc. (plus the broker
could have deleted the records in the interim so even having offset could
provide nothing).

I'm bit long winded here, but I've come full circle to your original
proposal that since Joins and Aggregations produce fundamentally new types,
we drop the corresponding information from the context even in the case of
single topic aggregations.

Thanks,
Bill

On Mon, Dec 4, 2017 at 7:02 PM, Matthias J. Sax <matth...@confluent.io>
wrote:


I agree with Guozhang that just exposing meta data at the source level
might not provide too much value. Furthermore, for timestamps we do
already have a well defined contract and we should exploit it:
timestamps can always be provided in a meaningful way.

Also, for simple operations like KStream-filter/map the contract is
simple and we can just use it. Same for KTable-filter/map (for new values).

For aggregations, join, and oldValue, I could just drop some information
and return `null`/-1, if the result records has no semantically
meaningful meta data.

For example, for aggregations, we could preserve the partition (as all
agg-input-records have the same partition). For single input topic
aggregation (what I guess is the most prominent case), we can also carry
over the topic name (would be a internal repartitioning topic name
often). Offsets don't have any semantic interpretation IMHO and we could
return -1.

For joins, we could keep the partition information. Topic and offset are
both unknown/invalid for the output record IMHO.

For the oldValue case, we can keep partition and for single input topic
case topic name. Timestamp might be -1 for now, but after we added
timestamps to KTable (what we plan to do anyway), we can also return a
valid timestamp. Offset would be -1 again (if we store offset in KTable
too, we could provide all offset as well -- but I don't see too much
value in doing this compared to the storage overhead this implies).


WDYT?


-Matthias

On 11/29/17 4:14 AM, Jan Filipiak wrote:

Hi,

thank you for the summary and thanks for acknowledging that I do have a
point here.

I don't like the second Idea at all. Hence I started of this discussion.

I am just disappointed, back then when we had the discussion about how
to refactor store overload
and IQ handling, I knew the path we are taking is wrong. Having problems
implementing these kinda
features (wich are really simple)  is just a symptom of messed up IQ
implementation. I wish really bad
I could have convinced you guys back then. To be honest with IQ we can
continue here
as we Materialize but would not send oldValue, but with join you're out
of luck with current setup.

I of course recommend to do not introduce any optimizations here. Id
recommend to go towards what
I recommended already back then. So i would't say we need to optimize
anything later we need to build
the topology better in the first place.




On 28.11.2017 21:00, Guozhang Wang wrote:

Jan,

Thanks for your input, I can understand now that the oldValue is also
exposed in user customized `filter` function and hence want record
context
we should expose is a problem. And I think it does brings a good point

to

consider for KIP-159. The discussions maybe a bit confusing to reader
though, and hence I'd like to summarize the status quo and with a
proposal:

In today's Streams DSL, when a KTable is created either from a source
topic, or from an stateful operator, we will materialize the KTable
with a
backing state store; on the other hand, KTables created from a
non-stateful
operator like filter, will not be backed by a state store by default
unless
users indicate so (e.g. using the overloaded function with the queryable
name or store supplier).

For example:

KTable table1 = builder.table("topic");

// a

state store created for table1
KTable table2 = table1.filter(.

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

2017-12-07 Thread Jan Filipiak



On 05.12.2017 00:42, Matthias J. Sax wrote:

Jan,

The KTableValueGetter thing is a valid point. I think we would need a
backwards mapper (or merge both into one and sacrifices lambdas?).
Another alternative would be, to drop the optimization and materialize
the KTable.operator() result... (not a great solution either). I am
personally fine with a backwards mapper (we should call it KeySplitter).


2. I am not sure if we can pull it of w/o said forth generic type in
KTable (that I am in favour of btw)

Not sure if I can follow here. I am personally not worried about the
number of generic types -- it's just to have a clear definition what
each passed parameter does.
I need to double check this again. Its good that we are open to 
introduce a new one
I think it will not work currently as a KTableProcessorSupplier when 
asked for a
ValueGetterSupplier it can only return a ValueGetter Supplier that has 
the same Keytype
as the key it receives in the process method. Even though it would 
forward a different
key type and therefore KTables key Type can't change. I am thinking how 
to pull this off but I see little chance


But I am always in big favour of introducing the forth type OutputKey, 
it would become

straight forward then. I hope you can follow.


+ It won't solves peoples problem having CombinedKey on the wire and not being 
able to inspect the topic with say there default tools.

I see your point, but do we not have this issue always? To make range
scan work, we need to serialize the prefix (K1) and suffix (K)
independently from each other. IMHO, it would be too much of a burden to
the user, to provide a single serialized for K0 that guaranteed the
ordering we need. Still, advanced user can provide custom Serde for the
changelog topic via `Joined` -- and they can serialize as they wish (ie,
get CombinedKey<K1,K>, convert internally to K0 and serialized -- but
this is an opt-in).

I think, this actually aligns with what you are saying. However, I think
the #prefix() call is not the best idea. We can just use Serde for
this (if users overwrite CombinedKey-Serde, it must overwrite Serde
too and can return the proper perfix (or do I miss something?).

I can't follow. For the stock implementation user would get
they wouldn't need prefix. Users had not to define it we can implement
that ourself by just getting K1 Serde.

But to Override with a custom Serde that prefix method is needed as an
indicator if only a prefix or the full thing is to be rendered.




  - Id rather introduce KTable::mapKeys() or something (4th generic in Ktable?) 
than overloading. It is better SOCs wise.

What overload are you talking about? From my understanding, we want to
add one single method (or maybe one for inner,left,outter each), but I
don't see any overloads atm?

The back and forth mapper would get an overload


Also, `KTable.mapKeys()` would have the issue, that one could create an
invalid KTable with key collisions. I would rather shield users to shoot
themselves in the foot.
This mapkeys would not be used to remove the actual values but to get 
rid of the CombinedKey-type.
Users can shoot themself with the proposed back and forth mapper you 
suggested.






Side remark:

In the KIP, in the Step-by-Step table (that I really like a lot!) I
think in line 5 (input A, with key A2 arrives, the columns "state B
materialized" and "state B other task" should not be empty but the same
as in line 4?

Will double check tonight. totally plausible i messed this up!

best Jan




-Matthias


On 11/25/17 8:56 PM, Jan Filipiak wrote:

Hi Matthias,

2 things that pop into my mind sunday morning. Can we provide an
KTableValueGetter when key in the store is different from the key
forwarded?
1. we would need a backwards mapper
2. I am not sure if we can pull it of w/o said forth generic type in
KTable (that I am in favour of btw)

+ It won't solves peoples problem having CombinedKey on the wire and not
beeing able to inspect the topic with say there default tools.
  - Id rather introduce KTable::mapKeys() or something (4th generic in
Ktable?) than overloading. It is better SOCs wise.

I am thinking more into an overload where we replace the Comined key
Serde. So people can use a default CombinedKey Serde
but could provide an own implementation that would internally use K0 vor
serialisation and deserialisation. One could implement
a ##prefix() into this call to make explicit that we only want the
prefix rendered. This would take CombinedKey logic out of publicly visible
data. A Stock CombinedKey Serde that would be used by default could also
handle the JSON users correctly.

Users would still get CombinedKey back. The downside of getting these
nested deeply is probably mitgated by users doing a group by
in the very next step to get rid of A's key again.

That is what I was able to come up with so far.
Let me know. what you think




On 22.11.2017 00:14, Matthias J. Sax wrote:

Jan,

Thanks for explaining the Ser

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Jan Filipiak


Hi Colin

Addressing the topic of how to manage slots from the other thread.
With tcp connections all this comes for free essentially.
I still would argue we disable it by default and make a flag in the broker
to ask the leader to maintain the cache while replicating and also only
have it optional in consumers (default to off) so one can turn it on 
where it really hurts.

MirrorMaker and audit consumers prominently.

Otherwise I left a few remarks in-line, which should help to understand
my view of the situation better

Best Jan


On 05.12.2017 08:06, Colin McCabe wrote:

On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:


On 03.12.2017 21:55, Colin McCabe wrote:

On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:

Thanks for the explanation, Colin. A few more questions.


The session epoch is not complex.  It's just a number which increments
on each incremental fetch.  The session epoch is also useful for
debugging-- it allows you to match up requests and responses when
looking at log files.

Currently each request in Kafka has a correlation id to help match the
requests and responses. Is epoch doing something differently?

Hi Becket,

The correlation ID is used within a single TCP session, to uniquely
associate a request with a response.  The correlation ID is not unique
(and has no meaning) outside the context of that single TCP session.

Keep in mind, NetworkClient is in charge of TCP sessions, and generally
tries to hide that information from the upper layers of the code.  So
when you submit a request to NetworkClient, you don't know if that
request creates a TCP session, or reuses an existing one.

Unfortunately, this doesn't work.  Imagine the client misses an
increment fetch response about a partition.  And then the partition is
never updated after that.  The client has no way to know about the
partition, since it won't be included in any future incremental fetch
responses.  And there are no offsets to compare, since the partition is
simply omitted from the response.

I am curious about in which situation would the follower miss a response
of a partition. If the entire FetchResponse is lost (e.g. timeout), the
follower would disconnect and retry. That will result in sending a full
FetchRequest.

Basically, you are proposing that we rely on TCP for reliable delivery
in a distributed system.  That isn't a good idea for a bunch of
different reasons.  First of all, TCP timeouts tend to be very long.  So
if the TCP session timing out is your error detection mechanism, you
have to wait minutes for messages to timeout.  Of course, we add a
timeout on top of that after which we declare the connection bad and
manually close it.  But just because the session is closed on one end
doesn't mean that the other end knows that it is closed.  So the leader
may have to wait quite a long time before TCP decides that yes,
connection X from the follower is dead and not coming back, even though
gremlins ate the FIN packet which the follower attempted to translate.
If the cache state is tied to that TCP session, we have to keep that
cache around for a much longer time than we should.

Hi,

I see this from a different perspective. The cache expiry time
has the same semantic as idle connection time in this scenario.
It is the time range we expect the client to come back an reuse
its broker side state. I would argue that on close we would get an
extra shot at cleaning up the session state early. As opposed to
always wait for that duration for expiry to happen.

Hi Jan,

The idea here is that the incremental fetch cache expiry time can be
much shorter than the TCP session timeout.  In general the TCP session
timeout is common to all TCP connections, and very long.  To make these
numbers a little more concrete, the TCP session timeout is often
configured to be 2 hours on Linux.  (See
https://www.cyberciti.biz/tips/linux-increasing-or-decreasing-tcp-sockets-timeouts.html
)  The timeout I was proposing for incremental fetch sessions was one or
two minutes at most.

Currently this is taken care of by
connections.max.idle.ms on the broker and defaults to something of few 
minutes.

Also something we could let the client change if we really wanted to.
So there is no need to worry about coupling our implementation to some 
timeouts

given by the OS, with TCP one always has full control over the worst times +
one gets the extra shot cleaning up early when the close comes through. 
Which

is the majority of the cases.




Secondly, from a software engineering perspective, it's not a good idea
to try to tightly tie together TCP and our code.  We would have to
rework how we interact with NetworkClient so that we are aware of things
like TCP sessions closing or opening.  We would have to be careful
preserve the ordering of incoming messages when doing things like
putting incoming requests on to a queue to be processed by multiple
threads.  It's just a lot of complexity to add, and there's no upside.

I see the point here. And I had

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-04 Thread Jan Filipiak

 deletes messages, and this changes the log start

offset

of the partition on the leader., or


Ah I see. I think I didn't notice this because statement assumes that the
LogStartOffset in the leader only changes due to LogCleaner. In fact the
LogStartOffset can change on the leader due to either log retention and
DeleteRecordsRequest. I haven't verified whether LogCleaner can change
LogStartOffset though. It may be a bit better to just say that a
partition is considered dirty if LogStartOffset changes.

I agree.  It should be straightforward to just resend the partition if
logStartOffset changes.


4. In "Fetch Session Caching" section, it is said that each broker

has a

limited number of slots. How is this number determined? Does this

require

a new broker config for this number?

Good point.  I added two broker configuration parameters to control

this

number.


I am curious to see whether we can avoid some of these new configs. For
example, incremental.fetch.session.cache.slots.per.broker is probably

not

necessary because if a leader knows that a FetchRequest comes from a
follower, we probably want the leader to always cache the information
from that follower. Does this make sense?

Yeah, maybe we can avoid having
incremental.fetch.session.cache.slots.per.broker.


Maybe we can discuss the config later after there is agreement on how the
protocol would look like.



What is the error code if broker does
not have new log for the incoming FetchRequest?

Hmm, is there a typo in this question?  Maybe you meant to ask what
happens if there is no new cache slot for the incoming FetchRequest?
That's not an error-- the incremental fetch session ID just gets set to
0, indicating no incremental fetch session was created.


Yeah there is a typo. You have answered my question.



5. Can you clarify what happens if follower adds a partition to the
ReplicaFetcherThread after receiving LeaderAndIsrRequest? Does leader
needs to generate a new session for this ReplicaFetcherThread or

does it

re-use

the existing session?  If it uses a new session, is the old session
actively deleted from the slot?

The basic idea is that you can't make changes, except by sending a full
fetch request.  However, perhaps we can allow the client to re-use its
existing session ID.  If the client sets sessionId = id, epoch = 0, it
could re-initialize the session.


Yeah I agree with the basic idea. We probably want to understand more
detail about how this works later.

Sounds good.  I updated the KIP with this information.  A
re-initialization should be exactly the same as an initialization,
except that it reuses an existing ID.

best,
Colin



BTW, I think it may be useful if the KIP can include the example

workflow

of how this feature will be used in case of partition change and so

on.

Yeah, that might help.

best,
Colin


Thanks,
Dong


On Wed, Nov 29, 2017 at 12:13 PM, Colin McCabe<cmcc...@apache.org>
wrote:


I updated the KIP with the ideas we've been discussing.

best,
Colin

On Tue, Nov 28, 2017, at 08:38, Colin McCabe wrote:

On Mon, Nov 27, 2017, at 22:30, Jan Filipiak wrote:

Hi Colin, thank you  for this KIP, it can become a really

useful

thing.

I just scanned through the discussion so far and wanted to

start a

thread to make as decision about keeping the
cache with the Connection / Session or having some sort of UUID

indN

exed

global Map.

Sorry if that has been settled already and I missed it. In this

case

could anyone point me to the discussion?

Hi Jan,

I don't think anyone has discussed the idea of tying the cache

to an

individual TCP session yet.  I agree that since the cache is

intended to

be used only by a single follower or client, it's an interesting

thing

to think about.

I guess the obvious disadvantage is that whenever your TCP

session

drops, you have to make a full fetch request rather than an

incremental

one.  It's not clear to me how often this happens in practice --

it

probably depends a lot on the quality of the network.  From a

code

perspective, it might also be a bit difficult to access data

associated

with the Session from classes like KafkaApis (although we could

refactor

it to make this easier).

It's also clear that even if we tie the cache to the session, we

still

have to have limits on the number of caches we're willing to

create.

And probably we should reserve some cache slots for each

follower, so

that clients don't take all of them.


Id rather see a protocol in which the client is hinting the

broker

that,

he is going to use the feature instead of a client
realizing that the broker just offered the feature (regardless

of

protocol version which should only indicate that the feature
would be usable).

Hmm.  I'm not sure what you mean by "hinting."  I do think that

the

server should have the option of not accepting incremental

requests

from

specific clients, in order to save memory space.


This seems to work better with

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-03 Thread Jan Filipiak



On 02.12.2017 23:34, Colin McCabe wrote:

On Thu, Nov 30, 2017, at 23:29, Jan Filipiak wrote:

Hi,

this discussion is going a little bit far from what I intended this
thread for.
I can see all of this beeing related.

To let you guys know what I am currently thinking is the following:

I do think the handling of Id's and epoch is rather complicated. I think
the complexity
comes from aiming for to much.

1. Currently all the work is towards making fetchRequest
completely empty. This brings all sorts of pain with regards to the
broker actually needs
to know what he send even though it tries to use sendfile as much as
possible.
2. Currently all the work is towards also making empty fetch request
across TCP sessions.

In this thread I aimed to relax our goals with regards to point 2.
Connection resets for us
are really the exceptions and I would argue, trying to introduce
complexity for sparing
1 full request on connection reset is not worth it. Therefore I argued
to keep the Server
side information with the Session instead somewhere global. Its not
going to bring in the
results.

As the discussion unvields I also want to challenge our approach for
point 1.
I do not see a reason to introduce complexity (and
   especially on the fetch answer path). Did we consider that from the
client we just send the offsets
we want to fetch and skip the topic partition description and just use
the order to match the information
on the broker side again? This would also reduce the fetch sizes a lot
while skipping a ton of complexity.

Hi Jan,

We need to solve the problem of the fetch request taking
O(num_partitions) space and time to process.  A solution that keeps the
O(num_partitions) behavior, but improves it by a constant factor,
doesn't really solve the problem.  And omitting some partition
information, but leaving other partition information in place,
definitely falls in that category, wouldn't you agree?  Also, as others
have noted, if you omit the partition IDs, you run into a lot of
problems surrounding changes in the partition membership.

best,
Colin

Hi Colin,

I agree that a fetch request sending only offsets still growths with the 
number of partitions.
Processing time, I can only follow as it comes to parsing, but I don't 
see a difference in the

work a broker has todo more for received offsets than cached offsets.

Given we still have the 100.000 partition case a fetchrequest as I 
suggest would savely
get below <1MB. How much of an improvement this is really depends on 
your set up.


Say you have all of these in 1 topic you are saving effectively maybe 
50% already.

As you increase topics and depending on how long you topic names are you get
extra savings.
In my playground cluster this is 160 topics, average 10 partitions, 2 
brokers
average and mean topic length is 54 and replication factor 2. This would 
result in
a saving of 5,5 bytes / topic-partition fetched. So from currently 21,5 
bytes per topic-partion
it would go down to basically 8, almost 2/3 saving. On our production 
cluster which has
a higher broker to replication factor ratio the savings are bigger. The 
Average of replicated partitions per
topic there is ~3 . This is roughly 75% percent saving in fetch request 
size. For us,
since we have many slowly changing smaller topics, varint encoding of 
offsets would give another big boost

as many fit into 2-3 bytes.


I do not quite understand what it means to omit partition-ids and 
changing ownership. The partition ID can be retrieved
by ordinal position from the brokers cache. The broker serving the fetch 
request
should not care if this consumer owns the partition in terms of its 
group membership. If the broker should no longer be the leader
of the partition he can return "not leader for partition" as usual. 
Maybe you can point me where this has been explained as I

couldn't really find a place where it got clear to me.

I think 75% saving and more is realistic and even though linear to the 
number of partitions fetch a very practical aproach that fits
the design principles "the consumer decides" a lot better. I am still 
trying to fully understand how the plan is to update the offsets broker
wise. No need to explain that here as I think I know where to look it 
up, I guess that is introduces a lot of complexity with sendfile and
an additional index lookup that I have a hard time believing it will pay 
off. Both in source code complexity and efficiency.


I intend to send you an answer on the other threads as soon as I get to 
it.  Hope this explains my view of

the size trade-off well enough. Would very much appreciate your opinion.

Best Jan




Hope these ideas are interesting

best Jan


On 01.12.2017 01:47, Becket Qin wrote:

Hi Colin,

Thanks for updating the KIP. I have two comments:

1. The session epoch seems introducing some complexity. It would be good if
we don't have to maintain the epoch.
2. If all the partitions has data returned (even a few me

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-01 Thread Jan Filipiak


BTW:

the shuffle problem would exist in all our solutions. An empty fetch 
request had the same issue about
what order to serve the topic and partitions. So my suggestion is not 
introducing this problem.


Best Jan

On 01.12.2017 08:29, Jan Filipiak wrote:

Hi,

this discussion is going a little bit far from what I intended this 
thread for.

I can see all of this beeing related.

To let you guys know what I am currently thinking is the following:

I do think the handling of Id's and epoch is rather complicated. I 
think the complexity

comes from aiming for to much.

1. Currently all the work is towards making fetchRequest
completely empty. This brings all sorts of pain with regards to the 
broker actually needs
to know what he send even though it tries to use sendfile as much as 
possible.
2. Currently all the work is towards also making empty fetch request 
across TCP sessions.


In this thread I aimed to relax our goals with regards to point 2. 
Connection resets for us
are really the exceptions and I would argue, trying to introduce 
complexity for sparing
1 full request on connection reset is not worth it. Therefore I argued 
to keep the Server
side information with the Session instead somewhere global. Its not 
going to bring in the

results.

As the discussion unvields I also want to challenge our approach for 
point 1.

I do not see a reason to introduce complexity (and
 especially on the fetch answer path). Did we consider that from the 
client we just send the offsets
we want to fetch and skip the topic partition description and just use 
the order to match the information
on the broker side again? This would also reduce the fetch sizes a lot 
while skipping a ton of complexity.


Hope these ideas are interesting

best Jan


On 01.12.2017 01:47, Becket Qin wrote:

Hi Colin,

Thanks for updating the KIP. I have two comments:

1. The session epoch seems introducing some complexity. It would be 
good if

we don't have to maintain the epoch.
2. If all the partitions has data returned (even a few messages), the 
next
fetch would be equivalent to a full request. This means the clusters 
with
continuously small throughput may not save much from the incremental 
fetch.


I am wondering if we can avoid session epoch maintenance and address the
fetch efficiency in general with some modifications to the solution. Not
sure if the following would work, but just want to give my ideas.

To solve 1, the basic idea is to let the leader return the partition 
data

with its expected client's position for each partition. If the client
disagree with the leader's expectation, a full FetchRequest is then 
sent to

ask the leader to update the client's position.
To solve 2, when possible, we just let the leader to infer the clients
position instead of asking the clients to provide the position, so the
incremental fetch can be empty in most cases.

More specifically, the protocol will have the following change.
1. Add a new flag called FullFetch to the FetchRequest.
1) A full FetchRequest is the same as the current FetchRequest with
FullFetch=true.
2) An incremental FetchRequest is always empty with FullFetch=false.
2. Add a new field called ExpectedPosition(INT64) to each partition 
data in

the FetchResponse.

The leader logic:
1. The leader keeps a map from client-id (client-uuid) to the interested
partitions of that client. For each interested partition, the leader 
keeps

the client's position for that client.
2. When the leader receives a full fetch request (FullFetch=true), the
leader
 1) replaces the interested partitions for the client id with the
partitions in that full fetch request.
 2) updates the client position with the offset specified in that 
full

fetch request.
 3) if the client is a follower, update the high watermark, etc.
3. When the leader receives an incremental fetch request (typically 
empty),

the leader returns the data from all the interested partitions (if any)
according to the position in the interested partitions map.
4. In the FetchResponse, the leader will include an 
ExpectedFetchingOffset
that the leader thinks the client is fetching at. The value is the 
client
position of the partition in the interested partition map. This is 
just to
confirm with the client that the client position in the leader is 
correct.
5. After sending back the FetchResponse, the leader updates the 
position of

the client's interested partitions. (There may be some overhead for the
leader to know of offsets, but I think the trick of returning at index
entry boundary or log end will work efficiently).
6. The leader will expire the client interested partitions if the client
hasn't fetch for some time. And if an incremental request is received 
when
the map does not contain the client info, an error will be returned 
to the

client to ask for a FullFetch.

The clients logic:
1. Start with sending a full FetchRequest, including partitions and 
offsets.
2. When get a response, check

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-01 Thread Jan Filipiak


Hi,

good catch about the rotation.
This is probably not a too big blocker. Plenty of ideas spring to my mind
of how this can be done. Maybe one can offer different algorithms here.
(nothing, random shuffle, client sends bitmask which it wants to fetch 
first, broker logic... many more)


Thank you for considering my ideas. I am pretty convinced we don't need
to aim for the 100% empty fetch request across TCP sessions. Maybe my ideas
offer decent tradeoffs.

Best Jan





On 01.12.2017 08:43, Becket Qin wrote:

Hi Jan,

I agree that we probably don't want to make the protocol too complicated
just for exception cases.

The current FetchRequest contains an ordered list of partitions that may
rotate based on the priority. Therefore it is kind of difficult to do the
order matching. But you brought a good point about order, we may want to
migrate the rotation logic from the clients to the server. Not sure if this
will introduce some complexity to the broker. Intuitively it seems fine.
The logic would basically be similar to the draining logic in the
RecordAccumulator of the producer.

Thanks,

Jiangjie (Becket) Qin

On Thu, Nov 30, 2017 at 11:29 PM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Hi,

this discussion is going a little bit far from what I intended this thread
for.
I can see all of this beeing related.

To let you guys know what I am currently thinking is the following:

I do think the handling of Id's and epoch is rather complicated. I think
the complexity
comes from aiming for to much.

1. Currently all the work is towards making fetchRequest
completely empty. This brings all sorts of pain with regards to the broker
actually needs
to know what he send even though it tries to use sendfile as much as
possible.
2. Currently all the work is towards also making empty fetch request
across TCP sessions.

In this thread I aimed to relax our goals with regards to point 2.
Connection resets for us
are really the exceptions and I would argue, trying to introduce
complexity for sparing
1 full request on connection reset is not worth it. Therefore I argued to
keep the Server
side information with the Session instead somewhere global. Its not going
to bring in the
results.

As the discussion unvields I also want to challenge our approach for point
1.
I do not see a reason to introduce complexity (and
  especially on the fetch answer path). Did we consider that from the
client we just send the offsets
we want to fetch and skip the topic partition description and just use the
order to match the information
on the broker side again? This would also reduce the fetch sizes a lot
while skipping a ton of complexity.

Hope these ideas are interesting

best Jan



On 01.12.2017 01:47, Becket Qin wrote:


Hi Colin,

Thanks for updating the KIP. I have two comments:

1. The session epoch seems introducing some complexity. It would be good
if
we don't have to maintain the epoch.
2. If all the partitions has data returned (even a few messages), the next
fetch would be equivalent to a full request. This means the clusters with
continuously small throughput may not save much from the incremental
fetch.

I am wondering if we can avoid session epoch maintenance and address the
fetch efficiency in general with some modifications to the solution. Not
sure if the following would work, but just want to give my ideas.

To solve 1, the basic idea is to let the leader return the partition data
with its expected client's position for each partition. If the client
disagree with the leader's expectation, a full FetchRequest is then sent
to
ask the leader to update the client's position.
To solve 2, when possible, we just let the leader to infer the clients
position instead of asking the clients to provide the position, so the
incremental fetch can be empty in most cases.

More specifically, the protocol will have the following change.
1. Add a new flag called FullFetch to the FetchRequest.
 1) A full FetchRequest is the same as the current FetchRequest with
FullFetch=true.
 2) An incremental FetchRequest is always empty with FullFetch=false.
2. Add a new field called ExpectedPosition(INT64) to each partition data
in
the FetchResponse.

The leader logic:
1. The leader keeps a map from client-id (client-uuid) to the interested
partitions of that client. For each interested partition, the leader keeps
the client's position for that client.
2. When the leader receives a full fetch request (FullFetch=true), the
leader
  1) replaces the interested partitions for the client id with the
partitions in that full fetch request.
  2) updates the client position with the offset specified in that full
fetch request.
  3) if the client is a follower, update the high watermark, etc.
3. When the leader receives an incremental fetch request (typically
empty),
the leader returns the data from all the interested partitions (if any)
according to the position in the interested partitions map.
4. In the FetchResponse, the

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-30 Thread Jan Filipiak

.

Yeah, that might help.

best,
Colin


Thanks,
Dong


On Wed, Nov 29, 2017 at 12:13 PM, Colin McCabe <cmcc...@apache.org>
wrote:


I updated the KIP with the ideas we've been discussing.

best,
Colin

On Tue, Nov 28, 2017, at 08:38, Colin McCabe wrote:

On Mon, Nov 27, 2017, at 22:30, Jan Filipiak wrote:

Hi Colin, thank you  for this KIP, it can become a really useful

thing.

I just scanned through the discussion so far and wanted to start a
thread to make as decision about keeping the
cache with the Connection / Session or having some sort of UUID

indN

exed

global Map.

Sorry if that has been settled already and I missed it. In this

case

could anyone point me to the discussion?

Hi Jan,

I don't think anyone has discussed the idea of tying the cache to an
individual TCP session yet.  I agree that since the cache is

intended to

be used only by a single follower or client, it's an interesting

thing

to think about.

I guess the obvious disadvantage is that whenever your TCP session
drops, you have to make a full fetch request rather than an

incremental

one.  It's not clear to me how often this happens in practice -- it
probably depends a lot on the quality of the network.  From a code
perspective, it might also be a bit difficult to access data

associated

with the Session from classes like KafkaApis (although we could

refactor

it to make this easier).

It's also clear that even if we tie the cache to the session, we

still

have to have limits on the number of caches we're willing to create.
And probably we should reserve some cache slots for each follower, so
that clients don't take all of them.


Id rather see a protocol in which the client is hinting the broker

that,

he is going to use the feature instead of a client
realizing that the broker just offered the feature (regardless of
protocol version which should only indicate that the feature
would be usable).

Hmm.  I'm not sure what you mean by "hinting."  I do think that the
server should have the option of not accepting incremental requests

from

specific clients, in order to save memory space.


This seems to work better with a per
connection/session attached Metadata than with a Map and could

allow

for

easier client implementations.
It would also make Client-side code easier as there wouldn't be any
Cache-miss error Messages to handle.

It is nice not to have to handle cache-miss responses, I agree.
However, TCP sessions aren't exposed to most of our client-side code.
For example, when the Producer creates a message and hands it off to

the

NetworkClient, the NC will transparently re-connect and re-send a
message if the first send failed.  The higher-level code will not be
informed about whether the TCP session was re-established, whether an
existing TCP session was used, and so on.  So overall I would still

lean

towards not coupling this to the TCP session...

best,
Colin


   Thank you again for the KIP. And again, if this was clarified

already

please drop me a hint where I could read about it.

Best Jan





On 21.11.2017 22:02, Colin McCabe wrote:

Hi all,

I created a KIP to improve the scalability and latency of

FetchRequest:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-

227%3A+Introduce+Incremental+FetchRequests+to+Increase+
Partition+Scalability

Please take a look.

cheers,
Colin

Re: [DISCUSS]: KIP-159: Introducing Rich functions to Streams

2017-11-29 Thread Jan Filipiak


Hi,

thank you for the summary and thanks for acknowledging that I do have a 
point here.


I don't like the second Idea at all. Hence I started of this discussion.

I am just disappointed, back then when we had the discussion about how 
to refactor store overload
and IQ handling, I knew the path we are taking is wrong. Having problems 
implementing these kinda
features (wich are really simple)  is just a symptom of messed up IQ 
implementation. I wish really bad
I could have convinced you guys back then. To be honest with IQ we can 
continue here
as we Materialize but would not send oldValue, but with join you're out 
of luck with current setup.


I of course recommend to do not introduce any optimizations here. Id 
recommend to go towards what
I recommended already back then. So i would't say we need to optimize 
anything later we need to build

the topology better in the first place.




On 28.11.2017 21:00, Guozhang Wang wrote:

Jan,

Thanks for your input, I can understand now that the oldValue is also
exposed in user customized `filter` function and hence want record context
we should expose is a problem. And I think it does brings a good point to
consider for KIP-159. The discussions maybe a bit confusing to reader
though, and hence I'd like to summarize the status quo and with a proposal:

In today's Streams DSL, when a KTable is created either from a source
topic, or from an stateful operator, we will materialize the KTable with a
backing state store; on the other hand, KTables created from a non-stateful
operator like filter, will not be backed by a state store by default unless
users indicate so (e.g. using the overloaded function with the queryable
name or store supplier).

For example:

KTable table1 = builder.table("topic");  // a
state store created for table1
KTable table2 = table1.filter(..);
// no state store created for table2
KTable table3 = table1.filter(.., "storeName");  // a state
store created for table3
KTable table4 = table1.groupBy(..).aggregate(..);// a state
store created for table4

Because of that, the filter() operator above on table1 will always be
exposed with oldValue and newValue; Damian's point is that, we may optimize
the first case such that table1 will only be materialized if users asked so
(e.g. using the overloaded function with a store supplier), and in which
case, we do not need to pass newValue / oldValue pairs (I think this is
what Jan suggests as well, i.e. do filtering before materializing, so that
we can have a smaller backed state store as well). But this optimization
does not eliminate the possibilities that we may still need to do filter if
users does specify "yes I do want to the source KTable itself to be
materialized, please". So the concern about how to expose the record
context in such cases still persists.


With that, regarding to KIP-159 itself, here are my thoughts:

1) if we restrict the scope of exposing record context only to source
KTables / KStreams I felt the KIP itself does not bring much value given
its required API change because only the SourceKStream can safely maintain
its records context, and for SourceKTable if it is materialized, then even
non-stateful operators like Join may still have a concern about exposing
the record context.

2) an alternative idea is we provide the semantics on how record context
would be inherited across the operators for KTable / KStream and expose it
in all operators (similarly in PAPI we would expose a much simpler
contract), and make it as a public contract that Streams library will
guarantee moving forward even we optimize our topology builder; it may not
align perfectly with the linear algebraic semantics but practically
applicable for most cases; if users semantics do not fit in the provided
contract, then they may need to get this themselves (embed such information
in the value payload, for example).

If people do not like the second idea, I'd suggest we hold on pursuing the
first direction since to me its beneficial scope is too limited compared to
its cost.


Guozhang



On Fri, Nov 24, 2017 at 1:39 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Cleary we show the oldValue to the user. We have to, because we filter
after the store.
https://github.com/axbaretto/kafka/blob/master/streams/src/m
ain/java/org/apache/kafka/streams/kstream/internals/KTableFilter.java#L96

I cannot help you following this. It is really obvious and I am running
out of tools for explaining.

Thanks for understanding my point to put filter before. Not only would it
make the store smaller. It would make this feature reasonably possible and
the framework easier. Interestingly it would also help to move IQ into more
reasonable directions. And it might help understand that we do not need any
intermediate representation of the topology,

KIP-182 I have no clue what everyone has with their "bytestores" so
broken. But

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-27 Thread Jan Filipiak


Hi Colin, thank you  for this KIP, it can become a really useful thing.

I just scanned through the discussion so far and wanted to start a 
thread to make as decision about keeping the
cache with the Connection / Session or having some sort of UUID indexed 
global Map.


Sorry if that has been settled already and I missed it. In this case 
could anyone point me to the discussion?


Id rather see a protocol in which the client is hinting the broker that, 
he is going to use the feature instead of a client
realizing that the broker just offered the feature (regardless of 
protocol version which should only indicate that the feature
would be usable). This seems to work better with a per 
connection/session attached Metadata than with a Map and could allow for

easier client implementations.
It would also make Client-side code easier as there wouldn't be any 
Cache-miss error Messages to handle.


 Thank you again for the KIP. And again, if this was clarified already 
please drop me a hint where I could read about it.


Best Jan





On 21.11.2017 22:02, Colin McCabe wrote:

Hi all,

I created a KIP to improve the scalability and latency of FetchRequest:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-227%3A+Introduce+Incremental+FetchRequests+to+Increase+Partition+Scalability

Please take a look.

cheers,
Colin

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

2017-11-25 Thread Jan Filipiak


Hi Matthias,

2 things that pop into my mind sunday morning. Can we provide an 
KTableValueGetter when key in the store is different from the key 
forwarded?

1. we would need a backwards mapper
2. I am not sure if we can pull it of w/o said forth generic type in 
KTable (that I am in favour of btw)


+ It won't solves peoples problem having CombinedKey on the wire and not 
beeing able to inspect the topic with say there default tools.
 - Id rather introduce KTable::mapKeys() or something (4th generic in 
Ktable?) than overloading. It is better SOCs wise.


I am thinking more into an overload where we replace the Comined key 
Serde. So people can use a default CombinedKey Serde
but could provide an own implementation that would internally use K0 vor 
serialisation and deserialisation. One could implement
a ##prefix() into this call to make explicit that we only want the 
prefix rendered. This would take CombinedKey logic out of publicly visible
data. A Stock CombinedKey Serde that would be used by default could also 
handle the JSON users correctly.


Users would still get CombinedKey back. The downside of getting these 
nested deeply is probably mitgated by users doing a group by

in the very next step to get rid of A's key again.

That is what I was able to come up with so far.
Let me know. what you think




On 22.11.2017 00:14, Matthias J. Sax wrote:

Jan,

Thanks for explaining the Serde issue! This makes a lot of sense.

I discussed with Guozhang about this issue and came up with the
following idea that bridges both APIs:

We still introduce CombinedKey as a public interface and exploit it to
manage the key in the store and the changelog topic. For this case we
can construct a suitable Serde internally based on the Serdes of both
keys that are combined.

However, the type of the result table is user defined and can be
anything. To bridge between the CombinedKey and the user defined result
type, users need to hand in a `ValueMapper<CombinedKey, KO>` that
convert the CombinedKey into the desired result type.

Thus, the method signature would be something like


<KO, VO, K1, V1> KTable<KO,VO> oneToManyJoin(> KTable<K1, V1> other,
 ValueMapper<V1, K> keyExtractor,> ValueJoiner<V, V1, VO> joiner,
 ValueMapper<CombinedKey<K,K1>, KO> resultKeyMapper);

The interface parameters are still easy to understand and don't leak
implementation details IMHO.

WDYT about this idea?


-Matthias


On 11/19/17 11:28 AM, Guozhang Wang wrote:

Hello Jan,

I think I get your point about the cumbersome that CombinedKey would
introduce for serialization and tooling based on serdes. What I'm still
wondering is the underlying of joinPrefixFakers mapper: from your latest
comment it seems this mapper will be a one-time mapper: we use this to map
the original resulted KTable<combined<K1, K2>, V0> to KTable<K0, V0> and
then that mapper can be thrown away and be forgotten. Is that true? My
original thought is that you propose to carry this mapper all the way along
the rest of the topology to "abstract" the underlying combined keys.

If it is the other way (i.e. the former approach), then the diagram of
these two approaches would be different: for the less intrusive approach we
would add one more step in this diagram to always do a mapping after the
"task perform join" block.

Also another minor comment on the internal topic: I think many readers may
not get the schema of this topic, so it is better to indicate that what
would be the key of this internal topic used for compaction, and what would
be used as the partition-key.

Guozhang


On Sat, Nov 18, 2017 at 2:30 PM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


-> it think the relationships between the different used types, K0,K1,KO
should be explains explicitly (all information is there implicitly, but
one need to think hard to figure it out)


I'm probably blind for this. can you help me here? how would you formulate
this?

Thanks,

Jan


On 16.11.2017 23:18, Matthias J. Sax wrote:


Hi,

I am just catching up on this discussion and did re-read the KIP and
discussion thread.

In contrast to you, I prefer the second approach with CombinedKey as
return type for the following reasons:

   1) the oneToManyJoin() method had less parameter
   2) those parameters are easy to understand
   3) we hide implementation details (joinPrefixFaker, leftKeyExtractor,
and the return type KO leaks internal implementation details from my
point of view)
   4) user can get their own KO type by extending CombinedKey interface
(this would also address the nesting issue Trevor pointed out)

That's unclear to me is, why you care about JSON serdes? What is the
problem with regard to prefix? It seems I am missing something here.

I also don't understand the argument about "the user can stick with his
default serde or his standard way of serializing"? If we have
`CombinedKey` a

Re: [DISCUSS]: KIP-159: Introducing Rich functions to Streams

2017-11-24 Thread Jan Filipiak

Cleary we show the oldValue to the user. We have to, because we filter
after the store.

https://github.com/axbaretto/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableFilter.java#L96

I cannot help you following this. It is really obvious and I am running
out of tools for explaining.

Thanks for understanding my point to put filter before. Not only would
it make the store smaller. It would make this feature reasonably
possible and the framework easier. Interestingly it would also help to
move IQ into more reasonable directions. And it might help understand
that we do not need any intermediate representation of the topology,

KIP-182 I have no clue what everyone has with their "bytestores" so
broken. But putting another store after doesn't help when the store
before is the problem.

On 24.11.2017 05:08, Matthias J. Sax wrote:

From a DSL point of view, users only see the new value on a
KTable#filter anyway. So why should it be an issue that we use
<newValue,oldValue> pair under the hood?

User sees newValue and gets corresponding RecordContext. I can't see any
issue here?

I cannot follow here:

Even when we have a statefull operation last. We move it to the very
first processor (KtableSource)
and therefore cant present a proper RecordContext.

With regard to `builder.table().filter()`:

I see you point that it would be good to be able to apply the filter()
first to reduce the stat store size of the table. But how is this
related to KIP-159?

Btw: with KIP-182, I am wondering if this would not be possible, by
putting a custom dummy store into the table and materialize the filter
result afterwards? It's not a nice way to do, but seems to be possible.

-Matthias

On 11/23/17 4:56 AM, Jan Filipiak wrote:

The comment is valid. It falls exactly into this topic, it has exactly
todo with this!
Even when we have a statefull operation last. We move it to the very
first processor (KtableSource)
and therefore cant present a proper RecordContext.

Regarding the other Jiras you are referring to. They harm the project
more than they do good!
There is no need for this kind of optimizer and meta representation and
what not. I hope they
never get implemented.

Best Jan

On 22.11.2017 14:44, Damian Guy wrote:

Jan, i think you comment with respect to filtering is valid, though
not for
this KIP. We have separate JIRAs for topology optimization of which this
falls into.

Thanks,
Damian

On Wed, 22 Nov 2017 at 02:25 Guozhang Wang <wangg...@gmail.com> wrote:

Jan,

Not sure I understand your argument that "we still going to present
change.oldValue to the filter even though the record context() is for
change.newValue". Are you referring to `KTableFilter#process()`? If yes
could you point to me which LOC are you concerning about?

Guozhang

On Mon, Nov 20, 2017 at 9:29 PM, Jan Filipiak <jan.filip...@trivago.com>
wrote:

a remark of mine that got missed during migration:

There is this problem that even though we have source.table.filter.join
the state-fullness happens at the table step not a the join step. In a
filter
we still going to present change.oldValue to the filter even though the
record context() is for change.newValue. I would go as far as applying
the filter before the table processor. Not to just get KIP-159, but

because

I think its a side effect of a non ideal topology layout. If i can
filter
99% of my
records. my state could be way smaller. Also widely escalates the
context
of the KIP

I can only see upsides of executing the filter first.

Best Jan

On 20.11.2017 22:22, Matthias J. Sax wrote:

I am moving this back to the DISCUSS thread... Last 10 emails were
sent
to VOTE thread.

Copying Guozhang's last summary below. Thanks for this summary. Very
comprehensive!

It seems, we all agree, that the current implementation of the context
at PAPI level is ok, but we should not leak it into DSL.

Thus, we can go with (2) or (3), were (3) is an extension to (2)
carrying the context to more operators than just sources. It also
seems,
that we all agree, that many-to-one operations void the context.

I still think, that just going with plain (2) is too restrictive --
but
I am also fine if we don't go with the full proposal of (3).

Also note, that the two operators filter() and filterNot() don't
modify
the record and thus for both, it would be absolutely valid to keep the
context.

I personally would keep the context for at least all one-to-one
operators. One-to-many is debatable and I am fine to not carry the
context further: at least the offset information is questionable for
this case -- note thought, that semantically, the timestamp is
inherited
via one-to-many, and I also think this applies to "topic" and
"partition". Thus, I think it's still valuable information we can
carry
downstreams.

-Matthias

Jan: which approach are you referring to as "the approach that is
on the

table wou

Re: [DISCUSS]: KIP-159: Introducing Rich functions to Streams

2017-11-23 Thread Jan Filipiak

The comment is valid. It falls exactly into this topic, it has exactly 
todo with this!
Even when we have a statefull operation last. We move it to the very 
first processor (KtableSource)

and therefore cant present a proper RecordContext.

Regarding the other Jiras you are referring to. They harm the project 
more than they do good!
There is no need for this kind of optimizer and meta representation and 
what not. I hope they

never get implemented.

Best Jan


On 22.11.2017 14:44, Damian Guy wrote:

Jan, i think you comment with respect to filtering is valid, though not for
this KIP. We have separate JIRAs for topology optimization of which this
falls into.

Thanks,
Damian

On Wed, 22 Nov 2017 at 02:25 Guozhang Wang <wangg...@gmail.com> wrote:


Jan,

Not sure I understand your argument that "we still going to present
change.oldValue to the filter even though the record context() is for
change.newValue". Are you referring to `KTableFilter#process()`? If yes
could you point to me which LOC are you concerning about?


Guozhang


On Mon, Nov 20, 2017 at 9:29 PM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


a remark of mine that got missed during migration:

There is this problem that even though we have source.table.filter.join
the state-fullness happens at the table step not a the join step. In a
filter
we still going to present change.oldValue to the filter even though the
record context() is for change.newValue. I would go as far as applying
the filter before the table processor. Not to just get KIP-159, but

because

I think its a side effect of a non ideal topology layout. If i can filter
99% of my
records. my state could be way smaller. Also widely escalates the context
of the KIP

I can only see upsides of executing the filter first.

Best Jan



On 20.11.2017 22:22, Matthias J. Sax wrote:


I am moving this back to the DISCUSS thread... Last 10 emails were sent
to VOTE thread.

Copying Guozhang's last summary below. Thanks for this summary. Very
comprehensive!

It seems, we all agree, that the current implementation of the context
at PAPI level is ok, but we should not leak it into DSL.

Thus, we can go with (2) or (3), were (3) is an extension to (2)
carrying the context to more operators than just sources. It also seems,
that we all agree, that many-to-one operations void the context.

I still think, that just going with plain (2) is too restrictive -- but
I am also fine if we don't go with the full proposal of (3).

Also note, that the two operators filter() and filterNot() don't modify
the record and thus for both, it would be absolutely valid to keep the
context.

I personally would keep the context for at least all one-to-one
operators. One-to-many is debatable and I am fine to not carry the
context further: at least the offset information is questionable for
this case -- note thought, that semantically, the timestamp is inherited
via one-to-many, and I also think this applies to "topic" and
"partition". Thus, I think it's still valuable information we can carry
downstreams.


-Matthias

Jan: which approach are you referring to as "the approach that is on the

table would be perfect"?

Note that in today's PAPI layer we are already effectively exposing the
record context which has the issues that we have been discussing right
now,
and its semantics is always referring to the "processing record" at

hand.

More specifically, we can think of processing a record a bit different:

1) the record traversed the topology from source to sink, it may be
transformed into new object or even generate multiple new objects

(think:

branch) along the traversal. And the record context is referring to

this

processing record. Here the "lifetime" of the record lasts for the

entire

topology traversal and any new records of this traversal is treated as
different transformed values of this record (this applies to join and
aggregations as well).

2) the record being processed is wiped out in the first operator after
the
source, and NEW records are forwarded to downstream operators. I.e.

each

record only lives between two adjacent operators, once it reached the

new

operator it's lifetime has ended and new records are generated.

I think in the past we have talked about Streams under both context,

and

we
do not have a clear agreement. I agree that 2) is logically more
understandable for users as it does not leak any internal

implementation

details (e.g. for stream-table joins, table record's traversal ends at
the
join operator as it is only be materialized, while stream record's
traversal goes through the join operator to further down until sinks).
However if we are going to interpret following 2) above then even for
non-stateful operators we would not inherit record context. What we're
discussing now, seems to infer a third semantics:

3) a record would traverse "through" one-to-one (non-stateful)

opera

Re: [DISCUSS]: KIP-159: Introducing Rich functions to Streams

2017-11-20 Thread Jan Filipiak


a remark of mine that got missed during migration:

There is this problem that even though we have source.table.filter.join
the state-fullness happens at the table step not a the join step. In a 
filter

we still going to present change.oldValue to the filter even though the
record context() is for change.newValue. I would go as far as applying
the filter before the table processor. Not to just get KIP-159, but because
I think its a side effect of a non ideal topology layout. If i can 
filter 99% of my

records. my state could be way smaller. Also widely escalates the context
of the KIP

I can only see upsides of executing the filter first.

Best Jan



On 20.11.2017 22:22, Matthias J. Sax wrote:

I am moving this back to the DISCUSS thread... Last 10 emails were sent
to VOTE thread.

Copying Guozhang's last summary below. Thanks for this summary. Very
comprehensive!

It seems, we all agree, that the current implementation of the context
at PAPI level is ok, but we should not leak it into DSL.

Thus, we can go with (2) or (3), were (3) is an extension to (2)
carrying the context to more operators than just sources. It also seems,
that we all agree, that many-to-one operations void the context.

I still think, that just going with plain (2) is too restrictive -- but
I am also fine if we don't go with the full proposal of (3).

Also note, that the two operators filter() and filterNot() don't modify
the record and thus for both, it would be absolutely valid to keep the
context.

I personally would keep the context for at least all one-to-one
operators. One-to-many is debatable and I am fine to not carry the
context further: at least the offset information is questionable for
this case -- note thought, that semantically, the timestamp is inherited
via one-to-many, and I also think this applies to "topic" and
"partition". Thus, I think it's still valuable information we can carry
downstreams.


-Matthias


Jan: which approach are you referring to as "the approach that is on the
table would be perfect"?

Note that in today's PAPI layer we are already effectively exposing the
record context which has the issues that we have been discussing right now,
and its semantics is always referring to the "processing record" at hand.
More specifically, we can think of processing a record a bit different:

1) the record traversed the topology from source to sink, it may be
transformed into new object or even generate multiple new objects (think:
branch) along the traversal. And the record context is referring to this
processing record. Here the "lifetime" of the record lasts for the entire
topology traversal and any new records of this traversal is treated as
different transformed values of this record (this applies to join and
aggregations as well).

2) the record being processed is wiped out in the first operator after the
source, and NEW records are forwarded to downstream operators. I.e. each
record only lives between two adjacent operators, once it reached the new
operator it's lifetime has ended and new records are generated.

I think in the past we have talked about Streams under both context, and we
do not have a clear agreement. I agree that 2) is logically more
understandable for users as it does not leak any internal implementation
details (e.g. for stream-table joins, table record's traversal ends at the
join operator as it is only be materialized, while stream record's
traversal goes through the join operator to further down until sinks).
However if we are going to interpret following 2) above then even for
non-stateful operators we would not inherit record context. What we're
discussing now, seems to infer a third semantics:

3) a record would traverse "through" one-to-one (non-stateful) operators,
will "replicate" at one-to-many (non-stateful) operators (think: "mapValues"
  ) and will "end" at many-to-one (stateful) operators where NEW records
will be generated and forwarded to the downstream operators.

Just wanted to lay the ground for discussions so we are all on the same
page before chatting more.


Guozhang



On 11/6/17 1:41 PM, Jeyhun Karimov wrote:

Hi Matthias,

Thanks a lot for correcting. It is a leftover from the past designs when
punctuate() was not deprecated.
I corrected.

Cheers,
Jeyhun

On Mon, Nov 6, 2017 at 5:30 PM Matthias J. Sax 
wrote:


I just re-read the KIP.

One minor comment: we don't need to introduce any deprecated methods.
Thus, RichValueTransformer#punctuate can be removed completely instead
of introducing it as deprecated.

Otherwise looks good to me.

Thanks for being so patient!


-Matthias

On 11/1/17 9:16 PM, Guozhang Wang wrote:

Jeyhun,

I think I'm convinced to not do KAFKA-3907 in this KIP. We should think
carefully if we should add this functionality to the DSL layer moving
forward since from what we discovered working on it the conclusion is

that

it would require revamping the public APIs quite a lot, and it's not

clear

if it is a good

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

2017-11-20 Thread Jan Filipiak




On 19.11.2017 21:12, Guozhang Wang wrote:

Jan: which approach are you referring to as "the approach that is on the
table would be perfect"?

The SourcesKStream/Table suggestion.


Note that in today's PAPI layer we are already effectively exposing the
record context which has the issues that we have been discussing right now,
and its semantics is always referring to the "processing record" at hand.
More specifically, we can think of processing a record a bit different:

1) the record traversed the topology from source to sink, it may be
transformed into new object or even generate multiple new objects (think:
branch) along the traversal. And the record context is referring to this
processing record. Here the "lifetime" of the record lasts for the entire
topology traversal and any new records of this traversal is treated as
different transformed values of this record (this applies to join and
aggregations as well).

2) the record being processed is wiped out in the first operator after the
source, and NEW records are forwarded to downstream operators. I.e. each
record only lives between two adjacent operators, once it reached the new
operator it's lifetime has ended and new records are generated.

I think in the past we have talked about Streams under both context, and we
do not have a clear agreement. I agree that 2) is logically more
understandable for users as it does not leak any internal implementation
details (e.g. for stream-table joins, table record's traversal ends at the
join operator as it is only be materialized, while stream record's
traversal goes through the join operator to further down until sinks).
However if we are going to interpret following 2) above then even for
non-stateful operators we would not inherit record context. What we're
discussing now, seems to infer a third semantics:

3) a record would traverse "through" one-to-one (non-stateful) operators,
will "replicate" at one-to-many (non-stateful) operators (think: "mapValues"
  ) and will "end" at many-to-one (stateful) operators where NEW records
will be generated and forwarded to the downstream operators.

There is this problem that even though we have source.table.filter.join
the state-fullness happens at the table step not a the join step. In a 
filter

we still going to present change.oldValue to the filter even though the
record context() is for change.newValue. I would go as far as applying
the filter before the table processor. Not to just get KIP-159, but because
I think its a side effect of a non ideal topology layout. If i can 
filter 99% of my

records. my state could be way smaller. Also widely escalates the context
of the KIP


Just wanted to lay the ground for discussions so we are all on the same
page before chatting more.


Guozhang


On Sat, Nov 18, 2017 at 3:10 AM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


Hi,

  not an issue at all. IMO
the approach that is on the table would be perfect


On 18.11.2017 10:58, Jeyhun Karimov wrote:


Hi,

I did not expected that Context will be this much an issue. Instead of
applying different semantics for different operators, I think we should
remove this feature completely.


Cheers,
Jeyhun
On Sat 18. Nov 2017 at 07:49, Jan Filipiak <jan.filip...@trivago.com>
wrote:

Yes, the mail said only join so I wanted to clarify.



On 17.11.2017 19:05, Matthias J. Sax wrote:


Yes. But I think an aggregation is an many-to-one operation, too.

For the stripping off part: internally, we can just keep some record
context, but just do not allow users to access it (because the context
context does not make sense for them) by hiding the corresponding APIs.


-Matthias

On 11/16/17 10:05 PM, Guozhang Wang wrote:


Matthias,

For this idea, are your proposing that for any many-to-one mapping
operations (for now only Join operators), we will strip off the record
context in the resulted records and claim "we cannot infer its traced
context anymore"?


Guozhang


On Thu, Nov 16, 2017 at 1:03 PM, Matthias J. Sax <
matth...@confluent.io
wrote:

Any thoughts about my latest proposal?

-Matthias

On 11/10/17 10:02 PM, Jan Filipiak wrote:


Hi,

i think this is the better way. Naming is always tricky Source is


kinda

taken

I had TopicBackedK[Source|Table] in mind
but for the user its way better already IMHO

Thank you for reconsideration

Best Jan


On 10.11.2017 22:48, Matthias J. Sax wrote:


I was thinking about the source stream/table idea once more and it


seems

it would not be too hard to implement:

We add two new classes

  SourceKStream extends KStream

and

  SourceKTable extend KTable

and return both from StreamsBuilder#stream and StreamsBuilder#table

As both are sub-classes, this change is backward compatible. We


change

the return type for any single-record transform to this new types,

too,

and use KStream/KTable as return type for any multi-record operation.

Th

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

2017-11-18 Thread Jan Filipiak


-> it think the relationships between the different used types, K0,K1,KO
should be explains explicitly (all information is there implicitly, but
one need to think hard to figure it out)


I'm probably blind for this. can you help me here? how would you 
formulate this?


Thanks,

Jan


On 16.11.2017 23:18, Matthias J. Sax wrote:

Hi,

I am just catching up on this discussion and did re-read the KIP and
discussion thread.

In contrast to you, I prefer the second approach with CombinedKey as
return type for the following reasons:

  1) the oneToManyJoin() method had less parameter
  2) those parameters are easy to understand
  3) we hide implementation details (joinPrefixFaker, leftKeyExtractor,
and the return type KO leaks internal implementation details from my
point of view)
  4) user can get their own KO type by extending CombinedKey interface
(this would also address the nesting issue Trevor pointed out)

That's unclear to me is, why you care about JSON serdes? What is the
problem with regard to prefix? It seems I am missing something here.

I also don't understand the argument about "the user can stick with his
default serde or his standard way of serializing"? If we have
`CombinedKey` as output, the use just provide the serdes for both input
combined-key types individually, and we can reuse both internally to do
the rest. This seems to be a way simpler API. With the KO output type
approach, users need to write an entirely new serde for KO in contrast.

Finally, @Jan, there are still some open comments you did not address
and the KIP wiki page needs some updates. Would be great if you could do
this.

Can you also explicitly describe the data layout of the store that is
used to do the range scans?

Additionally:

-> some arrows in the algorithm diagram are missing
-> was are those XXX in the diagram
-> can you finish the "Step by Step" example
-> it think the relationships between the different used types, K0,K1,KO
should be explains explicitly (all information is there implicitly, but
one need to think hard to figure it out)


Last but not least:


But noone is really interested.

Don't understand this statement...



-Matthias


On 11/16/17 9:05 AM, Jan Filipiak wrote:

We are running this perfectly fine. for us the smaller table changes
rather infrequent say. only a few times per day. The performance of the
flush is way lower than the computing power you need to bring to the
table to account for all the records beeing emmited after the one single
update.

On 16.11.2017 18:02, Trevor Huey wrote:

Ah, I think I see the problem now. Thanks for the explanation. That is
tricky. As you said, it seems the easiest solution would just be to
flush the cache. I wonder how big of a performance hit that'd be...

On Thu, Nov 16, 2017 at 9:07 AM Jan Filipiak <jan.filip...@trivago.com
<mailto:jan.filip...@trivago.com>> wrote:

 Hi Trevor,

 I am leaning towards the less intrusive approach myself. Infact
 that is how we implemented our Internal API for this and how we
 run it in production.
 getting more voices towards this solution makes me really happy.
 The reason its a problem for Prefix and not for Range is the
 following. Imagine the intrusive approach. They key of the RockDB
 would be CombinedKey<A,B> and the prefix scan would take an A, and
 the range scan would take an CombinedKey<A,B> still. As you can
 see with the intrusive approach the keys are actually different
 types for different queries. With the less intrusive apporach we
 use the same type and rely on Serde Invariances. For us this works
 nice (protobuf) might bite some JSON users.

 Hope it makes it clear

 Best Jan


 On 16.11.2017 16:39, Trevor Huey wrote:

 1. Going over KIP-213, I am leaning toward the "less intrusive"
 approach. In my use case, I am planning on performing a sequence
 of several oneToMany joins, From my understanding, the more
 intrusive approach would result in several nested levels of
 CombinedKey's. For example, consider Tables A, B, C, D with
 corresponding keys KA, KB, KC. Joining A and B would produce
 CombinedKey<KA, KB>. Then joining that result on C would produce
 CombinedKey<KC, CombinedKey<KA, KB>>. My "keyOtherSerde" in this
 case would need to be capable of deserializing CombinedKey<KA,
 KB>. This would just get worse the more tables I join. I realize
 that it's easier to shoot yourself in the foot with the less
 intrusive approach, but as you said, " the user can stick with
 his default serde or his standard way of serializing". In the
 simplest case where the keys are just strings, they can do simple
 string concatenation and Serdes.String(). It also allows the user
 to create and use their own version of CombinedKey if they feel
 so inclined.

 2

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

2017-11-18 Thread Jan Filipiak



On 17.11.2017 06:59, Guozhang Wang wrote:

Thanks for the explanation Jan. On top of my head I'm leaning towards the
"more intrusive" approach to resolve the race condition issue we discussed
above. Matthias has some arguments for this approach already, so I would
not re-iterate them here. To me I find the "ValueMapper<K, KO>
joinPrefixFaker" is actually leaking the same amount of internal
implementation details information as the more intrusive approach, but in a
less clear way. So I'd rather just clarify to users than trying to abstract
in an awkward way.

As again. The benefits of __not__ introducing a new "Wrapper" type are huge!
We spend lot of effort to get rid of Changes<> in our Topics. We also 
will not

want CombinedKeys.
I make one suggestion! Let's keep thinking how to make this more precise
w/o introducing new Kafka Streams only types?
As you can see currently the vote is 2 / 2. People that use kafka stream
like the less intrusive approach people that develop like the more 
intrusive one.

The most pretty thing might not be the thing that gives the bang for the bug
out there.

Best Jan


Also I'm not clear what do you mean by "CombinedKey would require an
additional mapping to what the less intrusive method has". If you meant
that users are enforced to provide a new serde for this combo key, could
that be avoided with the library automatically generate a serde for it
until the user changed this key later in the topology (e.g. via a map()
function) in which they can "flatten" this combo key into a flat key.

*@Trevor: *for your case for concatenating multiple joins, I think a better
way is to call `oneToManyJoin().map().oneToManyJoin().map()...` than
specifying a sequence of joinPrefixFakers as they will also be chained up
together (remember we have to keep this object along the rest of the
topology) which will make serde even harder?

Hi,

that was the map I was talking about. Last time I checked KTable only 
had 3 Generic Types.
For this I think it would require 4 Types, KeyIn,KeyOut,ValueIn,ValueOut 
I am very much in favour to
add this since basically ever, maybe this opens up some discussion, but 
without this Mapping keys of
KTable is not possible. I once again recommend peeking over to 
Mapreduce/Tez guys wich have the

concept of these 4 Generics since basically ever.



Similar to Matthias's question, the "XXX" markers are a bit confusing to me.

Sorry!!!


Guozhang


On Thu, Nov 16, 2017 at 2:18 PM, Matthias J. Sax <matth...@confluent.io>
wrote:


Hi,

I am just catching up on this discussion and did re-read the KIP and
discussion thread.

In contrast to you, I prefer the second approach with CombinedKey as
return type for the following reasons:

  1) the oneToManyJoin() method had less parameter
  2) those parameters are easy to understand
  3) we hide implementation details (joinPrefixFaker, leftKeyExtractor,
and the return type KO leaks internal implementation details from my
point of view)
  4) user can get their own KO type by extending CombinedKey interface
(this would also address the nesting issue Trevor pointed out)

That's unclear to me is, why you care about JSON serdes? What is the
problem with regard to prefix? It seems I am missing something here.

I also don't understand the argument about "the user can stick with his
default serde or his standard way of serializing"? If we have
`CombinedKey` as output, the use just provide the serdes for both input
combined-key types individually, and we can reuse both internally to do
the rest. This seems to be a way simpler API. With the KO output type
approach, users need to write an entirely new serde for KO in contrast.

Finally, @Jan, there are still some open comments you did not address
and the KIP wiki page needs some updates. Would be great if you could do
this.

Can you also explicitly describe the data layout of the store that is
used to do the range scans?

Additionally:

-> some arrows in the algorithm diagram are missing
-> was are those XXX in the diagram
-> can you finish the "Step by Step" example
-> it think the relationships between the different used types, K0,K1,KO
should be explains explicitly (all information is there implicitly, but
one need to think hard to figure it out)


Last but not least:


But noone is really interested.

Don't understand this statement...



-Matthias


On 11/16/17 9:05 AM, Jan Filipiak wrote:

We are running this perfectly fine. for us the smaller table changes
rather infrequent say. only a few times per day. The performance of the
flush is way lower than the computing power you need to bring to the
table to account for all the records beeing emmited after the one single
update.

On 16.11.2017 18:02, Trevor Huey wrote:

Ah, I think I see the problem now. Thanks for the explanation. That is
tricky. As you said, it seems the easiest solution would just be to
flush the cach

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

2017-11-18 Thread Jan Filipiak


Hi Matthias

answers to the questions inline.

On 16.11.2017 23:18, Matthias J. Sax wrote:

Hi,

I am just catching up on this discussion and did re-read the KIP and
discussion thread.

In contrast to you, I prefer the second approach with CombinedKey as
return type for the following reasons:

  1) the oneToManyJoin() method had less parameter

yeah I like that!

  2) those parameters are easy to understand

The big benefit really!

  3) we hide implementation details (joinPrefixFaker, leftKeyExtractor,
and the return type KO leaks internal implementation details from my
point of view)
It does, so does Combined key. I am a firm believer in the principle of 
leaky abstractions.
I think this is okay given the non-triviality of what it tries to 
abstract away.



  4) user can get their own KO type by extending CombinedKey interface
(this would also address the nesting issue Trevor pointed out)
They can not easily. Say you use protobuf (as we do) and your classes 
get generated by a compiler such as protoc you can not easily
have it generate subclasses of CombinedKey. I think it's great that we 
have Trevors opinion here as a second users perspective.
To get stuff going its sometimes easier to deal with the implications of 
you API (that are there anyways) instead of fighting you current
established toolset to adapt to some new sheme (like ComnbinedKeys). In 
the end there is a reason we run it in production with the
less intrusive approach. because it is way less intrusive into our 
current tool chain and does not require us to adapt to some "kafka streams"
specifica. We have tools to inspect topics, if they key would suddenly 
be a CombinedKey of two protobuf messages we cant use our default 
toolchain. This argument for us is very relevant to give some gravitas: 
We also rewrote the KTable::GroupBy that comes with stock
0.10.0.1 to repartition without serializing Change<> and have log 
compaction enabled to not treat Streamstopics different than any other.
For us this is very important. We want to upstream this to be able to 
use it instead of our Reflection based PAPI setup. We would not take the 
stock one into production.

That's unclear to me is, why you care about JSON serdes? What is the
problem with regard to prefix? It seems I am missing something here.

say you have a CombineKey A,B  with values "a" and "b",
if you take our Protobuf Serde its gonna be
"a""b"
and without "b" field set
"a"
as you can see its a perfect prefix
but with JSON
you would get
{ "A" => "a", "B" => "b" }
and without the B field
{ "A" => "a" }
as you can see it will  not be a prefix

I also don't understand the argument about "the user can stick with his
default serde or his standard way of serializing"? If we have
`CombinedKey` as output, the use just provide the serdes for both input
combined-key types individually, and we can reuse both internally to do
the rest. This seems to be a way simpler API. With the KO output type
approach, users need to write an entirely new serde for KO in contrast.

Finally, @Jan, there are still some open comments you did not address
and the KIP wiki page needs some updates. Would be great if you could do
this.

Can you also explicitly describe the data layout of the store that is
used to do the range scans?

Additionally:

-> some arrows in the algorithm diagram are missing
-> was are those XXX in the diagram
-> can you finish the "Step by Step" example
Can't find the missing arrows. Those XX's are not really relevant, will 
focus on the step by step example

-> it think the relationships between the different used types, K0,K1,KO
should be explains explicitly (all information is there implicitly, but
one need to think hard to figure it out)

will add this

Last but not least:


But noone is really interested.
This was the first time some took the effort to address the most 
pressuring issue moving this forward.

I counted this as not beeing interested before.

Don't understand this statement...



-Matthias


On 11/16/17 9:05 AM, Jan Filipiak wrote:

We are running this perfectly fine. for us the smaller table changes
rather infrequent say. only a few times per day. The performance of the
flush is way lower than the computing power you need to bring to the
table to account for all the records beeing emmited after the one single
update.

On 16.11.2017 18:02, Trevor Huey wrote:

Ah, I think I see the problem now. Thanks for the explanation. That is
tricky. As you said, it seems the easiest solution would just be to
flush the cache. I wonder how big of a performance hit that'd be...

On Thu, Nov 16, 2017 at 9:07 AM Jan Filipiak <jan.filip...@trivago.com
<mailto:jan.filip...@trivago.com>> wrote:

 Hi Trevor,

 I am leaning towards the less intrusive approach myself. Infact
 that is how

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

2017-11-18 Thread Jan Filipiak


Hi,

 not an issue at all. IMO
the approach that is on the table would be perfect

On 18.11.2017 10:58, Jeyhun Karimov wrote:

Hi,

I did not expected that Context will be this much an issue. Instead of
applying different semantics for different operators, I think we should
remove this feature completely.


Cheers,
Jeyhun
On Sat 18. Nov 2017 at 07:49, Jan Filipiak <jan.filip...@trivago.com> wrote:


Yes, the mail said only join so I wanted to clarify.



On 17.11.2017 19:05, Matthias J. Sax wrote:

Yes. But I think an aggregation is an many-to-one operation, too.

For the stripping off part: internally, we can just keep some record
context, but just do not allow users to access it (because the context
context does not make sense for them) by hiding the corresponding APIs.


-Matthias

On 11/16/17 10:05 PM, Guozhang Wang wrote:

Matthias,

For this idea, are your proposing that for any many-to-one mapping
operations (for now only Join operators), we will strip off the record
context in the resulted records and claim "we cannot infer its traced
context anymore"?


Guozhang


On Thu, Nov 16, 2017 at 1:03 PM, Matthias J. Sax <matth...@confluent.io
wrote:


Any thoughts about my latest proposal?

-Matthias

On 11/10/17 10:02 PM, Jan Filipiak wrote:

Hi,

i think this is the better way. Naming is always tricky Source is

kinda

taken
I had TopicBackedK[Source|Table] in mind
but for the user its way better already IMHO

Thank you for reconsideration

Best Jan


On 10.11.2017 22:48, Matthias J. Sax wrote:

I was thinking about the source stream/table idea once more and it

seems

it would not be too hard to implement:

We add two new classes

 SourceKStream extends KStream

and

 SourceKTable extend KTable

and return both from StreamsBuilder#stream and StreamsBuilder#table

As both are sub-classes, this change is backward compatible. We

change

the return type for any single-record transform to this new types,

too,

and use KStream/KTable as return type for any multi-record operation.

The new RecordContext API is added to both new classes. For old

classes,

we only implement KIP-149 to get access to the key.


WDYT?


-Matthias

On 11/9/17 9:13 PM, Jan Filipiak wrote:

Okay,

looks like it would _at least work_ for Cached KTableSources .
But we make it harder to the user to make mistakes by putting
features into places where they don't make sense and don't
help anyone.

I once again think that my suggestion is easier to implement and
more correct. I will use this email to express my disagreement with

the

proposed KIP (-1 non binding of course) state that I am open for any
questions
regarding this. I will also do the usual thing and point out that

the

friends
over at Hive got it correct aswell.
One can not user their
https://cwiki.apache.org/confluence/display/Hive/

LanguageManual+VirtualColumns

in any place where its not read from the Sources.

With KSQl in mind it makes me sad how this is evolving here.

Best Jan





On 10.11.2017 01:06, Guozhang Wang wrote:

Hello Jan,

Regarding your question about caching: today we keep the record

context

with the cached entry already so when we flush the cache which may
generate
new records forwarding we will set the record context

appropriately;

and
then after the flush is completed we will reset the context to the
record
before the flush happens. But I think when Jeyhun did the PR it is

a

good
time to double check on such stages to make sure we are not
introducing any
regressions.


Guozhang


On Mon, Nov 6, 2017 at 8:54 PM, Jan Filipiak <

jan.filip...@trivago.com>

wrote:


I Aggree completely.

Exposing this information in a place where it has no _natural_
belonging
might really be a bad blocker in the long run.

Concerning your first point. I would argue its not to hard to

have a

user
keep track of these. If we still don't want the user
to keep track of these I would argue that all > projection only <
transformations on a Source-backed KTable/KStream
could also return a Ktable/KStream instance of the type we return
from the
topology builder.
Only after any operation that exceeds projection or filter one

would

return a KTable not granting access to this any longer.

Even then its difficult already: I never ran a topology with

caching

but I
am not even 100% sure what the record Context means behind
a materialized KTable with Caching? Topic and Partition are

probably

with
some reasoning but offset is probably only the offset causing the
flush?
So one might aswell think to drop offsets from this RecordContext.

Best Jan







On 07.11.2017 03:18, Guozhang Wang wrote:


Regarding the API design (the proposed set of overloads v.s. one
overload
on #map to enrich the record), I think what we have represents a

good

trade-off between API succinctness and user convenience: on one
hand we
definitely want to keep as fewer overloaded functions as

possible.

But on
the other hand if we only do th

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

2017-11-17 Thread Jan Filipiak


Yes, the mail said only join so I wanted to clarify.



On 17.11.2017 19:05, Matthias J. Sax wrote:

Yes. But I think an aggregation is an many-to-one operation, too.

For the stripping off part: internally, we can just keep some record
context, but just do not allow users to access it (because the context
context does not make sense for them) by hiding the corresponding APIs.


-Matthias

On 11/16/17 10:05 PM, Guozhang Wang wrote:

Matthias,

For this idea, are your proposing that for any many-to-one mapping
operations (for now only Join operators), we will strip off the record
context in the resulted records and claim "we cannot infer its traced
context anymore"?


Guozhang


On Thu, Nov 16, 2017 at 1:03 PM, Matthias J. Sax <matth...@confluent.io>
wrote:


Any thoughts about my latest proposal?

-Matthias

On 11/10/17 10:02 PM, Jan Filipiak wrote:

Hi,

i think this is the better way. Naming is always tricky Source is kinda
taken
I had TopicBackedK[Source|Table] in mind
but for the user its way better already IMHO

Thank you for reconsideration

Best Jan


On 10.11.2017 22:48, Matthias J. Sax wrote:

I was thinking about the source stream/table idea once more and it seems
it would not be too hard to implement:

We add two new classes

SourceKStream extends KStream

and

SourceKTable extend KTable

and return both from StreamsBuilder#stream and StreamsBuilder#table

As both are sub-classes, this change is backward compatible. We change
the return type for any single-record transform to this new types, too,
and use KStream/KTable as return type for any multi-record operation.

The new RecordContext API is added to both new classes. For old classes,
we only implement KIP-149 to get access to the key.


WDYT?


-Matthias

On 11/9/17 9:13 PM, Jan Filipiak wrote:

Okay,

looks like it would _at least work_ for Cached KTableSources .
But we make it harder to the user to make mistakes by putting
features into places where they don't make sense and don't
help anyone.

I once again think that my suggestion is easier to implement and
more correct. I will use this email to express my disagreement with the
proposed KIP (-1 non binding of course) state that I am open for any
questions
regarding this. I will also do the usual thing and point out that the
friends
over at Hive got it correct aswell.
One can not user their
https://cwiki.apache.org/confluence/display/Hive/

LanguageManual+VirtualColumns


in any place where its not read from the Sources.

With KSQl in mind it makes me sad how this is evolving here.

Best Jan





On 10.11.2017 01:06, Guozhang Wang wrote:

Hello Jan,

Regarding your question about caching: today we keep the record

context

with the cached entry already so when we flush the cache which may
generate
new records forwarding we will set the record context appropriately;
and
then after the flush is completed we will reset the context to the
record
before the flush happens. But I think when Jeyhun did the PR it is a
good
time to double check on such stages to make sure we are not
introducing any
regressions.


Guozhang


On Mon, Nov 6, 2017 at 8:54 PM, Jan Filipiak <

jan.filip...@trivago.com>

wrote:


I Aggree completely.

Exposing this information in a place where it has no _natural_
belonging
might really be a bad blocker in the long run.

Concerning your first point. I would argue its not to hard to have a
user
keep track of these. If we still don't want the user
to keep track of these I would argue that all > projection only <
transformations on a Source-backed KTable/KStream
could also return a Ktable/KStream instance of the type we return
from the
topology builder.
Only after any operation that exceeds projection or filter one would
return a KTable not granting access to this any longer.

Even then its difficult already: I never ran a topology with caching
but I
am not even 100% sure what the record Context means behind
a materialized KTable with Caching? Topic and Partition are probably
with
some reasoning but offset is probably only the offset causing the
flush?
So one might aswell think to drop offsets from this RecordContext.

Best Jan







On 07.11.2017 03:18, Guozhang Wang wrote:


Regarding the API design (the proposed set of overloads v.s. one
overload
on #map to enrich the record), I think what we have represents a

good

trade-off between API succinctness and user convenience: on one
hand we
definitely want to keep as fewer overloaded functions as possible.
But on
the other hand if we only do that in, say, the #map() function then
this
enrichment could be an overkill: think of a topology that has 7
operators
in a chain, where users want to access the record context on
operator #2
and #6 only, with the "enrichment" manner they need to do the
enrichment
on
operator #2 and keep it that way until #6. In addition, the
RecordContext
fields (topic, offset, etc) are really orthogonal to the key-value
payloads
themselves, so I

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

2017-11-17 Thread Jan Filipiak


Hey,

... and group by. And yes there is no logical context we can present.
The context present has nothing todo with the record currently processed.
Its just doesn't come out 
https://en.wikipedia.org/wiki/Relational_algebra#Aggregation


I am all in on this approach.

Best Jan

On 17.11.2017 07:05, Guozhang Wang wrote:

Matthias,

For this idea, are your proposing that for any many-to-one mapping
operations (for now only Join operators), we will strip off the record
context in the resulted records and claim "we cannot infer its traced
context anymore"?


Guozhang


On Thu, Nov 16, 2017 at 1:03 PM, Matthias J. Sax <matth...@confluent.io>
wrote:


Any thoughts about my latest proposal?

-Matthias

On 11/10/17 10:02 PM, Jan Filipiak wrote:

Hi,

i think this is the better way. Naming is always tricky Source is kinda
taken
I had TopicBackedK[Source|Table] in mind
but for the user its way better already IMHO

Thank you for reconsideration

Best Jan


On 10.11.2017 22:48, Matthias J. Sax wrote:

I was thinking about the source stream/table idea once more and it seems
it would not be too hard to implement:

We add two new classes

SourceKStream extends KStream

and

SourceKTable extend KTable

and return both from StreamsBuilder#stream and StreamsBuilder#table

As both are sub-classes, this change is backward compatible. We change
the return type for any single-record transform to this new types, too,
and use KStream/KTable as return type for any multi-record operation.

The new RecordContext API is added to both new classes. For old classes,
we only implement KIP-149 to get access to the key.


WDYT?


-Matthias

On 11/9/17 9:13 PM, Jan Filipiak wrote:

Okay,

looks like it would _at least work_ for Cached KTableSources .
But we make it harder to the user to make mistakes by putting
features into places where they don't make sense and don't
help anyone.

I once again think that my suggestion is easier to implement and
more correct. I will use this email to express my disagreement with the
proposed KIP (-1 non binding of course) state that I am open for any
questions
regarding this. I will also do the usual thing and point out that the
friends
over at Hive got it correct aswell.
One can not user their
https://cwiki.apache.org/confluence/display/Hive/

LanguageManual+VirtualColumns


in any place where its not read from the Sources.

With KSQl in mind it makes me sad how this is evolving here.

Best Jan





On 10.11.2017 01:06, Guozhang Wang wrote:

Hello Jan,

Regarding your question about caching: today we keep the record

context

with the cached entry already so when we flush the cache which may
generate
new records forwarding we will set the record context appropriately;
and
then after the flush is completed we will reset the context to the
record
before the flush happens. But I think when Jeyhun did the PR it is a
good
time to double check on such stages to make sure we are not
introducing any
regressions.


Guozhang


On Mon, Nov 6, 2017 at 8:54 PM, Jan Filipiak <

jan.filip...@trivago.com>

wrote:


I Aggree completely.

Exposing this information in a place where it has no _natural_
belonging
might really be a bad blocker in the long run.

Concerning your first point. I would argue its not to hard to have a
user
keep track of these. If we still don't want the user
to keep track of these I would argue that all > projection only <
transformations on a Source-backed KTable/KStream
could also return a Ktable/KStream instance of the type we return
from the
topology builder.
Only after any operation that exceeds projection or filter one would
return a KTable not granting access to this any longer.

Even then its difficult already: I never ran a topology with caching
but I
am not even 100% sure what the record Context means behind
a materialized KTable with Caching? Topic and Partition are probably
with
some reasoning but offset is probably only the offset causing the
flush?
So one might aswell think to drop offsets from this RecordContext.

Best Jan







On 07.11.2017 03:18, Guozhang Wang wrote:


Regarding the API design (the proposed set of overloads v.s. one
overload
on #map to enrich the record), I think what we have represents a

good

trade-off between API succinctness and user convenience: on one
hand we
definitely want to keep as fewer overloaded functions as possible.
But on
the other hand if we only do that in, say, the #map() function then
this
enrichment could be an overkill: think of a topology that has 7
operators
in a chain, where users want to access the record context on
operator #2
and #6 only, with the "enrichment" manner they need to do the
enrichment
on
operator #2 and keep it that way until #6. In addition, the
RecordContext
fields (topic, offset, etc) are really orthogonal to the key-value
payloads
themselves, so I think separating them into this object is a cleaner
way.

Regarding the RecordContext inheritance, this is actually a g

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

2017-11-16 Thread Jan Filipiak

We are running this perfectly fine. for us the smaller table changes 
rather infrequent say. only a few times per day. The performance of the 
flush is way lower than the computing power you need to bring to the 
table to account for all the records beeing emmited after the one single 
update.


On 16.11.2017 18:02, Trevor Huey wrote:
Ah, I think I see the problem now. Thanks for the explanation. That is 
tricky. As you said, it seems the easiest solution would just be to 
flush the cache. I wonder how big of a performance hit that'd be...


On Thu, Nov 16, 2017 at 9:07 AM Jan Filipiak <jan.filip...@trivago.com 
<mailto:jan.filip...@trivago.com>> wrote:


Hi Trevor,

I am leaning towards the less intrusive approach myself. Infact
that is how we implemented our Internal API for this and how we
run it in production.
getting more voices towards this solution makes me really happy.
The reason its a problem for Prefix and not for Range is the
following. Imagine the intrusive approach. They key of the RockDB
would be CombinedKey<A,B> and the prefix scan would take an A, and
the range scan would take an CombinedKey<A,B> still. As you can
see with the intrusive approach the keys are actually different
types for different queries. With the less intrusive apporach we
use the same type and rely on Serde Invariances. For us this works
nice (protobuf) might bite some JSON users.

Hope it makes it clear

Best Jan


On 16.11.2017 16:39, Trevor Huey wrote:

1. Going over KIP-213, I am leaning toward the "less intrusive"
approach. In my use case, I am planning on performing a sequence
of several oneToMany joins, From my understanding, the more
intrusive approach would result in several nested levels of
CombinedKey's. For example, consider Tables A, B, C, D with
corresponding keys KA, KB, KC. Joining A and B would produce
CombinedKey<KA, KB>. Then joining that result on C would produce
CombinedKey<KC, CombinedKey<KA, KB>>. My "keyOtherSerde" in this
case would need to be capable of deserializing CombinedKey<KA,
KB>. This would just get worse the more tables I join. I realize
that it's easier to shoot yourself in the foot with the less
intrusive approach, but as you said, " the user can stick with
his default serde or his standard way of serializing". In the
simplest case where the keys are just strings, they can do simple
string concatenation and Serdes.String(). It also allows the user
to create and use their own version of CombinedKey if they feel
so inclined.

2. Why is there a problem for prefix, but not for range?

https://github.com/apache/kafka/pull/3720/files#diff-8f863b74c3c5a0b989e89d00c149aef1L162


On Thu, Nov 16, 2017 at 2:57 AM Jan Filipiak
<jan.filip...@trivago.com <mailto:jan.filip...@trivago.com>> wrote:

Hi Trevor,

thank you very much for your interested. Too keep discussion
mailing list focused and not Jira or Confluence I decided to
reply here.

1. its tricky activity is indeed very low. In the KIP-213
there are 2 proposals about the return type of the join. I
would like to settle on one.
Unfortunatly its controversal and I don't want to have the
discussion after I settled on one way and implemented it. But
noone is really interested.
So discussing with YOU, what your preferred return type would
look would be very helpfull already.

2.
The most difficult part is implementing
this

https://github.com/apache/kafka/pull/3720/files#diff-ac41b4dfb9fc6bb707d966477317783cR68
here

https://github.com/apache/kafka/pull/3720/files#diff-8f863b74c3c5a0b989e89d00c149aef1R244
and here

https://github.com/apache/kafka/pull/3720/files#diff-b1a1281dce5219fd0cb5afad380d9438R207
One can get an easy shot by just flushing the underlying
rocks and using Rocks for range scan.
But as you can see the implementation depends on the API. For
wich way the API discussion goes
I would implement this differently.

3.
I only have so and so much time to work on this. I filed the
KIP because I want to pull it through and I am pretty
confident that I can do it.
But I am still waiting for the full discussion to happen on
this. To get the discussion forward it seems to be that I
need to fill out the table in
the KIP entirly (the one describing the events, change
modifications and output). Feel free to continue the
discussion w/o the table. I want
to finish the table during next week.

    Best Jan thank you for your interest!

_ Jira Quote __

Jan Filipiak
<https://issues.apache.or

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

2017-11-16 Thread Jan Filipiak


Hi Trevor,

thank you very much for your interested. Too keep discussion mailing 
list focused and not Jira or Confluence I decided to reply here.


1. its tricky activity is indeed very low. In the KIP-213 there are 2 
proposals about the return type of the join. I would like to settle on one.
Unfortunatly its controversal and I don't want to have the discussion 
after I settled on one way and implemented it. But noone is really 
interested.
So discussing with YOU, what your preferred return type would look would 
be very helpfull already.


2.
The most difficult part is implementing
this 
https://github.com/apache/kafka/pull/3720/files#diff-ac41b4dfb9fc6bb707d966477317783cR68
here 
https://github.com/apache/kafka/pull/3720/files#diff-8f863b74c3c5a0b989e89d00c149aef1R244
and here 
https://github.com/apache/kafka/pull/3720/files#diff-b1a1281dce5219fd0cb5afad380d9438R207
One can get an easy shot by just flushing the underlying rocks and using 
Rocks for range scan.
But as you can see the implementation depends on the API. For wich way 
the API discussion goes

I would implement this differently.

3.
I only have so and so much time to work on this. I filed the KIP because 
I want to pull it through and I am pretty confident that I can do it.
But I am still waiting for the full discussion to happen on this. To get 
the discussion forward it seems to be that I need to fill out the table in
the KIP entirly (the one describing the events, change modifications and 
output). Feel free to continue the discussion w/o the table. I want

to finish the table during next week.

Best Jan thank you for your interest!

_ Jira Quote __

Jan Filipiak 
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jfilipiak> 
Please bear with me while I try to get caught up. I'm not yet familiar 
with the Kafka code base. I have a few questions to try to figure out 
how I can get involved:
1. It seems like we need to get buy-in on your KIP-213? It doesn't seem 
like there's been much activity on it besides yourself in a while. 
What's your current plan of attack for getting that approved?
2. I know you said that the most difficult part is yet to be done. Is 
there some code you can point me toward so I can start digging in and 
better understand why this is so difficult?
3. This issue has been open since May '16. How far out do you think we 
are from getting this implemented?

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

2017-11-12 Thread Jan Filipiak


Hi Gouzhang,

this felt like these questions are supposed to be answered by me.
I do not understand the first one. I don't understand why the user
shouldn't be able to specify a suffix for the topic name.

 For the third question I am not 100% familiar if the Produced class 
came to existence

at all. I remember proposing it somewhere in our redo DSL discussion that
I dropped out of later. Finally any call that does:

1. create the internal topic
2. register sink
3. register source

will always get the work done. If we have a Produced like class. putting 
all the parameters
in there make sense. (Partitioner, serde, PartitionHint, internal, name 
... )


Hope this helps?


On 10.11.2017 07:54, Guozhang Wang wrote:

A few clarification questions on the proposal details.

1. API: although the repartition only happens at the final stateful
operations like agg / join, the repartition flag info was actually passed
from an earlier operator like map / groupBy. So what should be the new API
look like? For example, if we do

stream.groupBy().through("topic-name", Produced..).aggregate

This would be add a bunch of APIs to GroupedKStream/KTable

2. Semantics: as Matthias mentioned, today any topics defined in
"through()" call is considered a user topic, and hence users are
responsible for managing them, including the topic name. For this KIP's
purpose, though, users would not care about the topic name. I.e. as a user
I still want to make it be an internal topic so that I do not need to worry
about it at all, but only specify num.partitions.

3. Details: in Produced we do not have specs for specifying the
num.partitions or should we repartition or not. So it is still not clear to
me how we would make use of that to achieve what's in the old
proposal's RepartitionHint class.



Guozhang


On Mon, Nov 6, 2017 at 1:21 PM, Ted Yu <yuzhih...@gmail.com> wrote:


bq. enlarge the score of through()

I guess you meant scope.

On Mon, Nov 6, 2017 at 1:15 PM, Jeyhun Karimov <je.kari...@gmail.com>
wrote:


Hi,

Sorry for the late reply. I am convinced that we should enlarge the score
of through() (add more overloads) instead of introducing a separate set

of

overloads to other methods.
I will update the KIP soon based on the discussion and inform.


Cheers,
Jeyhun

On Mon, Nov 6, 2017 at 9:18 PM Jan Filipiak <jan.filip...@trivago.com>
wrote:


Sorry for not beeing 100% up to date.
Back then we had the discussion that when an operation puts a >Sink<
into the topology, a >Produced<
parameter is added. This produced parameter could have internal or
external. If internal I think the name would still make
a great suffix for the topic name

Is this plan still around? Otherwise having the name as suffix is
probably always good it can help the user quicker to identify hot

topics

that need more
partitions if he has many of these internal repartitions

Best Jan


On 06.11.2017 20:13, Matthias J. Sax wrote:

I absolute agree with what you say. It's not a requirement to

specify a

topic name -- and this was the idea -- if user does specify a name,

we

treat as is -- if users does not specify a name, Streams create an
internal topic.

The goal of the Jira is to allow a simplified way to control
repartitioning (atm, user needs to manually create a topic and use

via

through()).

Thus, the idea is to make the topic name parameter of through

optional.

It's of course just an idea. Happy do have a other API design. The

goal

was, to avoid to many new overloads.


Could you clarify exactly what you mean by keeping the current

distinction?

Current distinction is: user topics are created manually and user
specifies the name -- internal topics are created by Kafka Streams

and

an name is generated automatically.

-> through("user-topic")
-> through(TopicConfig.withNumberOfPartitions(5)) // Streams creates

an

internal topic


-Matthias


On 11/6/17 6:56 PM, Thomas Becker wrote:

Could you clarify exactly what you mean by keeping the current

distinction?

Actually, re-reading the KIP and JIRA, it's not clear that being

able

to specify a custom name is actually a requirement. If the goal is to
control repartitioning and tune parallelism, maybe we can just sidestep
this issue altogether by removing the ability to set a different name.

On Mon, 2017-11-06 at 16:51 +0100, Matthias J. Sax wrote:

That's a good point. In current design, we strictly distinguish

both.

For example, the reset tools deletes internal topics (starting with
prefix `-` and ending with either `-repartition` or
`-changelog`.

Thus, from my point of view, it would make sense to keep the current
distinction.

-Matthias

On 11/6/17 4:45 PM, Thomas Becker wrote:


I think this sounds good as well. It's worth clarifying whether

topics

that are named by the user but created by streams are considered

"internal"

topics also.

On Sun, 2017-11-05 at 23:02 +0100, Matthias J. Sax wrote:

My idea

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

2017-11-10 Thread Jan Filipiak


Hi,

i think this is the better way. Naming is always tricky Source is kinda 
taken

I had TopicBackedK[Source|Table] in mind
but for the user its way better already IMHO

Thank you for reconsideration

Best Jan


On 10.11.2017 22:48, Matthias J. Sax wrote:

I was thinking about the source stream/table idea once more and it seems
it would not be too hard to implement:

We add two new classes

   SourceKStream extends KStream

and

   SourceKTable extend KTable

and return both from StreamsBuilder#stream and StreamsBuilder#table

As both are sub-classes, this change is backward compatible. We change
the return type for any single-record transform to this new types, too,
and use KStream/KTable as return type for any multi-record operation.

The new RecordContext API is added to both new classes. For old classes,
we only implement KIP-149 to get access to the key.


WDYT?


-Matthias

On 11/9/17 9:13 PM, Jan Filipiak wrote:

Okay,

looks like it would _at least work_ for Cached KTableSources .
But we make it harder to the user to make mistakes by putting
features into places where they don't make sense and don't
help anyone.

I once again think that my suggestion is easier to implement and
more correct. I will use this email to express my disagreement with the
proposed KIP (-1 non binding of course) state that I am open for any
questions
regarding this. I will also do the usual thing and point out that the
friends
over at Hive got it correct aswell.
One can not user their
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns

in any place where its not read from the Sources.

With KSQl in mind it makes me sad how this is evolving here.

Best Jan





On 10.11.2017 01:06, Guozhang Wang wrote:

Hello Jan,

Regarding your question about caching: today we keep the record context
with the cached entry already so when we flush the cache which may
generate
new records forwarding we will set the record context appropriately; and
then after the flush is completed we will reset the context to the record
before the flush happens. But I think when Jeyhun did the PR it is a good
time to double check on such stages to make sure we are not
introducing any
regressions.


Guozhang


On Mon, Nov 6, 2017 at 8:54 PM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


I Aggree completely.

Exposing this information in a place where it has no _natural_ belonging
might really be a bad blocker in the long run.

Concerning your first point. I would argue its not to hard to have a
user
keep track of these. If we still don't want the user
to keep track of these I would argue that all > projection only <
transformations on a Source-backed KTable/KStream
could also return a Ktable/KStream instance of the type we return
from the
topology builder.
Only after any operation that exceeds projection or filter one would
return a KTable not granting access to this any longer.

Even then its difficult already: I never ran a topology with caching
but I
am not even 100% sure what the record Context means behind
a materialized KTable with Caching? Topic and Partition are probably
with
some reasoning but offset is probably only the offset causing the flush?
So one might aswell think to drop offsets from this RecordContext.

Best Jan







On 07.11.2017 03:18, Guozhang Wang wrote:


Regarding the API design (the proposed set of overloads v.s. one
overload
on #map to enrich the record), I think what we have represents a good
trade-off between API succinctness and user convenience: on one hand we
definitely want to keep as fewer overloaded functions as possible.
But on
the other hand if we only do that in, say, the #map() function then
this
enrichment could be an overkill: think of a topology that has 7
operators
in a chain, where users want to access the record context on
operator #2
and #6 only, with the "enrichment" manner they need to do the
enrichment
on
operator #2 and keep it that way until #6. In addition, the
RecordContext
fields (topic, offset, etc) are really orthogonal to the key-value
payloads
themselves, so I think separating them into this object is a cleaner
way.

Regarding the RecordContext inheritance, this is actually a good point
that
have not been discussed thoroughly before. Here are my my two cents:
one
natural way would be to inherit the record context from the
"triggering"
record, for example in a join operator, if the record from stream A
triggers the join then the record context is inherited from with that
record. This is also aligned with the lower-level PAPI interface. A
counter
argument, though, would be that this is sort of leaking the internal
implementations of the DSL, so that moving forward if we did some
refactoring to our join implementations so that the triggering
record can
change, the RecordContext would also be different. I do not know how
much
it would really affect end users, but would like to hear your opinions.


Agreed to 100% exposing this informa

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

2017-11-09 Thread Jan Filipiak


Okay,

looks like it would _at least work_ for Cached KTableSources .
But we make it harder to the user to make mistakes by putting
features into places where they don't make sense and don't
help anyone.

I once again think that my suggestion is easier to implement and
more correct. I will use this email to express my disagreement with the
proposed KIP (-1 non binding of course) state that I am open for any 
questions
regarding this. I will also do the usual thing and point out that the 
friends

over at Hive got it correct aswell.
One can not user their
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
in any place where its not read from the Sources.

With KSQl in mind it makes me sad how this is evolving here.

Best Jan





On 10.11.2017 01:06, Guozhang Wang wrote:

Hello Jan,

Regarding your question about caching: today we keep the record context
with the cached entry already so when we flush the cache which may generate
new records forwarding we will set the record context appropriately; and
then after the flush is completed we will reset the context to the record
before the flush happens. But I think when Jeyhun did the PR it is a good
time to double check on such stages to make sure we are not introducing any
regressions.


Guozhang


On Mon, Nov 6, 2017 at 8:54 PM, Jan Filipiak <jan.filip...@trivago.com>
wrote:


I Aggree completely.

Exposing this information in a place where it has no _natural_ belonging
might really be a bad blocker in the long run.

Concerning your first point. I would argue its not to hard to have a user
keep track of these. If we still don't want the user
to keep track of these I would argue that all > projection only <
transformations on a Source-backed KTable/KStream
could also return a Ktable/KStream instance of the type we return from the
topology builder.
Only after any operation that exceeds projection or filter one would
return a KTable not granting access to this any longer.

Even then its difficult already: I never ran a topology with caching but I
am not even 100% sure what the record Context means behind
a materialized KTable with Caching? Topic and Partition are probably with
some reasoning but offset is probably only the offset causing the flush?
So one might aswell think to drop offsets from this RecordContext.

Best Jan







On 07.11.2017 03:18, Guozhang Wang wrote:


Regarding the API design (the proposed set of overloads v.s. one overload
on #map to enrich the record), I think what we have represents a good
trade-off between API succinctness and user convenience: on one hand we
definitely want to keep as fewer overloaded functions as possible. But on
the other hand if we only do that in, say, the #map() function then this
enrichment could be an overkill: think of a topology that has 7 operators
in a chain, where users want to access the record context on operator #2
and #6 only, with the "enrichment" manner they need to do the enrichment
on
operator #2 and keep it that way until #6. In addition, the RecordContext
fields (topic, offset, etc) are really orthogonal to the key-value
payloads
themselves, so I think separating them into this object is a cleaner way.

Regarding the RecordContext inheritance, this is actually a good point
that
have not been discussed thoroughly before. Here are my my two cents: one
natural way would be to inherit the record context from the "triggering"
record, for example in a join operator, if the record from stream A
triggers the join then the record context is inherited from with that
record. This is also aligned with the lower-level PAPI interface. A
counter
argument, though, would be that this is sort of leaking the internal
implementations of the DSL, so that moving forward if we did some
refactoring to our join implementations so that the triggering record can
change, the RecordContext would also be different. I do not know how much
it would really affect end users, but would like to hear your opinions.


Agreed to 100% exposing this information



Guozhang


On Mon, Nov 6, 2017 at 1:00 PM, Jeyhun Karimov <je.kari...@gmail.com>
wrote:

Hi Jan,

Sorry for late reply.


The API Design doesn't look appealing


In terms of API design we tried to preserve the java functional
interfaces.
We applied the same set of rich methods for KTable to make it compatible
with the rest of overloaded APIs.

It should be 100% sufficient to offer a KTable + KStream that is directly


feed from a topic with 1 additional overload for the #map() methods to
cover every usecase while keeping the API in a way better state.


- IMO this seems a workaround, rather than a direct solution.

Perhaps we should continue this discussion in DISCUSS thread.


Cheers,
Jeyhun


On Mon, Nov 6, 2017 at 9:14 PM Jan Filipiak <jan.filip...@trivago.com>
wrote:

Hi.

I do understand that it might come in Handy.
   From my POV in any relational algebra this is only a projection.
Currently we

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

2017-11-07 Thread Jan Filipiak



On 07.11.2017 12:59, Jan Filipiak wrote:


On 07.11.2017 11:20, Matthias J. Sax wrote:

About implementation if we do the KIP as proposed: I agree with Guozhang
that we would need to use the currently processed record's metadata in
the context. This does leak some implementation details, but I
personally don't see a big issue here (at the same time, I am also fine
to remove the RecordContext for joins if people think it's an issue).

About the API: while I agree with Jan, that having two APIs for input
streams/tables and "derived" streams/table (ie, result of
KStream-KStream join or an aggregation) would be a way to avoid some
semantic issue, I am not sure if it is worth the effort. IMHO, it would
make the API more convoluted and if users access the RecordContext on a
derived stream/table it's a "user error"
Why make it easy for the users to make mistakes in order to save some 
effort

(That I dont quite think is that big actually)

-- it's not really wrong as
users still get the current records context, but of course, we would
leak implementation details (as above, I don't see a bit issue here 
though).


At the same time, I disagree with Jan that "its not to hard to have a
user keeping track" -- if we apply this argument, we could even argue
that it's not to hard to use a Transformer instead of a map/filter etc.
We want to add "syntactic sugar" with this change and thus should really
provide value and not introduce a half-baked solution for which users
still need to do manual customizing.


-Matthias


On 11/7/17 5:54 AM, Jan Filipiak wrote:

I Aggree completely.

Exposing this information in a place where it has no _natural_ 
belonging

might really be a bad blocker in the long run.

Concerning your first point. I would argue its not to hard to have a
user keep track of these. If we still don't want the user
to keep track of these I would argue that all > projection only <
transformations on a Source-backed KTable/KStream
could also return a Ktable/KStream instance of the type we return from
the topology builder.
Only after any operation that exceeds projection or filter one would
return a KTable not granting access to this any longer.

Even then its difficult already: I never ran a topology with caching 
but

I am not even 100% sure what the record Context means behind
a materialized KTable with Caching? Topic and Partition are probably
with some reasoning but offset is probably only the offset causing the
flush?
So one might aswell think to drop offsets from this RecordContext.

Best Jan







On 07.11.2017 03:18, Guozhang Wang wrote:
Regarding the API design (the proposed set of overloads v.s. one 
overload

on #map to enrich the record), I think what we have represents a good
trade-off between API succinctness and user convenience: on one 
hand we
definitely want to keep as fewer overloaded functions as possible. 
But on
the other hand if we only do that in, say, the #map() function then 
this
enrichment could be an overkill: think of a topology that has 7 
operators
in a chain, where users want to access the record context on 
operator #2

and #6 only, with the "enrichment" manner they need to do the
enrichment on
operator #2 and keep it that way until #6. In addition, the 
RecordContext

fields (topic, offset, etc) are really orthogonal to the key-value
payloads
themselves, so I think separating them into this object is a 
cleaner way.


Regarding the RecordContext inheritance, this is actually a good point
that
have not been discussed thoroughly before. Here are my my two 
cents: one
natural way would be to inherit the record context from the 
"triggering"

record, for example in a join operator, if the record from stream A
triggers the join then the record context is inherited from with that
record. This is also aligned with the lower-level PAPI interface. A
counter
argument, though, would be that this is sort of leaking the internal
implementations of the DSL, so that moving forward if we did some
refactoring to our join implementations so that the triggering 
record can
change, the RecordContext would also be different. I do not know 
how much
it would really affect end users, but would like to hear your 
opinions.

Agreed to 100% exposing this information


Guozhang


On Mon, Nov 6, 2017 at 1:00 PM, Jeyhun Karimov <je.kari...@gmail.com>
wrote:


Hi Jan,

Sorry for late reply.


The API Design doesn't look appealing


In terms of API design we tried to preserve the java functional
interfaces.
We applied the same set of rich methods for KTable to make it 
compatible

with the rest of overloaded APIs.

It should be 100% sufficient to offer a KTable + KStream that is
directly
feed from a topic with 1 additional overload for the #map() 
methods to

cover every usecase while keeping the API in a way better state.

- IMO this seems a workaround, rather than a direct solution.

Perhaps we should continue this discussion

1 2 >

1 - 100 of 137 matches

Mail list logo