[jira] [Commented] (KAFKA-9455) Consider using TreeMap for In-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019191#comment-17019191 ] Ted Yu commented on KAFKA-9455: --- Maybe we can also look at (profile) Maps from fastutil such as: http://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/objects/Object2ObjectSortedMap.html > Consider using TreeMap for In-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Priority: Major > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019772#comment-17019772 ] Sophie Blee-Goldman commented on KAFKA-9455: [~guozhang] I agree with your added thoughts; it makes more sense to optimize for point queries and shouldn't necessarily be blocked on further work/optimization for the windowed join. And scaling well is particularly important for in-memory stores which do not have a bounded size/number of elements (beyond what's actually available per instance of course) > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027988#comment-17027988 ] highluck commented on KAFKA-9455: - [~guozhang] Do I need a KIP? > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029231#comment-17029231 ] Sophie Blee-Goldman commented on KAFKA-9455: This ticket just involves an implementation detail, so no KIP necessary > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029848#comment-17029848 ] highluck commented on KAFKA-9455: - [~ableegoldman] Thanks for the explanation! [~guozhang] Do you mind if I try? Thanks! > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030057#comment-17030057 ] Sophie Blee-Goldman commented on KAFKA-9455: [~high.lee] I'm pretty certain Guozhang isn't planning to work on this anytime soon, so feel free to pick it up > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030153#comment-17030153 ] Guozhang Wang commented on KAFKA-9455: -- [~high.lee] Sure please feel free to pick it up! > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Assignee: highluck >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034478#comment-17034478 ] highluck commented on KAFKA-9455: - [~guozhang] I have a question. Are you referring to the following form of point queries? "WindowStoreIterator fetch(final Bytes key)" thank you! > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Assignee: highluck >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034889#comment-17034889 ] Guozhang Wang commented on KAFKA-9455: -- >From the ReadOnlyWindowedStore interface: {code} V fetch(K key, long time);// single-point query WindowStoreIterator fetch(K key, Instant from, Instant to);// range-query KeyValueIterator, V> fetch(K from, K to, Instant fromTime, Instant toTime); // range-query KeyValueIterator, V> all(); // range-query KeyValueIterator, V> fetchAll(Instant from, Instant to); // range-query {code} The other APIs are deprecated. Hope this helps? > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Assignee: highluck >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034913#comment-17034913 ] highluck commented on KAFKA-9455: - @Guozhang Wang Thanks! you Want to add a new point api? > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Assignee: highluck >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034923#comment-17034923 ] Guozhang Wang commented on KAFKA-9455: -- No we do not need a new point API, the existing one should be good enough. -- -- Guozhang > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Assignee: highluck >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034927#comment-17034927 ] highluck commented on KAFKA-9455: - @Guozhang Thanks! it was helpful! > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Assignee: highluck >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035117#comment-17035117 ] highluck commented on KAFKA-9455: - [~guozhang] I have one more question What do you think about splitting a WindowStore into two stores? > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Assignee: highluck >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036300#comment-17036300 ] highluck commented on KAFKA-9455: - [~guozhang] I don't know if I understand.. I'm trying to replace JoinWindowStore and existing InMemoryWindowStore with TreeMap. What do you think? single-point query -> WindowStore with TreeMap range-query -> JoinWindowStore If it's not what you think, please give me a hint > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Assignee: highluck >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043797#comment-17043797 ] Guozhang Wang commented on KAFKA-9455: -- I'm actually considering that we should use tree-map for all in-memory time-windowed stores, independent of what queries they may be accessed for. What do you mean by `JoinWindowStore`? I think we do not have a specific store-type just for windowed stream-stream join? > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Assignee: highluck >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044126#comment-17044126 ] highluck commented on KAFKA-9455: - [~guozhang] thank you! `JoinWindowStore` is my mistake > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Assignee: highluck >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9455) Consider using TreeMap for in-memory stores of Streams
[ https://issues.apache.org/jira/browse/KAFKA-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044128#comment-17044128 ] ASF GitHub Bot commented on KAFKA-9455: --- highluck commented on pull request #8163: KAFKA-9455; Consider using TreeMap for in-memory stores of Streams URL: https://github.com/apache/kafka/pull/8163 https://issues.apache.org/jira/browse/KAFKA-9455 ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider using TreeMap for in-memory stores of Streams > -- > > Key: KAFKA-9455 > URL: https://issues.apache.org/jira/browse/KAFKA-9455 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Assignee: highluck >Priority: Major > Labels: newbie++ > > From [~ableegoldman]: It's worth noting that it might be a good idea to > switch to TreeMap for different reasons. Right now the ConcurrentSkipListMap > allows us to safely perform range queries without copying over the entire > keyset, but the performance on point queries seems to scale noticeably worse > with the number of unique keys. Point queries are used by aggregations while > range queries are used by windowed joins, but of course both are available > within the PAPI and for interactive queries so it's hard to say which we > should prefer. Maybe rather than make that tradeoff we should have one > version for efficient range queries (a "JoinWindowStore") and one for > efficient point queries ("AggWindowStore") - or something. I know we've had > similar thoughts for a different RocksDB store layout for Joins (although I > can't find that ticket anywhere..), it seems like the in-memory stores could > benefit from a special "Join" version as well cc/ Guozhang Wang > Here are some random thoughts: > 1. For kafka streams processing logic (i.e. without IQ), it's better to make > all processing logic relying on point queries rather than range queries. > Right now the only processor that use range queries are, as mentioned above, > windowed stream-stream joins. I think we should consider using a different > window implementation for this (and as a result also get rid of the > retainDuplicate flags) to refactor the windowed stream-stream join operation. > 2. With 1), range queries would only be exposed as IQ. Depending on its usage > frequency I think it makes lots of sense to optimize for single-point queries. > Of course, even without step 1) we should still consider using tree-map for > windowed in-memory stores to have a better scaling effect. -- This message was sent by Atlassian Jira (v8.3.4#803005)