[jira] [Commented] (KAFKA-8094) Iterating over cache with get(key) is inefficient

2019-03-11 Thread Sophie Blee-Goldman (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790070#comment-16790070
 ] 

Sophie Blee-Goldman commented on KAFKA-8094:


After some investigation, the solution is not as straightforward as simply 
replacing TreeMap -> ConcurrentSkipListMap and using a subMap iterator. 
Iterators are only weakly consistent, and changes to the underlying map are not 
guaranteed to be reflected in the iterator. Implementing it this way may cause, 
for example, evicted entries to still be returned by the iterator (see 
ThreadCacheTest#shouldSkipEntriesWhereValueHasBeenEvictedFromCache)

> Iterating over cache with get(key) is inefficient 
> --
>
> Key: KAFKA-8094
> URL: https://issues.apache.org/jira/browse/KAFKA-8094
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Sophie Blee-Goldman
>Priority: Major
>  Labels: streams
>
> Currently, range queries in the caching layer are implemented by creating an 
> iterator over the subset of keys in the range, and calling get() on the 
> underlying TreeMap for each key. While this protects against 
> ConcurrentModificationException, we can improve performance by replacing the 
> TreeMap with a concurrent data structure such as ConcurrentSkipListMap and 
> then just iterating over a subMap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8094) Iterating over cache with get(key) is inefficient

2019-03-11 Thread Sophie Blee-Goldman (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790080#comment-16790080
 ] 

Sophie Blee-Goldman commented on KAFKA-8094:


That said, maybe this is another occasion to question whether a returned 
iterator *should* reflect the current state of the cache/store, or the state at 
the time it was created (ie when it was queried) as a snapshot.

 

Personally I still believe the snapshot is more appropriate, and if it allows 
us to make this improvement am all the more in favor of it (of course this 
might not be a *huge* improvement as it only saves us a factor of log(N) ) . 
WDYT [~guozhang]

> Iterating over cache with get(key) is inefficient 
> --
>
> Key: KAFKA-8094
> URL: https://issues.apache.org/jira/browse/KAFKA-8094
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Sophie Blee-Goldman
>Priority: Major
>  Labels: streams
>
> Currently, range queries in the caching layer are implemented by creating an 
> iterator over the subset of keys in the range, and calling get() on the 
> underlying TreeMap for each key. While this protects against 
> ConcurrentModificationException, we can improve performance by replacing the 
> TreeMap with a concurrent data structure such as ConcurrentSkipListMap and 
> then just iterating over a subMap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8094) Iterating over cache with get(key) is inefficient

2019-03-12 Thread Guozhang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790766#comment-16790766
 ] 

Guozhang Wang commented on KAFKA-8094:
--

[~ableegoldman] I'd agree with you:

1) the underlying store's iterator from built-in rocksDB is also on a snapshot 
of the store, so if more updates are applied by the stream thread, while the IQ 
caller thread is iterating the store, it will not be reflected either.
2) hence for the caching store, if some entries are added after the iterator's 
snapshot, it should be fine since as 1) dictates, the underlying store does not 
reflect latest image either; if some entries are evicted after the iterator's 
snapshot, then the merge-iterator of the cache / underlying should be able to 
cover it as well (it favors the entry in the cache iterator for the same keys).

> Iterating over cache with get(key) is inefficient 
> --
>
> Key: KAFKA-8094
> URL: https://issues.apache.org/jira/browse/KAFKA-8094
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>  Labels: streams
>
> Currently, range queries in the caching layer are implemented by creating an 
> iterator over the subset of keys in the range, and calling get() on the 
> underlying TreeMap for each key. While this protects against 
> ConcurrentModificationException, we can improve performance by replacing the 
> TreeMap with a concurrent data structure such as ConcurrentSkipListMap and 
> then just iterating over a subMap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8094) Iterating over cache with get(key) is inefficient

2019-03-12 Thread Guozhang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790767#comment-16790767
 ] 

Guozhang Wang commented on KAFKA-8094:
--

I've not carefully looked into the unit tests that would be affected by this 
change, if you can have a quick PR to reflect the impact we can discuss more.

> Iterating over cache with get(key) is inefficient 
> --
>
> Key: KAFKA-8094
> URL: https://issues.apache.org/jira/browse/KAFKA-8094
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>  Labels: streams
>
> Currently, range queries in the caching layer are implemented by creating an 
> iterator over the subset of keys in the range, and calling get() on the 
> underlying TreeMap for each key. While this protects against 
> ConcurrentModificationException, we can improve performance by replacing the 
> TreeMap with a concurrent data structure such as ConcurrentSkipListMap and 
> then just iterating over a subMap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8094) Iterating over cache with get(key) is inefficient

2019-03-12 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790849#comment-16790849
 ] 

ASF GitHub Bot commented on KAFKA-8094:
---

ableegoldman commented on pull request #6433: KAFKA-8094: Iterating over cache 
with get(key) is inefficient
URL: https://github.com/apache/kafka/pull/6433
 
 
   Use concurrent data structure for the underlying cache in NamedCache, and 
iterate over it with subMap instead of many calls to get()
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Iterating over cache with get(key) is inefficient 
> --
>
> Key: KAFKA-8094
> URL: https://issues.apache.org/jira/browse/KAFKA-8094
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>  Labels: streams
>
> Currently, range queries in the caching layer are implemented by creating an 
> iterator over the subset of keys in the range, and calling get() on the 
> underlying TreeMap for each key. While this protects against 
> ConcurrentModificationException, we can improve performance by replacing the 
> TreeMap with a concurrent data structure such as ConcurrentSkipListMap and 
> then just iterating over a subMap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8094) Iterating over cache with get(key) is inefficient

2019-03-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796215#comment-16796215
 ] 

ASF GitHub Bot commented on KAFKA-8094:
---

bbejeck commented on pull request #6433: KAFKA-8094: Iterating over cache with 
get(key) is inefficient
URL: https://github.com/apache/kafka/pull/6433
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Iterating over cache with get(key) is inefficient 
> --
>
> Key: KAFKA-8094
> URL: https://issues.apache.org/jira/browse/KAFKA-8094
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>  Labels: streams
>
> Currently, range queries in the caching layer are implemented by creating an 
> iterator over the subset of keys in the range, and calling get() on the 
> underlying TreeMap for each key. While this protects against 
> ConcurrentModificationException, we can improve performance by replacing the 
> TreeMap with a concurrent data structure such as ConcurrentSkipListMap and 
> then just iterating over a subMap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)