[ https://issues.apache.org/jira/browse/IGNITE-20610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grigory Domozhirov updated IGNITE-20610: ---------------------------------------- Description: While intention for [IGNITE-3828|https://issues.apache.org/jira/browse/IGNITE-3828 ] (Data streamer: use identity comparison for "activeKeys" in DataStreamerImpl.load0 method) is clear it seems to work not as expected if allowOverwrite == true and same keys are added to a DataStreamer. With each DataStreamer.addData() a new UserKeyCacheObjectImpl() is created ([code|https://github.com/apache/ignite/blob/ceb22d20cab407b038570c81be022d7233a6e12d/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/binary/CacheObjectBinaryProcessorImpl.java#L1316]) for the key object and is added to GridConcurrentHashSet wrapped in a DataStreamerImpl.KeyCacheObjectWrapper ([code|https://github.com/apache/ignite/blob/fd504159bf5bc1603dfd5eb149ab5d998d3bffb4/modules/core/src/main/java/org/apache/ignite/internal/processors/datastreamer/DataStreamerImpl.java#L729]). Since its equals is overridden with identity check it ends up with `activeKeys` Set containing multiple objects with equal `UserKeyCacheObjectImpl`s. 1) Is that OK in general? 2) If yes, then does using GridConcurrentHashSet for activeKeys make any sense as all its entries are always non-equal? 3) Since `KeyCacheObjectWrapper.hashCode` returns actual key object's hashCode, the more often keys are repeated the lower performance is due to hash collisions of non-equal objects. Here is a corner case: {code:java} try (Ignite ignite = Ignition.start(new IgniteConfiguration()); IgniteCache<Integer, Long> cache = ignite.createCache("test"); IgniteDataStreamer<Integer, String> dataStreamer = ignite.dataStreamer(cache.getName()) ) { dataStreamer.allowOverwrite(true); // doesn't matter long start = System.currentTimeMillis(); for (int i = 0; i < 5_000_000; i++) { dataStreamer.addData(i, ""); //unique keys } System.out.println(System.currentTimeMillis() - start); }{code} runs in 6029 ms. {code:java} try (Ignite ignite = Ignition.start(new IgniteConfiguration()); IgniteCache<Integer, Long> cache = ignite.createCache("test"); IgniteDataStreamer<Integer, String> dataStreamer = ignite.dataStreamer(cache.getName()) ) { dataStreamer.allowOverwrite(true); // doesn't matter long start = System.currentTimeMillis(); for (int i = 0; i < 5_000_000; i++) { dataStreamer.addData(0, ""); //equal key } System.out.println(System.currentTimeMillis() - start); }{code} runs in 29025 ms. was: While intention for https://issues.apache.org/jira/browse/IGNITE-3828 (Data streamer: use identity comparison for "activeKeys" in DataStreamerImpl.load0 method) is clear it seems to work not as expected if allowOverwrite == true and same keys are added to a DataStreamer. With each DataStreamer.addData() a new UserKeyCacheObjectImpl() is created ([code|https://github.com/apache/ignite/blob/ceb22d20cab407b038570c81be022d7233a6e12d/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/binary/CacheObjectBinaryProcessorImpl.java#L1316]) for the key object and is added to GridConcurrentHashSet wrapped in a DataStreamerImpl.KeyCacheObjectWrapper ([code|https://github.com/apache/ignite/blob/fd504159bf5bc1603dfd5eb149ab5d998d3bffb4/modules/core/src/main/java/org/apache/ignite/internal/processors/datastreamer/DataStreamerImpl.java#L729]). Since its equals is overridden with identity check it ends up with `activeKeys` Set containing multiple objects with equal `UserKeyCacheObjectImpl`s. 1) Is that OK in general? 2) If yes, then does using GridConcurrentHashSet for activeKeys make any sense as all its entries are always non-equal? 3) Since `KeyCacheObjectWrapper.hashCode` returns actual key object's hashCode, the more often keys are repeated the lower performance is due to hash collisions of non-equal objects. Here is a corner case: {code:java} try (Ignite ignite = Ignition.start(new IgniteConfiguration()); IgniteCache<Integer, Long> cache = ignite.createCache("test"); IgniteDataStreamer<Integer, String> dataStreamer = ignite.dataStreamer(cache.getName()) ) { dataStreamer.allowOverwrite(true); // doesn't matter long start = System.currentTimeMillis(); for (int i = 0; i < 5_000_000; i++) { dataStreamer.addData(i, ""); //unique keys } System.out.println(System.currentTimeMillis() - start); }{code} runs in 6029 ms. {code:java} try (Ignite ignite = Ignition.start(new IgniteConfiguration()); IgniteCache<Integer, Long> cache = ignite.createCache("test"); IgniteDataStreamer<Integer, String> dataStreamer = ignite.dataStreamer(cache.getName()) ) { dataStreamer.allowOverwrite(true); // doesn't matter long start = System.currentTimeMillis(); for (int i = 0; i < 5_000_000; i++) { dataStreamer.addData(0, ""); //equal key } System.out.println(System.currentTimeMillis() - start); }{code} runs in 29025 ms. > DataStreamer/KeyCacheObjectWrapper inefficiency for non-unique keys > ------------------------------------------------------------------- > > Key: IGNITE-20610 > URL: https://issues.apache.org/jira/browse/IGNITE-20610 > Project: Ignite > Issue Type: Task > Components: streaming > Affects Versions: 2.15 > Reporter: Grigory Domozhirov > Priority: Minor > > While intention for > [IGNITE-3828|https://issues.apache.org/jira/browse/IGNITE-3828 ] (Data > streamer: use identity comparison for "activeKeys" in DataStreamerImpl.load0 > method) is clear it seems to work not as expected if allowOverwrite == true > and same keys are added to a DataStreamer. > With each DataStreamer.addData() a new UserKeyCacheObjectImpl() is created > ([code|https://github.com/apache/ignite/blob/ceb22d20cab407b038570c81be022d7233a6e12d/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/binary/CacheObjectBinaryProcessorImpl.java#L1316]) > for the key object and is added to GridConcurrentHashSet wrapped in a > DataStreamerImpl.KeyCacheObjectWrapper > ([code|https://github.com/apache/ignite/blob/fd504159bf5bc1603dfd5eb149ab5d998d3bffb4/modules/core/src/main/java/org/apache/ignite/internal/processors/datastreamer/DataStreamerImpl.java#L729]). > Since its equals is overridden with identity check it ends up with > `activeKeys` Set containing multiple objects with equal > `UserKeyCacheObjectImpl`s. > > 1) Is that OK in general? > 2) If yes, then does using GridConcurrentHashSet for activeKeys make any > sense as all its entries are always non-equal? > 3) Since `KeyCacheObjectWrapper.hashCode` returns actual key object's > hashCode, the more often keys are repeated the lower performance is due to > hash collisions of non-equal objects. Here is a corner case: > {code:java} > try (Ignite ignite = Ignition.start(new IgniteConfiguration()); > IgniteCache<Integer, Long> cache = ignite.createCache("test"); > IgniteDataStreamer<Integer, String> dataStreamer = > ignite.dataStreamer(cache.getName()) > ) { > dataStreamer.allowOverwrite(true); // doesn't matter > long start = System.currentTimeMillis(); > for (int i = 0; i < 5_000_000; i++) { > dataStreamer.addData(i, ""); //unique keys > } > System.out.println(System.currentTimeMillis() - start); > }{code} > runs in 6029 ms. > {code:java} > try (Ignite ignite = Ignition.start(new IgniteConfiguration()); > IgniteCache<Integer, Long> cache = ignite.createCache("test"); > IgniteDataStreamer<Integer, String> dataStreamer = > ignite.dataStreamer(cache.getName()) > ) { > dataStreamer.allowOverwrite(true); // doesn't matter > long start = System.currentTimeMillis(); > for (int i = 0; i < 5_000_000; i++) { > dataStreamer.addData(0, ""); //equal key > } > System.out.println(System.currentTimeMillis() - start); > }{code} > runs in 29025 ms. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)