[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171536#comment-16171536 ]
Paulo Motta commented on CASSANDRA-12245: ----------------------------------------- Finally getting to this after a while, sorry for the delay. Thanks for the update! Had another look at the patch and this is looking much better, see some follow-up comments below: bq. I have moved the methods to split the ranges to the Splitter, reusing its valueForToken method. Tests here. Awesome, looks much better now! It seems like the way the number of tokens in a range was computed by {{abs(range.right - range.left)}} may not work correctly for some wrap-around cases, as shown by [this test case|https://github.com/pauloricardomg/cassandra/blob/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237/test/unit/org/apache/cassandra/dht/SplitterTest.java#L184]. Even though this shouldn't break when local ranges are used , I fixed it on [this commit|https://github.com/pauloricardomg/cassandra/commit/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237] to make sure split works correctly for wrap-around ranges. Can you confirm this is correct? Other than that, it seems like you added unit tests only for {{Murmur3Partitioner}}, would you mind extending {{testSplit()}} to {{RandomPartitioner}}? bq. Agree. I have added a new dedicated executor in the CompactionManager, similar to the executors used for validation and cache cleanup. The concurrency of this executor is determined by the new config property concurrent_materialized_view_builders, which defaults to a perhaps too much conservative value of 1. This property can be modified through both JMX and the new setconcurrentviewbuilders and getconcurrentviewbuilders nodetool commands. These commands are tested here. I think having a dedicated executor will ensure view building doesn't compete with compactions for the compaction executor, good job! One problem I see though is that if the user is finding its view building slow it will try to increase the number of concurrent view builders via nodetool, but it will have no effect since the range was split in the previously number of concurrent view builders. Given this will be a pretty common scenario for large datasets, how about splitting the range in multiple smaller tasks, so that if the user increases {{concurrent_view_builders}} the other tasks immediately start executing? We could use a simple approach of splitting the local range in let's say 1000 hard-coded parts, or be smarter and make each split have ~100MB or so. In this way we can keep {{concurrent_materialized_view_builders=1}} by default, and users with large base tables are able to increase it and see immediate effect via nodetool. WDYT? bq. I would prefer to do this in another ticket. Agreed. bq. I have moved the marking of system tables (and the retries in case of failure) from the ViewBuilderTask to the ViewBuilder, using a callback to do the marking. I think the code is clearer this way. Great, looks much cleaner indeed! One minor thing is that if there's a failure after some {{ViewBuildTasks}} were completed, it will resume that subtask from its last token while it already finished. Could we maybe set the last_token = end_token when the task is finished to flag it was already finished and avoid resuming the task when that is the case? bq. Updated here. It also uses a byteman script to make sure that the MV build isn't finished before stopping the cluster, which is more likely to happen. The dtest looks mostly good, except for the following nits: * {{concurrent_materialized_view_builders=1}} when the nodes are restarted. Can you set the configuration value during cluster setup phase (instead of setting via nodetool) to make sure the restarted view builds will be parallel? * can probably use {{self._wait_for_view("ks", "t_by_v")}} [here|https://github.com/adelapena/cassandra-dtest/blob/cf893982c361fd7b6018b2570d3a5a33badd5424/materialized_views_test.py#L982] * We cannot ensure key 10000 was not built [here|https://github.com/adelapena/cassandra-dtest/blob/cf893982c361fd7b6018b2570d3a5a33badd5424/materialized_views_test.py#L980] which may cause flakiness, so it's probably better to check for {{self.assertNotEqual(len(list(session.execute("SELECT count\(*\) FROM t_by_v;"))), 10000)}} or something like that. * It would be nice to check that the view build was actually removed on restart, by checking for the log entry {{Resuming view build for range}} bq. I'm not sure about if it still makes sense for the builder task to extend CompactionInfo.Holder. If so, I'm neither sure about how to use prevToken.size(range.right) (that returns a double) to create CompationInfo objects. WDYT? I think it's still useful to show view build progress via {{nodetool compactionstats}}, so I created a new method {{Splitter.positionInRange(Token token, Range<Token> range)}} which gives the position of a token relative to a range and used that to show view build progress when a splitter is present. When it's not (such as the case of {{ByterOrderedPartitioner}}, we fallback to the progress based on the keys estimate. This is implemented on [this commit|https://github.com/pauloricardomg/cassandra/commit/bcdbca614e4e049c3b0c965ae0985a2dea2939d1]. Please let me know what do you think about this approach. In addition to the suggestion above, I also made the following improvements: * Select only sstables that fall into the {{ViewBuildTask}} range ([commit|https://github.com/pauloricardomg/cassandra/commit/3694c63678f6872477531825914a89a072f8f67d]) * Simplify ViewBuilderTask loop ([commit|https://github.com/pauloricardomg/cassandra/commit/3c4e2ac65b7a3ca050450400392735fb3738ecd7]) * It's pretty rare but in the case of collisions it's possible for multiple keys to share the same token, so I updated the {{ViewBuilderTask}} loop to build all the keys sharing the same token ([commit|https://github.com/pauloricardomg/cassandra/commit/4ad0deb1de9d5dbc414c8d83d87256481e307d9e]) I created a [PR|https://github.com/adelapena/cassandra/pull/1] on your branch with the above suggestions. Even though the patch is looking good and has some dtest coverage, I feel that we are still missing some unit testing to have confidence this is working as desired and catch any subtle regression, given this is critical for correct MV functioning. With that said, it would be nice if we could test that {{ViewBuilderTask}} is correctly building a specific range and maybe extend {{ViewTest.testViewBuilderResume}} to test view building/resume with different number of concurrent view builders. What do you think? > initial view build can be parallel > ---------------------------------- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement > Components: Materialized Views > Reporter: Tom van der Woerdt > Assignee: Andrés de la Peña > Fix For: 4.x > > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be optimized : > * do vnodes in parallel, instead of going through the entire range in one > thread > * just iterate through sstables, not worrying about duplicates, and include > the timestamp of the original write in the MV mutation. since this doesn't > exclude duplicates it does increase the amount of work and could temporarily > surface ghost rows (yikes) but I guess that's why they call it eventual > consistency. doing it this way can avoid holding references to all tables on > disk, allows parallelization, and removes the need to check other sstables > for existing data. this is essentially the 'do a full repair' path -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org