[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel

JIRA Sat, 18 Nov 2017 04:48:48 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258054#comment-16258054
 ]


Andrés de la Peña commented on CASSANDRA-12245:
-----------------------------------------------

Thanks for the comments, this is almost finished :)

[Here|https://github.com/apache/cassandra/compare/trunk...adelapena:12245-trunk]
 is the new version of the patch, rebased and squashed. The udpated dtests can 
be found 
[here|https://github.com/apache/cassandra-dtest/compare/master...adelapena:12245].

bq. One minor thing is that we should probably only split the view build tasks 
at all if the base table is larger than a given size (let's say 500MB or so?), 
to avoid 4 * num_processor flushes for base tables with negligible size, WDYT?

As discussed, I have moved the base table flush from {{ViewBuilderTask}} to 
{{ViewBuilder}} 
[here|https://github.com/adelapena/cassandra/commit/478ed88b490378caf4f8ddc82c8e3aa3f90e5264],
 to do a single flush at the begining of the view build. The following writes 
will be writen to the MV through the regular path so it seems that they won't 
need any further flushes. I think that with this we don't need to check the 
table size and give a special treatment to small ones, what do you think?

bq. I noticed we don't stop in-progess view builds when a view is removed, 
would you mind adding that?

Right, good catch. Done 
[here|https://github.com/adelapena/cassandra/commit/e1ace2f47be71d48ab1987d0e2c7a07cc9486e97].
 I have also added [this 
dtest|https://github.com/adelapena/cassandra-dtest/blob/12245/materialized_views_test.py#L1025-L1067]
 to verify that the build is properly stopped.

bq. ViewBuildExecutor is being constructed with minThreads=1 and 
maxPoolSize=concurrent_materialized_view_builders, but according to the 
{{DebuggableThreadPoolExecutor}}'s' 
[javadoc|https://github.com/apache/cassandra/blob/8b3a60b9a7dbefeecc06bace617279612ec7092d/src/java/org/apache/cassandra/concurrent/DebuggableThreadPoolExecutor.java#L33],
 this will actually make the executor with size 1 since maxPoolSize is not 
supported by {{DebuggableThreadPoolExecutor}} - and even if it were, new 
threads would only be created after the queue of the initial threads were full 
(which is quite unintuitive), but we actually want the pool to have 
concurrent_materialized_view_builders concurrent threads at most, so we should 
use the {{threadCount}} constructor instead - at some point we should actually 
remove the maximumPoolSize

Done 
[here|https://github.com/adelapena/cassandra/commit/fc14b034bb5d36c23435f313541445dc5adb0078].

bq. I think we could take a {{buildAllViews}} parameter on reload, and set that 
to false during Keyspace initialization, since views will be build during 
daemon initialization and keyspace changes anyway, WDYT?

Makes sense, done 
[here|https://github.com/adelapena/cassandra/commit/c4f19a5461434c0d5ca5e1301d92da26cca5083e].

bq. One last thing, can you please add the new yaml option 
{{concurrent_materialized_view_builders}} to the configuration section of the 
doc?

It seems that [the configuration 
section|https://github.com/apache/cassandra/blob/trunk/doc/source/configuration/index.rst]
 of the doc is currently empty. I think that writting this section (structure, 
introduction, etc.) is probably out of the scope of this ticket and it might be 
done in a separate, dedicated ticket. Instead, I have 
[updated|https://github.com/adelapena/cassandra/commit/30293f852584189a5b46c2dce5ae4042ae62d3e4]
 the NEWS.txt file with more detailed info and I have added [a 
note|https://github.com/adelapena/cassandra/commit/82c446398d0b6b4b1b13b35b3502489fc71fe703]
 to the doc about {{CREATE MATERIALIZED VIEW}} statement. WDYT?

I have updated the dtest {{interrupt_build_process_test}} to make sure that the 
build is really interrupted also in 3.x through [new byteman 
scripts|https://github.com/adelapena/cassandra-dtest/blob/f7aac39ee5d0c661b2f7f5b1db2a7347635f85c5/materialized_views_test.py#L962-L963].
 Without that, the build could finish before the cluster stop.

The CI results look good, at least for MVs:
||[utest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-testall/]||[dtest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-dtest/]||

> initial view build can be parallel
> ----------------------------------
>
>                 Key: CASSANDRA-12245
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12245
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Materialized Views
>            Reporter: Tom van der Woerdt
>            Assignee: Andrés de la Peña
>             Fix For: 4.x
>
>
> On a node with lots of data (~3TB) building a materialized view takes several 
> weeks, which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
>  * do vnodes in parallel, instead of going through the entire range in one 
> thread
>  * just iterate through sstables, not worrying about duplicates, and include 
> the timestamp of the original write in the MV mutation. since this doesn't 
> exclude duplicates it does increase the amount of work and could temporarily 
> surface ghost rows (yikes) but I guess that's why they call it eventual 
> consistency. doing it this way can avoid holding references to all tables on 
> disk, allows parallelization, and removes the need to check other sstables 
> for existing data. this is essentially the 'do a full repair' path



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel

Reply via email to