[jira] [Commented] (CASSANDRA-12201) Burst Hour Compaction Strategy

2017-06-14 Thread Pedro Gordo (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049671#comment-16049671
 ] 

Pedro Gordo commented on CASSANDRA-12201:
-

I've squashed several commits, and added everything to a fork from the proper 
cassandra repo. You can find it here: 
https://github.com/sedulam/cassandra/tree/12201
I believe now you can easily compare my changes to the code base. Please let me 
know if this should be done differently.

> Burst Hour Compaction Strategy
> --
>
> Key: CASSANDRA-12201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12201
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction
>Reporter: Pedro Gordo
> Attachments: BHCS outline.pdf
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> This strategy motivation revolves around taking advantage of periods of the 
> day where there's less I/O on the cluster. This time of the day will be 
> called “Burst Hour” (BH), and hence the strategy will be named “Burst Hour 
> Compaction Strategy” (BHCS). 
> The following process would be fired during BH:
> 1. Read all the SSTables and detect which partition keys are present in more 
> than the compaction minimum threshold value.
> 2. Gather all the tables that have keys present in other tables, with a 
> minimum of replicas equal to the minimum compaction threshold. 
> 3. Repeat step 2 until the bucket for gathered SSTables reaches the maximum 
> compaction threshold (32 by default), or until we've searched all the keys.
> 4. The compaction per se will be done through by MaxSSTableSizeWriter. The 
> compacted tables will have a maximum size equal to the configurable value of 
> max_sstable_size (100MB by default). 
> The maximum compaction task (nodetool compact command), does exactly the same 
> operation as the background compaction task, but differing in that it can be 
> triggered outside of the Burst Hour.
> This strategy tries to address three issues of the existing compaction 
> strategies:
> - Due to max_sstable_size_limit, there's no need to reserve disc space for a 
> huge compaction.
> - The number of SSTables that we need to read from to reply to a read query 
> will be consistently maintained at a low level and controllable through the 
> referenced_sstable_limit property.
> - It removes the dependency of a continuous high I/O.
> Possible future improvements:
> - Continuously evaluate how many pending compactions we have and I/O status, 
> and then based on that, we start (or not) the compaction.
> - If during the day, the size for all the SSTables in a family set reaches a 
> certain maximum, then background compaction can occur anyway. This maximum 
> should be elevated due to the high CPU usage of BHCS.
> - Make it possible to set several compaction times intervals, instead of just 
> one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12201) Burst Hour Compaction Strategy

2017-05-05 Thread Carlos Rolo (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997942#comment-15997942
 ] 

Carlos Rolo commented on CASSANDRA-12201:
-

Hello Pedro,

Check if this is something that could help you regarding testing: 
https://lists.apache.org/thread.html/05d9a7dcaa29b44608d9ddc818db4d23d9cf441634ee1f2110274ecd@%3Cdev.cassandra.apache.org%3E

> Burst Hour Compaction Strategy
> --
>
> Key: CASSANDRA-12201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12201
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction
>Reporter: Pedro Gordo
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> Although it may be subject to changes, for the moment I plan to create a 
> strategy that will revolve around taking advantage of periods of the day 
> where there's less I/O on the cluster. This time of the day will be called 
> “Burst Hour” (BH), and hence the strategy will be named “Burst Hour 
> Compaction Strategy” (BHCS). 
> The following process would be fired during BH:
> 1. Read all the SSTables and detect which partition keys are present in more 
> than a configurable value which I'll call referenced_sstable_limit. This 
> value will be three by default.
> 2. Group all the repeated keys with a reference to the SSTables containing 
> them.
> 3. Calculate the total size of the SSTables which will be merged for the 
> first partition key on the list created in step 2. If the size calculated is 
> bigger than property which I'll call max_sstable_size (also configurable), 
> more than one table will be created in step 4.
> 4. During the merge, the data will be streamed from SSTables up to a point 
> when we have a size close to max_sstable_size. After we reach this point, the 
> stream is paused, and the new SSTable will be closed, becoming immutable. 
> Repeat the streaming process until we've merged all tables for the partition 
> key that we're iterating.
> 5. Cycle through the rest of the collection created in step 2 and remove any 
> SSTables which don't exist anymore because they were merged in step 5. An 
> alternative course of action here would be to, instead of removing the 
> SSTable from the collection, to change its reference to the SSTable(s) which 
> was created in step 5. 
> 6. Repeat from step 3 to step 6 until we traversed the entirety of the 
> collection created in step 2.
> This strategy addresses three issues of the existing compaction strategies:
> - Due to max_sstable_size_limit, there's no need to reserve disc space for a 
> huge compaction, as it can happen on STCS.
> - The number of SSTables that we need to read from to reply to a read query 
> will be consistently maintained at a low level and controllable through the 
> referenced_sstable_limit property. This addresses the scenario of STCS when 
> we might have to read from a lot of SSTables.
> - It removes the dependency of a continuous high I/O of LCS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12201) Burst Hour Compaction Strategy

2017-05-05 Thread Pedro Gordo (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997921#comment-15997921
 ] 

Pedro Gordo commented on CASSANDRA-12201:
-

So far I've implemented the abstract methods from the super class, did some 
testing in the beginning of this week and at least the background task 
generation seems to be performing correctly. 
Now I need to figure out how to introduce the timers correctly in BHCS. You can 
find the implementation for BHCS here: 
https://github.com/sedulam/CASSANDRA-12201
This is all still untested. I'll try to start testing next week, but I still 
need to check what are the guidelines for testing Cassandra, code style, and 
other constraints that might exist.

> Burst Hour Compaction Strategy
> --
>
> Key: CASSANDRA-12201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12201
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction
>Reporter: Pedro Gordo
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> Although it may be subject to changes, for the moment I plan to create a 
> strategy that will revolve around taking advantage of periods of the day 
> where there's less I/O on the cluster. This time of the day will be called 
> “Burst Hour” (BH), and hence the strategy will be named “Burst Hour 
> Compaction Strategy” (BHCS). 
> The following process would be fired during BH:
> 1. Read all the SSTables and detect which partition keys are present in more 
> than a configurable value which I'll call referenced_sstable_limit. This 
> value will be three by default.
> 2. Group all the repeated keys with a reference to the SSTables containing 
> them.
> 3. Calculate the total size of the SSTables which will be merged for the 
> first partition key on the list created in step 2. If the size calculated is 
> bigger than property which I'll call max_sstable_size (also configurable), 
> more than one table will be created in step 4.
> 4. During the merge, the data will be streamed from SSTables up to a point 
> when we have a size close to max_sstable_size. After we reach this point, the 
> stream is paused, and the new SSTable will be closed, becoming immutable. 
> Repeat the streaming process until we've merged all tables for the partition 
> key that we're iterating.
> 5. Cycle through the rest of the collection created in step 2 and remove any 
> SSTables which don't exist anymore because they were merged in step 5. An 
> alternative course of action here would be to, instead of removing the 
> SSTable from the collection, to change its reference to the SSTable(s) which 
> was created in step 5. 
> 6. Repeat from step 3 to step 6 until we traversed the entirety of the 
> collection created in step 2.
> This strategy addresses three issues of the existing compaction strategies:
> - Due to max_sstable_size_limit, there's no need to reserve disc space for a 
> huge compaction, as it can happen on STCS.
> - The number of SSTables that we need to read from to reply to a read query 
> will be consistently maintained at a low level and controllable through the 
> referenced_sstable_limit property. This addresses the scenario of STCS when 
> we might have to read from a lot of SSTables.
> - It removes the dependency of a continuous high I/O of LCS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12201) Burst Hour Compaction Strategy

2017-02-06 Thread Pedro Gordo (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854024#comment-15854024
 ] 

Pedro Gordo commented on CASSANDRA-12201:
-

I was unable to start this last year because as I was about to, I suffered a 
wrist injury which prevented me from working for more than six months. I'm now 
resuming work on this, although I'll still spend a few days getting up to speed 
with C*.

I studied on the data structure for Cassandra 2.0 but from what I know, there 
were significant changes to 3.0, so I'll need to consider now which version 
I'll be working on. Let me know your opinion on this, please.

> Burst Hour Compaction Strategy
> --
>
> Key: CASSANDRA-12201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12201
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction
>Reporter: Pedro Gordo
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> Although it may be subject to changes, for the moment I plan to create a 
> strategy that will revolve around taking advantage of periods of the day 
> where there's less I/O on the cluster. This time of the day will be called 
> “Burst Hour” (BH), and hence the strategy will be named “Burst Hour 
> Compaction Strategy” (BHCS). 
> The following process would be fired during BH:
> 1. Read all the SSTables and detect which partition keys are present in more 
> than a configurable value which I'll call referenced_sstable_limit. This 
> value will be three by default.
> 2. Group all the repeated keys with a reference to the SSTables containing 
> them.
> 3. Calculate the total size of the SSTables which will be merged for the 
> first partition key on the list created in step 2. If the size calculated is 
> bigger than property which I'll call max_sstable_size (also configurable), 
> more than one table will be created in step 4.
> 4. During the merge, the data will be streamed from SSTables up to a point 
> when we have a size close to max_sstable_size. After we reach this point, the 
> stream is paused, and the new SSTable will be closed, becoming immutable. 
> Repeat the streaming process until we've merged all tables for the partition 
> key that we're iterating.
> 5. Cycle through the rest of the collection created in step 2 and remove any 
> SSTables which don't exist anymore because they were merged in step 5. An 
> alternative course of action here would be to, instead of removing the 
> SSTable from the collection, to change its reference to the SSTable(s) which 
> was created in step 5. 
> 6. Repeat from step 3 to step 6 until we traversed the entirety of the 
> collection created in step 2.
> This strategy addresses three issues of the existing compaction strategies:
> - Due to max_sstable_size_limit, there's no need to reserve disc space for a 
> huge compaction, as it can happen on STCS.
> - The number of SSTables that we need to read from to reply to a read query 
> will be consistently maintained at a low level and controllable through the 
> referenced_sstable_limit property. This addresses the scenario of STCS when 
> we might have to read from a lot of SSTables.
> - It removes the dependency of a continuous high I/O of LCS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)