Re: How to free up space on disc after removing entries from IgniteCache with enabled PDS?
Denis, I like the idea that defragmentation is just an additional step on a node (re)start like we perform PDS recovery now. We may just use special key to specify node should defragment persistence on (re)start. Defragmentation can be the part of Rolling Upgrade in this case :) It seems to be not a problem to restart nodes one-by-one, this will "eat" only one backup guarantee. On Mon, Oct 7, 2019 at 8:28 PM Denis Magda wrote: > Alex, thanks for the summary and proposal. Anton, Ivan and others who took > part in this discussion, what're your thoughts? I see this > rolling-upgrades-based approach as a reasonable solution. Even though a > node shutdown is expected, the procedure doesn't lead to the cluster outage > meaning it can be utilized for 24x7 production environments. > > - > Denis > > > On Mon, Oct 7, 2019 at 1:35 AM Alexey Goncharuk < > alexey.goncha...@gmail.com> > wrote: > > > Created a ticket for the first stage of this improvement. This can be a > > first change towards the online mode suggested by Sergey and Anton. > > https://issues.apache.org/jira/browse/IGNITE-12263 > > > > пт, 4 окт. 2019 г. в 19:38, Alexey Goncharuk >: > > > > > Maxim, > > > > > > Having a cluster-wide lock for a cache does not improve availability of > > > the solution. A user cannot defragment a cache if the cache is involved > > in > > > a mission-critical operation, so having a lock on such a cache is > > > equivalent to the whole cluster shutdown. > > > > > > We should decide between either a single offline node or a more complex > > > fully online solution. > > > > > > пт, 4 окт. 2019 г. в 11:55, Maxim Muzafarov : > > > > > >> Igniters, > > >> > > >> This thread seems to be endless, but we if some kind of cache group > > >> distributed write lock (exclusive for some of the internal Ignite > > >> process) will be introduced? I think it will help to solve a batch of > > >> problems, like: > > >> > > >> 1. defragmentation of all cache group partitions on the local node > > >> without concurrent updates. > > >> 2. improve data loading with data streamer isolation mode [1]. It > > >> seems we should not allow concurrent updates to cache if we on `fast > > >> data load` step. > > >> 3. recovery from a snapshot without cache stop\start actions > > >> > > >> > > >> [1] https://issues.apache.org/jira/browse/IGNITE-11793 > > >> > > >> On Thu, 3 Oct 2019 at 22:50, Sergey Kozlov > > wrote: > > >> > > > >> > Hi > > >> > > > >> > I'm not sure that node offline is a best way to do that. > > >> > Cons: > > >> > - different caches may have different defragmentation but we force > to > > >> stop > > >> > whole node > > >> > - offline node is a maintenance operation will require to add +1 > > >> backup to > > >> > reduce the risk of data loss > > >> > - baseline auto adjustment? > > >> > - impact to index rebuild? > > >> > - cache configuration changes (or destroy) during node offline > > >> > > > >> > What about other ways without node stop? E.g. make cache group on a > > node > > >> > offline? Add *defrag *command to control.sh to force > > start > > >> > rebalance internally in the node with expected impact to > performance. > > >> > > > >> > > > >> > > > >> > On Thu, Oct 3, 2019 at 12:08 PM Anton Vinogradov > > wrote: > > >> > > > >> > > Alexey, > > >> > > As for me, it does not matter will it be IEP, umbrella or a single > > >> issue. > > >> > > The most important thing is Assignee :) > > >> > > > > >> > > On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk < > > >> > > alexey.goncha...@gmail.com> > > >> > > wrote: > > >> > > > > >> > > > Anton, do you think we should file a single ticket for this or > > >> should we > > >> > > go > > >> > > > with an IEP? As of now, the change does not look big enough for > an > > >> IEP > > >> > > for > > >> > > > me. > > >> > > > > > >> > > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov : > > >> > > > > > >> > > > > Alexey, > > >> > > > > > > >> > > > > Sounds good to me. > > >> > > > > > > >> > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk < > > >> > > > > alexey.goncha...@gmail.com> > > >> > > > > wrote: > > >> > > > > > > >> > > > > > Anton, > > >> > > > > > > > >> > > > > > Switching a partition to and from the SHRINKING state will > > >> require > > >> > > > > > intricate synchronizations in order to properly determine > the > > >> start > > >> > > > > > position for historical rebalance without PME. > > >> > > > > > > > >> > > > > > I would still go with an offline-node approach, but instead > of > > >> > > cleaning > > >> > > > > the > > >> > > > > > persistence, we can do effective defragmentation when the > node > > >> is > > >> > > > offline > > >> > > > > > because we are sure that there is no concurrent load. After > > the > > >> > > > > > defragmentation completes, we bring the node back to the > > >> cluster and > > >> > > > > > historical rebalance will kick in automatically. It will > still > > >> > > require > > >> > > > > > manual node restarts, but since th
Re: Enabling the checkstyle profile on Build Apache Ignite suite (test period)
Let's give it a try. пн, 7 окт. 2019 г. в 13:21, Nikolay Izhikov : > > +1 > > В Пн, 07/10/2019 в 13:18 +0300, Maxim Muzafarov пишет: > > Igniters, > > > > > > I'm planning October 11 (Friday) 22-00 MSK enable `checkstyle` profile > > on the [Build > > Apache Ignite] suite by the end of the next weekend (one week test > > period). Such an option discussed many times before (e.g. [1]). > > > > Here are the reasons: > > > > - any code style violations in a PR lead to source code fixes which in > > turn require re-run of other test suites, so it is better to fail > > fast; > > - each a new Run:All suite (e.g. for a new module) must contain a > > checkstyle suite to code style by default, so it is better to include > > mandatory checks to the build Apache Ignite procedure; > > - `fail fast` paradigm will eliminate all the check style violations, > > currently in happens time ot time; > > > > The ability to create a prototype PR without code style checks still > > exists. You can disable `checkstyle` profile for such PRs in your > > local branches. > > > > > > Any objections? > > > > > > [1] > > http://apache-ignite-developers.2346864.n4.nabble.com/Code-inspection-td27709i80.html#a41297 -- Best regards, Ivan Pavlukhin
Re: Adding experimental support for Intel Optane DC Persistent Memory
Igniters, I would like to resurrect this discussion and will review the change again shortly. If anyone want to join the review - you are welcome! ср, 22 авг. 2018 г. в 18:49, Denis Magda : > Hi Dmitry, > > That's a BSD-3-Clause license if to believe this statement > "SPDX-License-Identifier: BSD-3-Clause": > https://github.com/pmem/llpl/blob/master/LICENSE > > This license can be used with ASF software: > https://www.apache.org/legal/resolved.html#category-a > > -- > Denis > > On Wed, Aug 22, 2018 at 9:28 AM Dmitriy Pavlov > wrote: > > > Hi Denis, > > > > Could you please double check if we can refer to any library licensed to > > Intel. Can we develop code only version of this support (without shipping > > it in release)? > > > > https://github.com/apache/ignite/pull/4381 is quite huge change, > > including 128 files changed, patch review will require resources from > > community members to review. I would like to be sure we can include this > > patch from the legal point of view. > > > > Sincerely, > > Dmitriy Pavlov > > > > пт, 3 авг. 2018 г. в 19:23, Dmitriy Pavlov : > > > >> Hi Mulugeta, > >> > >> I appologise, I've missed that license is already there. So I guess it > is > >> not standard open-source license, it is seems it is not listed in > >> https://www.apache.org/legal/resolved.html#category-a > >> > >> So there can be legal concern related to including this lib as > dependency > >> into Apache product. It should not block review, we can later > >> consult Secretary/Legal to find out how we can correctly include > reference > >> to lib. > >> > >> Sincerely, > >> Dmitriy Pavlov > >> > >> чт, 2 авг. 2018 г. в 0:24, Mammo, Mulugeta : > >> > >>> Hi Dmitriy, > >>> > >>> Do you mean our LLPL library? It has a license, please look here: > >>> https://github.com/pmem/llpl > >>> > >>> Regarding the changes made to Ignite, you may refer to the pull request > >>> here: https://github.com/apache/ignite/pull/4381 > >>> > >>> Thanks, > >>> Mulugeta > >>> > >>> -Original Message- > >>> From: Dmitriy Pavlov [mailto:dpavlov@gmail.com] > >>> Sent: Wednesday, August 1, 2018 10:49 AM > >>> To: dev@ignite.apache.org > >>> Subject: Re: Adding experimental support for Intel Optane DC Persistent > >>> Memory > >>> > >>> Hi Mulugeta Mammo, > >>> > >>> I've just noticed that repository, what you refer is full fork of > Ignite. > >>> How can I see differences with original Ignite? > >>> > >>> One more thing, library which you're referencing seems to not contain > >>> license, at least github can not parse it. Apache product has > limitations > >>> which libraries may be used (see > >>> https://www.apache.org/legal/resolved.html#category-a and > >>> https://www.apache.org/legal/resolved.html#category-b) > >>> > >>> Could you please comment if there is some legal risk? > >>> > >>> Sincerely, > >>> Dmitriy Pavlov > >>> > >>> ср, 1 авг. 2018 г. в 20:43, Dmitriy Pavlov : > >>> > >>> > Hi, > >>> > > >>> > This link works for me > >>> > > >>> > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-26%3A+Adding+Ex > >>> > perimental+Support+for+Intel+Optane+DC+Persistent+Memory > >>> > > >>> > Sincerely, > >>> > Dmitriy Pavlov > >>> > > >>> > чт, 26 июл. 2018 г. в 15:31, Stanislav Lukyanov < > >>> stanlukya...@gmail.com>: > >>> > > >>> >> Ah, ok, it’s just the ‘.’ at the end of the link. Removed it and > it’s > >>> >> fine. > >>> >> > >>> >> From: Stanislav Lukyanov > >>> >> Sent: 26 июля 2018 г. 15:12 > >>> >> To: dev@ignite.apache.org > >>> >> Subject: RE: Adding experimental support for Intel Optane DC > >>> >> Persistent Memory > >>> >> > >>> >> Hi, > >>> >> > >>> >> The link you’ve shared gives me 404. > >>> >> Perhaps you need to add a permission for everyone to access the > page? > >>> >> > >>> >> Thanks, > >>> >> Stan > >>> >> > >>> >> From: Mammo, Mulugeta > >>> >> Sent: 26 июля 2018 г. 2:44 > >>> >> To: dev@ignite.apache.org > >>> >> Subject: Adding experimental support for Intel Optane DC Persistent > >>> >> Memory > >>> >> > >>> >> Hi, > >>> >> > >>> >> I have added a new proposal to support Intel Optane DC Persistent > >>> >> Memory for Ignite here: > >>> >> > https://cwiki.apache.org/confluence/display/IGNITE/Adding+Experimenta > >>> >> l+Support+for+Intel+Optane+DC+Persistent+Memory > >>> >> . > >>> >> > >>> >> I'm looking forward to your feedback and collaboration on this. > >>> >> > >>> >> Thanks, > >>> >> Mulugeta > >>> >> > >>> >> > >>> >> > >>> >> > >>> > >> >
Re: How to free up space on disc after removing entries from IgniteCache with enabled PDS?
Alex, thanks for the summary and proposal. Anton, Ivan and others who took part in this discussion, what're your thoughts? I see this rolling-upgrades-based approach as a reasonable solution. Even though a node shutdown is expected, the procedure doesn't lead to the cluster outage meaning it can be utilized for 24x7 production environments. - Denis On Mon, Oct 7, 2019 at 1:35 AM Alexey Goncharuk wrote: > Created a ticket for the first stage of this improvement. This can be a > first change towards the online mode suggested by Sergey and Anton. > https://issues.apache.org/jira/browse/IGNITE-12263 > > пт, 4 окт. 2019 г. в 19:38, Alexey Goncharuk : > > > Maxim, > > > > Having a cluster-wide lock for a cache does not improve availability of > > the solution. A user cannot defragment a cache if the cache is involved > in > > a mission-critical operation, so having a lock on such a cache is > > equivalent to the whole cluster shutdown. > > > > We should decide between either a single offline node or a more complex > > fully online solution. > > > > пт, 4 окт. 2019 г. в 11:55, Maxim Muzafarov : > > > >> Igniters, > >> > >> This thread seems to be endless, but we if some kind of cache group > >> distributed write lock (exclusive for some of the internal Ignite > >> process) will be introduced? I think it will help to solve a batch of > >> problems, like: > >> > >> 1. defragmentation of all cache group partitions on the local node > >> without concurrent updates. > >> 2. improve data loading with data streamer isolation mode [1]. It > >> seems we should not allow concurrent updates to cache if we on `fast > >> data load` step. > >> 3. recovery from a snapshot without cache stop\start actions > >> > >> > >> [1] https://issues.apache.org/jira/browse/IGNITE-11793 > >> > >> On Thu, 3 Oct 2019 at 22:50, Sergey Kozlov > wrote: > >> > > >> > Hi > >> > > >> > I'm not sure that node offline is a best way to do that. > >> > Cons: > >> > - different caches may have different defragmentation but we force to > >> stop > >> > whole node > >> > - offline node is a maintenance operation will require to add +1 > >> backup to > >> > reduce the risk of data loss > >> > - baseline auto adjustment? > >> > - impact to index rebuild? > >> > - cache configuration changes (or destroy) during node offline > >> > > >> > What about other ways without node stop? E.g. make cache group on a > node > >> > offline? Add *defrag *command to control.sh to force > start > >> > rebalance internally in the node with expected impact to performance. > >> > > >> > > >> > > >> > On Thu, Oct 3, 2019 at 12:08 PM Anton Vinogradov > wrote: > >> > > >> > > Alexey, > >> > > As for me, it does not matter will it be IEP, umbrella or a single > >> issue. > >> > > The most important thing is Assignee :) > >> > > > >> > > On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk < > >> > > alexey.goncha...@gmail.com> > >> > > wrote: > >> > > > >> > > > Anton, do you think we should file a single ticket for this or > >> should we > >> > > go > >> > > > with an IEP? As of now, the change does not look big enough for an > >> IEP > >> > > for > >> > > > me. > >> > > > > >> > > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov : > >> > > > > >> > > > > Alexey, > >> > > > > > >> > > > > Sounds good to me. > >> > > > > > >> > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk < > >> > > > > alexey.goncha...@gmail.com> > >> > > > > wrote: > >> > > > > > >> > > > > > Anton, > >> > > > > > > >> > > > > > Switching a partition to and from the SHRINKING state will > >> require > >> > > > > > intricate synchronizations in order to properly determine the > >> start > >> > > > > > position for historical rebalance without PME. > >> > > > > > > >> > > > > > I would still go with an offline-node approach, but instead of > >> > > cleaning > >> > > > > the > >> > > > > > persistence, we can do effective defragmentation when the node > >> is > >> > > > offline > >> > > > > > because we are sure that there is no concurrent load. After > the > >> > > > > > defragmentation completes, we bring the node back to the > >> cluster and > >> > > > > > historical rebalance will kick in automatically. It will still > >> > > require > >> > > > > > manual node restarts, but since the data is not removed, there > >> are no > >> > > > > > additional risks. Also, this will be an excellent solution for > >> those > >> > > > who > >> > > > > > can afford downtime and execute the defragment command on all > >> nodes > >> > > in > >> > > > > the > >> > > > > > cluster simultaneously - this will be the fastest way > possible. > >> > > > > > > >> > > > > > --AG > >> > > > > > > >> > > > > > пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov >: > >> > > > > > > >> > > > > > > Alexei, > >> > > > > > > >> stopping fragmented node and removing partition data, > then > >> > > > starting > >> > > > > it > >> > > > > > > again > >> > > > > > > > >> > > > > > > That's exactly what we're doing to solve the fragmentat
[jira] [Created] (IGNITE-12267) ClassCastException after change column type (drop, add)
Kirill Tkalenko created IGNITE-12267: Summary: ClassCastException after change column type (drop, add) Key: IGNITE-12267 URL: https://issues.apache.org/jira/browse/IGNITE-12267 Project: Ignite Issue Type: Improvement Reporter: Kirill Tkalenko Assignee: Kirill Tkalenko Fix For: 2.8 SQL column type change is not present, but it is possible to delete and create with a new type. The application of the migration script passes without errors. The error occurs whenever the column is accessed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Metric showing how many nodes may safely leave the cluster
Denis, Alex, Sure, new metric will be integrated into new metrics framework. Let's not expose its value to control.sh right now. I'll create an issue for aggregated "getMinimumNumberOfPartitionCopies" if everyone agrees. Best Regards, Ivan Rakov On 04.10.2019 20:06, Denis Magda wrote: I'm for the proposal to add new JMX metrics and enhance the existing tooling. But I would encourage us to integrate this into the new metrics framework Nikolay has been working on. Otherwise, we will be deprecating these JMX metrics in a short time frame in favor of the new monitoring APIs. - Denis On Fri, Oct 4, 2019 at 9:33 AM Alexey Goncharuk wrote: I agree that we should have the ability to read any metric using simple Ignite tooling. I am not sure if visor.sh is a good fit - if I remember correctly, it will start a daemon node which will bump the topology version with all related consequences. I believe in the long term it will beneficial to migrate all visor.sh functionality to a more lightweight protocol, such as used in control.sh. As for the metrics, the metric suggested by Ivan totally makes sense to me - it is a simple and, actually, quite critical metric. It will be completely unusable to select a minimum of some metric for all cache groups manually. A monitoring system, on the other hand, might not be available when the metric is needed, or may not support aggregation. --AG пт, 4 окт. 2019 г. в 18:58, Ivan Rakov : Nikolay, Many users start to use Ignite with a small project without production-level monitoring. When proof-of-concept appears to be viable, they tend to expand Ignite usage by growing cluster and adding needed environment (including monitoring systems). Inability to find such basic thing as survival in case of next node crash may affect overall product impression. We all want Ignite to be successful and widespread. Can you clarify, what do you mean, exactly? Right now user can access metric mentioned by Alex and choose minimum of all cache groups. I want to highlight that not every user understands Ignite and its internals so much to find out that exactly these sequence of actions will bring him to desired answer. Can you clarify, what do you mean, exactly? We have a ticket[1] to support metrics output via visor.sh. My understanding: we should have an easy way to output metric values for each node in cluster. [1] https://issues.apache.org/jira/browse/IGNITE-12191 I propose to add metric method for aggregated "getMinimumNumberOfPartitionCopies" and expose it to control.sh. My understanding: it's result is critical enough to be accessible in a short path. I've started this topic due to request from user list, and I've heard many similar complaints before. Best Regards, Ivan Rakov On 04.10.2019 17:18, Nikolay Izhikov wrote: Ivan. We shouldn't force users to configure external tools and write extra code for basic things. Actually, I don't agree with you. Having external monitoring system for any production cluster is a *basic* thing. Can you, please, define "basic things"? single method for the whole cluster Can you clarify, what do you mean, exactly? We have a ticket[1] to support metrics output via visor.sh. My understanding: we should have an easy way to output metric values for each node in cluster. [1] https://issues.apache.org/jira/browse/IGNITE-12191 В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет: Max, What if user simply don't have configured monitoring system? Knowing whether cluster will survive node shutdown is critical for any administrator that performs any manipulations with cluster topology. Essential information should be easily accessed. We shouldn't force users to configure external tools and write extra code for basic things. Alex, Thanks, that's exact metric we need. My point is that we should make it more accessible: via control.sh command and single method for the whole cluster. Best Regards, Ivan Rakov On 04.10.2019 16:34, Alex Plehanov wrote: Ivan, there already exist metric CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows the current redundancy level for the cache group. We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without data loss in this cache group. пт, 4 окт. 2019 г. в 16:17, Ivan Rakov : Igniters, I've seen numerous requests to find out an easy way to check whether is it safe to turn off cluster node. As we know, in Ignite protection from sudden node shutdown is implemented through keeping several backup copies of each partition. However, this guarantee can be weakened for a while in case cluster has recently experienced node restart and rebalancing process is still in progress. Example scenario is restarting nodes one by one in order to update a local configuration parameter. User restarts one node and rebalancing starts: when it will be completed, it will be safe to proceed (backup count=1). However, there's no transparent way to determine whether rebalanci
[jira] [Created] (IGNITE-12266) Add limit parameter to Platforms for processing TextQuery
Yuriy Shuliha created IGNITE-12266: --- Summary: Add limit parameter to Platforms for processing TextQuery Key: IGNITE-12266 URL: https://issues.apache.org/jira/browse/IGNITE-12266 Project: Ignite Issue Type: Improvement Components: platforms Reporter: Yuriy Shuliha Assignee: Yuriy Shuliha Fix For: 2.8 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12265) JavaDoc doesn't have documentation for the org.apache.ignite.client package
Denis Mekhanikov created IGNITE-12265: - Summary: JavaDoc doesn't have documentation for the org.apache.ignite.client package Key: IGNITE-12265 URL: https://issues.apache.org/jira/browse/IGNITE-12265 Project: Ignite Issue Type: Bug Reporter: Denis Mekhanikov JavaDoc published on the website doesn't have documentation for the {{org.apache.ignite.client}} package. Link to the website: [https://ignite.apache.org/releases/2.7.6/javadoc/] A lack of {{package-info.java}} file or exclusion from the {{maven-javadoc-plugin}} in the root {{pom.xml}} may be the reason. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12264) Private application data should not be lit in the logs, exceptions, ERROR, WARN etc.
Pushenko Kirill created IGNITE-12264: Summary: Private application data should not be lit in the logs, exceptions, ERROR, WARN etc. Key: IGNITE-12264 URL: https://issues.apache.org/jira/browse/IGNITE-12264 Project: Ignite Issue Type: Improvement Affects Versions: 2.7.6 Reporter: Pushenko Kirill Private application data should not be lit in the logs, exceptions, ERROR, WARN etc. The executions contained a value in which there were cardboard numbers. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Enabling the checkstyle profile on Build Apache Ignite suite (test period)
+1 В Пн, 07/10/2019 в 13:18 +0300, Maxim Muzafarov пишет: > Igniters, > > > I'm planning October 11 (Friday) 22-00 MSK enable `checkstyle` profile > on the [Build > Apache Ignite] suite by the end of the next weekend (one week test > period). Such an option discussed many times before (e.g. [1]). > > Here are the reasons: > > - any code style violations in a PR lead to source code fixes which in > turn require re-run of other test suites, so it is better to fail > fast; > - each a new Run:All suite (e.g. for a new module) must contain a > checkstyle suite to code style by default, so it is better to include > mandatory checks to the build Apache Ignite procedure; > - `fail fast` paradigm will eliminate all the check style violations, > currently in happens time ot time; > > The ability to create a prototype PR without code style checks still > exists. You can disable `checkstyle` profile for such PRs in your > local branches. > > > Any objections? > > > [1] > http://apache-ignite-developers.2346864.n4.nabble.com/Code-inspection-td27709i80.html#a41297 signature.asc Description: This is a digitally signed message part
Enabling the checkstyle profile on Build Apache Ignite suite (test period)
Igniters, I'm planning October 11 (Friday) 22-00 MSK enable `checkstyle` profile on the [Build Apache Ignite] suite by the end of the next weekend (one week test period). Such an option discussed many times before (e.g. [1]). Here are the reasons: - any code style violations in a PR lead to source code fixes which in turn require re-run of other test suites, so it is better to fail fast; - each a new Run:All suite (e.g. for a new module) must contain a checkstyle suite to code style by default, so it is better to include mandatory checks to the build Apache Ignite procedure; - `fail fast` paradigm will eliminate all the check style violations, currently in happens time ot time; The ability to create a prototype PR without code style checks still exists. You can disable `checkstyle` profile for such PRs in your local branches. Any objections? [1] http://apache-ignite-developers.2346864.n4.nabble.com/Code-inspection-td27709i80.html#a41297
Re: How to free up space on disc after removing entries from IgniteCache with enabled PDS?
Created a ticket for the first stage of this improvement. This can be a first change towards the online mode suggested by Sergey and Anton. https://issues.apache.org/jira/browse/IGNITE-12263 пт, 4 окт. 2019 г. в 19:38, Alexey Goncharuk : > Maxim, > > Having a cluster-wide lock for a cache does not improve availability of > the solution. A user cannot defragment a cache if the cache is involved in > a mission-critical operation, so having a lock on such a cache is > equivalent to the whole cluster shutdown. > > We should decide between either a single offline node or a more complex > fully online solution. > > пт, 4 окт. 2019 г. в 11:55, Maxim Muzafarov : > >> Igniters, >> >> This thread seems to be endless, but we if some kind of cache group >> distributed write lock (exclusive for some of the internal Ignite >> process) will be introduced? I think it will help to solve a batch of >> problems, like: >> >> 1. defragmentation of all cache group partitions on the local node >> without concurrent updates. >> 2. improve data loading with data streamer isolation mode [1]. It >> seems we should not allow concurrent updates to cache if we on `fast >> data load` step. >> 3. recovery from a snapshot without cache stop\start actions >> >> >> [1] https://issues.apache.org/jira/browse/IGNITE-11793 >> >> On Thu, 3 Oct 2019 at 22:50, Sergey Kozlov wrote: >> > >> > Hi >> > >> > I'm not sure that node offline is a best way to do that. >> > Cons: >> > - different caches may have different defragmentation but we force to >> stop >> > whole node >> > - offline node is a maintenance operation will require to add +1 >> backup to >> > reduce the risk of data loss >> > - baseline auto adjustment? >> > - impact to index rebuild? >> > - cache configuration changes (or destroy) during node offline >> > >> > What about other ways without node stop? E.g. make cache group on a node >> > offline? Add *defrag *command to control.sh to force start >> > rebalance internally in the node with expected impact to performance. >> > >> > >> > >> > On Thu, Oct 3, 2019 at 12:08 PM Anton Vinogradov wrote: >> > >> > > Alexey, >> > > As for me, it does not matter will it be IEP, umbrella or a single >> issue. >> > > The most important thing is Assignee :) >> > > >> > > On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk < >> > > alexey.goncha...@gmail.com> >> > > wrote: >> > > >> > > > Anton, do you think we should file a single ticket for this or >> should we >> > > go >> > > > with an IEP? As of now, the change does not look big enough for an >> IEP >> > > for >> > > > me. >> > > > >> > > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov : >> > > > >> > > > > Alexey, >> > > > > >> > > > > Sounds good to me. >> > > > > >> > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk < >> > > > > alexey.goncha...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > Anton, >> > > > > > >> > > > > > Switching a partition to and from the SHRINKING state will >> require >> > > > > > intricate synchronizations in order to properly determine the >> start >> > > > > > position for historical rebalance without PME. >> > > > > > >> > > > > > I would still go with an offline-node approach, but instead of >> > > cleaning >> > > > > the >> > > > > > persistence, we can do effective defragmentation when the node >> is >> > > > offline >> > > > > > because we are sure that there is no concurrent load. After the >> > > > > > defragmentation completes, we bring the node back to the >> cluster and >> > > > > > historical rebalance will kick in automatically. It will still >> > > require >> > > > > > manual node restarts, but since the data is not removed, there >> are no >> > > > > > additional risks. Also, this will be an excellent solution for >> those >> > > > who >> > > > > > can afford downtime and execute the defragment command on all >> nodes >> > > in >> > > > > the >> > > > > > cluster simultaneously - this will be the fastest way possible. >> > > > > > >> > > > > > --AG >> > > > > > >> > > > > > пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov : >> > > > > > >> > > > > > > Alexei, >> > > > > > > >> stopping fragmented node and removing partition data, then >> > > > starting >> > > > > it >> > > > > > > again >> > > > > > > >> > > > > > > That's exactly what we're doing to solve the fragmentation >> issue. >> > > > > > > The problem here is that we have to perform N/B >> restart-rebalance >> > > > > > > operations (N - cluster size, B - backups count) and it takes >> a lot >> > > > of >> > > > > > time >> > > > > > > with risks to lose the data. >> > > > > > > >> > > > > > > On Fri, Sep 27, 2019 at 5:49 PM Alexei Scherbakov < >> > > > > > > alexey.scherbak...@gmail.com> wrote: >> > > > > > > >> > > > > > > > Probably this should be allowed to do using public API, >> actually >> > > > this >> > > > > > is >> > > > > > > > same as manual rebalancing. >> > > > > > > > >> > > > > > > > пт, 27 сент. 2019 г. в 17:40, Alexei Scherbakov < >> > > > > > > > alexey.scherb
[jira] [Created] (IGNITE-12263) Introduce native persistence compaction operation
Alexey Goncharuk created IGNITE-12263: - Summary: Introduce native persistence compaction operation Key: IGNITE-12263 URL: https://issues.apache.org/jira/browse/IGNITE-12263 Project: Ignite Issue Type: Improvement Reporter: Alexey Goncharuk Currently, Ignite native persistence does not shrink storage files after key-value pairs are removed. The causes of this behavior are: * The absence of a mechanism that allows Ignite to track highest non-empty page position in a partition file * The absence of a mechanism which allows Ignite to select a page closest to the file beginning for write * The absence of a mechanism which allows Ignite to move a key-value pair from page to page during defragmentation As an initial change I suggest to introduce a new node startup mode, which will run a defragmentation procedure allowing the node to shrink storage files. The procedure will not mutate the logical state of a partition allowing further historical rebalance to quickly catch up the node. Since the procedure will run during the node startup (during the final stages of recovery), there will be no concurrent load, thus the entries can be freely moved from page to page with no tricky synchronization. If a procedure is applied during the whole cluster restart, then all nodes will be defragmented simultaneously, allowing for a quicker parallel defragmentation at a cost of downtime. The procedure should accept an optional list of cache groups to defragment to allow arbitrary cache group selection for defragmentation. An idea of the actions taken during the run for each partition selected for defragmentation: * Partition pages are preloaded to memory if possible to avoid excessive page replacement. During the scan, a HWM of the written data is detected (empty pages are skipped) * Pages references in a free list are sorted in a way allowing to pick pages closest to the file start * The partition is scanned in reverse order, key-value pairs are moved closer to the file start, HWM is updated accordingly. This step is particularly open for various optimizations because different strategies will work well for different fragmentation patterns. * After the scan iteration is completed, the file size can be updated according to the HWM As a further improvement, this partition defragmentation procedure can be later run in online mode, after proper cache update protocol changes are designed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: How to free up space on disc after removing entries from IgniteCache with enabled PDS?
Hello! I think that good robust approach is to start background thread which will try to compact pages and remove unneeded ones. It should only be active when system is reasonably idle, or if there's severe fragmentation problem. However, I am aware that implementing such heurestical cleaner is a challenging task. Regards, -- Ilya Kasnacheev пт, 4 окт. 2019 г. в 19:38, Alexey Goncharuk : > Maxim, > > Having a cluster-wide lock for a cache does not improve availability of the > solution. A user cannot defragment a cache if the cache is involved in a > mission-critical operation, so having a lock on such a cache is equivalent > to the whole cluster shutdown. > > We should decide between either a single offline node or a more complex > fully online solution. > > пт, 4 окт. 2019 г. в 11:55, Maxim Muzafarov : > > > Igniters, > > > > This thread seems to be endless, but we if some kind of cache group > > distributed write lock (exclusive for some of the internal Ignite > > process) will be introduced? I think it will help to solve a batch of > > problems, like: > > > > 1. defragmentation of all cache group partitions on the local node > > without concurrent updates. > > 2. improve data loading with data streamer isolation mode [1]. It > > seems we should not allow concurrent updates to cache if we on `fast > > data load` step. > > 3. recovery from a snapshot without cache stop\start actions > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11793 > > > > On Thu, 3 Oct 2019 at 22:50, Sergey Kozlov wrote: > > > > > > Hi > > > > > > I'm not sure that node offline is a best way to do that. > > > Cons: > > > - different caches may have different defragmentation but we force to > > stop > > > whole node > > > - offline node is a maintenance operation will require to add +1 > backup > > to > > > reduce the risk of data loss > > > - baseline auto adjustment? > > > - impact to index rebuild? > > > - cache configuration changes (or destroy) during node offline > > > > > > What about other ways without node stop? E.g. make cache group on a > node > > > offline? Add *defrag *command to control.sh to force > start > > > rebalance internally in the node with expected impact to performance. > > > > > > > > > > > > On Thu, Oct 3, 2019 at 12:08 PM Anton Vinogradov > wrote: > > > > > > > Alexey, > > > > As for me, it does not matter will it be IEP, umbrella or a single > > issue. > > > > The most important thing is Assignee :) > > > > > > > > On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk < > > > > alexey.goncha...@gmail.com> > > > > wrote: > > > > > > > > > Anton, do you think we should file a single ticket for this or > > should we > > > > go > > > > > with an IEP? As of now, the change does not look big enough for an > > IEP > > > > for > > > > > me. > > > > > > > > > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov : > > > > > > > > > > > Alexey, > > > > > > > > > > > > Sounds good to me. > > > > > > > > > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk < > > > > > > alexey.goncha...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Anton, > > > > > > > > > > > > > > Switching a partition to and from the SHRINKING state will > > require > > > > > > > intricate synchronizations in order to properly determine the > > start > > > > > > > position for historical rebalance without PME. > > > > > > > > > > > > > > I would still go with an offline-node approach, but instead of > > > > cleaning > > > > > > the > > > > > > > persistence, we can do effective defragmentation when the node > is > > > > > offline > > > > > > > because we are sure that there is no concurrent load. After the > > > > > > > defragmentation completes, we bring the node back to the > cluster > > and > > > > > > > historical rebalance will kick in automatically. It will still > > > > require > > > > > > > manual node restarts, but since the data is not removed, there > > are no > > > > > > > additional risks. Also, this will be an excellent solution for > > those > > > > > who > > > > > > > can afford downtime and execute the defragment command on all > > nodes > > > > in > > > > > > the > > > > > > > cluster simultaneously - this will be the fastest way possible. > > > > > > > > > > > > > > --AG > > > > > > > > > > > > > > пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov >: > > > > > > > > > > > > > > > Alexei, > > > > > > > > >> stopping fragmented node and removing partition data, then > > > > > starting > > > > > > it > > > > > > > > again > > > > > > > > > > > > > > > > That's exactly what we're doing to solve the fragmentation > > issue. > > > > > > > > The problem here is that we have to perform N/B > > restart-rebalance > > > > > > > > operations (N - cluster size, B - backups count) and it takes > > a lot > > > > > of > > > > > > > time > > > > > > > > with risks to lose the data. > > > > > > > > > > > > > > > > On Fri, Sep 27, 2019 at 5:49 PM Alexei Scherbakov < > > > > > > > > alexey.scherbak...@gmail.com> wrote: >