Re: ExecutorServices hide assertions without logging and node stop

2020-05-06 Thread Nikolay Izhikov
Hello, Maxim.

I can confirm this itching issue.
It also happens when some custom Security plugin throws an exception while 
processing a communication message.

```
UUID newSecSubjId = secSubjId != null ? secSubjId : nodeId;

try (OperationSecurityContext s = ctx.security().withContext(newSecSubjId)) {
lsnr.onMessage(nodeId, msg, plc);
}
finally {
if (change)
CUR_PLC.set(oldPlc);
}
```

If an exception thrown from `withContext` it is hidden from the user.

> 4 мая 2020 г., в 19:15, Maxim Muzafarov  написал(а):
> 
> Igniters,
> 
> 
> I've spent a couple of days in debug mode and found that some of the
> configured ExecutorServices of an Ignite instance completely hide
> assertion errors without any logging and node killing. This may
> violate some internal guarantees and hide serious errors.
> 
> Let's consider, for instance, GridDhtPartitionsExchangeFuture [1]. It
> has three places of submitting some Runnable on system executor
> service. If an assertion error (or even any uncached exception) occurs
> in the code block below it will be swallowed without logging, exchange
> future completion or node stoping.
> 
> cctx.kernalContext().getSystemExecutorService().submit(new Runnable() {
>@Override public void run() {
>sendPartitions(newCrd);
>}
> });
> 
> I've checked that these executor services and most of them configured
> to catch only OutOfMemoryError.
> 
> Should we consider catching AssertionErrors as well and treat them as
> CRITICAL_ERRORS for the Failure Handler?
> Should we log uncached errors on each of them?
> 
> 
> The most important list of executor services configured with OOM handler only:
> execSvc
> svcExecSvc
> sysExecSvc
> p2pExecSvc
> restExecSvc
> utilityCacheExecSvc
> affExecSvc
> qryExecSvc
> 
> [1] 
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionsExchangeFuture.java#L4848



[RESULT][VOTE] Release Apache Ignite Spring Boot extensions 1.0.0 RC2

2020-05-06 Thread Nikolay Izhikov
The vote for a new release candidate is closed, now

Vote result: Vote passes with 4 votes +1 (4 binding +1 votes), no 0 and no -1.

+1 votes:

- Nikolay Izhikov (binding)
- Maxim Muzafarov (binding)
- Saikat Maitra (binding)
- Denis Magda (binding)

Vote thread
https://lists.apache.org/thread.html/r5427082e51e5eaf051703afcfda2bdd9812b69ad3e29f714ddbb4f4e%40%3Cdev.ignite.apache.org%3E

[jira] [Created] (IGNITE-12983) Logging exceptions inside IgniteSecurityProcessor#withContext(java.util.UUID)

2020-05-06 Thread Denis Garus (Jira)
Denis Garus created IGNITE-12983:


 Summary: Logging exceptions inside 
IgniteSecurityProcessor#withContext(java.util.UUID)
 Key: IGNITE-12983
 URL: https://issues.apache.org/jira/browse/IGNITE-12983
 Project: Ignite
  Issue Type: Improvement
Reporter: Denis Garus
Assignee: Denis Garus
 Fix For: 2.9


We should write down to log all exception inside 
IgniteSecurityProcessor#withContext(java.util.UUID).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] Data loss handling improvements

2020-05-06 Thread Alexei Scherbakov
Folks,

I've almost finished a patch bringing some improvements to the data loss
handling code, and I wish to discuss proposed changes with the community
before submitting.

*The issue*

During the grid's lifetime, it's possible to get into a situation when some
data nodes have failed or mistakenly stopped. If a number of stopped nodes
exceeds a certain threshold depending on configured backups, count a data
loss will occur. For example, a grid having one backup (meaning at least
two copies of each data partition exist at the same time) can tolerate only
one node loss at the time. Generally, data loss is guaranteed to occur if
backups + 1 or more nodes have failed simultaneously using default affinity
function.

For in-memory caches, data is lost forever. For persistent caches, data is
not physically lost and accessible again after failed nodes are returned to
the topology.

Possible data loss should be taken into consideration while designing an
application.



*Consider an example: money is transferred from one deposit to another, and
all nodes holding data for one of the deposits are gone.In such a case, a
transaction temporary cannot be completed until a cluster is recovered from
the data loss state. Ignoring this can cause data inconsistency.*
It is necessary to have an API telling us if an operation is safe to
complete from the perspective of data loss.

Such an API exists for some time [1] [2] [3]. In short, a grid can be
configured to switch caches to the partial availability mode if data loss
is detected.

Let's give two definitions according to the Javadoc for
*PartitionLossPolicy*:

·   *Safe* (data loss handling) *policy* - cache operations are only
available for non-lost partitions (PartitionLossPolicy != IGNORE).

·   *Unsafe policy* - cache operations are always possible
(PartitionLossPolicy = IGNORE). If the unsafe policy is configured, lost
partitions automatically re-created on the remaining nodes if needed or
immediately owned if a last supplier has left during rebalancing.

*That needs to be fixed*

1. The default loss policy is unsafe, even for persistent caches in the
current implementation. It can result in unintentional data loss and
business invariants' failure.

2. Node restarts in the persistent grid with detected data loss will cause
automatic resetting of LOST state after the restart, even if the safe
policy is configured. It can result in data loss or partition desync if not
all nodes are returned to the topology or returned in the wrong order.


*An example: a grid has three nodes, one backup. The grid is under load.
First, a node2 has left, soon a node3 has left. If the node2 is returned to
the topology first, it would have stale data for some keys. Most recent
data are on node3, which is not in the topology yet. Because a lost state
was reset, all caches are fully available, and most probably will become
inconsistent even in safe mode.*
3. Configured loss policy doesn't provide guarantees described in the
Javadoc depending on the cluster configuration[4]. In particular, unsafe
policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
automatically readjusted on node left), because partitions are not
automatically get reassigned on topology change, and no nodes are existing
to fulfill a read/write request. Same for READ_ONLY_ALL and READ_WRITE_ALL.

4. Calling resetLostPartitions doesn't provide a guarantee for full cache
operations availability if a topology doesn't have at least one owner for
each lost partition.

The ultimate goal of the patch is to fix API inconsistencies and fix the
most crucial bugs related to data loss handling.

*The planned changes are:*

1. The safe policy is used by default, except for in-memory grids with
enabled baseline auto-adjust [5] with zero timeout [6]. In the latter case,
the unsafe policy is used by default. It protects from unintentional data
loss.

2. Lost state is never reset in the case of grid nodes restart (despite
full restart). It makes real data loss impossible in persistent grids if
following the recovery instruction.

3. Lost state is impossible to reset if a topology doesn't have at least
one owner for each lost partition. If nodes are physically dead, they
should be removed from a baseline first before calling resetLostPartitions.

4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because their
guarantees are impossible to fulfill, not on the full baseline.

5. Any operation failed due to data loss contains
CacheInvalidStateException as a root cause.

In addition to code fixes, I plan to write a tutorial for safe data loss
recovery in the persistent mode in the Ignite wiki.

Any comments for the proposed changes are welcome.

[1]
org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
partLossPlc)
[2] org.apache.ignite.Ignite#resetLostPartitions(caches)
[3] org.apache.ignite.IgniteCache#lostPartitions
[4]  https://issues.apache.org/jira/browse/IGNITE-10041
[5] org.apac

[jira] [Created] (IGNITE-12984) Distributed join incorrectly processed when batched:unicast on primary key is used

2020-05-06 Thread Ilya Kasnacheev (Jira)
Ilya Kasnacheev created IGNITE-12984:


 Summary: Distributed join incorrectly processed when 
batched:unicast on primary key is used
 Key: IGNITE-12984
 URL: https://issues.apache.org/jira/browse/IGNITE-12984
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.8
Reporter: Ilya Kasnacheev
Assignee: Taras Ledkov


Please see attached SQL script and userlist discussion.

Summary :
CASE-1 Results: Correct and as expected
{code}
SELECT
__Z0.ID AS __C0_0,
__Z0.NAME AS __C0_1,
__Z1.BLOOD_GROUP AS __C0_2,
__Z2.UNIVERSAL_DONOR AS __C0_3
FROM PUBLIC.PERSON__Z0
/* PUBLIC.PERSON_NAME_ASC_IDX_proxy */
LEFT OUTER JOIN PUBLIC.MEDICAL_INFO __Z1
/* batched:broadcast PUBLIC.MEDICAL_INFO_NAME_ASC_IDX: NAME = __Z0.NAME */
ON __Z0.NAME = __Z1.NAME
LEFT OUTER JOIN PUBLIC.BLOOD_GROUP_INFO_PJ __Z2
/* batched:broadcast PUBLIC.BLOOD_GROUP_INFO_PJ_BLOOD_GROUP_ASC_IDX: 
BLOOD_GROUP =
__Z1.BLOOD_GROUP */
ON __Z1.BLOOD_GROUP = __Z2.BLOOD_GROUP
{code}

{code}
Summary :
CASE-2 Results: In-correct
SELECT
__Z0.ID AS __C0_0,
__Z0.NAME AS __C0_1,
__Z1.BLOOD_GROUP AS __C0_2,
__Z2.UNIVERSAL_DONOR AS __C0_3
FROM PUBLIC.PERSON __Z0
/* PUBLIC.PERSON_ID_ASC_IDX_proxy */
LEFT OUTER JOIN PUBLIC.MEDICAL_INFO __Z1
/* batched:broadcast PUBLIC.MEDICAL_INFO_NAME_ASC_IDX: NAME = __Z0.NAME */
ON __Z0.NAME = __Z1.NAME
LEFT OUTER JOIN PUBLIC.BLOOD_GROUP_INFO_P __Z2
/* batched:unicast PUBLIC._key_PK_proxy: BLOOD_GROUP = __Z1.BLOOD_GROUP */
ON __Z1.BLOOD_GROUP = __Z2.BLOOD_GROUP
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Data loss handling improvements

2020-05-06 Thread Anton Vinogradov
Alexei,

1,2,4,5 - looks good to me, no objections here.

>> 3. Lost state is impossible to reset if a topology doesn't have at least
>> one owner for each lost partition.

Do you mean that, according to your example, where
>> a node2 has left, soon a node3 has left. If the node2 is returned to
>> the topology first, it would have stale data for some keys.
we have to have node2 at cluster to be able to reset "lost" to node2's data?

>> at least one owner for each lost partition.
What the reason to have owners for all lost partitions when we want to
reset only some (available)?
Will it be possible to perform operations on non-lost partitions when the
cluster has at least one lost partition?

On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
alexey.scherbak...@gmail.com> wrote:

> Folks,
>
> I've almost finished a patch bringing some improvements to the data loss
> handling code, and I wish to discuss proposed changes with the community
> before submitting.
>
> *The issue*
>
> During the grid's lifetime, it's possible to get into a situation when some
> data nodes have failed or mistakenly stopped. If a number of stopped nodes
> exceeds a certain threshold depending on configured backups, count a data
> loss will occur. For example, a grid having one backup (meaning at least
> two copies of each data partition exist at the same time) can tolerate only
> one node loss at the time. Generally, data loss is guaranteed to occur if
> backups + 1 or more nodes have failed simultaneously using default affinity
> function.
>
> For in-memory caches, data is lost forever. For persistent caches, data is
> not physically lost and accessible again after failed nodes are returned to
> the topology.
>
> Possible data loss should be taken into consideration while designing an
> application.
>
>
>
> *Consider an example: money is transferred from one deposit to another, and
> all nodes holding data for one of the deposits are gone.In such a case, a
> transaction temporary cannot be completed until a cluster is recovered from
> the data loss state. Ignoring this can cause data inconsistency.*
> It is necessary to have an API telling us if an operation is safe to
> complete from the perspective of data loss.
>
> Such an API exists for some time [1] [2] [3]. In short, a grid can be
> configured to switch caches to the partial availability mode if data loss
> is detected.
>
> Let's give two definitions according to the Javadoc for
> *PartitionLossPolicy*:
>
> ·   *Safe* (data loss handling) *policy* - cache operations are only
> available for non-lost partitions (PartitionLossPolicy != IGNORE).
>
> ·   *Unsafe policy* - cache operations are always possible
> (PartitionLossPolicy = IGNORE). If the unsafe policy is configured, lost
> partitions automatically re-created on the remaining nodes if needed or
> immediately owned if a last supplier has left during rebalancing.
>
> *That needs to be fixed*
>
> 1. The default loss policy is unsafe, even for persistent caches in the
> current implementation. It can result in unintentional data loss and
> business invariants' failure.
>
> 2. Node restarts in the persistent grid with detected data loss will cause
> automatic resetting of LOST state after the restart, even if the safe
> policy is configured. It can result in data loss or partition desync if not
> all nodes are returned to the topology or returned in the wrong order.
>
>
> *An example: a grid has three nodes, one backup. The grid is under load.
> First, a node2 has left, soon a node3 has left. If the node2 is returned to
> the topology first, it would have stale data for some keys. Most recent
> data are on node3, which is not in the topology yet. Because a lost state
> was reset, all caches are fully available, and most probably will become
> inconsistent even in safe mode.*
> 3. Configured loss policy doesn't provide guarantees described in the
> Javadoc depending on the cluster configuration[4]. In particular, unsafe
> policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
> automatically readjusted on node left), because partitions are not
> automatically get reassigned on topology change, and no nodes are existing
> to fulfill a read/write request. Same for READ_ONLY_ALL and READ_WRITE_ALL.
>
> 4. Calling resetLostPartitions doesn't provide a guarantee for full cache
> operations availability if a topology doesn't have at least one owner for
> each lost partition.
>
> The ultimate goal of the patch is to fix API inconsistencies and fix the
> most crucial bugs related to data loss handling.
>
> *The planned changes are:*
>
> 1. The safe policy is used by default, except for in-memory grids with
> enabled baseline auto-adjust [5] with zero timeout [6]. In the latter case,
> the unsafe policy is used by default. It protects from unintentional data
> loss.
>
> 2. Lost state is never reset in the case of grid nodes restart (despite
> full restart). It makes real data loss impossible in persistent grids if
> followin

Crash recovery speed-up #3, Cellular Switch

2020-05-06 Thread Anton Vinogradov
Igniters,

PME-free switch [1] (since 2.8) skips PME on node left when possible
(baseline + fully rebalanced cluster).
This means we already wait for nothing (except recovery) to perform the
switch.
This optimization allows continuing already started operations during or
after the switch if they are not affected by failed primary.
But upcoming operations still can't be started until the switch is finished
cluster-wide.

Let me propose an additional optimization - Cellular switch.
Cellular Affinity [2] means that nodes combined into virtual cells where,
for each partition, backups located at the same cell with primaries.
The simplest way to gain Cellular Affinity is to use backup filters [3].

Cellular Affinity allows to finish the switch outside the affected cell
instantly with the following assumptions:
- Replicated caches should be recovered first since every node affected (as
a backup) by any failed primary.
  But, it is expected that replicated caches effectively read-only (has
extremely rare updates), so, nothing to wait here.
- Upcoming replicated transactions (with non-failed primaries) can be
started but can't be committed until switch finished cluster-wide.
- Upcoming transactions related to the broken cell will wait for cell
recovery (cluster-wide switch finish).

... and this means:
In addition to PME-free switch, where we able to continue already started
operations during or after the switch, now we also able to perform most of
the upcoming operations during the switch.

In other words, Cellular switch has little effect on the operation's
latency, when operation not related to the failed cell.

According to benchmark [4] which checks "how fast upcoming transactions
(started after switch start) can be committed when we have thousands of
prepared transactions (prepared before switch start)", we have 5326 ms [5]
operation's latency on master and 65 ms [6] with the proposed fix, which is
~100 times faster.

Fix [7] (as a part of IEP-45 [8]) ready to be reviewed.
Waiting for your review!


[1]
http://apache-ignite-developers.2346864.n4.nabble.com/Non-blocking-PME-Phase-One-Node-fail-tp43531p44586.html
[2]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up#IEP-45:CrashRecoverySpeed-Up-Cellularswitch
[3]
https://gist.github.com/anton-vinogradov/c50f9d0ce3e3e2997646f84ba7eba5f5#file-bench-java-L417
[4]
https://gist.github.com/anton-vinogradov/c50f9d0ce3e3e2997646f84ba7eba5f5
[5]
https://gist.github.com/anton-vinogradov/a35a3a8151b7494aa84b83f58cb75889#file-master-txt-L15
[6]
https://gist.github.com/anton-vinogradov/a35a3a8151b7494aa84b83f58cb75889#file-fix-txt-L15
[7] https://issues.apache.org/jira/browse/IGNITE-12617
[8]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up


Re: Discovery-based services deployment guarantees question

2020-05-06 Thread Mikhail Petrov

Hello, Igniters.

I am working on IGNITE-12894 - [1]. It seems that it has the root cause 
which is similar to the problem described in this thread.


To solve these problems, I propose to change the behavior of the 
IgniteServiceProcessor#serviceTopology if the timeout argument is 0.
At the moment, IgniteServiceProcessor#serviceTopology returns the 
topology immediately, regardless of whether it was initialized or not in 
this case. I propose to wait for the service topology to be initialized 
if the requested service is already registered on local node, but the 
full message was not received from the coordinator yet.


So the final behavior of IgniteServices#serviceProxy() will be:
1. If the timeout is specified - it waits for the topology over a 
specified timeout even if the requested service was not registered yet. 
As in current implementation.


2. If the timeout is not specified - if service was not registered it 
fails immediately, else it is waiting for the topology initialization 
(full message from the coordinator) if needed.


Here is PR with the implementation of the described proposal - [2].

WDYT?

[1] - https://issues.apache.org/jira/browse/IGNITE-12894
[2] - https://github.com/apache/ignite/pull/7771

On 30.12.2019 13:03, Alexey Goncharuk wrote:

Agree, sounds like a plan, thanks for taking over!

пн, 30 дек. 2019 г. в 13:00, Vyacheslav Daradur :


Alexey,

I would not make it default in the current implementation.

Waiting of proxies on non-deployment-initiator nodes should be
improved - additional checks are required:
1) We should not wait if requested service has not been submitted to
deploy (when there is no info about such service)
2) If service deployment failed - getting proxy should be failed or
interrupted as well (do not wait for all available timeout)

Let's schedule this improvement to next release, I'll try to find a
time to implement it.

What do you think?

On Mon, Dec 30, 2019 at 12:05 PM Alexey Goncharuk
 wrote:

Vyacheslav, thanks for the explanation, makes sense to me.

I was thinking though, should we make the behavior with the timeout

default

for all proxies?

Just my opinion - I think for a user it would be hard to control which

node

deploys the service, especially if multiple nodes deploy it concurrently.
Most likely users will end up always calling the second option of the

proxy

(with the timeout), so, perhaps, make it default?

вс, 29 дек. 2019 г. в 21:05, Vyacheslav Daradur :


Alexey,

I've prepared pr [1] to show our proxy invocation guarantees and to
avoid misunderstanding.

Please, let me know if you think that we should improve our guaranties
in some cases.

[1] https://github.com/apache/ignite/pull/7213

On Tue, Dec 24, 2019 at 7:27 PM Vyacheslav Daradur <

daradu...@gmail.com>

wrote:

even the local deployment looks broken: if a compute job
is sent to a remote node after the service deployment

This is a different case and covered by retries:
* If you deploy a service from node A to node B, then take a proxy
from node A (deployment initiator) it should NOT fail even if node B
has not received yet a message that deployment finished successfully,
because of proxy invocation retries.

Look like It's better to describe all these cases on the wiki.


Should we schedule this ticket for the further work on Services

IEP?

If it is a frequent use-case we definitely should implement it.


On Tue, Dec 24, 2019 at 6:55 PM Alexey Goncharuk
 wrote:

Ok, got it.

I agree that this is consistent with the old behavior, but this is

the

kind

of errors we wanted to get rid of when we started the IEP. From the
user perspective, even the local deployment looks broken: if a

compute

job

is sent to a remote node after the service deployment, the job

execution

may fail due to this error.

Should we schedule this ticket for the further work on Services

IEP?

вт, 24 дек. 2019 г. в 18:49, Vyacheslav Daradur <

daradu...@gmail.com>:

Not sure that "user fallback" is the right definition, it is not

new

behaviour in comparison with legacy implementation.

Our synchronous deployment provides guaranties for a deployment
initiator to be able to start work with service immediately after
deployment finished successfully.
For not the deployment initiator we can't provide such guarantees

now,

because of unknown deployment result and possibly fail.

In this case, a reasonable timeout might be an acceptable

solution.

We can improve guaranties in future releases, but there is an

open

question:
- how long taking of proxy should wait? - deployment of "heavy"
service may take a while

On Tue, Dec 24, 2019 at 6:19 PM Alexey Goncharuk
 wrote:

What should be the user fallback in this case? Retry

infinitely? Is

there a

way to wait for the proper deployment?

вт, 24 дек. 2019 г. в 12:41, Vyacheslav Daradur <

daradu...@gmail.com>:

I’ll take a look at the end of the week.

There is one more use-case:
* if you initiate deployment from node A, but getting proxy

on

node B

(wh

Re: [DISCUSSION] Ignite WebConsole Deprecation

2020-05-06 Thread Nikolay Izhikov
Hello.

+1 to remove any graphical utilities from the Ignite core.
I think we should maintain and support only cmd, JMX, and other «core» tools.

We shouldn't waste our resources on supporting pretty looking UI tools.


> 2 мая 2020 г., в 04:05, Saikat Maitra  написал(а):
> 
> Hello Denis,
> 
> I am thinking if we should move web-console to a separate repo like
> ignite-web-console. In my opinion the tech stack we have in web-console
> like npm, nodejs and Mondodb etc are little different and can be hosted as
> a separate project in a git repo.
> 
> I had used web-console for streamers project and found it helpful in
> creating table schema and execute sql queries and also to do cache lookup.
> 
> https://github.com/samaitra/streamers
> 
> If we continue to find that usage of ignite-web-console is limited then we
> can plan moving ignite-web-console in Apache Attic
> 
> https://attic.apache.org/
> 
> Please let me know your thoughts.
> 
> Regards,
> Saikat
> 
> 
> 
> On Fri, May 1, 2020 at 12:43 PM Denis Magda  wrote:
> 
>> Igniters,
>> 
>> I would like to hear your opinion on what we should do with Ignite
>> WebConsole.
>> 
>> To my knowledge, we don't have active maintainers of the component, and a
>> list of issues is piling up [1]. Users even report that the docker images
>> have not been updated for more than a year [2].
>> 
>> Personally, I share the opinion of those who believe the community needs to
>> provide and support all the essential tooling (metrics and tracing API,
>> command-line tool) while the UI tools are not our business. There is a
>> myriad of UI tools Ignite can be monitored and traced with. Users already
>> have plenty of choices.
>> 
>> What are your thoughts? Probably, some of you want to become maintainers of
>> the component. Otherwise, we should let the tool go.
>> 
>> [1]
>> 
>> https://issues.apache.org/jira/browse/IGNITE-12923?jql=project%20%3D%20IGNITE%20AND%20text%20~%20%22web%20console%22%20AND%20status%20%3D%20Open%20and%20type%20%3D%20Bug%20ORDER%20BY%20createdDate%20
>> [2] https://issues.apache.org/jira/browse/IGNITE-12923
>> 
>> -
>> Denis
>> 



[jira] [Created] (IGNITE-12985) Fix unguarded log.info/log.debug/log.trace usages

2020-05-06 Thread Sergey Antonov (Jira)
Sergey Antonov created IGNITE-12985:
---

 Summary: Fix unguarded log.info/log.debug/log.trace usages
 Key: IGNITE-12985
 URL: https://issues.apache.org/jira/browse/IGNITE-12985
 Project: Ignite
  Issue Type: Improvement
Reporter: Sergey Antonov
Assignee: Sergey Antonov


There are multiple places in code where {{log.info()/log.debug()/log.trace()}} 
is called without checking {{if (log.isInfoEnabled)}}. This leads to polluted 
logs when INFO/DEBUG/TRACE is disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSSION] Ignite WebConsole Deprecation

2020-05-06 Thread Вячеслав Коптилин
Hello,

+1 to remove this component or move it to a separate repository if someone
wants to maintain it.
In case the web console provides useful features, we should consider how to
add them to our command-line utilities, if possible.

Thanks,
Slava.

ср, 6 мая 2020 г. в 16:10, Nikolay Izhikov :

> Hello.
>
> +1 to remove any graphical utilities from the Ignite core.
> I think we should maintain and support only cmd, JMX, and other «core»
> tools.
>
> We shouldn't waste our resources on supporting pretty looking UI tools.
>
>
> > 2 мая 2020 г., в 04:05, Saikat Maitra 
> написал(а):
> >
> > Hello Denis,
> >
> > I am thinking if we should move web-console to a separate repo like
> > ignite-web-console. In my opinion the tech stack we have in web-console
> > like npm, nodejs and Mondodb etc are little different and can be hosted
> as
> > a separate project in a git repo.
> >
> > I had used web-console for streamers project and found it helpful in
> > creating table schema and execute sql queries and also to do cache
> lookup.
> >
> > https://github.com/samaitra/streamers
> >
> > If we continue to find that usage of ignite-web-console is limited then
> we
> > can plan moving ignite-web-console in Apache Attic
> >
> > https://attic.apache.org/
> >
> > Please let me know your thoughts.
> >
> > Regards,
> > Saikat
> >
> >
> >
> > On Fri, May 1, 2020 at 12:43 PM Denis Magda  wrote:
> >
> >> Igniters,
> >>
> >> I would like to hear your opinion on what we should do with Ignite
> >> WebConsole.
> >>
> >> To my knowledge, we don't have active maintainers of the component, and
> a
> >> list of issues is piling up [1]. Users even report that the docker
> images
> >> have not been updated for more than a year [2].
> >>
> >> Personally, I share the opinion of those who believe the community
> needs to
> >> provide and support all the essential tooling (metrics and tracing API,
> >> command-line tool) while the UI tools are not our business. There is a
> >> myriad of UI tools Ignite can be monitored and traced with. Users
> already
> >> have plenty of choices.
> >>
> >> What are your thoughts? Probably, some of you want to become
> maintainers of
> >> the component. Otherwise, we should let the tool go.
> >>
> >> [1]
> >>
> >>
> https://issues.apache.org/jira/browse/IGNITE-12923?jql=project%20%3D%20IGNITE%20AND%20text%20~%20%22web%20console%22%20AND%20status%20%3D%20Open%20and%20type%20%3D%20Bug%20ORDER%20BY%20createdDate%20
> >> [2] https://issues.apache.org/jira/browse/IGNITE-12923
> >>
> >> -
> >> Denis
> >>
>
>


Re: IEP-44 Thin Client Discovery

2020-05-06 Thread Pavel Tupitsyn
Igniters, let's discuss the following issue:

For partition awareness, and now for cluster discovery, we use a response
flag to detect topology changes.
The problem is - if the client does not do anything (user code does not
perform operations),
then we'll never know about topology changes and may even lose the cluster
(all known nodes leave).

Should we introduce some keep-alive mechanism, so that thin clients send
periodic ping requests?
Maybe do this as a separate feature.

Thoughts?

On Tue, Apr 28, 2020 at 8:14 PM Pavel Tupitsyn  wrote:

> Ok, I've updated IEP and POC accordingly:
> * Config flag removed
> * IPs and host names retrieval simplified - use existing node properties
> and attributes instead of Compute
>
> On Tue, Apr 28, 2020 at 7:57 PM Igor Sapego  wrote:
>
>> I guess it makes sense. If anyone needs more control over connection
>> we would need to implement a new feature anyway (like node filter we
>> discussed earlier)
>>
>> Best Regards,
>> Igor
>>
>>
>> On Tue, Apr 28, 2020 at 12:29 PM Pavel Tupitsyn 
>> wrote:
>>
>> > > enable the capability if the best effort affinity is on
>> > I agree, makes sense.
>> >
>> > Igor, what do you think?
>> >
>> > On Tue, Apr 28, 2020 at 8:25 AM Denis Magda  wrote:
>> >
>> > > Pavel,
>> > >
>> > > That would be a tremendous improvement for the recently release best
>> > effort
>> > > affinity feature. Without this capability, we force application
>> > developers
>> > > to reopen thin client connections every type a cluster is scaled out.
>> I
>> > > believe that once the folks start using the best effort affinity,
>> we'll
>> > be
>> > > hearing more of a feature request for what you're proposing in this
>> > thread.
>> > > So, thanks for taking care of this proactively!
>> > >
>> > > As for the public API changes, do we really need any extra flag? I
>> would
>> > > enable the capability if the best effort affinity is on. For me, it's
>> > just
>> > > a natural improvement of the latter and it sounds reasonable to reuse
>> the
>> > > best effort affinity's flag.
>> > >
>> > > -
>> > > Denis
>> > >
>> > >
>> > > On Mon, Apr 27, 2020 at 2:58 AM Pavel Tupitsyn 
>> > > wrote:
>> > >
>> > > > Igniters,
>> > > >
>> > > > I've prepared an IEP [1] and a POC [2] for Thin Client Discovery
>> > feature.
>> > > > Let's discuss it here.
>> > > >
>> > > > In particular, I'd like to address the following points:
>> > > >
>> > > > 1. Value: do you think this would be a good feature to have?
>> > > > 2. Public API changes: is a boolean property enough? Should we have
>> > > > something more complex, so users can plug in custom logic to filter
>> > > and/or
>> > > > translate IPs and host names?
>> > > > 3. Server-side implementation details: should we use Compute, Node
>> > > > Attributes, or something else to retrieve client endpoints from all
>> > nodes
>> > > > in cluster?
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-44%3A+Thin+client+cluster+discovery
>> > > > [2] https://github.com/apache/ignite/pull/7744
>> > > > [3] https://issues.apache.org/jira/browse/IGNITE-12932
>> > > >
>> > >
>> >
>>
>


Re: Extended logging for rebalance performance analysis

2020-05-06 Thread Ivan Rakov
Hi,

IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> duration rebalance of cache group after which partitions distribution is
> output, set in milliseconds, default value is 10 minutes.

 Does it mean that if the rebalancing process took less than 10 minutes,
only a short version of the message (with supplier statistics) will show up?

In general, I have no objections.


On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл 
wrote:

> Hi, Igniters!
>
> I'd like to share a new small feature in AI [1].
>
> Current rebalance logging does not allow you to quickly answer following
> questions:
> 1)How long was the balance(divided by supplier)?
> 2)How many records and bytes per supplier were rebalanced?
> 3)How many times did rebalance restart?
> 4)Which partitions were rebalanced and from which nodes did they receive
> them?
> 5)When did rebalance for all cache groups end?
>
> What you can see in logs now:
>
> 1)Starting rebalance with order of cache groups.
> Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
> evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c1]
>
> 2)Start rebalance of cache group from a specific supplier, specifying
> partition ids and mode - historical or full.
> Starting rebalance routine [ignite-sys-cache,
> topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> supplier=8c525892-703b-4fc4-b28b-b2f13970, fullPartitions=[0-99],
> histPartitions=[]]
>
> 3)Getting partial or complete partitions of cache group.
> Completed rebalancing [grp=ignite-sys-cache,
> supplier=8c525892-703b-4fc4-b28b-b2f13970,
> topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
> Completed (final) rebalancing [grp=ignite-sys-cache,
> supplier=c2146a04-dc23-4bc9-870d-dfbb55c1,
> topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
>
> 4)End rebalance of cache group.
> Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
> [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
> minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
> receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
> lastCancelledTime=-1]
>
> Rebalance statistics:
>
> To speed up rebalance analysis, statistics will be output for each cache
> group and total for all cache groups.
> If duration rebalance for cache group is greater than threshold value,
> partition distribution is output.
> Statistics will you to analyze duration of the balance for each supplier
> to understand which of them has been transmitting data for longest time.
>
> System properties are used to output statistics:
>
> IGNITE_QUIET - to output statistics, value must be false;
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> duration rebalance of cache group after which partitions distribution is
> output, set in milliseconds, default value is 10 minutes.
>
> Statistics examples:
>
> Successful full and historical rebalance of group cache, without
> partitions distribution.
> Rebalance information per cache group (successful rebalance): [id=3181548,
> name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
> d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
> bytes, d - duration, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> [1=rebalancing.RebalanceStatisticsTest0]
> Rebalance information per cache group (successful rebalance): [id=3181547,
> name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
> he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
> bytes, d - duration, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>
> Successful full and historical rebalance of group cache, with partitions
> distribution.
> Rebalance information per cache group (successful rebalance): [id=3181548,
> name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
> d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
> bytes, d - duration, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
> than 5 ms, printing detailed information about partitions distribution
> (threshold can be changed by setting number of milliseconds into
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
> [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
> [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
> [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
> [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su]

Re: Using GraalVM instead of standard JVM

2020-05-06 Thread Denis Magda
I'll leave this reference here so that we have a better understanding of
why it's worthwhile to support GraalVM:
https://blogs.oracle.com/graalvm/apache-spark
—lightning-fast-on-graalvm-enterprise

Spark benefits from running on GraalVM, so should we. Apart from memory
usage and performance advantages, this JVM can execute Python code. With
that, we can enable compute APIs support for Python.

-
Denis


On Sun, May 13, 2018 at 12:23 PM Sven Beauprez 
wrote:

> Thnx all for the feedback.
>
> Looking forward to the results of such a test run.
>
> Regards,
>
> Sven
>
>
>
> SVEN BEAUPREZ
>
> L e a d   A r c h i t e c t
>
>
>
> De Kleetlaan 5, B-1831 Diegem
>
> www.theglue.com 
> On 10/05/2018, 17:44, "Petr Ivanov"  wrote:
>
> File the ticket and specify priority — and I will start researching.
>
> For test runs — we can have a copy of current test project and run
> some tests in different VMs (as you rightly remarked — right after JDK9
> task is complete).
>
>
>
>
> > On 10 May 2018, at 18:34, Dmitry Pavlov 
> wrote:
> >
> > Hi Peter,
> >
> > It seems it is one more argument to implement selectable VM for
> existing run-all chain instead of creating one more.
> >
> > Would it be easy to add one more option once JDK 9 run is ready?
> >
> > Sincerely,
> > Dmitriy Pavlov
> >
> > чт, 10 мая 2018 г. в 15:58, Dmitriy Setrakyan  >:
> > Would be nice to have a TC run on Graal, just to have an
> understanding
> > whether we support it or not.
> >
> > D.
> >
> > On Wed, May 9, 2018 at 4:28 PM, Denis Magda  > wrote:
> >
> > > The performance might become better just by replacing HotSpot with
> Graal,
> > > but something suggests me that Ignite has to be adopted for this
> JVM (as
> > > well as for Azul VM) to get more benefits. Probably, someone will
> get
> > > interested and pick this task up.
> > >
> > > What stands out is that the Graal folks also see this VM as an
> opportunity
> > > to run custom code on a database side like Oracle or MySQL:
> > > https://oracle.github.io/oracle-db-mle/ <
> https://oracle.github.io/oracle-db-mle/> It's a sort of their response to
> > > compute grid functionality of data grids and Hadoop ecosystem.
> > >
> > > --
> > > Denis
> > >
> > > On Wed, May 9, 2018 at 5:23 AM, sbeaupre <
> sven.beaup...@theglue.com >
> > > wrote:
> > >
> > > > This is just a thought that came out of a discussion with
> Dimitry this
> > > > morning. Recently Oracle has released GraalVM 1.0 after many
> years of
> > > > research and development, as a replacement for standard JVM.
> > > >
> > > > It should come with huge improvements on several areas
> (interesting for
> > > > ignite: AOT, native compilation, remove object allocation in
> many cases,
> > > > ...)
> > > >
> > > > Any interest from GG in this? Do you guys think it would give
> ignite a
> > > > performance boost (haven't tested it myself, just checking if it
> is
> > > > worthwhile in the first place, probably low on our prio list).
> > > >
> > > > More info:
> > > > - GraalVM for Java:
> > > > http://www.graalvm.org/docs/why-graal/#for-java-programs
> 
> > > > - Twitter is running GraalVM in production for a while now:
> > > > https://www.youtube.com/watch?v=pR5NDkIZBOA <
> https://www.youtube.com/watch?v=pR5NDkIZBOA>
> > > > - Getting started:
> > > > http://www.graalvm.org/docs/getting-started/ <
> http://www.graalvm.org/docs/getting-started/>
> > > >
> > > > regards,
> > > >
> > > > Sven
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from:
> http://apache-ignite-developers.2346864.n4.nabble.com/ <
> http://apache-ignite-developers.2346864.n4.nabble.com/>
> > > >
> > >
>
>
>
>


Re: [DISCUSS] Data loss handling improvements

2020-05-06 Thread Alexei Scherbakov
ср, 6 мая 2020 г. в 12:54, Anton Vinogradov :

> Alexei,
>
> 1,2,4,5 - looks good to me, no objections here.
>
> >> 3. Lost state is impossible to reset if a topology doesn't have at least
> >> one owner for each lost partition.
>
> Do you mean that, according to your example, where
> >> a node2 has left, soon a node3 has left. If the node2 is returned to
> >> the topology first, it would have stale data for some keys.
> we have to have node2 at cluster to be able to reset "lost" to node2's
> data?
>

Not sure if I understand a question, but try to answer using an example:
Assume 3 nodes n1, n2, n3, 1 backup, persistence enabled, partition p is
owned by n2 and n3.
1. Topology is activated.
2. cache.put(p, 0) // n2 and n3 have p->0, updateCounter=1
3. n2 has failed.
4. cache.put(p, 1) // n3 has p->1, updateCounter=2
5. n3 has failed, partition loss is happened.
6. n2 joins a topology, it has stale data (p->0)

We actually have 2 issues:
7. cache.put(p, 2) will success, n2 has p->2, n3 has p->0, data is diverged
and will not be adjusted by counters rebalancing if n3 is later joins a
topology.
or
8. n3 joins a topology, it has actual data (p->1) but rebalancing will not
work because joining node has highest counter (it can only be a demander in
this scenario).

In both cases rebalancing by counters will not work causing data divergence
in copies.


>
> >> at least one owner for each lost partition.
> What the reason to have owners for all lost partitions when we want to
> reset only some (available)?
>

It's never were possible to reset only subset of lost partitions. The
reason is to make guarantee of resetLostPartitions method - all cache
operations are resumed, data is correct.


> Will it be possible to perform operations on non-lost partitions when the
> cluster has at least one lost partition?
>

Yes it will be.


>
> On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
> alexey.scherbak...@gmail.com> wrote:
>
> > Folks,
> >
> > I've almost finished a patch bringing some improvements to the data loss
> > handling code, and I wish to discuss proposed changes with the community
> > before submitting.
> >
> > *The issue*
> >
> > During the grid's lifetime, it's possible to get into a situation when
> some
> > data nodes have failed or mistakenly stopped. If a number of stopped
> nodes
> > exceeds a certain threshold depending on configured backups, count a data
> > loss will occur. For example, a grid having one backup (meaning at least
> > two copies of each data partition exist at the same time) can tolerate
> only
> > one node loss at the time. Generally, data loss is guaranteed to occur if
> > backups + 1 or more nodes have failed simultaneously using default
> affinity
> > function.
> >
> > For in-memory caches, data is lost forever. For persistent caches, data
> is
> > not physically lost and accessible again after failed nodes are returned
> to
> > the topology.
> >
> > Possible data loss should be taken into consideration while designing an
> > application.
> >
> >
> >
> > *Consider an example: money is transferred from one deposit to another,
> and
> > all nodes holding data for one of the deposits are gone.In such a case, a
> > transaction temporary cannot be completed until a cluster is recovered
> from
> > the data loss state. Ignoring this can cause data inconsistency.*
> > It is necessary to have an API telling us if an operation is safe to
> > complete from the perspective of data loss.
> >
> > Such an API exists for some time [1] [2] [3]. In short, a grid can be
> > configured to switch caches to the partial availability mode if data loss
> > is detected.
> >
> > Let's give two definitions according to the Javadoc for
> > *PartitionLossPolicy*:
> >
> > ·   *Safe* (data loss handling) *policy* - cache operations are only
> > available for non-lost partitions (PartitionLossPolicy != IGNORE).
> >
> > ·   *Unsafe policy* - cache operations are always possible
> > (PartitionLossPolicy = IGNORE). If the unsafe policy is configured, lost
> > partitions automatically re-created on the remaining nodes if needed or
> > immediately owned if a last supplier has left during rebalancing.
> >
> > *That needs to be fixed*
> >
> > 1. The default loss policy is unsafe, even for persistent caches in the
> > current implementation. It can result in unintentional data loss and
> > business invariants' failure.
> >
> > 2. Node restarts in the persistent grid with detected data loss will
> cause
> > automatic resetting of LOST state after the restart, even if the safe
> > policy is configured. It can result in data loss or partition desync if
> not
> > all nodes are returned to the topology or returned in the wrong order.
> >
> >
> > *An example: a grid has three nodes, one backup. The grid is under load.
> > First, a node2 has left, soon a node3 has left. If the node2 is returned
> to
> > the topology first, it would have stale data for some keys. Most recent
> > data are on node3, which is not in the topology ye

Re: Using GraalVM instead of standard JVM

2020-05-06 Thread Stephen Darlington
I’ve been playing around with it. I’ve was really impressed that I could run 
JavaScript on Ignite with comparatively little code:

https://github.com/sdarlington/ignite-graalvm 


I’ve not been looking at performance, though.

Regards,
Stephen

> On 6 May 2020, at 17:52, Denis Magda  wrote:
> 
> I'll leave this reference here so that we have a better understanding of
> why it's worthwhile to support GraalVM:
> https://blogs.oracle.com/graalvm/apache-spark
> —lightning-fast-on-graalvm-enterprise
> 
> Spark benefits from running on GraalVM, so should we. Apart from memory
> usage and performance advantages, this JVM can execute Python code. With
> that, we can enable compute APIs support for Python.
> 
> -
> Denis
> 
> 
> On Sun, May 13, 2018 at 12:23 PM Sven Beauprez 
> wrote:
> 
>> Thnx all for the feedback.
>> 
>> Looking forward to the results of such a test run.
>> 
>> Regards,
>> 
>> Sven
>> 
>> 
>> 
>> SVEN BEAUPREZ
>> 
>> L e a d   A r c h i t e c t
>> 
>> 
>> 
>> De Kleetlaan 5, B-1831 Diegem
>> 
>> www.theglue.com 
>> On 10/05/2018, 17:44, "Petr Ivanov"  wrote:
>> 
>>File the ticket and specify priority — and I will start researching.
>> 
>>For test runs — we can have a copy of current test project and run
>> some tests in different VMs (as you rightly remarked — right after JDK9
>> task is complete).
>> 
>> 
>> 
>> 
>>> On 10 May 2018, at 18:34, Dmitry Pavlov 
>> wrote:
>>> 
>>> Hi Peter,
>>> 
>>> It seems it is one more argument to implement selectable VM for
>> existing run-all chain instead of creating one more.
>>> 
>>> Would it be easy to add one more option once JDK 9 run is ready?
>>> 
>>> Sincerely,
>>> Dmitriy Pavlov
>>> 
>>> чт, 10 мая 2018 г. в 15:58, Dmitriy Setrakyan > >:
>>> Would be nice to have a TC run on Graal, just to have an
>> understanding
>>> whether we support it or not.
>>> 
>>> D.
>>> 
>>> On Wed, May 9, 2018 at 4:28 PM, Denis Magda > > wrote:
>>> 
 The performance might become better just by replacing HotSpot with
>> Graal,
 but something suggests me that Ignite has to be adopted for this
>> JVM (as
 well as for Azul VM) to get more benefits. Probably, someone will
>> get
 interested and pick this task up.
 
 What stands out is that the Graal folks also see this VM as an
>> opportunity
 to run custom code on a database side like Oracle or MySQL:
 https://oracle.github.io/oracle-db-mle/ <
>> https://oracle.github.io/oracle-db-mle/> It's a sort of their response to
 compute grid functionality of data grids and Hadoop ecosystem.
 
 --
 Denis
 
 On Wed, May 9, 2018 at 5:23 AM, sbeaupre <
>> sven.beaup...@theglue.com >
 wrote:
 
> This is just a thought that came out of a discussion with
>> Dimitry this
> morning. Recently Oracle has released GraalVM 1.0 after many
>> years of
> research and development, as a replacement for standard JVM.
> 
> It should come with huge improvements on several areas
>> (interesting for
> ignite: AOT, native compilation, remove object allocation in
>> many cases,
> ...)
> 
> Any interest from GG in this? Do you guys think it would give
>> ignite a
> performance boost (haven't tested it myself, just checking if it
>> is
> worthwhile in the first place, probably low on our prio list).
> 
> More info:
> - GraalVM for Java:
>http://www.graalvm.org/docs/why-graal/#for-java-programs
>> 
> - Twitter is running GraalVM in production for a while now:
>https://www.youtube.com/watch?v=pR5NDkIZBOA <
>> https://www.youtube.com/watch?v=pR5NDkIZBOA>
> - Getting started:
>http://www.graalvm.org/docs/getting-started/ <
>> http://www.graalvm.org/docs/getting-started/>
> 
> regards,
> 
> Sven
> 
> 
> 
> 
> 
> --
> Sent from:
>> http://apache-ignite-developers.2346864.n4.nabble.com/ <
>> http://apache-ignite-developers.2346864.n4.nabble.com/>
> 
 
>> 
>> 
>> 
>> 




Re: Extended logging for rebalance performance analysis

2020-05-06 Thread Maxim Muzafarov
Kirill,


Thank you for raising this topic. It's true that the rebalance process
still requires additional information for analyzing issues. Please,
don't think that I'm against your changes :-)

* My short answer. *

We won't do performance analysis on the production environment. Each
time we need performance analysis it will be done on a test
environment with verbose logging enabled. Thus I suggest moving these
changes to a separate `profiling` module and extend the logging much
more without any ышяу limitations. The same as these [2] [3]
activities do.

Let's keep the `core` module as simple as possible.
Let's design the right API for accessing rebalance internals for
profiling tools.
Can you, please, remove all changes from your PR [6] which are not
related to the proposed topic? (e.g. variable renamings).


* The long answer. *

Here are my thoughts on this. There are two different types of issues
in the rebalance process. The first case must be covered by daily
monitoring subsystems, the second case must be covered by additional
profiling tools.
1. errors during the rebalancing (e.g. rebalance not happens when required)
2. performance rebalancing issues (e.g. rebalance is slow)

Daily monitoring tools (JMX, Logging) are always turned on and
shouldn't require additional systems resources themselves. Since these
metrics must be lightweight any internal aggregation machinery is not
used on them. All these metrics are collected from each node
independently. Please, take a look at this issue [1] which covers most
of your questions mentioned above.

For all available metrics, we can configure LogExporterSpi, so they
will be available in logs.

> 1)How long was the balance(divided by supplier)?
rebalancingStartTime, rebalancingEndTime already exists for cache groups [4].
We can add the same for suppliers.

> 2)How many records and bytes per supplier were rebalanced?
We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
We will have rebalancingExpectedKeys [5].
We can add a new metric per cache keys, per supplier.

> 3)How many times did the rebalance restart?
rebalancingLastCancelledTime [4] metric exists.
Do we need to keep historical data on it?

> 4)Which partitions were rebalanced and from which nodes did they receive them?
Let's print this information to log prior to the rebalance process
starts. This will be helpful information and do not require a lot of
changes.

> 5)When did rebalance for all cache groups end?
This metric may be aggregated from rebalancingEndTime [4] by pulling
it from all nodes for all caches.


Performance rebalancing issues are related to profiling tools. They
may require additional system resources and definitely require a
dedicated environment for tests. We can't do performance analysis on
production environments due to performance impact.
I see some disadvantages of adding such tools to production code:
- verbose logging may affect performance.
- the problematic process may become even worse if an automatic
threshold suddenly turns on.
- new code changes will require additional efforts to keep logging up-to-date.


[1] https://issues.apache.org/jira/browse/IGNITE-12183
[2] https://issues.apache.org/jira/browse/IGNITE-12666
[3] 
https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
[4] https://issues.apache.org/jira/browse/IGNITE-12193
[5] https://issues.apache.org/jira/browse/IGNITE-12194
[6] https://github.com/apache/ignite/pull/7705/files

On Wed, 6 May 2020 at 19:50, Ivan Rakov  wrote:
>
> Hi,
>
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> > duration rebalance of cache group after which partitions distribution is
> > output, set in milliseconds, default value is 10 minutes.
>
>  Does it mean that if the rebalancing process took less than 10 minutes,
> only a short version of the message (with supplier statistics) will show up?
>
> In general, I have no objections.
>
>
> On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл 
> wrote:
>
> > Hi, Igniters!
> >
> > I'd like to share a new small feature in AI [1].
> >
> > Current rebalance logging does not allow you to quickly answer following
> > questions:
> > 1)How long was the balance(divided by supplier)?
> > 2)How many records and bytes per supplier were rebalanced?
> > 3)How many times did rebalance restart?
> > 4)Which partitions were rebalanced and from which nodes did they receive
> > them?
> > 5)When did rebalance for all cache groups end?
> >
> > What you can see in logs now:
> >
> > 1)Starting rebalance with order of cache groups.
> > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
> > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c1]
> >
> > 2)Start rebalance of cache group from a specific supplier, specifying
> > partition ids and mode - historical or full.
> > Starting rebalance routine [ignite-sys-cache,
> > topVer=AffinityTopologyVersion [topVer=2, m

Re: Extended logging for rebalance performance analysis

2020-05-06 Thread Alexei Scherbakov
Hello.

Let's look at existing rebalancing log for a single group:

[2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
[order=[ignite-sys-cache, cache1, cache2, default],
top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d2]
...
[2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]
[2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]
[2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, fullPartitions=[1, 5, 7, 9,
11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
[2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, fullPartitions=[6, 8, 10,
16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
[2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
[2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
[grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
[2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
RebalanceFuture [grp=CacheGroupContext [grp=cache1],
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2,
routines=2]

>From these logs I'm already can get answers to 1 and 4.
The logs look concise and easy to read and understand, and should
remain what way.

But I think some proposed improvements can be done here without harm.

2. OK, let's add it to supplier info per cache with additional info:

[2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, entries=100, duration=12ms,
bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
progress=1/2]

3. This information is already printed to log.

5. OK, let's add a line which is printed after all groups in the chain are
rebalanced or cancelled with a summary:

[2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
[rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M]

Another thing I would suggest: add rebalanceId to all messages for ease of
grepping and to distinguish multiple rebalances for a same topology version.

Regarding the "detailed" message - I wasn't able to figure out any case
where it is useful. Not sure if we need it at all.
Kirill Tkalenko, can you give me an example?

ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov :

> Kirill,
>
>
> Thank you for raising this topic. It's true that the rebalance process
> still requires additional information for analyzing issues. Please,
> don't think that I'm against your changes :-)
>
> * My short answer. *
>
> We won't do performance analysis on the production environment. Each
> time we need performance analysis it will be done on a test
> environment with verbose logging enabled. Thus I suggest moving these
> changes to a separate `profiling` module and extend the logging much
> more without any ышяу limitations. The same as these [2] [3]
> activities do.
>
> Let's keep the `core` module as simple as possible.
> Let's design the right API for accessing rebalance internals for
> profiling tools.
> Can you, please, remove all changes from your PR [6] which are not
> related to the proposed topic? (e.g. variable renamings).
>
>
> * The long answer. *
>
> Here are my thoughts on this. There are two different types of issues
> in the rebalance process. The first case must be covered by daily
> monitoring subsystems, the second case must be covered by additional
> profiling tools.
> 1. errors during the rebalancing (e.g. rebalance not happens when required)
> 2. performance rebalancing issues (e.g. rebalance is slow)
>
> Daily monitoring tools (JMX, Logging) are always turned on and
> shouldn't require additional systems resources themselves. Since these
> metrics must be lightweight any internal aggregation machinery is not
> used on them. All these metrics are collected from each node
> independently. Please, take a look at this issue [1] which covers most
> of your questions mentioned above.
>
> For all available metrics, we can configure LogExporterSpi, so they
> will be available in logs.
>
> > 1)How long was the balance(divided by supplier)?
> rebalancingStartTime, rebalancingEndTime already exists for cache groups
> [4].
> We can add the same for suppliers.
>
> > 2)How many records and bytes per supplier were rebalanced?
> We already

Re: Using GraalVM instead of standard JVM

2020-05-06 Thread Ivan Pavlukhin
Hi,

My thoughts:
* Does anyone has an idea what prevents us to run Ignite on GraalVM?
* Regarding Python. I suppose every JVM can leverage jython [1]. Also
I think we should support all significant features on HotSpot as well.

[1] https://www.jython.org/

Best regards,
Ivan Pavlukhin

ср, 6 мая 2020 г. в 20:10, Stephen Darlington :
>
> I’ve been playing around with it. I’ve was really impressed that I could run 
> JavaScript on Ignite with comparatively little code:
>
> https://github.com/sdarlington/ignite-graalvm 
> 
>
> I’ve not been looking at performance, though.
>
> Regards,
> Stephen
>
> > On 6 May 2020, at 17:52, Denis Magda  wrote:
> >
> > I'll leave this reference here so that we have a better understanding of
> > why it's worthwhile to support GraalVM:
> > https://blogs.oracle.com/graalvm/apache-spark
> > —lightning-fast-on-graalvm-enterprise
> >
> > Spark benefits from running on GraalVM, so should we. Apart from memory
> > usage and performance advantages, this JVM can execute Python code. With
> > that, we can enable compute APIs support for Python.
> >
> > -
> > Denis
> >
> >
> > On Sun, May 13, 2018 at 12:23 PM Sven Beauprez 
> > wrote:
> >
> >> Thnx all for the feedback.
> >>
> >> Looking forward to the results of such a test run.
> >>
> >> Regards,
> >>
> >> Sven
> >>
> >>
> >>
> >> SVEN BEAUPREZ
> >>
> >> L e a d   A r c h i t e c t
> >>
> >>
> >>
> >> De Kleetlaan 5, B-1831 Diegem
> >>
> >> www.theglue.com 
> >> On 10/05/2018, 17:44, "Petr Ivanov"  wrote:
> >>
> >>File the ticket and specify priority — and I will start researching.
> >>
> >>For test runs — we can have a copy of current test project and run
> >> some tests in different VMs (as you rightly remarked — right after JDK9
> >> task is complete).
> >>
> >>
> >>
> >>
> >>> On 10 May 2018, at 18:34, Dmitry Pavlov 
> >> wrote:
> >>>
> >>> Hi Peter,
> >>>
> >>> It seems it is one more argument to implement selectable VM for
> >> existing run-all chain instead of creating one more.
> >>>
> >>> Would it be easy to add one more option once JDK 9 run is ready?
> >>>
> >>> Sincerely,
> >>> Dmitriy Pavlov
> >>>
> >>> чт, 10 мая 2018 г. в 15:58, Dmitriy Setrakyan  >> >:
> >>> Would be nice to have a TC run on Graal, just to have an
> >> understanding
> >>> whether we support it or not.
> >>>
> >>> D.
> >>>
> >>> On Wed, May 9, 2018 at 4:28 PM, Denis Magda  >> > wrote:
> >>>
>  The performance might become better just by replacing HotSpot with
> >> Graal,
>  but something suggests me that Ignite has to be adopted for this
> >> JVM (as
>  well as for Azul VM) to get more benefits. Probably, someone will
> >> get
>  interested and pick this task up.
> 
>  What stands out is that the Graal folks also see this VM as an
> >> opportunity
>  to run custom code on a database side like Oracle or MySQL:
>  https://oracle.github.io/oracle-db-mle/ <
> >> https://oracle.github.io/oracle-db-mle/> It's a sort of their response to
>  compute grid functionality of data grids and Hadoop ecosystem.
> 
>  --
>  Denis
> 
>  On Wed, May 9, 2018 at 5:23 AM, sbeaupre <
> >> sven.beaup...@theglue.com >
>  wrote:
> 
> > This is just a thought that came out of a discussion with
> >> Dimitry this
> > morning. Recently Oracle has released GraalVM 1.0 after many
> >> years of
> > research and development, as a replacement for standard JVM.
> >
> > It should come with huge improvements on several areas
> >> (interesting for
> > ignite: AOT, native compilation, remove object allocation in
> >> many cases,
> > ...)
> >
> > Any interest from GG in this? Do you guys think it would give
> >> ignite a
> > performance boost (haven't tested it myself, just checking if it
> >> is
> > worthwhile in the first place, probably low on our prio list).
> >
> > More info:
> > - GraalVM for Java:
> >http://www.graalvm.org/docs/why-graal/#for-java-programs
> >> 
> > - Twitter is running GraalVM in production for a while now:
> >https://www.youtube.com/watch?v=pR5NDkIZBOA <
> >> https://www.youtube.com/watch?v=pR5NDkIZBOA>
> > - Getting started:
> >http://www.graalvm.org/docs/getting-started/ <
> >> http://www.graalvm.org/docs/getting-started/>
> >
> > regards,
> >
> > Sven
> >
> >
> >
> >
> >
> > --
> > Sent from:
> >> http://apache-ignite-developers.2346864.n4.nabble.com/ <
> >> http://apache-ignite-developers.2346864.n4.nabble.com/>
> >
> 
> >>
> >>
> >>
> >>
>
>


Re: [DISCUSSION] Ignite WebConsole Deprecation

2020-05-06 Thread Denis Magda
Folks,

Personally, I like the idea of Apache Attic [1] suggested by Saikat. We are
not discontinuing Ignite WebConsole because it became useless like it was
with Hadoop Accelerator. The tool has good capabilities such as the
configuration wizard, and the only problem is that we no longer want to
support or maintain this technology (while it has the right to exist).

Is anybody interested in figuring out with the ASF-mates if Ignite
WebConsole can be accepted to Attic? The Attic's guidelines talk about an
ASF project as a whole and not about individual components.

[1] https://attic.apache.org

-
Denis


On Wed, May 6, 2020 at 7:20 AM Вячеслав Коптилин 
wrote:

> Hello,
>
> +1 to remove this component or move it to a separate repository if someone
> wants to maintain it.
> In case the web console provides useful features, we should consider how to
> add them to our command-line utilities, if possible.
>
> Thanks,
> Slava.
>
> ср, 6 мая 2020 г. в 16:10, Nikolay Izhikov :
>
> > Hello.
> >
> > +1 to remove any graphical utilities from the Ignite core.
> > I think we should maintain and support only cmd, JMX, and other «core»
> > tools.
> >
> > We shouldn't waste our resources on supporting pretty looking UI tools.
> >
> >
> > > 2 мая 2020 г., в 04:05, Saikat Maitra 
> > написал(а):
> > >
> > > Hello Denis,
> > >
> > > I am thinking if we should move web-console to a separate repo like
> > > ignite-web-console. In my opinion the tech stack we have in web-console
> > > like npm, nodejs and Mondodb etc are little different and can be hosted
> > as
> > > a separate project in a git repo.
> > >
> > > I had used web-console for streamers project and found it helpful in
> > > creating table schema and execute sql queries and also to do cache
> > lookup.
> > >
> > > https://github.com/samaitra/streamers
> > >
> > > If we continue to find that usage of ignite-web-console is limited then
> > we
> > > can plan moving ignite-web-console in Apache Attic
> > >
> > > https://attic.apache.org/
> > >
> > > Please let me know your thoughts.
> > >
> > > Regards,
> > > Saikat
> > >
> > >
> > >
> > > On Fri, May 1, 2020 at 12:43 PM Denis Magda  wrote:
> > >
> > >> Igniters,
> > >>
> > >> I would like to hear your opinion on what we should do with Ignite
> > >> WebConsole.
> > >>
> > >> To my knowledge, we don't have active maintainers of the component,
> and
> > a
> > >> list of issues is piling up [1]. Users even report that the docker
> > images
> > >> have not been updated for more than a year [2].
> > >>
> > >> Personally, I share the opinion of those who believe the community
> > needs to
> > >> provide and support all the essential tooling (metrics and tracing
> API,
> > >> command-line tool) while the UI tools are not our business. There is a
> > >> myriad of UI tools Ignite can be monitored and traced with. Users
> > already
> > >> have plenty of choices.
> > >>
> > >> What are your thoughts? Probably, some of you want to become
> > maintainers of
> > >> the component. Otherwise, we should let the tool go.
> > >>
> > >> [1]
> > >>
> > >>
> >
> https://issues.apache.org/jira/browse/IGNITE-12923?jql=project%20%3D%20IGNITE%20AND%20text%20~%20%22web%20console%22%20AND%20status%20%3D%20Open%20and%20type%20%3D%20Bug%20ORDER%20BY%20createdDate%20
> > >> [2] https://issues.apache.org/jira/browse/IGNITE-12923
> > >>
> > >> -
> > >> Denis
> > >>
> >
> >
>


Re: Using GraalVM instead of standard JVM

2020-05-06 Thread Denis Magda
Stephen, that's terrific! To Ivan's first question, did you just swap
HotSpot with GraalVM and got the thing working? Or did it require some
extra work?

-
Denis


On Wed, May 6, 2020 at 10:10 AM Stephen Darlington <
stephen.darling...@gridgain.com> wrote:

> I’ve been playing around with it. I’ve was really impressed that I could
> run JavaScript on Ignite with comparatively little code:
>
> https://github.com/sdarlington/ignite-graalvm <
> https://github.com/sdarlington/ignite-graalvm>
>
> I’ve not been looking at performance, though.
>
> Regards,
> Stephen
>
> > On 6 May 2020, at 17:52, Denis Magda  wrote:
> >
> > I'll leave this reference here so that we have a better understanding of
> > why it's worthwhile to support GraalVM:
> > https://blogs.oracle.com/graalvm/apache-spark
> > —lightning-fast-on-graalvm-enterprise
> >
> > Spark benefits from running on GraalVM, so should we. Apart from memory
> > usage and performance advantages, this JVM can execute Python code. With
> > that, we can enable compute APIs support for Python.
> >
> > -
> > Denis
> >
> >
> > On Sun, May 13, 2018 at 12:23 PM Sven Beauprez <
> sven.beaup...@theglue.com>
> > wrote:
> >
> >> Thnx all for the feedback.
> >>
> >> Looking forward to the results of such a test run.
> >>
> >> Regards,
> >>
> >> Sven
> >>
> >>
> >>
> >> SVEN BEAUPREZ
> >>
> >> L e a d   A r c h i t e c t
> >>
> >>
> >>
> >> De Kleetlaan 5, B-1831 Diegem
> >>
> >> www.theglue.com 
> >> On 10/05/2018, 17:44, "Petr Ivanov"  wrote:
> >>
> >>File the ticket and specify priority — and I will start researching.
> >>
> >>For test runs — we can have a copy of current test project and run
> >> some tests in different VMs (as you rightly remarked — right after JDK9
> >> task is complete).
> >>
> >>
> >>
> >>
> >>> On 10 May 2018, at 18:34, Dmitry Pavlov 
> >> wrote:
> >>>
> >>> Hi Peter,
> >>>
> >>> It seems it is one more argument to implement selectable VM for
> >> existing run-all chain instead of creating one more.
> >>>
> >>> Would it be easy to add one more option once JDK 9 run is ready?
> >>>
> >>> Sincerely,
> >>> Dmitriy Pavlov
> >>>
> >>> чт, 10 мая 2018 г. в 15:58, Dmitriy Setrakyan  >> >:
> >>> Would be nice to have a TC run on Graal, just to have an
> >> understanding
> >>> whether we support it or not.
> >>>
> >>> D.
> >>>
> >>> On Wed, May 9, 2018 at 4:28 PM, Denis Magda  >> > wrote:
> >>>
>  The performance might become better just by replacing HotSpot with
> >> Graal,
>  but something suggests me that Ignite has to be adopted for this
> >> JVM (as
>  well as for Azul VM) to get more benefits. Probably, someone will
> >> get
>  interested and pick this task up.
> 
>  What stands out is that the Graal folks also see this VM as an
> >> opportunity
>  to run custom code on a database side like Oracle or MySQL:
>  https://oracle.github.io/oracle-db-mle/ <
> >> https://oracle.github.io/oracle-db-mle/> It's a sort of their response
> to
>  compute grid functionality of data grids and Hadoop ecosystem.
> 
>  --
>  Denis
> 
>  On Wed, May 9, 2018 at 5:23 AM, sbeaupre <
> >> sven.beaup...@theglue.com >
>  wrote:
> 
> > This is just a thought that came out of a discussion with
> >> Dimitry this
> > morning. Recently Oracle has released GraalVM 1.0 after many
> >> years of
> > research and development, as a replacement for standard JVM.
> >
> > It should come with huge improvements on several areas
> >> (interesting for
> > ignite: AOT, native compilation, remove object allocation in
> >> many cases,
> > ...)
> >
> > Any interest from GG in this? Do you guys think it would give
> >> ignite a
> > performance boost (haven't tested it myself, just checking if it
> >> is
> > worthwhile in the first place, probably low on our prio list).
> >
> > More info:
> > - GraalVM for Java:
> >http://www.graalvm.org/docs/why-graal/#for-java-programs
> >> 
> > - Twitter is running GraalVM in production for a while now:
> >https://www.youtube.com/watch?v=pR5NDkIZBOA <
> >> https://www.youtube.com/watch?v=pR5NDkIZBOA>
> > - Getting started:
> >http://www.graalvm.org/docs/getting-started/ <
> >> http://www.graalvm.org/docs/getting-started/>
> >
> > regards,
> >
> > Sven
> >
> >
> >
> >
> >
> > --
> > Sent from:
> >> http://apache-ignite-developers.2346864.n4.nabble.com/ <
> >> http://apache-ignite-developers.2346864.n4.nabble.com/>
> >
> 
> >>
> >>
> >>
> >>
>
>
>


Re: Using GraalVM instead of standard JVM

2020-05-06 Thread Stephen Darlington
I just switched out OpenJDK and used GraalVM instead. Everything seemed to work 
but I wasn’t looking terribly hard. We’d need to do some more QA but I think 
chances are good that it’ll work just fine.

Regards,
Stephen

>> On 6 May 2020, at 20:31, Denis Magda  wrote:
> Stephen, that's terrific! To Ivan's first question, did you just swap
> HotSpot with GraalVM and got the thing working? Or did it require some
> extra work?
> 
> -
> Denis
> 
> 
>> On Wed, May 6, 2020 at 10:10 AM Stephen Darlington <
>> stephen.darling...@gridgain.com> wrote:
>> 
>> I’ve been playing around with it. I’ve was really impressed that I could
>> run JavaScript on Ignite with comparatively little code:
>> 
>> https://github.com/sdarlington/ignite-graalvm <
>> https://github.com/sdarlington/ignite-graalvm>
>> 
>> I’ve not been looking at performance, though.
>> 
>> Regards,
>> Stephen
>> 
> On 6 May 2020, at 17:52, Denis Magda  wrote:
>>> I'll leave this reference here so that we have a better understanding of
>>> why it's worthwhile to support GraalVM:
>>> https://blogs.oracle.com/graalvm/apache-spark
>>> —lightning-fast-on-graalvm-enterprise
>>> Spark benefits from running on GraalVM, so should we. Apart from memory
>>> usage and performance advantages, this JVM can execute Python code. With
>>> that, we can enable compute APIs support for Python.
>>> -
>>> Denis
>>> On Sun, May 13, 2018 at 12:23 PM Sven Beauprez <
>> sven.beaup...@theglue.com>
>>> wrote:
 Thnx all for the feedback.
 Looking forward to the results of such a test run.
 Regards,
 Sven
 SVEN BEAUPREZ
 L e a d   A r c h i t e c t
 De Kleetlaan 5, B-1831 Diegem
 www.theglue.com 
 On 10/05/2018, 17:44, "Petr Ivanov"  wrote:
   File the ticket and specify priority — and I will start researching.
   For test runs — we can have a copy of current test project and run
 some tests in different VMs (as you rightly remarked — right after JDK9
 task is complete).
> On 10 May 2018, at 18:34, Dmitry Pavlov 
 wrote:
> Hi Peter,
> It seems it is one more argument to implement selectable VM for
 existing run-all chain instead of creating one more.
> Would it be easy to add one more option once JDK 9 run is ready?
> Sincerely,
> Dmitriy Pavlov
> чт, 10 мая 2018 г. в 15:58, Dmitriy Setrakyan >>> >:
> Would be nice to have a TC run on Graal, just to have an
 understanding
> whether we support it or not.
> D.
> On Wed, May 9, 2018 at 4:28 PM, Denis Magda >>> > wrote:
>> The performance might become better just by replacing HotSpot with
 Graal,
>> but something suggests me that Ignite has to be adopted for this
 JVM (as
>> well as for Azul VM) to get more benefits. Probably, someone will
 get
>> interested and pick this task up.
>> What stands out is that the Graal folks also see this VM as an
 opportunity
>> to run custom code on a database side like Oracle or MySQL:
>> https://oracle.github.io/oracle-db-mle/ <
 https://oracle.github.io/oracle-db-mle/> It's a sort of their response
>> to
>> compute grid functionality of data grids and Hadoop ecosystem.
>> --
>> Denis
>> On Wed, May 9, 2018 at 5:23 AM, sbeaupre <
 sven.beaup...@theglue.com >
>> wrote:
>>> This is just a thought that came out of a discussion with
 Dimitry this
>>> morning. Recently Oracle has released GraalVM 1.0 after many
 years of
>>> research and development, as a replacement for standard JVM.
>>> It should come with huge improvements on several areas
 (interesting for
>>> ignite: AOT, native compilation, remove object allocation in
 many cases,
>>> ...)
>>> Any interest from GG in this? Do you guys think it would give
 ignite a
>>> performance boost (haven't tested it myself, just checking if it
 is
>>> worthwhile in the first place, probably low on our prio list).
>>> More info:
>>> - GraalVM for Java:
>>>   http://www.graalvm.org/docs/why-graal/#for-java-programs
 
>>> - Twitter is running GraalVM in production for a while now:
>>>   https://www.youtube.com/watch?v=pR5NDkIZBOA <
 https://www.youtube.com/watch?v=pR5NDkIZBOA>
>>> - Getting started:
>>>   http://www.graalvm.org/docs/getting-started/ <
 http://www.graalvm.org/docs/getting-started/>
>>> regards,
>>> Sven
>>> --
>>> Sent from:
 http://apache-ignite-developers.2346864.n4.nabble.com/ <
 http://apache-ignite-developers.2346864.n4.nabble.com/>


Re: IEP-44 Thin Client Discovery

2020-05-06 Thread Alex Plehanov
Pavel,

Since we have a notification mechanism for thin clients now, we can
implement a subscription to some types of events and this can be used
to inform a client about topology change as well. I think it's a
more appropriate way to detect topology changes than ping requests. But
approach with ping requests has another advantage: the client can detect
that connection was lost earlier. With subscriptions approach client will
detect problems with a connection only after the next request to the server.


ср, 6 мая 2020 г. в 17:31, Pavel Tupitsyn :

> Igniters, let's discuss the following issue:
>
> For partition awareness, and now for cluster discovery, we use a response
> flag to detect topology changes.
> The problem is - if the client does not do anything (user code does not
> perform operations),
> then we'll never know about topology changes and may even lose the cluster
> (all known nodes leave).
>
> Should we introduce some keep-alive mechanism, so that thin clients send
> periodic ping requests?
> Maybe do this as a separate feature.
>
> Thoughts?
>
> On Tue, Apr 28, 2020 at 8:14 PM Pavel Tupitsyn 
> wrote:
>
> > Ok, I've updated IEP and POC accordingly:
> > * Config flag removed
> > * IPs and host names retrieval simplified - use existing node properties
> > and attributes instead of Compute
> >
> > On Tue, Apr 28, 2020 at 7:57 PM Igor Sapego  wrote:
> >
> >> I guess it makes sense. If anyone needs more control over connection
> >> we would need to implement a new feature anyway (like node filter we
> >> discussed earlier)
> >>
> >> Best Regards,
> >> Igor
> >>
> >>
> >> On Tue, Apr 28, 2020 at 12:29 PM Pavel Tupitsyn 
> >> wrote:
> >>
> >> > > enable the capability if the best effort affinity is on
> >> > I agree, makes sense.
> >> >
> >> > Igor, what do you think?
> >> >
> >> > On Tue, Apr 28, 2020 at 8:25 AM Denis Magda 
> wrote:
> >> >
> >> > > Pavel,
> >> > >
> >> > > That would be a tremendous improvement for the recently release best
> >> > effort
> >> > > affinity feature. Without this capability, we force application
> >> > developers
> >> > > to reopen thin client connections every type a cluster is scaled
> out.
> >> I
> >> > > believe that once the folks start using the best effort affinity,
> >> we'll
> >> > be
> >> > > hearing more of a feature request for what you're proposing in this
> >> > thread.
> >> > > So, thanks for taking care of this proactively!
> >> > >
> >> > > As for the public API changes, do we really need any extra flag? I
> >> would
> >> > > enable the capability if the best effort affinity is on. For me,
> it's
> >> > just
> >> > > a natural improvement of the latter and it sounds reasonable to
> reuse
> >> the
> >> > > best effort affinity's flag.
> >> > >
> >> > > -
> >> > > Denis
> >> > >
> >> > >
> >> > > On Mon, Apr 27, 2020 at 2:58 AM Pavel Tupitsyn <
> ptupit...@apache.org>
> >> > > wrote:
> >> > >
> >> > > > Igniters,
> >> > > >
> >> > > > I've prepared an IEP [1] and a POC [2] for Thin Client Discovery
> >> > feature.
> >> > > > Let's discuss it here.
> >> > > >
> >> > > > In particular, I'd like to address the following points:
> >> > > >
> >> > > > 1. Value: do you think this would be a good feature to have?
> >> > > > 2. Public API changes: is a boolean property enough? Should we
> have
> >> > > > something more complex, so users can plug in custom logic to
> filter
> >> > > and/or
> >> > > > translate IPs and host names?
> >> > > > 3. Server-side implementation details: should we use Compute, Node
> >> > > > Attributes, or something else to retrieve client endpoints from
> all
> >> > nodes
> >> > > > in cluster?
> >> > > >
> >> > > > [1]
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-44%3A+Thin+client+cluster+discovery
> >> > > > [2] https://github.com/apache/ignite/pull/7744
> >> > > > [3] https://issues.apache.org/jira/browse/IGNITE-12932
> >> > > >
> >> > >
> >> >
> >>
> >
>


[MTCGA]: new failures in builds [5278999] needs to be handled

2020-05-06 Thread dpavlov . tasks
Hi Igniters,

 I've detected some new issue on TeamCity to be handled. You are more than 
welcomed to help.

 *Test with high flaky rate in master 
GridCachePartitionedNodeRestartTest.testRestartWithTxFourNodesNoBackups 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-7149300244223660127&branch=%3Cdefault%3E&tab=testDetails
 No changes in the build

 - Here's a reminder of what contributors were agreed to do 
https://cwiki.apache.org/confluence/display/IGNITE/How+to+Contribute 
 - Should you have any questions please contact dev@ignite.apache.org 

Best Regards,
Apache Ignite TeamCity Bot 
https://github.com/apache/ignite-teamcity-bot
Notification generated at 02:16:34 07-05-2020 


[jira] [Created] (IGNITE-12986) Redis mget command is broken

2020-05-06 Thread Vishnu Bharathi (Jira)
Vishnu Bharathi created IGNITE-12986:


 Summary: Redis mget command is broken
 Key: IGNITE-12986
 URL: https://issues.apache.org/jira/browse/IGNITE-12986
 Project: Ignite
  Issue Type: Bug
Reporter: Vishnu Bharathi


When trying to use the redis layer for ignite, noticed that the data returned 
by the mget command is inconsistent. Hence the mget command is broken. To 
demostrate here is an example

{code}
127.0.0.1:11211> set a 1
OK
127.0.0.1:11211> set b 2
OK
127.0.0.1:11211> set c 3
OK
(0.98s)
127.0.0.1:11211> mget a b c 
1) "1"
2) "2"
3) "3"
127.0.0.1:11211> mget c b a
1) "1"
2) "2"
3) "3"
127.0.0.1:11211> mget a c b
1) "1"
2) "2"
3) "3"
{code}

If you notice, the order of the values returned does not match the order of the 
values returned.

In order to demonstrate the expected behaviour, will run the same commands 
against a real redis instance and paste the output below.

{code}
127.0.0.1:6379> set a 1 
OK
127.0.0.1:6379> set b 2 
OK
127.0.0.1:6379> set c 3
OK
127.0.0.1:6379> mget a b c 
1) "1"
2) "2"
3) "3"
127.0.0.1:6379> mget c b a
1) "3"
2) "2"
3) "1"
127.0.0.1:6379> mget a c b
1) "1"
2) "3"
3) "2"
{code}

This is not only happening on the redis-cli, it is also happening when using 
redis client libraries. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12987) Remove TC badge from README.md

2020-05-06 Thread Ivan Pavlukhin (Jira)
Ivan Pavlukhin created IGNITE-12987:
---

 Summary: Remove TC badge from README.md
 Key: IGNITE-12987
 URL: https://issues.apache.org/jira/browse/IGNITE-12987
 Project: Ignite
  Issue Type: Task
Reporter: Ivan Pavlukhin


Currently TC badge in main 
[README|https://github.com/apache/ignite/blob/master/README.md] seems to be not 
very useful because a build status is not seen directly (is says "no 
permissions to get data"). Moreover TC results are not very representative 
because integration tests are not stable by nature and we rely on TC bot checks 
instead of plain TC status.

Let's remove this badge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Remove TC badge from README.md

2020-05-06 Thread Ivan Pavlukhin
Folks,

I created a ticket [1] and prepared PR with magic number =).

Please review.

[1] https://issues.apache.org/jira/browse/IGNITE-12987

Best regards,
Ivan Pavlukhin

пн, 4 мая 2020 г. в 10:01, Ivan Pavlukhin :
>
> Hi Igniters,
>
> Inspired by a neighbor thread about PR checks [1]. It brought my
> attention that now we have a neat Travis badge and a strange TC badge
> which is not very useful from a first glance.
>
> What do you think, is it ought to completely remove TC badge from readme?
>
> [1] 
> https://lists.apache.org/thread.html/r1580e2fb23728e83afbe1bad2b4cf7e9cece0e19a1d305e5755d7dd1%40%3Cdev.ignite.apache.org%3E
>
> Best regards,
> Ivan Pavlukhin