from:"Zhenya Stanilovsky"

Re: Ignite transactions

2024-02-14 Thread Zhenya Stanilovsky via user


Hi, Andrey
tx.start() — thread1
cache1.put|get  — thread1
cache2.put|get  — thread1
tx.commit — thread1
log operation 
i didn`t see the problem here 
 
>Thanks Pavel!
> 
>*  According to business logic, I must transactionally change the values in 2 
>caches; in the course of my actions, I must log all these actions in the 3rd 
>cache (protocol of my actions). So, it doesn’t matter whether my changes in 
>these first two caches end up being a success (commit) or an error (rollback), 
>I want the protocol of my actions to be preserved anyway. Based on your 
>answers, I can assume that I can use either queues or a separate thread for 
>these purposes.
> 
>Нестрогаев Андрей
> 
>From: Pavel Tupitsyn < ptupit...@apache.org >
>Sent: Wednesday, February 14, 2024 12:13 PM
>To: user@ignite.apache.org
>Subject: Re: Ignite transactions
> 
>1. Not sure I understand
>2. Messaging is not transactional
>3. No
>4. No, transactions are tied to a specific thread
> 
>On Wed, Feb 14, 2024 at 11:01AM Нестрогаев Андрей Викторович < 
>a.nestrog...@flexsoft.com > wrote:
>>Hi All, 
>> 
>>Maybe someone has already researched these questions: 
>>1. How can you organize nested/autonomous transactions in ignite? For 
>>example, for the purpose of writing a protocol to another cache, so that the 
>>protocol is saved regardless of the result of the main transaction. 
>>2. If you use Messaging in ignite within a transaction, does it take it into 
>>account, or is the message sent without taking into account the transaction? 
>>3. Does a transaction started on the current node extend to the code sent to 
>>another node (IgniteRunnable , IgniteClosure )?
>>4. Does a transaction span another thread started from current ?
>> 
>>Thanks for the help in advance.
>> 
>>Andrey
>>Настоящее  сообщение или любые приложения к нему могут содержать 
>>конфиденциальную информацию и другую информацию, защищаемую от раскрытия и 
>>принадлежащую АО «ФлексСофт». Ее разглашение или иное использование без 
>>согласования с АО «ФлексСофт» является нарушением законодательства Российской 
>>Федерации. Любое действие, направленное на копирование, передачу, 
>>распространение каким-либо образом и с помощью каких-либо средств как самого 
>>письма, так и информации, содержащейся в нем (в том числе в виде приложений), 
>>запрещено. Отправитель настоящего сообщения не несет ответственность за 
>>точность и полноту передачи информации, содержащейся в настоящем сообщении, а 
>>также за своевременность ее получения. Если Вы получили настоящее сообщение 
>>по ошибке, пожалуйста, сообщите об этом отправителю, а затем удалите его и 
>>любые копии с Вашего компьютера. Настоящее электронное сообщение и 
>>содержащаяся в нем информация, или любые  приложения к нему,  не является 
>>официальной позицией АО «ФлексСофт» и не влечет финансовые или иные 
>>обязательства АО «ФлексСофт»; не могут использоваться и не являются 
>>какого-либо рода офертой,  акцептом оферты, или предложением делать оферты, 
>>или совершать акцепт оферты, не является рекламой или профессиональным 
>>советом,  прогнозом любых событий,  если иное прямо не предусмотрено в 
>>настоящем сообщении или любых приложениях к нему. АО «ФлексСофт» не несет 
>>ответственность за любые прямые или косвенные убытки от использования 
>>получателем или иным лицом сведений настоящего сообщения и/или приложений к 
>>нему.
>>Информация, передаваемая по сети Интернет, без использования технических 
>>средств защиты, является не защищенной от противоправных действий третьих лиц 
>>и может содержать вредоносные программные средства. АО «ФлексСофт» не несет 
>>ответственности за данные действия.
>Настоящее  сообщение или любые приложения к нему могут содержать 
>конфиденциальную информацию и другую информацию, защищаемую от раскрытия и 
>принадлежащую АО «ФлексСофт». Ее разглашение или иное использование без 
>согласования с АО «ФлексСофт» является нарушением законодательства Российской 
>Федерации. Любое действие, направленное на копирование, передачу, 
>распространение каким-либо образом и с помощью каких-либо средств как самого 
>письма, так и информации, содержащейся в нем (в том числе в виде приложений), 
>запрещено. Отправитель настоящего сообщения не несет ответственность за 
>точность и полноту передачи информации, содержащейся в настоящем сообщении, а 
>также за своевременность ее получения. Если Вы получили настоящее сообщение по 
>ошибке, пожалуйста, сообщите об этом отправителю, а затем удалите его и любые 
>копии с Вашего компьютера. Настоящее электронное сообщение и содержащаяся в 
>нем информация, или любые  приложения к нему,  не является официальной 
>позицией АО «ФлексСофт» и не влечет финансовые или иные обязательства АО 
>«ФлексСофт»; не могут использоваться и не являются какого-либо рода офертой,  
>акцептом оферты, или предложением делать оферты, или совершать акцепт оферты, 
>не является рекламой или профессиональным советом,  прогнозом любых событий,  
>если иное прямо не предусмотрено в настоящем

Re: Another replicated cache oddity

2023-11-21 Thread Zhenya Stanilovsky via user


 
>Hi,
 
Hi, this really looks like very strange 
First of all you need to check consistency of your data : [1]
 
> Some time ago an element (E) was added to this cache (among many others)
And some time it will be all ok there ? Are you sure that this element was 
properly touched ? 
What king of cache you are talking about ? How data was populated there ? What 
API is used to « load element E» on each node ?
If you are talking about restart — i assume that you take a deal with 
persistent store, isn`t it ? is it native ignite persistence or some 3-rd party 
DB?
Thanks !
 
[1]  
https://ignite.apache.org/docs/latest/tools/control-script#verifying-partition-checksums
 
> 
>We have been triaging an odd issue we encountered in a system using Ignite 
>v2.15 and the C# client.
> 
>We have a replicated cache across four nodes, lets call them P0, P1, P2 & P3. 
>Because the cache is replicated every item added to the cache is present in 
>each of P0, P1, P2 and P3.
> 
>Some time ago an element (E) was added to this cache (among many others). A 
>number of system restarts have occurred since that time.
> 
>We started observing an issue where a query running across P0/P1/P2/P3 as a 
>cluster compute operation needed to load element E on each of the nodes to 
>perform that query. Node P0 succeeded, while nodes P1, P2 & P3 all reported 
>that element E did not exist. 
> 
>This situation persisted until the cluster was restarted, after which the same 
>query that had been failing now succeeded as all four 'P' nodes were able to 
>read element E.
> 
>There were no Ignite errors reported in the context of these failing queries 
>to indicate unhappiness in the Ignite nodes.
> 
>This seems like very strange behaviour. Are there any suggestions as to what 
>could be causing this failure to read the replicated value on the three 
>failing nodes, especially as the element 'came back' after a cluster restart?
> 
>Thanks,
>Raymond.
> 
> 
> 
>  -- 
>
>Raymond Wilson
>Trimble Distinguished Engineer, Civil Construction Software (CCS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wil...@trimble.com
>         
>

Re: Ignite query taking too long time with IN operator

2023-05-31 Thread Zhenya Stanilovsky via user


Also there are additional optimizations for ‘in’ cases in higher versions, 
probably it will help.

  
>Среда, 31 мая 2023, 11:24 +03:00 от Charlin S :
> 
>Hi,
>Already Field4 having index([QuerySqlField(IsIndexed = true)]), which is used 
>in where clause.  Do you mean index for all fields if yes how that will solve 
>this problem?
> 
>Thanks,
>Charlin  
>On Wed, 31 May 2023 at 13:38, Zhenya Stanilovsky via user < 
>user@ignite.apache.org > wrote:
>>Hi, seems you need to build index over this field.
>>
>>
>>
>> 
>>>Hi All,
>>>I am having two nodes cluster grid and  cache model, which has 2126239 
>>>records and Ignite query with IN operator taking too long sometimes 17 - 18 
>>>seconds and another time around 40 seconds.
>>> 
>>>Expected result : 3 to 4 records.
>>> 
>>>ignite version: 2.10
>>>Ignite server hosted on linux box and C# .net 6 is the Ignite client
>>> 
>>>Ignite cache model class
>>> 
>>>public class TestModel : IBinarizable
>>>{
>>>
>>>[QuerySqlField()]
>>>public decimal? Field1 { get; set; }
>>>
>>>[QuerySqlField()]
>>>public string Field2 { get; set; }
>>>
>>>[QuerySqlField()]
>>>public string Field3 { get; set; }
>>>
>>>[QuerySqlField(IsIndexed = true)]
>>>public string Field4 { get; set; }
>>>
>>>[QuerySqlField()]
>>>public string Field5 { get; set; }
>>>
>>>public void ReadBinary(IBinaryReader reader)
>>>{
>>>if (reader != null)
>>>         {
>>>Field1 = reader.ReadDecimal("field1");
>>>Field2 = reader.ReadString("field2");
>>>Field3 = reader.ReadString("field3");
>>>Field4 = reader.ReadString("field4");
>>>Field5 = reader.ReadString("field5");
>>> }
>>>}
>>>
>>>public void WriteBinary(IBinaryWriter writer)
>>>{
>>>if (writer != null)
>>>         {
>>>writer.WriteDecimal("field1",Field1) ;
>>>writer.WriteString("field2",Field2) ;
>>>writer.WriteString("field3",Field3) ;
>>>writer.WriteString("field4",Field4) ;
>>>writer.WriteString("field5",Field5) ;
>>>}
>>>}
>>>
>>>Ignite query execution code
>>>string query = "select Field1,Field2,Field3,Field4 from TestModel where 
>>>Field4 in('1','2')";
>>>SqlFieldsQuery fieldsQuery = new SqlFieldsQuery(query);
>>>    ICache.Query(fieldsQuery);
>>> 
>>> 
>>>Regards,
>>>Charlin 
>> 
>> 
>> 
>>

Re: Ignite query taking too long time with IN operator

2023-05-31 Thread Zhenya Stanilovsky via user



Oh, miss it, try this trick :
@QuerySqlField(index = true, inlineSize = 30)
 
 
 
>Hi,
>Already Field4 having index([QuerySqlField(IsIndexed = true)]), which is used 
>in where clause.  Do you mean index for all fields if yes how that will solve 
>this problem?
> 
>Thanks,
>Charlin  
>On Wed, 31 May 2023 at 13:38, Zhenya Stanilovsky via user < 
>user@ignite.apache.org > wrote:
>>Hi, seems you need to build index over this field.
>>
>>
>>
>> 
>>>Hi All,
>>>I am having two nodes cluster grid and  cache model, which has 2126239 
>>>records and Ignite query with IN operator taking too long sometimes 17 - 18 
>>>seconds and another time around 40 seconds.
>>> 
>>>Expected result : 3 to 4 records.
>>> 
>>>ignite version: 2.10
>>>Ignite server hosted on linux box and C# .net 6 is the Ignite client
>>> 
>>>Ignite cache model class
>>> 
>>>public class TestModel : IBinarizable
>>>{
>>>
>>>[QuerySqlField()]
>>>public decimal? Field1 { get; set; }
>>>
>>>[QuerySqlField()]
>>>public string Field2 { get; set; }
>>>
>>>[QuerySqlField()]
>>>public string Field3 { get; set; }
>>>
>>>[QuerySqlField(IsIndexed = true)]
>>>public string Field4 { get; set; }
>>>
>>>[QuerySqlField()]
>>>public string Field5 { get; set; }
>>>
>>>public void ReadBinary(IBinaryReader reader)
>>>{
>>>if (reader != null)
>>>         {
>>>Field1 = reader.ReadDecimal("field1");
>>>Field2 = reader.ReadString("field2");
>>>Field3 = reader.ReadString("field3");
>>>Field4 = reader.ReadString("field4");
>>>Field5 = reader.ReadString("field5");
>>> }
>>>}
>>>
>>>public void WriteBinary(IBinaryWriter writer)
>>>{
>>>if (writer != null)
>>>         {
>>>writer.WriteDecimal("field1",Field1) ;
>>>writer.WriteString("field2",Field2) ;
>>>writer.WriteString("field3",Field3) ;
>>>writer.WriteString("field4",Field4) ;
>>>writer.WriteString("field5",Field5) ;
>>>}
>>>}
>>>
>>>Ignite query execution code
>>>string query = "select Field1,Field2,Field3,Field4 from TestModel where 
>>>Field4 in('1','2')";
>>>SqlFieldsQuery fieldsQuery = new SqlFieldsQuery(query);
>>>    ICache.Query(fieldsQuery);
>>> 
>>> 
>>>Regards,
>>>Charlin 
>> 
>> 
>> 
>>

Re: Ignite query taking too long time with IN operator

2023-05-31 Thread Zhenya Stanilovsky via user


Hi, seems you need to build index over this field.



 
>Hi All,
>I am having two nodes cluster grid and  cache model, which has 2126239 records 
>and Ignite query with IN operator taking too long sometimes 17 - 18 seconds 
>and another time around 40 seconds.
> 
>Expected result : 3 to 4 records.
> 
>ignite version: 2.10
>Ignite server hosted on linux box and C# .net 6 is the Ignite client
> 
>Ignite cache model class
> 
>public class TestModel : IBinarizable
>{
>
>[QuerySqlField()]
>public decimal? Field1 { get; set; }
>
>[QuerySqlField()]
>public string Field2 { get; set; }
>
>[QuerySqlField()]
>public string Field3 { get; set; }
>
>[QuerySqlField(IsIndexed = true)]
>public string Field4 { get; set; }
>
>[QuerySqlField()]
>public string Field5 { get; set; }
>
>public void ReadBinary(IBinaryReader reader)
>{
>if (reader != null)
>         {
>Field1 = reader.ReadDecimal("field1");
>Field2 = reader.ReadString("field2");
>Field3 = reader.ReadString("field3");
>Field4 = reader.ReadString("field4");
>Field5 = reader.ReadString("field5");
> }
>}
>
>public void WriteBinary(IBinaryWriter writer)
>{
>if (writer != null)
>         {
>writer.WriteDecimal("field1",Field1) ;
>writer.WriteString("field2",Field2) ;
>writer.WriteString("field3",Field3) ;
>writer.WriteString("field4",Field4) ;
>writer.WriteString("field5",Field5) ;
>}
>}
>
>Ignite query execution code
>string query = "select Field1,Field2,Field3,Field4 from TestModel where Field4 
>in('1','2')";
>SqlFieldsQuery fieldsQuery = new SqlFieldsQuery(query);
>    ICache.Query(fieldsQuery);
> 
> 
>Regards,
>Charlin

Re: pyignite - performance issue

2023-03-14 Thread Zhenya Stanilovsky via user


Hi, plz append ignite and py client versions.
 
>Hi,
>I made a speed comparison of retrieving data from Apache Ignite using several 
>methods. All records are in one table, I did not use any WHERE condition, only 
>a SELECT * FROM TABLE XYZ LIMIT 2.
>Test results are:
>Apache Ignite
>*  Apache Ignite REST API - 0.52 seconds
>*  JDBC - 4 seconds
>*  Python pyignite - 40 seconds !!!
>pseudocode in Python using pyignite:
>client = Client(username="ignite", password="pass", use_ssl=False)
>client.connect('localhost', 10800)
>
>cursor=client.sql('SELECT * FROM TABLE_XYZ LIMIT 2')
>for row in cursor:
>pass
>
>After that I made a speed comparison of retrieving data from PostgreSQL using 
>JDBC and psycopg2 Python package. SQL select is same, SELECT * FROM TABLE XYZ 
>LIMIT 2
>PostgreSQL
>*  JDBC - 3 seconds
>*  Python psycopg2 using fetchall - 3 seconds
>*  Python psycopg2 using fetchone - 4 seconds
>pseudocode in Python using psycopg2:
>import psycopg2
>
>conn = psycopg2.connect(database=DB_NAME,
>user=DB_USER,
>password=DB_PASS,
>host=DB_HOST,
>port=DB_PORT)
>
>cur = conn.cursor()
>cur.execute("SELECT * FROM TABLE_XYZ LIMIT 2")
>rows = cur.fetchall()
>for data in rows:
>pass
>
>I can conclude that the pyignite implementation has much worse performance 
>compared to psycopg2 tests. The performance difference on PostgreSQL between 
>Java JDBC and Python psycopg2 is negligible. 
>The performance difference on Apache Ignite between Java JDBC and Python 
>pyignite is very big.
>Please if someone can comment on the tests, did I do something wrong or are 
>these results expected? How can such large differences in execution times be 
>explained? Do you have any suggestions to get better results using pyignite?
>Thank you

Re: How to handle missed indexes and affinity keys?

2022-12-20 Thread Zhenya Stanilovsky via user



I can`t see the whole your problem, but possibly this is your case [1] ? Check 
it plz.
 
[1]  https://issues.apache.org/jira/browse/IGNITE-18377
 
>Hi Maksim and Zhenya!
>I think that invalid coaches appeared because of execution of this statement 
>in PyIgnite client: 
> 
>print(ignite_client.get_or_create_cache("PUBLIC_ProductFeatures").get_size())
> 
>... while cache was not yet created by DDL SQL statement.
>There is “ IF NOT EXISTS ” in DDL so I guess that indexes and any properties 
>were not created because the cache was already created by Key-Value API.
>Now we can’t reproduce this :( because we can’t investigate when exactly this 
>python statement was executed.
>And this doesn’t explain why this happened to only 3 caches as statements were 
>executed for all caches.
> 
>Another  hypothesis is that there was some not-graceful shutdown of the whole 
>cluster. 
> 
>Debug info: 
>1. Ignite version = 2.13.0
>2.  CacheConfiguration:  CACHE_NAME=PUBLIC_ProductFeatures, 
>KEY_TYPE=io.sbmt.ProductFeaturesKey, VALUE_TYPE=io.sbmt.ProductFeaturesValue, 
>AFFINITY_KEY=product_sku, TEMPLATE=PARTITIONED, BACKUPS=1 .
>3. Classes specified in KEY_TYPE and VALUE_TYPE are artificial - we build them 
>with Binary Builder (they don’t exist as class definition).
>4. The SQL query you run and the "explain" plan for it:  
>EXPLAIN
>SELECT ProductFeatures.product_sku,
>ProductFeatures.total_cnt_orders_with_sku
>FROM ProductFeatures
>WHERE ProductFeatures.product_sku = 52864
>;
>
>And the result: 
>
>SELECT
>__Z0.PRODUCT_SKU AS __C0_0,
>__Z0.TOTAL_CNT_ORDERS_WITH_SKU AS __C0_1
>FROM PUBLIC.PRODUCTFEATURES __Z0
>/* PUBLIC.PRODUCTFEATURES.__SCAN_ */
>WHERE __Z0.PRODUCT_SKU = 52864
> 
>As you can see primary key is not used :( 
> 
>   
>On 20 Dec 2022, at 10:46 AM, Zhenya Stanilovsky via user < 
>user@ignite.apache.org > wrote:
>
>Hi Roza, when did you observe such a problem after restart ? and your caches 
>with persistence mode ?
>
>
> Hi Maksim!
> The problem is that simple SELECT query runs in ~20min - this index does not 
>work.
> More over, other (not corrupted) tables with affinity key == primary key have 
>index by concrete column, not _KEY, and have specified affinity key - see my 
>first message with example. 
> We have hypothesis that somehow these corrupted caches were created by 
>Key-Value API, not SQL. Otherwise how specified indexes and affinity keys were 
>skipped in DDL while creating the caches? 
> The more important question - is there any way to rebuild index and add 
>affinity key back? 
> Thanks!
>  On 16 Dec 2022, at 4:30 PM, Maksim Timonin < timoninma...@apache.org > wrote:
> Hi Roza,
> In this ddl primary key (product_sku) equals the affinity key (product_sku). 
>In such cases Ignite skips creating an additional index because _key_PK index 
>already covers primary key.
> Thanks,
>Maksim
> On Fri, Dec 16, 2022 at 2:06 PM Айсина Роза Мунеровна < 
>roza.ays...@sbermarket.ru > wrote:
>Hello Stephen!
> This DDL we use: 
> CREATE TABLE IF NOT EXISTS PUBLIC.ProductFeatures
>(
>    product_sku INT PRIMARY KEY,
>    total_cnt_orders_with_sku INT
>)
>WITH "CACHE_NAME=PUBLIC_ProductFeatures, KEY_TYPE=io.sbmt.ProductFeaturesKey, 
>VALUE_TYPE=io.sbmt.ProductFeaturesValue, AFFINITY_KEY=product_sku, 
>TEMPLATE=PARTITIONED, BACKUPS=1
> And all tables are created similarly.
>  On 16 Dec 2022, at 1:03 PM, Stephen Darlington < 
>stephen.darling...@gridgain.com > wrote:
> Внимание: Внешний отправитель!
>Если вы не знаете отправителя - не открывайте вложения, не переходите по 
>ссылкам, не пересылайте письмо!
> What are the CREATE TABLE  commands for those tables?
>  On 16 Dec 2022, at 09:39, Айсина Роза Мунеровна < roza.ays...@sbermarket.ru 
>> wrote:
> Hola!
>
>We've discovered some strange behaviour in Ignite cluster and now we are 
>trying to understand how to recover from this state. 
> So we have 5 node cluster with persistence and all caches either replicated 
>or partitioned with affinity key.
>All caches are created via DDL with CREATE TABLE IF NOT EXISTS statements in 
>one regular job (once per day). 
> The problem is that we hit Query execution is too long warning. 
>After some debug we found out that some tables have missed indexes and 
>affinity keys.
>More precisely - corrupted tables have indexes not by exact column name but 
>for _KEY column.
>And no affinity key at all. 
> select 
>  TABLE_NAME,
>  INDEX_NAME,
>  COLUMNS
>from SYS.INDEXES
>where TABLE_NAME = ‘PRODUCTFEATURES’ — broken table
>  or TABLE_NAME = ‘USERFEATURESDISCOUNT’ — healthy table
>;
>
>Result: 
>+++-

Re: How to handle missed indexes and affinity keys?

2022-12-19 Thread Zhenya Stanilovsky via user


Hi Roza, when did you observe such a problem after restart ? and your caches 
with persistence mode ?


 
>Hi Maksim!
> 
>The problem is that simple SELECT query runs in ~20min - this index does not 
>work.
> 
>More over, other (not corrupted) tables with affinity key == primary key have 
>index by concrete column, not  _KEY , and have specified affinity key - see my 
>first message with example. 
> 
>We have  hypothesis that somehow these corrupted caches were created by 
>Key-Value API, not SQL. Otherwise how specified indexes and affinity keys were 
>skipped in DDL while creating the caches? 
> 
>The more important question - is there any way to rebuild index and add 
>affinity key back? 
> 
>Thanks!
> 
>>On 16 Dec 2022, at 4:30 PM, Maksim Timonin < timoninma...@apache.org > wrote: 
>> 
>>Hi Roza,
>> 
>>In this ddl primary key (product_sku) equals the affinity key (product_sku). 
>>In such cases Ignite skips creating an additional index because  _key_PK  
>>index already covers primary key.
>> 
>>Thanks,
>>Maksim  
>>On Fri, Dec 16, 2022 at 2:06 PM Айсина Роза Мунеровна < 
>>roza.ays...@sbermarket.ru > wrote:
>>>Hello Stephen!
>>> 
>>>This DDL we use: 
>>> 
>>>CREATE TABLE IF NOT EXISTS PUBLIC.ProductFeatures
>>>(
>>>    product_sku INT PRIMARY KEY,
>>>    total_cnt_orders_with_sku INT
>>>)
>>>WITH "CACHE_NAME=PUBLIC_ProductFeatures, 
>>>KEY_TYPE=io.sbmt.ProductFeaturesKey, 
>>>VALUE_TYPE=io.sbmt.ProductFeaturesValue, AFFINITY_KEY=product_sku, 
>>>TEMPLATE=PARTITIONED, BACKUPS=1
>>> 
>>>And all tables are created similarly.
>>> 
On 16 Dec 2022, at 1:03 PM, Stephen Darlington < 
stephen.darling...@gridgain.com > wrote:  
Внимание: Внешний отправитель!
Если вы не знаете отправителя - не открывайте вложения, не переходите по 
ссылкам, не пересылайте письмо!  
What are the CREATE TABLE  commands for those tables?
 
>On 16 Dec 2022, at 09:39, Айсина Роза Мунеровна < 
>roza.ays...@sbermarket.ru > wrote:  
>Hola!
>
>We've discovered some strange behaviour in Ignite cluster and now we are 
>trying to understand how to recover from this state. 
> 
>So we have 5 node cluster with persistence and all caches either 
>replicated or partitioned with affinity key.
>All caches are created via DDL with CREATE TABLE IF NOT EXISTS statements 
>in one regular job (once per day). 
> 
>The problem is that we hit  Query execution is too long  warning. 
>After some debug we found out that some tables have missed indexes and 
>affinity keys.
>More precisely - corrupted tables have indexes not by exact column name 
>but for _KEY column.
>And no affinity key at all. 
> 
>select 
>  TABLE_NAME,
>  INDEX_NAME,
>  COLUMNS
>from SYS.INDEXES
>where TABLE_NAME = ‘PRODUCTFEATURES’ — broken table
>  or TABLE_NAME = ‘USERFEATURESDISCOUNT’ — healthy table
>;
>
>Result: 
>++++
>|TABLE_NAME          |INDEX_NAME  |COLUMNS                     |
>++++
>|USERFEATURESDISCOUNT|_key_PK_hash|"USER_ID" ASC, "USER_ID" ASC|
>|USERFEATURESDISCOUNT|__SCAN_     |null                        |
>|USERFEATURESDISCOUNT|_key_PK     |"USER_ID" ASC               |
>|USERFEATURESDISCOUNT|AFFINITY_KEY|"USER_ID" ASC               |
>|PRODUCTFEATURES     |_key_PK_hash|"_KEY" ASC                  |
>|PRODUCTFEATURES     |__SCAN_     |null                        |
>|PRODUCTFEATURES     |_key_PK     |"_KEY" ASC                  |
>++++
> 
>Query execution even with simplest statements with filters on primary and 
>affinity keys takes ~20min in best case. 
>We have 8 tables, 5 out 8 are corrupted. 
> 
>So the questions are: 
>1. What can probably cause such state? 
>2. Is there any way to recover without full delete-refill tables? I see 
>that index can be created via CREATE INDEX, but affinity key can be 
>created only via CREATE TABLE statement? 
> 
>Thanks in advance!
> 
>--
> 
>Роза Айсина
>Старший разработчик ПО
>СберМаркет | Доставка из любимых магазинов
> 
>Email:  roza.ays...@sbermarket.ru
>Mob:
>Web:  sbermarket.ru
>App:  iOS и  Android
> 
> 
> 
> 
> 
>УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: это электронное сообщение и любые 
>документы, приложенные к нему, содержат конфиденциальную информацию. 
>Настоящим уведомляем Вас о том, что, если это сообщение не предназначено 
>Вам, использование, копирование, распространение информации, содержащейся 
>в настоящем сообщении, а также осуществление любых действий на основе этой 
>информации, строго запрещено. Если Вы получили это сообщение по ошибке, 
>пожалуйста, сообщите об этом отправителю по электронной почте и удалите

Re[4]: What is data-streamer-stripe threasd?

2022-09-19 Thread Zhenya Stanilovsky via user



It`s up to you, if it not annoying you leave it as it is and fill otherwise )


 
>Nah, it's fine just wanted to make sure what it was. Unless you think I should 
>log at least an issue?
>   
>On Wed, Sep 14, 2022 at 3:13 AM Zhenya Stanilovsky via user < 
>user@ignite.apache.org > wrote:
>>Yep, i already mention that you can`t disable this pool at all and 1 worker 
>>thread still be visible.
>>You can fill the issue but i can`t guarantee that it would be completed soon, 
>>or can do it yourself and present pull request.
>> 
>>best.
>>   
>>>Ok so just to understand on the client side. Set the pool size for data 
>>>streamer to 1.
>>> 
>>>But it will still look blocked?  
>>>On Mon., Sep. 12, 2022, 8:59 a.m. Zhenya Stanilovsky via user, < 
>>>user@ignite.apache.org > wrote:
>>>>John, seems all you can here is just to set this pool size into «1» , «0» — 
>>>>tends to error.
>>>> 
>>>>https://ignite.apache.org/docs/latest/data-streaming#configuring-data-streamer-thread-pool-size
>>>> 
>>>>1 thread will still be frozen in such a case. 
>>>> 
>>>>> 
>>>>>> 
>>>>>>>Hi I'm profiling my application through YourKit and it indicates that a 
>>>>>>>bunch of these threads (data-streamer-stripe) are "frozen" for 21 days. 
>>>>>>>This 
>>>>>>>
>>>>>>>I'm not using data streaming, is there a way to disable it or just 
>>>>>>>ignore the messages? The application is configured as thick client 
>>>>>>>(client = true) 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>> 
>> 
>> 
>>

Re[2]: What is data-streamer-stripe threasd?

2022-09-14 Thread Zhenya Stanilovsky via user


Yep, i already mention that you can`t disable this pool at all and 1 worker 
thread still be visible.
You can fill the issue but i can`t guarantee that it would be completed soon, 
or can do it yourself and present pull request.
 
best.
 
>Ok so just to understand on the client side. Set the pool size for data 
>streamer to 1.
> 
>But it will still look blocked?  
>On Mon., Sep. 12, 2022, 8:59 a.m. Zhenya Stanilovsky via user, < 
>user@ignite.apache.org > wrote:
>>John, seems all you can here is just to set this pool size into «1» , «0» — 
>>tends to error.
>> 
>>https://ignite.apache.org/docs/latest/data-streaming#configuring-data-streamer-thread-pool-size
>> 
>>1 thread will still be frozen in such a case. 
>> 
>>> 
>>>> 
>>>>>Hi I'm profiling my application through YourKit and it indicates that a 
>>>>>bunch of these threads (data-streamer-stripe) are "frozen" for 21 days. 
>>>>>This 
>>>>>
>>>>>I'm not using data streaming, is there a way to disable it or just ignore 
>>>>>the messages? The application is configured as thick client (client = 
>>>>>true) 
>>>> 
>>>> 
>>>> 
>>>>

Re: What is data-streamer-stripe threasd?

2022-09-12 Thread Zhenya Stanilovsky via user


John, seems all you can here is just to set this pool size into «1» , «0» — 
tends to error.
 
https://ignite.apache.org/docs/latest/data-streaming#configuring-data-streamer-thread-pool-size
 
1 thread will still be frozen in such a case. 
 
> 
>> 
>>>Hi I'm profiling my application through YourKit and it indicates that a 
>>>bunch of these threads (data-streamer-stripe) are "frozen" for 21 days. This 
>>>
>>>I'm not using data streaming, is there a way to disable it or just ignore 
>>>the messages? The application is configured as thick client (client = true) 
>> 
>> 
>> 
>>

Re[8]: Checkpointing threads

2022-09-12 Thread Zhenya Stanilovsky via user


Not throttling, but : «Thread dump is hidden due to throttling settings» There 
are huge documentation about persistence tuning in apache ignite.



 
>Hi,
>Throttling is disabled in ignite config as mentioned in prev reply. What do 
>you suggest to make ignite catchup with SSD limits on checkpointing.   
>On Mon, 12 Sept 2022, 11:32 Zhenya Stanilovsky via user, < 
>user@ignite.apache.org > wrote:
>>
>>
>>
>> 
>>>We have observed one interesting issue with checkpointing. We are using 64G 
>>>RAM 12 CPU with 3K iops/128mbps SSDs. Our application fills up the WAL 
>>>directory really fast and hence the RAM. We made the following observations
>>>
>>>0. Not so bad news first, it resumes processing after getting stuck for 
>>>several minutes.
>>>
>>>1. WAL and WAL Archive writes are a lot faster than writes to the work 
>>>directory through checkpointing. Very curious to know why this is the case. 
>>>checkpointing writes never exceeds 15 mbps while wal and wal archive go 
>>>really high upto max limits of ssd
>> 
>>Very simple example : sequential changing of 1 key, so in wal you obtain all 
>>changes and in (in your terms — checkpointing) only one key change.
>> 
>>>
>>>2. We observed that when offheap memory usage tend to zero , checkpointing 
>>>takes minutes to complete , sometimes 30+ minutes which stalls the 
>>>application writes completely on all nodes. It means the whole cluster 
>>>freezes. 
>> 
>>Seems ignite enables throttling in such a case, you need some system and 
>>cluster tuning. 
>> 
>>>
>>>3. Checkpointing thread get stuck at checkpointing page futures.get and 
>>>after several minutes, it logs this error and grid resumes processing
>>>
>>>"sys-stripe-0-#1" #19 prio=5 os_prio=0 cpu=86537.69ms elapsed=2166.63s 
>>>tid=0x7fa52a6f1000 nid=0x3b waiting on condition  [0x7fa4c58be000]
>>>   java.lang.Thread.State: WAITING (parking)
>>>at jdk.internal.misc.Unsafe.park( java.base@11.0.14.1/Native Method)
>>>at java.util.concurrent.locks.LockSupport.park( java.base@11.0.14.1/Unknown 
>>>Source)
>>>at 
>>>org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
>>>at 
>>>org.apache.ignite.internal.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:146)
>>>at 
>>>org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:144)
>>>at 
>>>org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1613)
>>>at 
>>>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processDhtAtomicUpdateRequest(GridDhtAtomicCache.java:3313)
>>>at 
>>>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$600(GridDhtAtomicCache.java:143)
>>>at 
>>>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:322)
>>>at 
>>>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:317)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1151)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:592)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:393)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:319)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:110)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:309)
>>>at 
>>>org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1908)
>>>at 
>>>org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1529)
>>>at 
>>>org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:242)
>>>at 
>>>org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1422)
>>>at 
>>>org.apache.ignite.internal.managers.co

Re[6]: Checkpointing threads

2022-09-12 Thread Zhenya Stanilovsky via user

t 
>org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.failCheckpointReadLock(CheckpointTimeoutLock.java:210)
>at 
>org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:108)
>at 
>org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1613)
>at 
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processDhtAtomicUpdateRequest(GridDhtAtomicCache.java:3313)
>at 
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$600(GridDhtAtomicCache.java:143)
>at 
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:322)
>[2022-09-09 18:58:35,148][INFO ][sys-stripe-7-#8][FailureProcessor] Thread 
>dump is hidden due to throttling settings. Set 
>IGNITE_DUMP_THREADS_ON_FAILURE_THROTTLING_TIMEOUT property to 0 to see all 
>thread dumps.
>
>
>4. Other nodes printy below logs during the window problematic node is stuck 
>at checkpointing
>
>[2022-09-09 18:58:35,153][WARN ][push-metrics-exporter-#80][G] >>> Possible 
>starvation in striped pool.
>    Thread name: sys-stripe-5-#6
>    Queue: 
>[o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout@eb9f832,
> Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, 
>ordered=false, timeout=0, skipOnTimeout=false, 
>msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=1, 
>arr=[351148], Message closure [msg=GridIoMessage [plc=2, 
>topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
>msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=2, 
>arr=[273841,273843], Message closure [msg=GridIoMessage [plc=2, 
>topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
>msg=GridNearSingleGetRequest [futId=1662749921887, key=BinaryObjectImpl [arr= 
>true, ctx=false, start=0], flags=1, topVer=AffinityTopologyVersion [topVer=14, 
>minorTopVer=0], subjId=12746da1-ac0d-4ba1-933e-5aa3f92d2f68, taskNameHash=0, 
>createTtl=-1, accessTtl=-1, txLbl=null, mvccSnapshot=null]]], Message closure 
>[msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, 
>timeout=0, skipOnTimeout=false, msg=GridDhtAtomicDeferredUpdateResponse 
>[futIds=GridLongList [idx=1, arr=[351149], 
>o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout@110ec0fa,
> Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, 
>ordered=false, timeout=0, skipOnTimeout=false, 
>msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=10, 
>arr=[414638,414655,414658,414661,414662,414663,414666,414668,414673,414678],
> 
>o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout@63ae8204,
> 
>o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout@2d3cc0b,
> Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, 
>ordered=false, timeout=0, skipOnTimeout=false, 
>msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=1, 
>arr=[414667], Message closure [msg=GridIoMessage [plc=2, 
>topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
>msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=4, 
>arr=[351159,351162,351163,351164], Message closure [msg=GridIoMessage 
>[plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, 
>skipOnTimeout=false, msg=GridDhtAtomicDeferredUpdateResponse 
>[futIds=GridLongList [idx=1, arr=[290762], Message closure 
>[msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, 
>timeout=0, skipOnTimeout=false, msg=GridDhtAtomicDeferredUpdateResponse 
>[futIds=GridLongList [idx=1, arr=[400357], 
>o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout@71887193,
> Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, 
>ordered=false, timeout=0, skipOnTimeout=false, 
>msg=GridDhtAtomicSingleUpdateRequest [key=BinaryObjectImpl [arr= true, 
>ctx=false, start=0], val=BinaryObjectImpl [arr= true, ctx=false, start=0], 
>prevVal=null, super=GridDhtAtomicAbstractUpdateRequest [onRes=false, 
>nearNodeId=null, nearFutId=0, flags=, Message closure [msg=GridIoMessage 
>[plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, 
>skipOnTimeout=false, msg=GridNearAtomicSingleUpdateRequest 
>[key=BinaryObjectImpl [arr= true, ctx=false, start=0], 
>parent=GridNearAtomicAbstractSingleUpdateRequest [nodeId=null, futId=1324019, 
>topVer=Af

Re[4]: Checkpointing threads

2022-09-07 Thread Zhenya Stanilovsky via user


Ok, Raymond i understand. But seems no one have good answer here, it depends on 
appropriate fs and near (probably cloud) layer implementation.
If you not observe «throttling» messages (described in prev link) seems it`s 
all ok, but of course you can benchmark your io by yourself with 3-rd party 
tool.
 
>Thanks Zhenya.
> 
>I have seen the link you provide has a lot of good information on this system. 
>But it does not talk about the check point writers in any detail.
> 
>I appreciate this cannot be a bottleneck, my question is more related to: "If 
>I have more check pointing threads will check points take less time". In our 
>case we use AWS EFS so if each checkpoint thread is spending relatively long 
>times blocking on write I/O to the persistent store then more check points 
>allow more concurrent writes to take place. Of course, if the check point 
>threads themselves utilise async I/O tasks and interleave I/O activities on 
>that basis then there may not be an opportunity for performance improvement, 
>but I am not an expert in the Ignite code base :)
> 
>Raymond.
>   
>On Wed, Sep 7, 2022 at 7:51 PM Zhenya Stanilovsky via user < 
>user@ignite.apache.org > wrote:
>>
>>No, there is no any log and metrics suggestions and as i told earlier — this 
>>place can`t became a bottleneck, if you have any performance problems — 
>>describe them somehow wider and interesting reading here [1]
>> 
>>[1]  
>>https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>    
>>>Thanks Zhenya. 
>>> 
>>>Is there any logging or metrics that would indicate if there was value 
>>>increasing the size of this pool?
>>> 
>>> 
>>>On Fri, 2 Sep 2022 at 8:20 PM, Zhenya Stanilovsky via user < 
>>>user@ignite.apache.org > wrote:
>>>>Hi  Raymond
>>>> 
>>>>checkpoint threads is responsible for dumping modified pages, so you may 
>>>>consider it as io bound only operation and pool size is amount of disc 
>>>>writing workers.
>>>>I think that default is enough and no need for raising it, but it also up 
>>>>to you.
>>>>   
>>>>>Hi,
>>>>> 
>>>>>I am looking at our configuration of the Ignite checkpointing system to 
>>>>>ensure we have it tuned correctly.
>>>>> 
>>>>>There is a checkpointing thread pool defined, which defaults to 4 threads 
>>>>>in size. I have not been able to find much of a discussion on when/how 
>>>>>this pool size should be changed to reflect the node size Ignite is 
>>>>>running on.
>>>>> 
>>>>>In our case, we are running 16 core servers with 128 GB RAM with 
>>>>>persistence on an NFS storage layer.
>>>>> 
>>>>>Given the number of cores, and the relative latency of NFS compared to 
>>>>>local SSD, is 4 checkpointing threads appropriate, or are we likely to see 
>>>>>better performance if we increased it to 8 (or more)?
>>>>> 
>>>>>If there is a discussion related to this a pointer to it would be good 
>>>>>(it's not really covered in the performance tuning section).
>>>>> 
>>>>>Thanks,
>>>>>Raymond.
>>>>>  --
>>>>>
>>>>>Raymond Wilson
>>>>>Trimble Distinguished Engineer, Civil Construction Software (CCS)
>>>>>11 Birmingham Drive   |   Christchurch, New Zealand
>>>>>raymond_wil...@trimble.com
>>>>>         
>>>>> 
>>>> 
>>>> 
>>>> 
>>>>  
>> 
>> 
>> 
>>  
> 
>  --
>
>Raymond Wilson
>Trimble Distinguished Engineer, Civil Construction Software (CCS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wil...@trimble.com
>         
>

Re[2]: Checkpointing threads

2022-09-07 Thread Zhenya Stanilovsky via user



No, there is no any log and metrics suggestions and as i told earlier — this 
place can`t became a bottleneck, if you have any performance problems — 
describe them somehow wider and interesting reading here [1]
 
[1]  
https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
  
>Thanks Zhenya. 
> 
>Is there any logging or metrics that would indicate if there was value 
>increasing the size of this pool?
> 
> 
>On Fri, 2 Sep 2022 at 8:20 PM, Zhenya Stanilovsky via user < 
>user@ignite.apache.org > wrote:
>>Hi  Raymond
>> 
>>checkpoint threads is responsible for dumping modified pages, so you may 
>>consider it as io bound only operation and pool size is amount of disc 
>>writing workers.
>>I think that default is enough and no need for raising it, but it also up to 
>>you.
>>   
>>>Hi,
>>> 
>>>I am looking at our configuration of the Ignite checkpointing system to 
>>>ensure we have it tuned correctly.
>>> 
>>>There is a checkpointing thread pool defined, which defaults to 4 threads in 
>>>size. I have not been able to find much of a discussion on when/how this 
>>>pool size should be changed to reflect the node size Ignite is running on.
>>> 
>>>In our case, we are running 16 core servers with 128 GB RAM with persistence 
>>>on an NFS storage layer.
>>> 
>>>Given the number of cores, and the relative latency of NFS compared to local 
>>>SSD, is 4 checkpointing threads appropriate, or are we likely to see better 
>>>performance if we increased it to 8 (or more)?
>>> 
>>>If there is a discussion related to this a pointer to it would be good (it's 
>>>not really covered in the performance tuning section).
>>> 
>>>Thanks,
>>>Raymond.
>>>  --
>>>
>>>Raymond Wilson
>>>Trimble Distinguished Engineer, Civil Construction Software (CCS)
>>>11 Birmingham Drive   |   Christchurch, New Zealand
>>>raymond_wil...@trimble.com
>>>         
>>> 
>> 
>> 
>> 
>>

Re: Checkpointing threads

2022-09-02 Thread Zhenya Stanilovsky via user


Hi  Raymond
 
checkpoint threads is responsible for dumping modified pages, so you may 
consider it as io bound only operation and pool size is amount of disc writing 
workers.
I think that default is enough and no need for raising it, but it also up to 
you.
 
>Hi,
> 
>I am looking at our configuration of the Ignite checkpointing system to ensure 
>we have it tuned correctly.
> 
>There is a checkpointing thread pool defined, which defaults to 4 threads in 
>size. I have not been able to find much of a discussion on when/how this pool 
>size should be changed to reflect the node size Ignite is running on.
> 
>In our case, we are running 16 core servers with 128 GB RAM with persistence 
>on an NFS storage layer.
> 
>Given the number of cores, and the relative latency of NFS compared to local 
>SSD, is 4 checkpointing threads appropriate, or are we likely to see better 
>performance if we increased it to 8 (or more)?
> 
>If there is a discussion related to this a pointer to it would be good (it's 
>not really covered in the performance tuning section).
> 
>Thanks,
>Raymond.
>  --
>
>Raymond Wilson
>Trimble Distinguished Engineer, Civil Construction Software (CCS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wil...@trimble.com
>         
>

Re: Ignite Discovery worker blocked error when client node attempts to connect

2022-07-06 Thread Zhenya Stanilovsky


Hi, can you share full log somehow ?
Provided information is not enough for analysis.

 
>We have deployed an Ignite 2.11.1 cluster in Kubernetes.  When a client node 
>attempts to join the grid, we are getting the  tcp-disco-msg-worker blocked 
>below.  
> 
>We have used the same configuration successfully with another deployment so we 
>are not certain why we are getting this error. 
>The only change is that in the deployment that is failing we have added 
>resource limits to the configuration.
> 
>Link to same question of stack overflow. 
> 
>[20:07:24,201][SEVERE][tcp-disco-msg-worker-[crd]-#2-#48][G] Blocked 
>system-critical thread has been detected. This can lead to cluster-wide 
>undefined behaviour [workerName=db-checkpoint-thread, 
>threadName=db-checkpoint-thread-#78, blockedFor=4562s]
>[20:07:24] Possible failure suppressed accordingly to a configured handler 
>[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
>super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
>[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
>failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class 
>o.a.i.IgniteException: GridWorker [name=db-checkpoint-thread, 
>igniteInstanceName=null, finished=false, heartbeatTs=1657047082073]]]
>[20:07:27,073][SEVERE][tcp-disco-msg-worker-[crd]-#2-#48][G] Blocked 
>system-critical thread has been detected. This can lead to cluster-wide 
>undefined behaviour [workerName=sys-stripe-0, threadName=sys-stripe-0-#1, 
>blockedFor=4072s]
> 
>Thread [name="qtp2015455415-61", id=61, state=TIMED_WAITING, blockCnt=1, 
>waitCnt=702]
>    Lock 
>[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@78322a4,
> ownerName=null, ownerId=-1]
>        at sun.misc.Unsafe.park(Native Method)
>        at 
>java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>        at 
>java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>        at 
>org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
>        at 
>org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:973)
>        at 
>org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1023)
>        at java.lang.Thread.run(Thread.java:748)
> 
>Thread [name="qtp2015455415-60", id=60, state=RUNNABLE, blockCnt=3, 
>waitCnt=679]
>        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>        - locked sun.nio.ch.Util$3@7a7b4390
>        - locked java.util.Collections$UnmodifiableSet@220c68db
>        - locked sun.nio.ch.EPollSelectorImpl@4f569077
>        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
>        at 
>org.eclipse.jetty.io.ManagedSelector.nioSelect(ManagedSelector.java:183)
>        at 
>org.eclipse.jetty.io.ManagedSelector.select(ManagedSelector.java:190)
>        at 
>org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:606)
>        at 
>org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:543)
>        at 
>org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
>        at 
>org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
>        at 
>org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
>        at 
>org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
>        at 
>org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
>        at 
>org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
>        at 
>org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
>        at java.lang.Thread.run(Thread.java:748)
> 
>Thread [name="qtp2015455415-59", id=59, state=TIMED_WAITING, blockCnt=1, 
>waitCnt=684]
>    Lock 
>[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@78322a4,
> ownerName=null, ownerId=-1]
>        at sun.misc.Unsafe.park(Native Method)
>        at 
>java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>        at 
>java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>        at 
>org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
>        at 
>org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:973)
>        at 
>org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1023)
>        at

Re[2]: gridgain ultimate edition snapshot error

2022-06-07 Thread Zhenya Stanilovsky


hi, u need to change limits [1]
 
[1]  
https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/general-perf-tips#ulimits
  
>Вторник, 7 июня 2022, 8:35 +03:00 от Surinder Mehra :
> 
>Hi,
>I was going through this post on stackoverflow which is about the same issue. 
>The fact that snapshot works for apache ignite bit not in ultimate edition 
>indicates there is some bug in later. Could you please confirm. We have around 
>15 caches with 2 backups. I changed backups to zero but still see this issue. 
>Could you please advise further.
>
>https://stackoverflow.com/questions/72041292/is-there-a-fix-for-too-many-open-files-error-in-gridgain-cluster
>  
>On Mon, Jun 6, 2022 at 9:13 PM Surinder Mehra < redni...@gmail.com > wrote:
>>Hi,
>>I was experimenting with the GG ultimate edition to take snapshots and 
>>encountered the below error and cluster stops. Please note that this works in 
>>the ignite free version and we don't see too many files open error. Is this a 
>>bug or we are missing some configuration?
>> 
>>version:  gridgain-8.8.19
>>
>>/bin./snapshot-utility.sh snapshot -type=full
>>
>>[21:03:51,693][SEVERE][db-snapshot-executor-stripe-0-#35][] Critical system 
>>error detected. Will be handled accordingly to configured handler 
>>[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
>>super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
>>[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
>>failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
>>o.a.i.i.processors.cache.persistence.StorageException: Failed to initialize 
>>partition file: 
>>/home/usr/tools/gridgain-ultimate-8.8.19/work/db/node00-c221fe71-5d29-4cd7-ab0f-9fa8240711b2/cache-name/part-88.bin]
>>class 
>>org.apache.ignite.internal.processors.cache.persistence.StorageException: 
>>Failed to initialize partition file: 
>>/home/usr/tools/gridgain-ultimate-8.8.19/work/db/node00-c221fe71-5d29-4cd7-ab0f-9fa8240711b2/cache-name/part-88.bin
>>at 
>>org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:519)
>>at 
>>org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.read(FilePageStore.java:405)
>>at 
>>org.apache.ignite.internal.processors.cache.persistence.pagemem.PageReadWriteManagerImpl.read(PageReadWriteManagerImpl.java:68)
>>at 
>>org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:577)
>>at 
>>org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:911)
>>at 
>>org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:730)
>>at 
>>org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:711)
>>at 
>>org.gridgain.grid.internal.processors.cache.database.snapshot.SnapshotCreateFuture.completeSavingAllocatedIndex(SnapshotCreateFuture.java:1304)
>>at 
>>org.gridgain.grid.internal.processors.cache.database.snapshot.SnapshotCreateFuture.completeSnapshotCreation(SnapshotCreateFuture.java:1486)
>>at 
>>org.gridgain.grid.internal.processors.cache.database.snapshot.SnapshotCreateFuture.doFinalStage(SnapshotCreateFuture.java:1171)
>>at 
>>org.gridgain.grid.internal.processors.cache.database.snapshot.SnapshotOperationFuture.completeStagesLocally(SnapshotOperationFuture.java:2352)
>>at 
>>org.gridgain.grid.internal.processors.cache.database.snapshot.SnapshotOperationFuture$10.run(SnapshotOperationFuture.java:2286)
>>at 
>>org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:567)
>>at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
>>at java.base/java.lang.Thread.run(Thread.java:829)
>>Caused by: java.nio.file.FileSystemException: 
>>/home/usr/tools/gridgain-ultimate-8.8.19/work/db/node00-c221fe71-5d29-4cd7-ab0f-9fa8240711b2/cache-name/part-88.bin:
>> Too many open files
>>at 
>>java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
>>at 
>>java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>>at 
>>java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>>at 
>>java.base/sun.nio.fs.UnixFileSystemProvider.newAsynchronousFileChannel(UnixFileSystemProvider.java:201)
>>at 
>>java.base/java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:253)
>>at 
>>java.base/java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:311)
>>at 
>>org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIO.(AsyncFileIO.java:65)
>>at 
>>org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory.create(AsyncFileIOFactory.java:43)
>>at 
>>org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:491)
>>... 14 more

Re[2]: LEFT, RIGHT JOIN not working

2022-06-06 Thread Zhenya Stanilovsky



Hi ! thanks for example, i hope some updates will be here in a short time.


 
>Hi,
>Just wondering if you had an opportunity to look into this.  
>On Thu, Jun 2, 2022 at 2:52 PM Surinder Mehra < redni...@gmail.com > wrote:
>>Hi,
>>Please find the attached java file which reproduces the issue. As you can 
>>see, the cache key is used as a join condition but LEFT join is still giving 
>>only common values.
>> 
>>output:
>>[2, Keyboard, 2]
>>Size of actual output 1
>>Expected size 3 is not equal to Actual size 1
>>   
>>On Thu, Jun 2, 2022 at 11:48 AM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>wrote:
>>>Hi, Surinder Mehra ! I check your sql and it work correct for me.
>>>*  You no need to define AffinityKeyMapped for Key, check additionally [1], 
>>>you can simple modify [2] according to your case
>>>*  I problem still exist somehow, plz attach some code example.
>>>thanks !
>>> 
>>>[1]  
>>>https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/affinity/AffinityKeyMapped.html
>>>[2]  
>>>https://github.com/apache/ignite/blob/master/modules/indexing/src/test/java/org/apache/ignite/internal/processors/cache/IgniteCacheJoinPartitionedAndReplicatedTest.java#L160
>>>   
>>>>Hi,
>>>>I have the following sample code to demo issue in SQL joins. I have created 
>>>>an affinity key and value as shown below and added some sample data to it. 
>>>>When I try LEFT self join on this table it always gives me common rows   
>>>>irrespective of LEFT or RIGHT JOIN
>>>>Could you please help me find what am I doing wrong here.
>>>> 
>>>>cache Key :
>>>>
>>>>public class OrderAffinityKey {
>>>>    Integer id;
>>>>    @AffinityKeyMapped
>>>>    Integer customerId;
>>>>}
>>>>
>>>>
>>>>cache value:
>>>>
>>>>public class Order implements Serializable {
>>>>    @QuerySqlField
>>>>    Integer id;
>>>>
>>>>    @AffinityKeyMapped
>>>>    @QuerySqlField Integer customerId;
>>>>    @QuerySqlField String product;
>>>>}
>>>>
>>>>
>>>>Table C: (select customerID, product FROM "orderCache"."ORDER" WHERE 
>>>>CUSTOMERID IN ( 1, 2))
>>>>
>>>>1 keyboard
>>>>2 Laptop
>>>>
>>>>
>>>>Table O: (select customerID, product FROM "orderCache"."ORDER" WHERE 
>>>>CUSTOMERID IN ( 3, 2))
>>>>
>>>>2 laptop
>>>>3 mouse
>>>>
>>>>
>>>>
>>>>JOIN:
>>>>
>>>>Query :
>>>>select DISTINCT C.customerID, C.product, O.customerID
>>>>FROM
>>>> (select customerID, product FROM "orderCache"."ORDER" WHERE CUSTOMERID IN 
>>>>( 1, 2)) C
>>>> LEFT JOIN
>>>>(select customerID, product FROM "orderCache"."ORDER" WHERE CUSTOMERID IN ( 
>>>>3, 2)) O
>>>>ON
>>>>C.customerId = O.customerId
>>>>
>>>>
>>>>Output:
>>>>
>>>>2 laptop   2
>>>>3 mouse   3
>>>>
>>>>Expected output:
>>>>
>>>>1 keyboard   null
>>>>2 laptop   2
>>>>3 mouse   3 
>>> 
>>> 
>>> 
>>>

Re: LEFT, RIGHT JOIN not working

2022-06-02 Thread Zhenya Stanilovsky


Hi, Surinder Mehra ! I check your sql and it work correct for me.
*  You no need to define AffinityKeyMapped for Key, check additionally [1], you 
can simple modify [2] according to your case
*  I problem still exist somehow, plz attach some code example.
thanks !
 
[1]  
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/affinity/AffinityKeyMapped.html
[2]  
https://github.com/apache/ignite/blob/master/modules/indexing/src/test/java/org/apache/ignite/internal/processors/cache/IgniteCacheJoinPartitionedAndReplicatedTest.java#L160
 
>Hi,
>I have the following sample code to demo issue in SQL joins. I have created an 
>affinity key and value as shown below and added some sample data to it. When I 
>try LEFT self join on this table it always gives me common rows   irrespective 
>of LEFT or RIGHT JOIN
>Could you please help me find what am I doing wrong here.
> 
>cache Key :
>
>public class OrderAffinityKey {
>    Integer id;
>    @AffinityKeyMapped
>    Integer customerId;
>}
>
>
>cache value:
>
>public class Order implements Serializable {
>    @QuerySqlField
>    Integer id;
>
>    @AffinityKeyMapped
>    @QuerySqlField Integer customerId;
>    @QuerySqlField String product;
>}
>
>
>Table C: (select customerID, product FROM "orderCache"."ORDER" WHERE 
>CUSTOMERID IN ( 1, 2))
>
>1 keyboard
>2 Laptop
>
>
>Table O: (select customerID, product FROM "orderCache"."ORDER" WHERE 
>CUSTOMERID IN ( 3, 2))
>
>2 laptop
>3 mouse
>
>
>
>JOIN:
>
>Query :
>select DISTINCT C.customerID, C.product, O.customerID
>FROM
> (select customerID, product FROM "orderCache"."ORDER" WHERE CUSTOMERID IN ( 
>1, 2)) C
> LEFT JOIN
>(select customerID, product FROM "orderCache"."ORDER" WHERE CUSTOMERID IN ( 3, 
>2)) O
>ON
>C.customerId = O.customerId
>
>
>Output:
>
>2 laptop   2
>3 mouse   3
>
>Expected output:
>
>1 keyboard   null
>2 laptop   2
>3 mouse   3

Re: Strange annoying message in Ignite 2.13 logs

2022-05-17 Thread Zhenya Stanilovsky


Hello, Noah. Yep seems this behavior is mistaken, just filter out this log 
somehow.
I fill the issue [1]
 
[1]  https://issues.apache.org/jira/browse/IGNITE-16989
 
> 
>> 
>>>Hi all, 
>>> 
>>>I recently upgraded Ignite from 2.8.1 to 2.13 and started to obtain the 
>>>following annoying warning messages.
>>> 
>>>2022-05-17 01:03:29,633 [100] WRN [MutableCacheComputeServer] Failed to ping 
>>>node [nodeId=null]. Reached the timeout 1ms. Cause: Connection refused 
>>>(Connection refused)  
>>>2022-05-17 01:03:29,633 [100] WRN [MutableCacheComputeServer] Failed to ping 
>>>node [nodeId=null]. Reached the timeout 1ms. Cause: Connection refused 
>>>(Connection refused)  
>>>2022-05-17 01:03:29,634 [100] WRN [MutableCacheComputeServer] Failed to ping 
>>>node [nodeId=null]. Reached the timeout 1ms. Cause: Connection refused 
>>>(Connection refused)  
>>>2022-05-17 01:03:29,636 [100] WRN [MutableCacheComputeServer] Failed to ping 
>>>node [nodeId=null]. Reached the timeout 1ms. Cause: Connection refused 
>>>(Connection refused)  
>>>2022-05-17 01:03:29,637 [100] WRN [MutableCacheComputeServer] Failed to ping 
>>>node [nodeId=null]. Reached the timeout 1ms. Cause: Connection refused 
>>>(Connection refused)  
>>>2022-05-17 01:03:29,639 [100] WRN [MutableCacheComputeServer] Failed to ping 
>>>node [nodeId=null]. Reached the timeout 1ms. Cause: Connection refused 
>>>(Connection refused)  
>>> 
>>>I tried to find any difference between 2.8.1 and 2.13 and found that the 
>>>newer version has added the following code which is to add those warning 
>>>messages.
>>>if ( spi . failureDetectionTimeoutEnabled () &&  timeoutHelper . 
>>>checkFailureTimeoutReached (e)) {
>>>     log . warning ( "Failed to ping node [nodeId=" + nodeId +  "]. Reached 
>>>the timeout " +
>>>         spi . failureDetectionTimeout () +  "ms. Cause: " +  e . getMessage 
>>>());  
>>>     break ;
>>>}
>>> 
>>>I really wonder which cases the `nodeId` can be null and how I can fix this 
>>>warning message.
>>>Could anyone please help me avoid these messages and let me know which cases 
>>>the nodeId can null?
>>> 
>>>Kind regards,
>>> 
>>>  
>> 
>> 
>> 
>>

Re[2]: Is apache ignite suitable for sql querying on ignite cache?

2022-05-05 Thread Zhenya Stanilovsky


Yes, it`s suitable, seems there is a typo here:
(Like memory store, persistence mode enabled) — hope you wand to say : 
persistence mode disabled.
 
>You might also want to consider using third-party persistence ( 
>external-storage ) instead of “manually” reading/writing from Oracle.
> 
>>On 4 May 2022, at 07:11, Reshma Bochare < rboch...@csod.com > wrote:  
>>Hi Team,
>>    Could you please help me with the below use case.
>> 
>>Thanks,
>>Reshma.
>> 
>>From:   Reshma Bochare  
>>Sent:   Thursday, April 28, 2022 4:52 PM
>>To:   user@ignite.apache.org
>>Subject:   Is apache ignite suitable for sql querying on ignite cache?
>> 
>>Hi Team,
>>    We want to use apache ignite for below use case.
>>We provide reports to customer . We execute query on oracle and feed it into 
>>oracle. And on the top of report, we allow pagination, sorting, grouping and 
>>export etc. Right now for each and every action after report execution, we 
>>hit oracle to get data. So we want we will hit oracle once and get data in 
>>memory and execute query on memory data to get data of further operations 
>>like pagination, sorting, grouping and export .
>>To achieve this, I am thinking to use of apache ignite as below
>> 
>>1.      We execute query on oracle database and want to save that data 
>>into apache ignite as cache. (Like memory store, persistence mode enabled)
>>2.      And on ignite cache, execute some complicated queries and fetch 
>>data from ignite cache.
>>3.      Keep the expiry of this cache to 5 minutes.  
>>4.      So there will be lot of insertion of new data and at the same 
>>time get of cache also like User executes report and performs some grouping, 
>>and closes report. In this case we will create cache in ignite and keep in 
>>the memory till user closes it. Meanwhile perform sql queries on ignite cache.
>>5.      For cache, considering creating class with different data types 
>>fields to it. And perform query on this class
>> 
>>Is it apache ignite suitable for this use case.?
>> 
>>Thanks,
>>Reshma.  
>> 
>> 
>>  This message, together with any attachments, is intended only for the use 
>>of the individual or entity to which it is addressed and may contain 
>>confidential and/or privileged information. If you are not the intended 
>>recipient(s), or the employee or agent responsible for delivery of this 
>>message to the intended recipient(s), you are hereby notified that any 
>>dissemination, distribution or copying of this message, or any attachment, is 
>>strictly prohibited. If you have received this message in error, please 
>>immediately notify the sender and delete the message, together with any 
>>attachments, from your computer. Thank you for your cooperation.

Re[2]: Apache Ignite H2 Vulnerabilities

2022-04-28 Thread Zhenya Stanilovsky


Seems it would be published with new documentation, Nikita Amelchev isn`t it ? 
check [1]
 
[1]  https://issues.apache.org/jira/browse/IGNITE-15189
 
>Thank you Stephen. 
>Is there also a writeup summarizing what is/isn't supported with this 
>'experimental' feature?  
>On Thu, Apr 28, 2022 at 4:30 PM Stephen Darlington < 
>stephen.darling...@gridgain.com > wrote:
>>https://github.com/apache/ignite/blob/2.13.0/modules/calcite/README.txt
>> 
>>>On 28 Apr 2022, at 11:46, Lokesh Bandaru < lokeshband...@gmail.com > wrote:  
>>>Thanks Ilya. 
>>> 
>>>Version 2.13 has come out but still seems to be shipping with the same 
>>>vulnerability-ridden version of h2 database. 
>>>The documentation doesn't mention if/how Calcite is turned on. 
>>>Can you advise on how it can be enabled?   
>>>On Wed, Apr 13, 2022 at 7:29 AM Ilya Korol < llivezk...@gmail.com > wrote:
Hi Lokesh,

Updates for running Ignite over Java 17 is already in master. Please
take a look:
https://github.com/apache/ignite/blob/master/bin/include/jvmdefaults.sh

On 2022/04/12 10:11:57 Lokesh Bandaru wrote:
 > You are fast. :) Was just typing a reply on top of the last one and yours
 > is already here.
 >
 > Ignore the last question, found this,
 >  https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+2.13 .
 > *Looking forward to this release. *
 >
 > *One slightly unrelated question, feel free to ignore. *
 > *I know there is no support(or certified) for any version of Java greater
 > than 11. *
 > *What would it take for 2.13 to be able to run on Java17?*
 >
 > On Tue, Apr 12, 2022 at 3:36 PM Stephen Darlington <
 >  stephen.darling...@gridgain.com > wrote:
 >
 > > Code freeze was yesterday. The target release date is 22 April.
 > >
 > > More here: Apache+Ignite+2.13
 > > < 
https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+2.13 >
 > >
 > > On 12 Apr 2022, at 11:03, Lokesh Bandaru < lo...@gmail.com > wrote:
 > >
 > > Thanks for getting back, Stephen.
 > > I am aware that Calcite is in the plans.
 > > Any tentative timeline as to when 2.13(beta/ga) is going to be made
 > > available?
 > >
 > > Regards.
 > >
 > > On Tue, Apr 12, 2022 at 2:15 PM Stephen Darlington <
 > >  stephen.darling...@gridgain.com > wrote:
 > >
 > >> The H2 project removed support for Ignite some time ago (
 > >>  https://github.com/h2database/h2database/pull/2227 ) which makes it
 > >> difficult to move to newer versions.
 > >>
 > >> The next version of Ignite (2.13) has an alternative SQL engine
(Apache
 > >> Calcite) so over time there will be no need for H2.
 > >>
 > >> On 11 Apr 2022, at 20:34, Lokesh Bandaru < lo...@gmail.com > wrote:
 > >>
 > >> Resending.
 > >>
 > >> On Mon, Apr 11, 2022 at 6:42 PM Lokesh Bandaru < lo...@gmail.com >
 > >> wrote:
 > >>
 > >>> Hello there, hi
 > >>>
 > >>> Writing to you with regards to the security
vulnerabilities(particularly
 > >>> the most recent ones, CVE-2022-xxx and CVE-2021-xxx) in the H2
database and
 > >>> the Apache Ignite's dependency on the flagged versions of H2.
 > >>> There is an open issue tracking this,
 > >>>  https://issues.apache.org/jira/browse/IGNITE-16542 , which doesn't
seem
 > >>> to have been fully addressed yet.
 > >>> Have these problems been overcome already? Can you please advise?
 > >>>
 > >>> Thanks.
 > >>>
 > >>
 > >>
 > >
 >

Re[2]: Query performance

2022-04-28 Thread Zhenya Stanilovsky


Got it, can you show both sql requests (with strict and non strict criteria) 
and EXPLAIN output in both cases ?
Do you have indexes?
 
>Hey, it's enabled already. Please check the console log in my email   
>On Thu, 28 Apr 2022, 13:16 Zhenya Stanilovsky, < arzamas...@mail.ru > wrote:
>>
>>Hi, can you check the same with lazy [1] flag ? 
>>
>>[1]  
>>https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/query/SqlFieldsQuery.html#setLazy-boolean-
>>   
>>>Hi,
>>>We are running a sql field query to fetch 4million records from ignite 
>>>cache. We have created a group index for all fields used in where clause and 
>>>can see group index used. But the query takes 20 minutes to fetch all 
>>>records. If we provide more strict criteria to fetch only say 500 records, 
>>>it completes in less than 200 millis. This makes me wonder if I am missing 
>>>some configuration to make result fetching faster or I am not using ignite 
>>>correctly for this use case. Below log is printed during query execution.  
>>>Could you please advise me?
>>>
>>>[11:46:30,694][WARNING][query-#8568][GridMapQueryExecutor] Query produced 
>>>big result set.  [fetched=10, duration=120331ms, type=MAP, 
>>>distributedJoin=false, enforceJoinOrder=false, lazy=true, 
>> 
>> 
>> 
>>

Re: Query performance

2022-04-28 Thread Zhenya Stanilovsky



Hi, can you check the same with lazy [1] flag ? 

[1]  
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/query/SqlFieldsQuery.html#setLazy-boolean-
 
>Hi,
>We are running a sql field query to fetch 4million records from ignite cache. 
>We have created a group index for all fields used in where clause and can see 
>group index used. But the query takes 20 minutes to fetch all records. If we 
>provide more strict criteria to fetch only say 500 records, it completes in 
>less than 200 millis. This makes me wonder if I am missing some configuration 
>to make result fetching faster or I am not using ignite correctly for this use 
>case. Below log is printed during query execution.  Could you please advise me?
>
>[11:46:30,694][WARNING][query-#8568][GridMapQueryExecutor] Query produced big 
>result set.  [fetched=10, duration=120331ms, type=MAP, 
>distributedJoin=false, enforceJoinOrder=false, lazy=true,

Re[2]: BinaryObject Data Can Not Mapping To SQL-Data

2022-04-19 Thread Zhenya Stanilovsky


hi !
BinaryObjectBuilder oldBuilder = 
igniteClient.binary().builder(«com.inspur...PubPartitionKeys_1_7»);
 
do you call:
 
oldBuilder.build(); // after ?
 
If so — what this mean ? «data is not mapped to sql» is it error in log or 
client side or smth ?
 
thanks !
>Hi,
> 
>I have had the same experience without sql, using KV API only. My cluster 
>consists of several data nodes and self-written jar application that starts 
>the client node. When started, client node executes mapreduce tasks for data 
>load and processing.
> 
>The workaround is as follows:
>1. create POJO on the client node;
>2. convert it to the binary object;
>3. on the data node, get binary object over the network and get its builder 
>(obj.toBuilder());
>4. set some fields, build and put in the cache.
> 
>The builder on the step 3 seems to be the same as the one on the cluent node.
> 
>Hope that helps,
>Vladimir
>  13:06, 18 апреля 2022 г., y < hty1994...@163.com >:
>>Hi ,
>>When using binary to insert data, I need to  get  an exist 
>>BinaryObject/BinaryObjectBuilder   from the database, similar to the code 
>>below. 
>>442062c6$3$1803c222cba$Coremail$hty1994712$163.com
>>
>>If I create a BinaryObjectBuilder directly, inserting binary data does not 
>>map to table data. The following code will not throw error, but the data is 
>>not mapped to sql.  If there is  no data in my table at first , how can I 
>>insert data?
>>3ecbd8f9$4$1803c222cba$Coremail$hty1994712$163.com
>>
>>   
>> 
>
>--
>Отправлено из мобильного приложения Яндекс Почты

Re: Fwd: One of the six Ignite nodes stuck causing entire cluster not operational

2022-04-05 Thread Zhenya Stanilovsky


Hello. i check your logs and found no issues, seems need more detailed logs and 
infra inspections.
Very fast decision in your case is just to update to newer version, there are 
huge bugs have been fixed there.
 
>Hello,
> 
>Background
>We are using Ignite version 2.8.1-1 and running a cluster of 6 ignite server 
>nodes with the same configurations. Each ignite server node has 4 cores and 
>16GB host memory. The service configuration is attached. We use java clients 
>to connect to the ignite server nodes to write and read caches. We do not use 
>any of the SQL functionality.
> 
>The Issue
>In the past few months, it has occurred multiple times that one out of six 
>nodes got into the  SYSTEM_WORKER_BLOCKED state. When this happens, the 
>performance of the other 5 nodes also gets impacted.  Checking the metrics 
>printed in the log just before the issue has happened, it suggests the system 
>was not using too much resource. The CPU usage was low, there was plenty of 
>room on the heap, and the affected thread is not in deadlock mode status 
>either.  Without an explicit error or warning log regarding the issue, it is 
>hard to tell what went wrong.  Could you please take a look at the 
>configuration and log and give us some hints?
> 
> 
>In addition, we have configured the  RestartProcessFailureHandler to handle 
>the system error like  SYSTEM_WORKER_BLOCKED, the node is supposed to restart 
>itself but it never did, which is a separate issue. Maybe the   
>RestartProcessFailureHandler is not suitable for handling such failure?
> 
>Logs
> 
>[2022-04-05T04:14:34,104] [INFO ] 
>grid-timeout-worker-#23%ignite-jetstream-prd1% 
>[IgniteKernal%ignite-jetstream-prd1]
>Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>^-- Node  [id=d26bf020, name=ignite-jetstream-prd1, uptime=2 days, 
>08:44:17.623]
>^-- H/N/C  [hosts=89, nodes=218, CPUs=1352]
>^-- CPU  [cur=1.83%, avg=14.86%, GC=0%]
>^-- PageMemory  [pages=1576438]
>^-- Heap  [used=1015MB, free=57.37%, comm=2382MB]
>^-- Off-heap  [used=6230MB, free=6.35%, comm=6552MB]
>^-- sysMemPlc region  [used=0MB, free=99.99%, comm=100MB]
>^-- default region  [used=6229MB, free=1.93%, comm=6352MB]
>^-- metastoreMemPlc region  [used=0MB, free=99.62%, comm=0MB]
>^-- TxLog region  [used=0MB, free=100%, comm=100MB]
>^-- Ignite persistence  [used=55405MB]
>^-- sysMemPlc region  [used=0MB]
>^-- default region  [used=55404MB]
>^-- metastoreMemPlc region  [used=0MB]
>^-- TxLog region  [used=0MB]
>^-- Outbound messages queue  [size=0]
>^-- Public thread pool  [active=0, idle=2, qSize=0]
>^-- System thread pool  [active=0, idle=8, qSize=0]
>...
>[2022-04-05T04:15:13,406] [ERROR] 
>grid-timeout-worker-#23%ignite-jetstream-prd1% [G]  Blocked system-critical 
>thread has been detected. This can lead to cluster-wide undefined behaviour  
>workerName=sys-stripe-3, threadName=sys-stripe-3-#4%ignite-jetstream-prd1%, 
>blockedFor=12s
>[2022-04-05T04:15:13,417] [WARN ] 
>grid-timeout-worker-#23%ignite-jetstream-prd1% [G]  Thread  
>name="sys-stripe-3-#4%ignite-jetstream-prd1%", id=20, state=TIMED_WAITING, 
>blockCnt=15385, waitCnt=12226082
>Lock  
>[object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@70b28434,
> ownerName=null, ownerId=-1]
>[2022-04-05T04:15:13,419] [ERROR] 
>grid-timeout-worker-#23%ignite-jetstream-prd1% [] Critical system error 
>detected. Will be handled accordingly to configured handler 
>[hnd=RestartProcessFailureHandler [super=AbstractFailureHandler 
>[ignoredFailureTypes=UnmodifiableSet []]], failureCtx=FailureContext 
>[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker  
>[name=sys-stripe-3, igniteInstanceName=ignite-jetstream-prd1, finished=false, 
>heartbeatTs=1649132101219] ]]
>org.apache.ignite.IgniteException: GridWorker  [name=sys-stripe-3, 
>igniteInstanceName=ignite-jetstream-prd1, finished=false, 
>heartbeatTs=1649132101219]
>at 
>org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$3.apply(IgnitionEx.java:1810)
>  [ignite-core-2.8.1.jar:2.8.1]
>at 
>org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$3.apply(IgnitionEx.java:1805)
>  [ignite-core-2.8.1.jar:2.8.1]
>at 
>org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:234)
>  [ignite-core-2.8.1.jar:2.8.1]
>at 
>org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)  
>[ignite-core-2.8.1.jar:2.8.1]
>at 
>org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221)
>  [ignite-core-2.8.1.jar:2.8.1]
>at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)  
>[ignite-core-2.8.1.jar:2.8.1]
>at java.lang.Thread.run(Thread.java:748)  [?:1.8.0_312] ...
>[2022-04-05T04:15:15,970] [WARN ] 
>grid-timeout-worker-#23%ignite-jetstream-prd1% [G]  >>> Possible starvation in 
>striped pool.
>Thread name: sys-stripe-3-#4%ignite-jetstream-prd1%
>Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, 
>topicOrd=8,

Re: ignite-shmem ?

2022-04-04 Thread Zhenya Stanilovsky


Hi ! It already removed in master and ignite 2.13 ver (as i know) will be 
released without this lib, as about gridgain — i don`t know here, but it can 
easily works without this lib with no impact to functionality correctness.


 
>hi Igniters,
> 
>Can anyone please elaborate what below is used for ? The inside 
>libigniteshmem.so is labelled as high security risk by compliance tool hence i 
>have to exclude it. but seems the fundermental funcationalities are not 
>impact.  Is it safe to exlude it ? or any advanced ignite usages could be 
>impacted by the removal of ignite-shmem ?
>org.gridgain:ignite-shmem
> 
>Server OS: redhat 7
> 
> 
>Thanks,
>MJ

Re[2]: version of h2 Database Engine.

2022-02-08 Thread Zhenya Stanilovsky


Dany, possibly you can try to shade [1] h2, i.e. rebuild ignite project from 
scratch with a bit of modifications?
 
[1]  https://maven.apache.org/plugins/maven-shade-plugin/
 
>Hi,
> 
>Regarding the H2 deadlock, do you have an expected ETA for post-release 2.13? 
>All our projects are currently block because H2 is such problematic we cannot 
>build project using it. 
> 
>Thank you 
> 
>>On Feb 7, 2022, at 1:08 AM, Alex Plehanov < plehanov.a...@gmail.com > wrote:  
>>Hello,
>> 
>>Unfortunately, update to the newest H2 version is not possible, since H2 
>>removed the Ignite support [1].
>>Currently, the new SQL engine for Ignite is under development. The first beta 
>>release of this engine is planned to Apache Ignite 2.13 version [2]. But it 
>>still requires the ignite-indexing module, which requires H2. This problem 
>>will be settled in one of the next versions after 2.13.
>> 
>>[1]:  https://github.com/h2database/h2database/pull/2227
>>[2]:  https://lists.apache.org/thread/yck17qhcgg2qmzp374q69xvlhb9ocwhh
>>   
>>ср, 2 февр. 2022 г. в 19:13, tore yang < torey...@yahoo.com >:
>>>Hi,
>>> 
>>>Today the library h2 1.4.197 was banned by my firm for security issue, as a 
>>>result my i gnite applications (running on version 2.12.0) can not be 
>>>released,  When I upgrade to version like 2.1.210, the clients failed to 
>>>connect to the cluster, complaining no class definition for 
>>>org.h2.index.BaseIndex.
>>>I guess Apache Ignite doesn't support h2 2.1.210 as of now? if so when it's 
>>>going to support?
>>> 
>>> 
>>>Regards,
>>> 
>>>Tao
>>> 
>>>

Re: OOME on startup with 2.11.1 and 2.12

2022-02-03 Thread Zhenya Stanilovsky



also seems you open a lot of jdbc connections or some configurations issue … 
attach your config and give more info about startup ? num of servers ? who and 
how was connected to grid ?


 
> 
>We are planning to upgrade from 2.9 to 2.11.1. But as soon as we start the 
>2.11.1 or 2.12 ignite instance crashes with OOME error.
> 
>[20:15:05,474][SEVERE][grid-nio-worker-client-listener-2-#34][] Critical 
>system error detected. Will be handled accordingly to configured handler 
>[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
>super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
>[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
>failureCtx=FailureContext [type=CRITICAL_ERROR, 
>err=java.lang.OutOfMemoryError: Direct buffer memory]]
>java.lang.OutOfMemoryError: Direct buffer memory
> 
>Some of the pointers
> 
>1)    2.9 has no problem.
>2)    2.11 and 2.12 crashes even with default supplied config of 
>ignite.sh. OOM is within a min of startup.
>3)    No other application is running on the server
>4)    Details about the environment
>[oracle@jmngd1s05ppv014 log]$ free -g
>  total    used    free  shared  buff/cache   available
>Mem: 46  18  21   0   5  23
>Swap:    47   0  47
> 
>[oracle@jmngd1s05ppv014 log]$ java -version
>java version "1.8.0_301"
>Java(TM) SE Runtime Environment (build 1.8.0_301-b25)
>Java HotSpot(TM) 64-Bit Server VM (build 25.301-b25, mixed mode)
> 
>[oracle@jmngd1s05ppv014 log]$ uname -a
>Linux jmngd1s05ppv014 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 9 16:09:48 
>UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> 
>[oracle@jmngd1s05ppv014 log]$ cat /etc/os-release
>NAME="Red Hat Enterprise Linux Server"
>VERSION="7.9 (Maipo)"
>ID="rhel"
>ID_LIKE="fedora"
>VARIANT="Server"
>VARIANT_ID="server"
>VERSION_ID="7.9"
>PRETTY_NAME="Red Hat Enterprise Linux Server 7.9 (Maipo)"
>ANSI_COLOR="0;31"
>CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
>HOME_URL=" https://www.redhat.com/ "
>BUG_REPORT_URL=" https://bugzilla.redhat.com/ "
> 
>REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
>REDHAT_BUGZILLA_PRODUCT_VERSION=7.9
>REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
>REDHAT_SUPPORT_PRODUCT_VERSION="7.9"
>5)    I have also attached the ignite log.
>6)    O/S administrator confirmed there are no hardware failure.
> 
>If anyone can help how to further diagnosis the issue. We will be grateful.
> 
> 
>Thanks,
>Sachin.
>
>" Confidentiality Warning : This message and any attachments are intended only 
>for the use of the intended recipient(s), are confidential and may be 
>privileged. If you are not the intended recipient, you are hereby notified 
>that any review, re-transmission, conversion to hard copy, copying, 
>circulation or other use of this message and any attachments is strictly 
>prohibited. If you are not the intended recipient, please notify the sender 
>immediately by return email and delete this message and any attachments from 
>your system.
>Virus Warning: Although the company has taken reasonable precautions to ensure 
>no viruses are present in this email. The company cannot accept responsibility 
>for any loss or damage arising from the use of this email or attachment."

Re: OOME on startup with 2.11.1 and 2.12

2022-02-03 Thread Zhenya Stanilovsky



hello, seems you really use all available memory, of course without heap dump i 
have no clue why it`s happen, plz increase -Xmx and check once more.

 
> 
>We are planning to upgrade from 2.9 to 2.11.1. But as soon as we start the 
>2.11.1 or 2.12 ignite instance crashes with OOME error.
> 
>[20:15:05,474][SEVERE][grid-nio-worker-client-listener-2-#34][] Critical 
>system error detected. Will be handled accordingly to configured handler 
>[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
>super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
>[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
>failureCtx=FailureContext [type=CRITICAL_ERROR, 
>err=java.lang.OutOfMemoryError: Direct buffer memory]]
>java.lang.OutOfMemoryError: Direct buffer memory
> 
>Some of the pointers
> 
>1)    2.9 has no problem.
>2)    2.11 and 2.12 crashes even with default supplied config of 
>ignite.sh. OOM is within a min of startup.
>3)    No other application is running on the server
>4)    Details about the environment
>[oracle@jmngd1s05ppv014 log]$ free -g
>  total    used    free  shared  buff/cache   available
>Mem: 46  18  21   0   5  23
>Swap:    47   0  47
> 
>[oracle@jmngd1s05ppv014 log]$ java -version
>java version "1.8.0_301"
>Java(TM) SE Runtime Environment (build 1.8.0_301-b25)
>Java HotSpot(TM) 64-Bit Server VM (build 25.301-b25, mixed mode)
> 
>[oracle@jmngd1s05ppv014 log]$ uname -a
>Linux jmngd1s05ppv014 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 9 16:09:48 
>UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> 
>[oracle@jmngd1s05ppv014 log]$ cat /etc/os-release
>NAME="Red Hat Enterprise Linux Server"
>VERSION="7.9 (Maipo)"
>ID="rhel"
>ID_LIKE="fedora"
>VARIANT="Server"
>VARIANT_ID="server"
>VERSION_ID="7.9"
>PRETTY_NAME="Red Hat Enterprise Linux Server 7.9 (Maipo)"
>ANSI_COLOR="0;31"
>CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
>HOME_URL=" https://www.redhat.com/ "
>BUG_REPORT_URL=" https://bugzilla.redhat.com/ "
> 
>REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
>REDHAT_BUGZILLA_PRODUCT_VERSION=7.9
>REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
>REDHAT_SUPPORT_PRODUCT_VERSION="7.9"
>5)    I have also attached the ignite log.
>6)    O/S administrator confirmed there are no hardware failure.
> 
>If anyone can help how to further diagnosis the issue. We will be grateful.
> 
> 
>Thanks,
>Sachin.
>
>" Confidentiality Warning : This message and any attachments are intended only 
>for the use of the intended recipient(s), are confidential and may be 
>privileged. If you are not the intended recipient, you are hereby notified 
>that any review, re-transmission, conversion to hard copy, copying, 
>circulation or other use of this message and any attachments is strictly 
>prohibited. If you are not the intended recipient, please notify the sender 
>immediately by return email and delete this message and any attachments from 
>your system.
>Virus Warning: Although the company has taken reasonable precautions to ensure 
>no viruses are present in this email. The company cannot accept responsibility 
>for any loss or damage arising from the use of this email or attachment."

Re: Ignite node crash

2022-01-27 Thread Zhenya Stanilovsky


Hi, at first glance you really have a network problems, check 04c.log :
2022-01-25 18:32:53.858+ WARN 
[grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%] 
o.a.i.s.c.t.TcpCommunicationSpi          : Communication SPI session write 
timed out (consider increasing 'socketWriteTimeout' configuration property) 
[remoteAddr=/169.182.110.132:36364, writeTimeout=2000]
 
>Hi Ignite team,
> 
>We are using Ignite 2.10.0 and we have a 5-node Ignite cluster with persistent 
>enabled. The nodes have the following node id and consistent id:
>*  01p – node id=ee035a96, consistent id=lrdeqprmap01p
>*  02p – node id=81d7df57, consistent id=lrdeqprmap02p
>*  03p – node id=3a275472, consistent id=lrdeqprmap03p
>*  03c – node id=e8c54e6d, consistent id=lcgeqprmap03c
>*  04c – node id=de3959cf, consistent id=lcgeqprmap04c
> 
>One of the nodes, 03c, crashed one day. We would like to figure out the root 
>cause of the crash. I check the logs with the following findings:
> 
>*  From 03c log, 03c was trying to connect to 04c multiple times, starting 
>from 18:49:56 but all were unsuccessful. Eventually the node thought it’s 
>segmented and killed itself due to critical system error.
>*  From 04c log, 04c was rejecting all connections from 03c since 18:49:56, as 
>04c thought 03c was failed and regarded it as unknown node.
>*  In 04c, there were a lot of “Possible starvation in stripped pool” warning 
>since 18:35:15.
>*  In 04c, there were a lot of TCP client created, trying to connect to 02p 
>since 18:33:51. At the same time, in 02p there were a lot of “Received 
>incoming connection when already connected to this node, rejecting” 04p.
>*  I can confirm that there were no network outage between the nodes.
> 
>I have also attached the log for your information, and also our ignite xml 
>config. Can you please help to investigate? Thanks.
> 
>Regards,
>Marcus
>

Re[2]: AW: [2.8.1] Having more backups make SQL queries slower.

2021-11-30 Thread Zhenya Stanilovsky


hello Maximiliano Gazquez, good question ! But there is no one strength answer 
on it.
 
>  I assumed that queries are distributed and each node answers the query only 
>with its primary partitions and adding backups wouldn’t affect performance.
Ok, but what about overall system performance degradation ? Check page 
replacement [1] algo, more efficient was introduced in new versions, thus 
upgrade to new ver is welcomes. Second — wal it`s all about io usage. Do you 
have monitoring of your disk io activity ? 4 Backups means 5 nodes with equal 
data, is it really necessary or you just make a research ? Also additional work 
for index rebuilding i hope.
 
[1]  
https://cwiki.apache.org/confluence/display/IGNITE/IEP-62+Page+replacement+improvements
 
>My problem is 100% with queries, not writes.
>It’s the same cluster, same hardware, but a LOT slower when using 4 backups 
>instead of 2.
>
>Is there any metric that I could check to find out what’s happening?
>
>Thanks!
>On 30 Nov 2021 15:42 -0300, Henrik < ho...@magenta.de >, wrote:
>>With more backups the cluster has the worse writing performance since data 
>>will be copied by multiple times. But the reading performance should be 
>>increased since each node answers the request from local backup.
>>
>>Thanks
>>
>>
>>Gesendet mit der  Telekom Mail App
>>-Original-Nachricht-
>>Von: Maximiliano Gazquez < maximiliano@gmail.com >
>>Betreff: [2.8.1] Having more backups make SQL queries slower.
>>Datum: 01.12.2021, 00:12 Uhr
>>An: < user@ignite.apache.org >
>>Hello everyone.
>>
>>We are doing some testing in a 10 node cluster which we use as a distributed 
>>database with persistence enabled.
>>Each node has 6gb region size + 5gb heap.
>>All caches are partitioned, and I connect to the cluster using the thin 
>>client.
>>
>>I’ve found a performance issue:
>>*  With 2 backups, the performance is pretty great.
>>*  With 4 backups the performance is really bad.
>>So I wanted to ask why would this happen.
>>
>>I assumed that queries are distributed and each node answers the query only 
>>with its primary partitions and adding backups wouldn’t affect performance.
>>
>>Thanks everyone!

Re: [2.11.0]: 'B+Tree is corrupted' exception in GridCacheTtlManager.expire() and PartitionsEvictManager$PartitionEvictionTask.run() on node start

2021-11-25 Thread Zhenya Stanilovsky


probably this is the case:  https://issues.apache.org/jira/browse/IGNITE-15990
 
   
>
>Hello Denis,
>Yes, as I said in the original message we do use the expiration on persistent 
>caches.
>The corruptedPages_2021-11-08_11-03-21_999.txt and 
>corruptedPages_2021-11-09_12-43-12_449.txt files were generated by Ignite on 
>crash. They show that two different caches were affected. The first one during 
>the expiration and the second (next day) during rebalance eviction.  Both 
>caches are persistent and use the expiration.
>I also run the diagnostic utility (IgniteWalConverter) the way it is 
>recommended in the error message (output attached as diag-* files). 
>Is there any usefull information in these diag-* files which can help to 
>understand what and how was corruped in particular?
>***
>Generally this was a test run of new 2.11.0 version in test environment. A 
>goal was to check if new version works fine with out application and also can 
>be safely stopped/started for maintenance. We do that since run into the 
>similar problem with 'B+Tree is corrupted' on production during eviction 
>rebalance (with 2.8.1).  We see two similar issues fixed in 2.11.0:   
>https://issues.apache.org/jira/browse/IGNITE-12489 and  
>https://issues.apache.org/jira/browse/IGNITE-14093 and consider upgrade if it 
>would help.   By the way fix of the IGNITE-12489 ( 
>https://github.com/apache/ignite/pull/8358/commits ) contains a lot changes 
>with several attempts. May be it just fixes not all situations?
>Before the deactivation cluster works fine under our usual load about 5 days. 
>Load is about 300 requests per second each consists of several reads and 
>single write to caches with the expiration turned on.  After that we stop / 
>start the cluster to emulate the situation we had on production with 2.8.1 
>(load from our application was stopped as well before the deactivation 
>request).
>***
>Caches configurations. The first one has an affinity key and interceptor
>  public CacheConfiguration 
>getEdenContactHistoryCacheConfiguration() {
>    CacheConfiguration cacheConfiguration = 
>new CacheConfiguration<>();
>    cacheConfiguration.setCacheMode(CacheMode.PARTITIONED);
>    cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false, 
>1024));
>    cacheConfiguration.setBackups(1);
>    cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC);
>    int expirationDays = appConfig.getContactHistoryEdenExpirationDays();
>    cacheConfiguration
>    .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new 
>Duration(TimeUnit.DAYS, expirationDays)));
>    cacheConfiguration.setInterceptor(new ContactHistoryInterceptor());
>    return cacheConfiguration;
>  }
>public class ContactHistoryKey {
>  String sOfferId;
>
>  @AffinityKeyMapped
>  String subsIdAffinityKey;
>}
>  CacheConfiguration getChannelOfferIdCache() {
>    CacheConfiguration cacheConfiguration = new 
>CacheConfiguration<>();
>    cacheConfiguration.setCacheMode(CacheMode.PARTITIONED);
>    cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false, 
>1024));
>    cacheConfiguration.setBackups(1);
>    cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC);
>    int expirationDays = appConfig.getChannelOfferIdCacheExpirationDays();
>    cacheConfiguration
>    .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new 
>Duration(TimeUnit.DAYS, expirationDays)));
>    return cacheConfiguration;
>  }
>***
>As for the other details.  Not sure is it relevant or not. Deactivation was 
>relatevily long and log contains a lot of warnings between  2021-11-08 
>10:54:44 and 2021-11-08 10:59:33.  Also there was a page locks dump at 
>10:56:47,567.  A lot of locks were logged for cache in the  
>Batch_Campaigns_Region (region without the persistence).
>[2021-11-08 
>10:56:47,560][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{} 
>Threads hanged: [(sys-#99087-105322, WAITING)]
>[2021-11-08 
>10:56:47,567][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{} Page 
>locks dump:
>
>Thread=[name=sys-#99087, id=105322], state=WAITING
>Log overflow, size:512, headIdx=512 [structureId=7, 
>pageIdpageId=281474976790149 [pageIdHex=000100013685, partId=0, 
>pageIdx=79493, flags=0001]]
>Locked pages = 
>[844420635168174[00020dae](r=0|w=1),844420635168175[00020daf](r=0|w=1)]
>Locked pages log: name=sys-#99087 time=(1636358127247, 2021-11-08 07:55:27.247)
>L=1 -> Write lock pageId=844420635168174, 
>structureId=batch_campaign_results-p-0##CacheData [pageIdHex=00020dae, 
>partId=65535, pageIdx=3502, flags=0002]
>L=2 -> Write lock pageId=844420635168175, 
>structureId=batch_campaign_results-p-0##CacheData [pageIdHex=00020daf, 
>partId=65535, pageIdx=3503, flags=0002]
>L=3 -> Write lock pageId=281474976716639, 
>structureId=Batch_Campaigns_Region##FreeList [pageIdHex=0001175f, 
>partId=0, pageIdx=5983, flags=0001]
>L=4 -> Write lock pageId=844420635243234,

Re: Cluster rebalancing when one node dies

2021-11-17 Thread Zhenya Stanilovsky


 
>Hi All,
 
hi !
> 
>We are facing an issue , due to some misunderstanding may be on our part of 
>ignite.
> 
>We have a 2 node Ignite cluster configured with persistence.  We have a APi to 
>activate it, when the two nodes are up we hit the api to activate the cluster 
>and it works. 
 
I suppose mentioned API is a standard baseline api ?
> 
>However then one node went down , we patched the node and restarted it, but 
>when request goes to that node we get the following error :-  
>org.apache.ignite.IgniteException: Can not perform the operation because the 
>cluster is inactive. Note, that the cluster is considered inactive by default 
>if Ignite Persistent Store is used to let all the nodes join the cluster. To 
>activate the cluster call Ignite.active(true).
 
If cluster with persistence and restarted node has the same persistence data 
(folders etc.) as before restart, after restarting it automatically activates 
and become a topology member. If you set baseline and found 1 node instead of 
expecting two — some of your actions are wrong.
> 
>We had set the baseline Topology to 2 and on checking I saw the current state 
>is  "NODES":1,"BT":2 
> 
>Can anyone please help on what we should do ? 
> 
> 
>

Re[6]: Failed to perform cache operation (cache is stopped)

2021-10-18 Thread Zhenya Stanilovsky



Akash, can you attach here full logs with failure, not only a part ?
thanks ! 
>Could someone please help me out here to find out the root cause of this 
>problem?
>This is now happening so frequently.  
>On Wed, Oct 13, 2021 at 3:56 PM Akash Shinde < akashshi...@gmail.com > wrote:
>>Yes, I have set  failureDetectionTimeout  = 6.
>>There is no long GC pause. 
>>Core1 GC report
>>Core 2 GC Report  
>>On Wed, Oct 13, 2021 at 1:16 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>wrote:
>>>
>>>Ok, additionally there is info about node segmentation:
>>>Node FAILED: TcpDiscoveryNode [id=fb67a5fd-f1ab-441d-a38e-bab975cd1037, 
>>>consistentId=0:0:0:0:0:0:0:1%lo,XX.XX.XX.XX, 127.0.0.1:47500 , 
>>>addrs=ArrayList [0:0:0:0:0:0:0:1%lo, XX.XX.XX.XX, 127.0.0.1], 
>>>sockAddrs=HashSet [ qagmscore02.xyz.com/XX.XX.XX.XX:47500 , 
>>>/0:0:0:0:0:0:0:1%lo:47500, / 127.0.0.1:47500 ], discPort=47500, order=25, 
>>>intOrder=16, lastExchangeTime=1633426750418, loc=false, 
>>>ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]
>>> 
>>>Local node SEGMENTED: TcpDiscoveryNode 
>>>[id=7f357ca2-0ae2-4af0-bfa4-d18e7bcb3797
>>> 
>>>Possible too long JVM pause: 1052 milliseconds.
>>>
>>>Are you changed default settings networking timeouts ? If no — try to 
>>>recheck setting of failureDetectionTimeout
>>>If you have GC pause longer than 10 seconds, node will be dropped from the 
>>>cluster(by default). 
>>>   
>>>>This is the codebase of AgmsCacheJdbcStoreSessionListner.java
>>>>This null pointer occurs due to a datasource bean not found. 
>>>>The cluster was working fine but what could be the reason for 
>>>>unavailability of datasource bean in between running cluster. 
>>>> 
>>>> 
>>>>public class AgmsCacheJdbcStoreSessionListener extends 
>>>>CacheJdbcStoreSessionListener {
>>>>
>>>>
>>>>  @SpringApplicationContextResource
>>>>  public void setupDataSourceFromSpringContext(Object appCtx) {
>>>>ApplicationContext appContext = (ApplicationContext) appCtx;
>>>>setDataSource((DataSource) appContext.getBean("dataSource"));
>>>>  }
>>>>}
>>>> 
>>>>I can see one log line that tells us about a problem on the network side. 
>>>>Is this the possible reason?
>>>> 
>>>>2021-10-07 16:28:22,889 197776202 [tcp-disco-msg-worker-[fb67a5fd 
>>>>XX.XX.XX.XX:47500 crd]-#2%springDataNode%-#69%springDataNode%] WARN  
>>>>o.a.i.s.d.tcp.TcpDiscoverySpi - Node is out of topology (probably, due to 
>>>>short-time network problems).
>>>>   
>>>>On Mon, Oct 11, 2021 at 7:15 PM stanilovsky evgeny < 
>>>>estanilovs...@gridgain.com > wrote:
>>>>>may be this ? 
>>>>> 
>>>>>Caused by: java.lang.NullPointerException: null
>>>>>at 
>>>>>com.xyz.agms.grid.cache.loader.AgmsCacheJdbcStoreSessionListener.setupDataSourceFromSpringContext(AgmsCacheJdbcStoreSessionListener.java:14)
>>>>>... 23 common frames omitted
>>>>> 
>>>>> 
>>>>>>Hi Zhenya,
>>>>>>CacheStoppedException occurred again on our ignite cluster. I have 
>>>>>>captured logs with  IGNITE_QUIET = false.
>>>>>>There are four core nodes in the cluster and two nodes gone down. I am 
>>>>>>attaching the logs for two failed nodes.
>>>>>>Please let me know if you need any further details.
>>>>>> 
>>>>>>Thanks,
>>>>>>Akash   
>>>>>>On Tue, Sep 7, 2021 at 12:19 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>>>>>wrote:
>>>>>>>plz share somehow these logs, if you have no ideas how to share, you can 
>>>>>>>send it directly to  arzamas...@mail.ru
>>>>>>>   
>>>>>>>>Meanwhile I grep the logs with the next occurrence of cache stopped 
>>>>>>>>exception,can someone highlight if there is any known bug related to 
>>>>>>>>this?
>>>>>>>>I want to check the possible reason for this cache stop exception.  
>>>>>>>>On Mon, Sep 6, 2021 at 6:27 PM Akash Shinde < akashshi...@gmail.com > 
>>>>>>>>wrote:
>>>>>>>>>Hi Zhenya,
>>>>>>>>>Thanks for the quick response.
>>>

Re[4]: apache ignite 2.10.0 heap starvation

2021-10-13 Thread Zhenya Stanilovsky


Node is going down due to full gc triggering, you can avoid it by [1] but you 
obtain ALL not only life objects.
Additionally you can try to attach with async prof [2] or probably visual vm
 
[1]  
https://stackoverflow.com/questions/23393480/can-heap-dump-be-created-for-analyzing-memory-leak-without-garbage-collection
[2]  https://github.com/jvm-profiling-tools/async-profiler
 
>heap dump generation does not seems to be working.
>whenever I tried to generate the heap dump, node is going down, bit strange,
>what else we could analyze  
>On Tue, Oct 12, 2021 at 7:35 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>wrote:
>>hi, highly likely the problem in your code - cpu usage grow synchronously 
>>with heap increasing between 00.00 and 12.00.
>>You need to analyze heap dump, no additional settings will help here.
>>   
>>>On the same subject, we have made the changes as suggested 
>>> 
>>>nodes are running on 8 CORE and 128 GB MEM VMs, i've added the following jvm 
>>>parameters
>>>
>>>-XX:ParallelGCThreads=4
>>>-XX:ConcGCThreads=2
>>>-XX:MaxGCPauseMillis=200
>>>-XX:InitiatingHeapOccupancyPercent=40
>>> 
>>>Not used any of these below, using the default values for all these, which 
>>>is 8 (as the number of cores)
>>> 
>>>        
>>>        
>>>        
>>>        
>>> 
>>>I could still see our heap is increasing,  but atleast I could see a pattern 
>>>now (not like earlier which is almost exponential)
>>> 
>>>Attaching the screenshots of heap, CPU, GC and start script with all the jvm 
>>>arguments used. 
>>>what do you think I should be changing to run to use heap effectively 
>>> 
>>>   
>>>On Wed, Sep 29, 2021 at 2:35 PM Ibrahim Altun < ibrahim.al...@segmentify.com 
>>>> wrote:
>>>>after many configuration changes and optimizations, i think i've solved the 
>>>>heap problem.
>>>>
>>>>here are the changes that i applied to the system;
>>>>JVM changes ->  
>>>>https://medium.com/@hoan.nguyen.it/how-did-g1gc-tuning-flags-affect-our-back-end-web-app-c121d38dfe56
>>>> helped a lot
>>>>
>>>>nodes are running on 12CORE and 64GB MEM servers, i've added the following 
>>>>jvm parameters
>>>>
>>>>-XX:ParallelGCThreads=6
>>>>-XX:ConcGCThreads=2
>>>>-XX:MaxGCPauseMillis=200
>>>>-XX:InitiatingHeapOccupancyPercent=40
>>>>
>>>>on ignite configuration i've changed all thread pool sizes, which were much 
>>>>more than these;
>>>>        
>>>>        
>>>>        
>>>>        
>>>>        
>>>>        
>>>>        
>>>>
>>>>Here is the 16 hours of GC report;
>>>>https://gceasy.io/diamondgc-report.jsp?p=c2hhcmVkLzIwMjEvMDkvMjkvLS1nYy5sb2cuMC5jdXJyZW50LS04LTU4LTMx=WEB
>>>>
>>>>
>>>>
>>>>On 2021/09/27 17:11:21, Ilya Korol < llivezk...@gmail.com > wrote:
>>>>> Actually Query interface doesn't define close() method, but QueryCursor
>>>>> does.
>>>>> In your snippets you're using try-with-resource construction for SELECT
>>>>> queries which is good, but when you run MERGE INTO query you would also
>>>>> get an QueryCursor as a result of
>>>>>
>>>>> igniteCacheService.getCache(ID, IgniteCacheType.LABEL).query(insertQuery);
>>>>>
>>>>> so maybe this QueryCursor objects still hold some resources/memory.
>>>>> Javadoc for QueryCursor states that you should always close cursors.
>>>>>
>>>>> To simplify cursor closing there is a cursor.getAll() method that will
>>>>> do this for you under the hood.
>>>>>
>>>>>
>>>>> On 2021/09/13 06:17:21, Ibrahim Altun < i...@segmentify.com > wrote:
>>>>>  > Hi Ilya,>
>>>>>  >
>>>>>  > since this is production environment i could not risk to take heap
>>>>> dump for now, but i will try to convince my superiors to get one and
>>>>> analyze it.>
>>>>>  >
>>>>>  > Queries are heavily used in our system but aren't they autoclosable
>>>>> objects? do we have to close them anyway?>
>>>>>  >
>>>>>  > here are some usage examples on our system;>
>>>>>  > --insert query is like thi

Re[4]: Failed to perform cache operation (cache is stopped)

2021-10-13 Thread Zhenya Stanilovsky



Ok, additionally there is info about node segmentation:
Node FAILED: TcpDiscoveryNode [id=fb67a5fd-f1ab-441d-a38e-bab975cd1037, 
consistentId=0:0:0:0:0:0:0:1%lo,XX.XX.XX.XX,127.0.0.1:47500, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, XX.XX.XX.XX, 127.0.0.1], sockAddrs=HashSet 
[qagmscore02.xyz.com/XX.XX.XX.XX:47500, /0:0:0:0:0:0:0:1%lo:47500, 
/127.0.0.1:47500], discPort=47500, order=25, intOrder=16, 
lastExchangeTime=1633426750418, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, 
isClient=false]
 
Local node SEGMENTED: TcpDiscoveryNode [id=7f357ca2-0ae2-4af0-bfa4-d18e7bcb3797
 
Possible too long JVM pause: 1052 milliseconds.

Are you changed default settings networking timeouts ? If no — try to recheck 
setting of failureDetectionTimeout
If you have GC pause longer than 10 seconds, node will be dropped from the 
cluster(by default). 
 
>This is the codebase of AgmsCacheJdbcStoreSessionListner.java
>This null pointer occurs due to a datasource bean not found. 
>The cluster was working fine but what could be the reason for unavailability 
>of datasource bean in between running cluster. 
> 
> 
>public class AgmsCacheJdbcStoreSessionListener extends 
>CacheJdbcStoreSessionListener {
>
>
>  @SpringApplicationContextResource
>  public void setupDataSourceFromSpringContext(Object appCtx) {
>ApplicationContext appContext = (ApplicationContext) appCtx;
>setDataSource((DataSource) appContext.getBean("dataSource"));
>  }
>}
> 
>I can see one log line that tells us about a problem on the network side. Is 
>this the possible reason?
> 
>2021-10-07 16:28:22,889 197776202 [tcp-disco-msg-worker-[fb67a5fd 
>XX.XX.XX.XX:47500 crd]-#2%springDataNode%-#69%springDataNode%] WARN  
>o.a.i.s.d.tcp.TcpDiscoverySpi - Node is out of topology (probably, due to 
>short-time network problems).
>   
>On Mon, Oct 11, 2021 at 7:15 PM stanilovsky evgeny < 
>estanilovs...@gridgain.com > wrote:
>>may be this ? 
>> 
>>Caused by: java.lang.NullPointerException: null
>>at 
>>com.xyz.agms.grid.cache.loader.AgmsCacheJdbcStoreSessionListener.setupDataSourceFromSpringContext(AgmsCacheJdbcStoreSessionListener.java:14)
>>... 23 common frames omitted
>> 
>> 
>>>Hi Zhenya,
>>>CacheStoppedException occurred again on our ignite cluster. I have captured 
>>>logs with  IGNITE_QUIET = false.
>>>There are four core nodes in the cluster and two nodes gone down. I am 
>>>attaching the logs for two failed nodes.
>>>Please let me know if you need any further details.
>>> 
>>>Thanks,
>>>Akash   
>>>On Tue, Sep 7, 2021 at 12:19 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>>wrote:
>>>>plz share somehow these logs, if you have no ideas how to share, you can 
>>>>send it directly to  arzamas...@mail.ru
>>>>   
>>>>>Meanwhile I grep the logs with the next occurrence of cache stopped 
>>>>>exception,can someone highlight if there is any known bug related to this?
>>>>>I want to check the possible reason for this cache stop exception.  
>>>>>On Mon, Sep 6, 2021 at 6:27 PM Akash Shinde < akashshi...@gmail.com > 
>>>>>wrote:
>>>>>>Hi Zhenya,
>>>>>>Thanks for the quick response.
>>>>>>I believe you are talking about ignite instances. There is single ignite 
>>>>>>using in application.
>>>>>>I also want to point out that I am not using destroyCache()  method 
>>>>>>anywhere in application.
>>>>>> 
>>>>>>I will set   IGNITE_QUIET = false  and try to grep the required logs.
>>>>>>This issue occurs by random and there is no way reproduce it.
>>>>>> 
>>>>>>Thanks,
>>>>>>Akash
>>>>>> 
>>>>>>   
>>>>>>On Mon, Sep 6, 2021 at 5:33 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>>>>>wrote:
>>>>>>>Hi, Akash
>>>>>>>You can obtain such a case, for example when you have several instances 
>>>>>>>and :
>>>>>>>inst1:
>>>>>>>cache = inst1.getOrCreateCache("cache1");
>>>>>>> 
>>>>>>>after inst2 destroy calling:
>>>>>>> 
>>>>>>>cache._some_method_call_
>>>>>>> 
>>>>>>>inst2:
>>>>>>> inst2.destroyCache("cache1");
>>>>>>> 
>>>>>>>or shorter: you still use instance that already destroyed, you can 
>>>>&

Re[2]: apache ignite 2.10.0 heap starvation

2021-10-12 Thread Zhenya Stanilovsky


hi, highly likely the problem in your code - cpu usage grow synchronously with 
heap increasing between 00.00 and 12.00.
You need to analyze heap dump, no additional settings will help here.
 
>On the same subject, we have made the changes as suggested 
> 
>nodes are running on 8 CORE and 128 GB MEM VMs, i've added the following jvm 
>parameters
>
>-XX:ParallelGCThreads=4
>-XX:ConcGCThreads=2
>-XX:MaxGCPauseMillis=200
>-XX:InitiatingHeapOccupancyPercent=40
> 
>Not used any of these below, using the default values for all these, which is 
>8 (as the number of cores)
> 
>        
>        
>        
>        
> 
>I could still see our heap is increasing,  but atleast I could see a pattern 
>now (not like earlier which is almost exponential)
> 
>Attaching the screenshots of heap, CPU, GC and start script with all the jvm 
>arguments used. 
>what do you think I should be changing to run to use heap effectively 
> 
>   
>On Wed, Sep 29, 2021 at 2:35 PM Ibrahim Altun < ibrahim.al...@segmentify.com > 
>wrote:
>>after many configuration changes and optimizations, i think i've solved the 
>>heap problem.
>>
>>here are the changes that i applied to the system;
>>JVM changes ->  
>>https://medium.com/@hoan.nguyen.it/how-did-g1gc-tuning-flags-affect-our-back-end-web-app-c121d38dfe56
>> helped a lot
>>
>>nodes are running on 12CORE and 64GB MEM servers, i've added the following 
>>jvm parameters
>>
>>-XX:ParallelGCThreads=6
>>-XX:ConcGCThreads=2
>>-XX:MaxGCPauseMillis=200
>>-XX:InitiatingHeapOccupancyPercent=40
>>
>>on ignite configuration i've changed all thread pool sizes, which were much 
>>more than these;
>>        
>>        
>>        
>>        
>>        
>>        
>>        
>>
>>Here is the 16 hours of GC report;
>>https://gceasy.io/diamondgc-report.jsp?p=c2hhcmVkLzIwMjEvMDkvMjkvLS1nYy5sb2cuMC5jdXJyZW50LS04LTU4LTMx=WEB
>>
>>
>>
>>On 2021/09/27 17:11:21, Ilya Korol < llivezk...@gmail.com > wrote:
>>> Actually Query interface doesn't define close() method, but QueryCursor
>>> does.
>>> In your snippets you're using try-with-resource construction for SELECT
>>> queries which is good, but when you run MERGE INTO query you would also
>>> get an QueryCursor as a result of
>>>
>>> igniteCacheService.getCache(ID, IgniteCacheType.LABEL).query(insertQuery);
>>>
>>> so maybe this QueryCursor objects still hold some resources/memory.
>>> Javadoc for QueryCursor states that you should always close cursors.
>>>
>>> To simplify cursor closing there is a cursor.getAll() method that will
>>> do this for you under the hood.
>>>
>>>
>>> On 2021/09/13 06:17:21, Ibrahim Altun < i...@segmentify.com > wrote:
>>>  > Hi Ilya,>
>>>  >
>>>  > since this is production environment i could not risk to take heap
>>> dump for now, but i will try to convince my superiors to get one and
>>> analyze it.>
>>>  >
>>>  > Queries are heavily used in our system but aren't they autoclosable
>>> objects? do we have to close them anyway?>
>>>  >
>>>  > here are some usage examples on our system;>
>>>  > --insert query is like this; MERGE INTO "ProductLabel" ("productId",
>>> "label", "language") VALUES (?, ?, ?)>
>>>  > igniteCacheService.getCache(ID,
>>> IgniteCacheType.LABEL).query(insertQuery);>
>>>  >
>>>  > another usage example;>
>>>  > --sqlFieldsQuery is like this; >
>>>  > String sql = "SELECT _val FROM \"UserRecord\" WHERE \"email\" IN (?)";>
>>>  > SqlFieldsQuery sqlFieldsQuery = new SqlFieldsQuery(sql);>
>>>  > sqlFieldsQuery.setLazy(true);>
>>>  > sqlFieldsQuery.setArgs(emails.toArray());>
>>>  >
>>>  > try (QueryCursor> ignored = igniteCacheService.getCache(ID,
>>> IgniteCacheType.USER).query(sqlFieldsQuery)) {...}>
>>>  >
>>>  >
>>>  >
>>>  > On 2021/09/12 20:28:09, Shishkov Ilya < sh...@gmail.com > wrote: >
>>>  > > Hi, Ibrahim!>
>>>  > > Have you analyzed the heap dump of the server node JVMs?>
>>>  > > In case your application executes queries are their cursors closed?>
>>>  > > >
>>>  > > пт, 10 сент. 2021 г. в 11:54, Ibrahim Altun < ib...@segmentify.com >:>
>>>  > > >
>>>  > > > Igniters any comment on this issue, we are facing huge GC
>>> problems on>
>>>  > > > production environment, please advise.>
>>>  > > >>
>>>  > > > On 2021/09/07 14:11:09, Ibrahim Altun < ib...@segmentify.com >>
>>>  > > > wrote:>
>>>  > > > > Hi,>
>>>  > > > >>
>>>  > > > > totally 400 - 600K reads/writes/updates>
>>>  > > > > 12core>
>>>  > > > > 64GB RAM>
>>>  > > > > no iowait>
>>>  > > > > 10 nodes>
>>>  > > > >>
>>>  > > > > On 2021/09/07 12:51:28, Piotr Jagielski < pj...@touk.pl > wrote:>
>>>  > > > > > Hi,>
>>>  > > > > > Can you provide some information on how you use the cluster?
>>> How many>
>>>  > > > reads/writes/updates per second? Also CPU / RAM spec of cluster
>>> nodes?>
>>>  > > > > >>
>>>  > > > > > We observed full GC / CPU load / OOM killer when loading big
>>> amount of>
>>>  > > > data (15 mln records, data streamer + allowOverwrite=true). We've
>>> seen>
>>>  > > > 200-400k updates per sec on JMX metrics, but load up to 10 on

Re[2]: What does "First 10 long running cache futures" ?

2021-10-06 Thread Zhenya Stanilovsky



Ok, seems something goes wrong on node with 
id=36edbfd5-4feb-417e-b965-bdc34a0a6f4f If you still have a problem, can u send 
here or directly by me these logs ?


 
>And finally this on the coordinator node
>
>[14:07:41,282][WARNING][exchange-worker-#42%xx%][GridDhtPartitionsExchangeFuture]
> Unable to await partitions release latch within timeout. Some nodes have not 
>sent acknowledgement for latch completion. It's possible due to unfinishined 
>atomic updates, transactions or not released explicit locks on that nodes. 
>Please check logs for errors on nodes with ids reported in latch `pendingAcks` 
>collection [latch=ServerLatch [permits=1, pendingAcks=HashSet 
>[36edbfd5-4feb-417e-b965-bdc34a0a6f4f], super=CompletableLatch 
>[id=CompletableLatchUid [id=exchange, topVer=AffinityTopologyVersion 
>[topVer=103, minorTopVer=0]  
>On Tue, 5 Oct 2021 at 10:07, John Smith < java.dev@gmail.com > wrote:
>>And I see this...
>>
>>[14:04:15,150][WARNING][exchange-worker-#43%raange%][GridDhtPartitionsExchangeFuture]
>> Unable to await partitions release latch within timeout. For more details 
>>please check coordinator node logs [crdNode=TcpDiscoveryNode 
>>[id=36ad785d-e344-43bb-b685-e79557572b54, 
>>consistentId=8172e45d-3ff8-4fe4-aeda-e7d30c1e11e2, addrs=ArrayList 
>>[127.0.0.1, xx.65], sockAddrs=HashSet [xx-0002/xx.65:47500, / 
>>127.0.0.1:47500 ], discPort=47500, order=1, intOrder=1, 
>>lastExchangeTime=1633370987399, loc=false, ver=2.8.1#20200521-sha1:86422096, 
>>isClient=false]] [latch=ClientLatch [coordinator=TcpDiscoveryNode 
>>[id=36ad785d-e344-43bb-b685-e79557572b54, 
>>consistentId=8172e45d-3ff8-4fe4-aeda-e7d30c1e11e2, addrs=ArrayList 
>>[127.0.0.1, xx.65], sockAddrs=HashSet [xx-0002/xx.65:47500, / 
>>127.0.0.1:47500 ], discPort=47500, order=1, intOrder=1, 
>>lastExchangeTime=1633370987399, loc=false, ver=2.8.1#20200521-sha1:86422096, 
>>isClient=false], ackSent=true, super=CompletableLatch [id=CompletableLatchUid 
>>[id=exchange, topVer=AffinityTopologyVersion [topVer=103, minorTopVer=0]  
>>On Tue, 5 Oct 2021 at 10:02, John Smith < java.dev@gmail.com > wrote:
>>>Actually to be more clear...
>>>
>>>http://xx-0001:8080/ignite?cmd=version responds immediately.
>>>
>>>http://xx-0001:8080/ignite?cmd=size=my-cache doesn't respond 
>>>at all.  
>>>On Tue, 5 Oct 2021 at 09:59, John Smith < java.dev@gmail.com > wrote:
>>>>Yeah ever since I got this erro for example the REST APi wont return and 
>>>>the request are slower. But when I connect with visor I can get stats I can 
>>>>scan the cache etc...
>>>>
>>>>Is it possible that these async futures/threads are not released?  
>>>>On Tue, 5 Oct 2021 at 04:11, Zhenya Stanilovsky < arzamas...@mail.ru > 
>>>>wrote:
>>>>>Hi, this is just a warning shows that something suspicious observed.
>>>>>There is no simple reply for your question, in common case all these 
>>>>>messages are due to cluster (resources or settings)  limitation.
>>>>>Check documentation for tuning performance [1]
>>>>> 
>>>>>[1]  
>>>>>https://ignite.apache.org/docs/latest/perf-and-troubleshooting/general-perf-tips
>>>>>   
>>>>>>Hi, using 2.8.1 I understand the message as in my async TRX is taking 
>>>>>>longer but is there a way to prevent it?
>>>>>> 
>>>>>>When this happened I was pushing about 50, 000 get/puts per second from 
>>>>>>my API. 
>>>>> 
>>>>> 
>>>>> 
>>>>>

Re: What does "First 10 long running cache futures" ?

2021-10-05 Thread Zhenya Stanilovsky


Hi, this is just a warning shows that something suspicious observed.
There is no simple reply for your question, in common case all these messages 
are due to cluster (resources or settings)  limitation.
Check documentation for tuning performance [1]
 
[1] 
https://ignite.apache.org/docs/latest/perf-and-troubleshooting/general-perf-tips
 
>Hi, using 2.8.1 I understand the message as in my async TRX is taking longer 
>but is there a way to prevent it?
> 
>When this happened I was pushing about 50, 000 get/puts per second from my 
>API.

Re[2]: apache ignite 2.10.0 heap starvation

2021-09-29 Thread Zhenya Stanilovsky


Ok, i still can`t understand whats the source of 128 value.
Can you check Runtime.getRuntime().availableProcessors() returning value on 
your side ?
 
 
> 
>> 
>>>Hi Naveen,
>>>
>>>my first change was to change jvm parameters, at first it seemed to be 
>>>resolved but changing jvm parameters only delayed the problem. Before that 
>>>heap problems occured after 14-16 hours after the start, but with jvm 
>>>changes it took up to 36 hours.
>>>
>>>while keeping jvm changes i updated threadpool configurations and heap 
>>>problem solved. we can see
>>>saw pattern in heap usage.
>>>
>>>before:  https://ibb.co/mqx4kYy
>>>after:  https://ibb.co/y8B0hzS
>>>
>>>
>>>On 2021/09/29 13:37:53, Naveen Kumar < naveen.band...@gmail.com > wrote:
 Good to hear from you , I have had the same issue for quite a long time and
 am still looking for a fix.

 What do you think has exactly resolved the heap starvation issue, is it the
 GC related configuration or the threadpool configuration. ?
 Default thread pool is the number of the cores of the server, if this is
 true, we don't need to specify any config for all these thread pool

 Thanks
 Naveen



 On Wed, Sep 29, 2021 at 2:35 PM Ibrahim Altun < 
 ibrahim.al...@segmentify.com >
 wrote:

 > after many configuration changes and optimizations, i think i've solved
 > the heap problem.
 >
 > here are the changes that i applied to the system;
 > JVM changes ->
 >  
 > https://medium.com/@hoan.nguyen.it/how-did-g1gc-tuning-flags-affect-our-back-end-web-app-c121d38dfe56
 > helped a lot
 >
 > nodes are running on 12CORE and 64GB MEM servers, i've added the 
 > following
 > jvm parameters
 >
 > -XX:ParallelGCThreads=6
 > -XX:ConcGCThreads=2
 > -XX:MaxGCPauseMillis=200
 > -XX:InitiatingHeapOccupancyPercent=40
 >
 > on ignite configuration i've changed all thread pool sizes, which were
 > much more than these;
 > 
 > 
 > 
 > 
 > 
 > 
 > 
 >
 > Here is the 16 hours of GC report;
 >
 >  
 > https://gceasy.io/diamondgc-report.jsp?p=c2hhcmVkLzIwMjEvMDkvMjkvLS1nYy5sb2cuMC5jdXJyZW50LS04LTU4LTMx=WEB
 >
 >
 >
 > On 2021/09/27 17:11:21, Ilya Korol < llivezk...@gmail.com > wrote:
 > > Actually Query interface doesn't define close() method, but QueryCursor
 > > does.
 > > In your snippets you're using try-with-resource construction for SELECT
 > > queries which is good, but when you run MERGE INTO query you would also
 > > get an QueryCursor as a result of
 > >
 > > igniteCacheService.getCache(ID,
 > IgniteCacheType.LABEL).query(insertQuery);
 > >
 > > so maybe this QueryCursor objects still hold some resources/memory.
 > > Javadoc for QueryCursor states that you should always close cursors.
 > >
 > > To simplify cursor closing there is a cursor.getAll() method that will
 > > do this for you under the hood.
 > >
 > >
 > > On 2021/09/13 06:17:21, Ibrahim Altun < i...@segmentify.com > wrote:
 > > > Hi Ilya,>
 > > >
 > > > since this is production environment i could not risk to take heap
 > > dump for now, but i will try to convince my superiors to get one and
 > > analyze it.>
 > > >
 > > > Queries are heavily used in our system but aren't they autoclosable
 > > objects? do we have to close them anyway?>
 > > >
 > > > here are some usage examples on our system;>
 > > > --insert query is like this; MERGE INTO "ProductLabel" ("productId",
 > > "label", "language") VALUES (?, ?, ?)>
 > > > igniteCacheService.getCache(ID,
 > > IgniteCacheType.LABEL).query(insertQuery);>
 > > >
 > > > another usage example;>
 > > > --sqlFieldsQuery is like this; >
 > > > String sql = "SELECT _val FROM \"UserRecord\" WHERE \"email\" IN
 > (?)";>
 > > > SqlFieldsQuery sqlFieldsQuery = new SqlFieldsQuery(sql);>
 > > > sqlFieldsQuery.setLazy(true);>
 > > > sqlFieldsQuery.setArgs(emails.toArray());>
 > > >
 > > > try (QueryCursor> ignored = igniteCacheService.getCache(ID,
 > > IgniteCacheType.USER).query(sqlFieldsQuery)) {...}>
 > > >
 > > >
 > > >
 > > > On 2021/09/12 20:28:09, Shishkov Ilya < sh...@gmail.com > wrote: >
 > > > > Hi, Ibrahim!>
 > > > > Have you analyzed the heap dump of the server node JVMs?>
 > > > > In case your application executes queries are their cursors 
 > > > > closed?>
 > > > > >
 > > > > пт, 10 сент. 2021 г. в 11:54, Ibrahim Altun < ib...@segmentify.com
 > >:>
 > > > > >
 > > > > > Igniters any comment on this issue, we are facing huge GC
 > > problems on>
 > > > > > production environment, please advise.>
 > > > > >>
 > > > > > On 2021/09/07 14:11:09, Ibrahim Altun < ib...@segmentify.com >>
 > > > > > wrote:>
 > > > > > > Hi,>
 > > > > > >>

Re[2]: apache ignite 2.10.0 heap starvation

2021-09-29 Thread Zhenya Stanilovsky



systemThreadPoolSize and other pools are defined by default from:
Runtime.getRuntime().availableProcessors(), if you somehow obtain 128, plz fill 
the ticket with all env info.
thanks !


 
>after many configuration changes and optimizations, i think i've solved the 
>heap problem.
>
>here are the changes that i applied to the system;
>JVM changes ->  
>https://medium.com/@hoan.nguyen.it/how-did-g1gc-tuning-flags-affect-our-back-end-web-app-c121d38dfe56
> helped a lot
>
>nodes are running on 12CORE and 64GB MEM servers, i've added the following jvm 
>parameters
>
>-XX:ParallelGCThreads=6
>-XX:ConcGCThreads=2
>-XX:MaxGCPauseMillis=200
>-XX:InitiatingHeapOccupancyPercent=40
>
>on ignite configuration i've changed all thread pool sizes, which were much 
>more than these;
>
>
>
>
>
>
>
>
>Here is the 16 hours of GC report;
>https://gceasy.io/diamondgc-report.jsp?p=c2hhcmVkLzIwMjEvMDkvMjkvLS1nYy5sb2cuMC5jdXJyZW50LS04LTU4LTMx=WEB
>
>
>
>On 2021/09/27 17:11:21, Ilya Korol < llivezk...@gmail.com > wrote:
>> Actually Query interface doesn't define close() method, but QueryCursor
>> does.
>> In your snippets you're using try-with-resource construction for SELECT
>> queries which is good, but when you run MERGE INTO query you would also
>> get an QueryCursor as a result of
>>
>> igniteCacheService.getCache(ID, IgniteCacheType.LABEL).query(insertQuery);
>>
>> so maybe this QueryCursor objects still hold some resources/memory.
>> Javadoc for QueryCursor states that you should always close cursors.
>>
>> To simplify cursor closing there is a cursor.getAll() method that will
>> do this for you under the hood.
>>
>>
>> On 2021/09/13 06:17:21, Ibrahim Altun < i...@segmentify.com > wrote:
>> > Hi Ilya,>
>> >
>> > since this is production environment i could not risk to take heap
>> dump for now, but i will try to convince my superiors to get one and
>> analyze it.>
>> >
>> > Queries are heavily used in our system but aren't they autoclosable
>> objects? do we have to close them anyway?>
>> >
>> > here are some usage examples on our system;>
>> > --insert query is like this; MERGE INTO "ProductLabel" ("productId",
>> "label", "language") VALUES (?, ?, ?)>
>> > igniteCacheService.getCache(ID,
>> IgniteCacheType.LABEL).query(insertQuery);>
>> >
>> > another usage example;>
>> > --sqlFieldsQuery is like this; >
>> > String sql = "SELECT _val FROM \"UserRecord\" WHERE \"email\" IN (?)";>
>> > SqlFieldsQuery sqlFieldsQuery = new SqlFieldsQuery(sql);>
>> > sqlFieldsQuery.setLazy(true);>
>> > sqlFieldsQuery.setArgs(emails.toArray());>
>> >
>> > try (QueryCursor> ignored = igniteCacheService.getCache(ID,
>> IgniteCacheType.USER).query(sqlFieldsQuery)) {...}>
>> >
>> >
>> >
>> > On 2021/09/12 20:28:09, Shishkov Ilya < sh...@gmail.com > wrote: >
>> > > Hi, Ibrahim!>
>> > > Have you analyzed the heap dump of the server node JVMs?>
>> > > In case your application executes queries are their cursors closed?>
>> > > >
>> > > пт, 10 сент. 2021 г. в 11:54, Ibrahim Altun < ib...@segmentify.com >:>
>> > > >
>> > > > Igniters any comment on this issue, we are facing huge GC
>> problems on>
>> > > > production environment, please advise.>
>> > > >>
>> > > > On 2021/09/07 14:11:09, Ibrahim Altun < ib...@segmentify.com >>
>> > > > wrote:>
>> > > > > Hi,>
>> > > > >>
>> > > > > totally 400 - 600K reads/writes/updates>
>> > > > > 12core>
>> > > > > 64GB RAM>
>> > > > > no iowait>
>> > > > > 10 nodes>
>> > > > >>
>> > > > > On 2021/09/07 12:51:28, Piotr Jagielski < pj...@touk.pl > wrote:>
>> > > > > > Hi,>
>> > > > > > Can you provide some information on how you use the cluster?
>> How many>
>> > > > reads/writes/updates per second? Also CPU / RAM spec of cluster
>> nodes?>
>> > > > > >>
>> > > > > > We observed full GC / CPU load / OOM killer when loading big
>> amount of>
>> > > > data (15 mln records, data streamer + allowOverwrite=true). We've
>> seen>
>> > > > 200-400k updates per sec on JMX metrics, but load up to 10 on
>> nodes, iowait>
>> > > > to 30%. Our cluster is 3 x 4CPU, 16GB RAM (already upgradingto
>> 8CPU, 32GB>
>> > > > RAM). Ignite 2.10>
>> > > > > >>
>> > > > > > Regards,>
>> > > > > > Piotr>
>> > > > > >>
>> > > > > > On 2021/09/02 08:36:07, Ibrahim Altun < ib...@segmentify.com >>
>> > > > wrote:>
>> > > > > > > After upgrading from 2.7.1 version to 2.10.0 version ignite
>> nodes>
>> > > > facing>
>> > > > > > > huge full GC operations after 24-36 hours after node start.>
>> > > > > > >>
>> > > > > > > We try to increase heap size but no luck, here is the start>
>> > > > configuration>
>> > > > > > > for nodes;>
>> > > > > > >>
>> > > > > > > JVM_OPTS="$JVM_OPTS -Xms12g -Xmx12g -server>
>> > > > > > >>
>> > > >
>> -javaagent:/etc/prometheus/jmx_prometheus_javaagent-0.14.0.jar=8090:/etc/prometheus/jmx.yml>
>>
>> > > > > > > -Dcom.sun.management.jmxremote>
>> > > > > > > -Dcom.sun.management.jmxremote.authenticate=false>
>> > > > > > >

Re[2]: Failed to perform cache operation (cache is stopped)

2021-09-07 Thread Zhenya Stanilovsky


plz share somehow these logs, if you have no ideas how to share, you can send 
it directly to arzamas...@mail.ru
 
>Meanwhile I grep the logs with the next occurrence of cache stopped 
>exception,can someone highlight if there is any known bug related to this?
>I want to check the possible reason for this cache stop exception.  
>On Mon, Sep 6, 2021 at 6:27 PM Akash Shinde < akashshi...@gmail.com > wrote:
>>Hi Zhenya,
>>Thanks for the quick response.
>>I believe you are talking about ignite instances. There is single ignite 
>>using in application.
>>I also want to point out that I am not using destroyCache()  method anywhere 
>>in application.
>> 
>>I will set   IGNITE_QUIET = false  and try to grep the required logs.
>>This issue occurs by random and there is no way reproduce it.
>> 
>>Thanks,
>>Akash
>> 
>>   
>>On Mon, Sep 6, 2021 at 5:33 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>wrote:
>>>Hi, Akash
>>>You can obtain such a case, for example when you have several instances and :
>>>inst1:
>>>cache = inst1.getOrCreateCache("cache1");
>>> 
>>>after inst2 destroy calling:
>>> 
>>>cache._some_method_call_
>>> 
>>>inst2:
>>> inst2.destroyCache("cache1");
>>> 
>>>or shorter: you still use instance that already destroyed, you can simple 
>>>grep your logs and found the time when cache has been stopped.
>>>probably you need to set  IGNITE_QUIET = false.
>>>[1]  https://ignite.apache.org/docs/latest/logging
>>> 
>>>> 
>>>>> 
>>>>>>Hi,
>>>>>>I have four server nodes and six client nodes on ignite cluster. I am 
>>>>>>using ignite 2.10 version.
>>>>>>Some operations are failing due to the CacheStoppedException exception on 
>>>>>>the server nodes. This has become a blocker issue. 
>>>>>>Could someone please help me to resolve this issue.
>>>>>> 
>>>>>>Cache Configuration
>>>>>>CacheConfiguration subscriptionCacheCfg = new 
>>>>>>CacheConfiguration<>(CacheName.SUBSCRIPTION_CACHE.name());
>>>>>>subscriptionCacheCfg.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL);
>>>>>>subscriptionCacheCfg.setWriteThrough(false);
>>>>>>subscriptionCacheCfg.setReadThrough(true);
>>>>>>subscriptionCacheCfg.setRebalanceMode(CacheRebalanceMode.ASYNC);
>>>>>>subscriptionCacheCfg.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC);
>>>>>>subscriptionCacheCfg.setBackups(2);
>>>>>>Factory storeFactory = 
>>>>>>FactoryBuilder.factoryOf(SubscriptionDataLoader.class);
>>>>>>subscriptionCacheCfg.setCacheStoreFactory(storeFactory);
>>>>>>subscriptionCacheCfg.setIndexedTypes(DefaultDataKey.class, 
>>>>>>SubscriptionData.class);
>>>>>>subscriptionCacheCfg.setSqlIndexMaxInlineSize(47);
>>>>>>RendezvousAffinityFunction affinityFunction = new 
>>>>>>RendezvousAffinityFunction();
>>>>>>affinityFunction.setExcludeNeighbors(true);
>>>>>>subscriptionCacheCfg.setAffinity(affinityFunction);
>>>>>>subscriptionCacheCfg.setStatisticsEnabled(true);
>>>>>>subscriptionCacheCfg.setPartitionLossPolicy(PartitionLossPolicy.READ_WRITE_SAFE);
>>>>>> 
>>>>>>Exception stack trace
>>>>>> 
>>>>>>ERROR c.q.dgms.kafka.TaskRequestListener - Error occurred while consuming 
>>>>>>the object
>>>>>>com.baidu.unbiz.fluentvalidator.exception.RuntimeValidateException: 
>>>>>>java.lang.IllegalStateException: class 
>>>>>>org.apache.ignite.internal.processors.cache.CacheStoppedException: Failed 
>>>>>>to perform cache operation (cache is stopped): SUBSCRIPTION_CACHE
>>>>>>at 
>>>>>>com.baidu.unbiz.fluentvalidator.FluentValidator.doValidate(FluentValidator.java:506)
>>>>>>at 
>>>>>>com.baidu.unbiz.fluentvalidator.FluentValidator.doValidate(FluentValidator.java:461)
>>>>>>at 
>>>>>>com.xyz.dgms.service.UserManagementServiceImpl.deleteUser(UserManagementServiceImpl.java:710)
>>>>>>at 
>>>>>>com.xyz.dgms.kafka.TaskRequestListener.processRequest(TaskRequestListener.java:190)
>>>>>>at 
>>>>>

Re: Failed to perform cache operation (cache is stopped)

2021-09-06 Thread Zhenya Stanilovsky


Hi, Akash
You can obtain such a case, for example when you have several instances and :
inst1:
cache = inst1.getOrCreateCache("cache1");
 
after inst2 destroy calling:
 
cache._some_method_call_
 
inst2:
 inst2.destroyCache("cache1");
 
or shorter: you still use instance that already destroyed, you can simple grep 
your logs and found the time when cache has been stopped.
probably you need to set  IGNITE_QUIET = false.
[1] https://ignite.apache.org/docs/latest/logging
 
> 
>> 
>>>Hi,
>>>I have four server nodes and six client nodes on ignite cluster. I am using 
>>>ignite 2.10 version.
>>>Some operations are failing due to the CacheStoppedException exception on 
>>>the server nodes. This has become a blocker issue. 
>>>Could someone please help me to resolve this issue.
>>> 
>>>Cache Configuration
>>>CacheConfiguration subscriptionCacheCfg = new 
>>>CacheConfiguration<>(CacheName.SUBSCRIPTION_CACHE.name());
>>>subscriptionCacheCfg.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL);
>>>subscriptionCacheCfg.setWriteThrough(false);
>>>subscriptionCacheCfg.setReadThrough(true);
>>>subscriptionCacheCfg.setRebalanceMode(CacheRebalanceMode.ASYNC);
>>>subscriptionCacheCfg.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC);
>>>subscriptionCacheCfg.setBackups(2);
>>>Factory storeFactory = 
>>>FactoryBuilder.factoryOf(SubscriptionDataLoader.class);
>>>subscriptionCacheCfg.setCacheStoreFactory(storeFactory);
>>>subscriptionCacheCfg.setIndexedTypes(DefaultDataKey.class, 
>>>SubscriptionData.class);
>>>subscriptionCacheCfg.setSqlIndexMaxInlineSize(47);
>>>RendezvousAffinityFunction affinityFunction = new 
>>>RendezvousAffinityFunction();
>>>affinityFunction.setExcludeNeighbors(true);
>>>subscriptionCacheCfg.setAffinity(affinityFunction);
>>>subscriptionCacheCfg.setStatisticsEnabled(true);
>>>subscriptionCacheCfg.setPartitionLossPolicy(PartitionLossPolicy.READ_WRITE_SAFE);
>>> 
>>>Exception stack trace
>>> 
>>>ERROR c.q.dgms.kafka.TaskRequestListener - Error occurred while consuming 
>>>the object
>>>com.baidu.unbiz.fluentvalidator.exception.RuntimeValidateException: 
>>>java.lang.IllegalStateException: class 
>>>org.apache.ignite.internal.processors.cache.CacheStoppedException: Failed to 
>>>perform cache operation (cache is stopped): SUBSCRIPTION_CACHE
>>>at 
>>>com.baidu.unbiz.fluentvalidator.FluentValidator.doValidate(FluentValidator.java:506)
>>>at 
>>>com.baidu.unbiz.fluentvalidator.FluentValidator.doValidate(FluentValidator.java:461)
>>>at 
>>>com.xyz.dgms.service.UserManagementServiceImpl.deleteUser(UserManagementServiceImpl.java:710)
>>>at 
>>>com.xyz.dgms.kafka.TaskRequestListener.processRequest(TaskRequestListener.java:190)
>>>at 
>>>com.xyz.dgms.kafka.TaskRequestListener.process(TaskRequestListener.java:89)
>>>at 
>>>com.xyz.libraries.mom.kafka.consumer.TopicConsumer.lambda$run$3(TopicConsumer.java:162)
>>>at net.jodah.failsafe.Functions$12.call(Functions.java:274)
>>>at net.jodah.failsafe.SyncFailsafe.call(SyncFailsafe.java:145)
>>>at net.jodah.failsafe.SyncFailsafe.run(SyncFailsafe.java:93)
>>>at 
>>>com.xyz.libraries.mom.kafka.consumer.TopicConsumer.run(TopicConsumer.java:159)
>>>at 
>>>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>at 
>>>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>at java.lang.Thread.run(Thread.java:748)
>>>Caused by: java.lang.IllegalStateException: class 
>>>org.apache.ignite.internal.processors.cache.CacheStoppedException: Failed to 
>>>perform cache operation (cache is stopped): SUBSCRIPTION_CACHE
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheGateway.enter(GridCacheGateway.java:166)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.onEnter(GatewayProtectedCacheProxy.java:1625)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:673)
>>>at 
>>>com.xyz.dgms.grid.dao.AbstractDataGridDAO.getData(AbstractDataGridDAO.java:39)
>>>at 
>>>com.xyz.dgms.grid.dao.AbstractDataGridDAO.getData(AbstractDataGridDAO.java:28)
>>>at 
>>>com.xyz.dgms.grid.dataservice.DefaultDataGridService.getData(DefaultDataGridService.java:22)
>>>at 
>>>com.xyz.dgms.grid.dataservice.DefaultDataGridService.getData(DefaultDataGridService.java:10)
>>>at 
>>>com.xyz.dgms.validators.common.validators.UserDataValidator.validateSubscription(UserDataValidator.java:226)
>>>at 
>>>com.xyz.dgms.validators.common.validators.UserDataValidator.validateRequest(UserDataValidator.java:124)
>>>at 
>>>com.xyz.dgms.validators.common.validators.UserDataValidator.validate(UserDataValidator.java:346)
>>>at 
>>>com.xyz.dgms.validators.common.validators.UserDataValidator.validate(UserDataValidator.java:41)
>>>at 
>>>com.baidu.unbiz.fluentvalidator.FluentValidator.doValidate(FluentValidator.java:490)
>>>... 12 common frames omitted
>>>Caused by: 
>>>org.apache.ignite.internal.processors.cache.CacheStoppedException: Failed to 
>>>perform

Re: Ignite High memory usage though very less disk size

2021-07-30 Thread Zhenya Stanilovsky


hi, Devakumar J
There is not enough information for analysis.
Do you have any monitoring ? If no — plz enable it and try to understand how 
huge cpu consumption and possibly gc pauses correlates with you tasks.
Do you have enough heap (-Xmx param) ? What kind of processes are consume most 
heap ?
Without all these info we can`t move forward in analysis.
 
thanks ! 
 
>Hi,
>
>We have 3server+2client cluster setup. Also we have 2 completely different 
>clusters for different regions.
>
>Both has similar set of integrations in terms of SQL queries/ CQ listeners/ 
>Client connections.
>
>Also the VM hardware/OS settings also same.
>
>In cluster 1 through we have disk of 20GB but the cluster performance is 
>really good and heap usage/CPU usage is optimal.
>
>In cluster 2 we do have less data only in disk but there is heavy fluctuations 
>in heap usage and lot FULL GC happening pausing JVM for 7 to 8 secs every 
>minute. Only restart helps in this case.
>
>
>Only difference noticed between machines is memory page cache utilization. We 
>have done page cache cleanup and restarted the cluster and page cache 
>utilization become 105 GB out of 126GB RAM with in a day.
>
>Please find the metrics below and suggest any debugging steps to carry 
>out/document to refer.
>
>
>Cluster 1:
>
>Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>    ^-- Node [id=27529ecd, name=server-node-3, uptime=2 days, 17:48:09.803]
>    ^-- H/N/C [hosts=3, nodes=5, CPUs=24]
>    ^-- CPU [cur=1.33%, avg=3.05%, GC=0%]
>    ^-- PageMemory [pages=3375870]
>    ^-- Heap [used=4372MB, free=73.31%, comm=5600MB]
>    ^-- Off-heap [used=13341MB, free=20.03%, comm=16584MB]
>    ^--   sysMemPlc region [used=0MB, free=99.99%, comm=100MB]
>    ^--   metastoreMemPlc region [used=0MB, free=99.85%, comm=0MB]
>    ^--   TxLog region [used=0MB, free=100%, comm=100MB]
>    ^--   DefaultRegion region [used=13341MB, free=18.57%, comm=16384MB]
>    ^-- Ignite persistence [used=20052MB]
>    ^--   sysMemPlc region [used=0MB]
>    ^--   metastoreMemPlc region [used=0MB]
>    ^--   TxLog region [used=0MB]
>    ^--   DefaultRegion region [used=20052MB]
>    ^-- Outbound messages queue [size=0]
>    ^-- Public thread pool [active=0, idle=0, qSize=0]
>    ^-- System thread pool [active=0, idle=7, qSize=0]
>
>Cluster 2:
>Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>    ^-- Node [id=5905afb7, name=server-node-1, uptime=2 days, 05:49:04.925]
>    ^-- H/N/C [hosts=3, nodes=5, CPUs=24]
>    ^-- CPU [cur=1.23%, avg=6.4%, GC=0%]
>    ^-- PageMemory [pages=1173731]
>    ^-- Heap [used=13043MB, free=20.39%, comm=16384MB]
>    ^-- Off-heap [used=4638MB, free=72.2%, comm=16584MB]
>    ^--   sysMemPlc region [used=0MB, free=99.99%, comm=100MB]
>    ^--   metastoreMemPlc region [used=0MB, free=99.91%, comm=0MB]
>    ^--   TxLog region [used=0MB, free=100%, comm=100MB]
>    ^--   DefaultRegion region [used=4638MB, free=71.69%, comm=16384MB]
>    ^-- Ignite persistence [used=5423MB]
>    ^--   sysMemPlc region [used=0MB]
>    ^--   metastoreMemPlc region [used=0MB]
>    ^--   TxLog region [used=0MB]
>    ^--   DefaultRegion region [used=5422MB]
>    ^-- Outbound messages queue [size=0]
>    ^-- Public thread pool [active=0, idle=0, qSize=0]
>    ^-- System thread pool [active=0, idle=5, qSize=0]
> 
>  Thanks & Regards ,
>Devakumar J
> 
>Virus-free.  www.avast.com

Re[2]: Howto fix object lock in cache?

2021-07-01 Thread Zhenya Stanilovsky


You can try to kill them, but i suppose: you can`t, cause specific state:  
MARKED_ROLLBACK  just try.
If this can`t help — you need to grep all ignite.log(s) and found this  xid=  
versions, i suppose you will grep possibly pair instances, after you need to 
sequentially restart grid instances with such xid in logs. If all ok — just 
plan to upgrade into 2.10, there are some bugs on 2.8.1 closing to your case.
 
thanks!
>Hi,
>thanks for the hint using the control.sh script. It looks like the server node 
>with .111 IP address is having some very old transactions:
>Matching transactions:
>TcpDiscoveryNode [id=78b6df9f-7b77-46ea-9aef-a1b61918b258, 
>addrs=[110.10.123.111], order=1, ver=2.8.1#20200521-sha1:86422096, 
>isClient=false, consistentId=3d51530f-0dec-46da-98cb-eb550c860777]
>    Tx: [xid=93724cf8971--0d9e-4326--0001, label=null, 
>state=MARKED_ROLLBACK, startTime=2021-05-24 06:22:10.778, duration=3224128, 
>isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, 
>topVer=AffinityTopologyVersion [topVer=74, minorTopVer=0], timeout=9990, 
>size=0, dhtNodes=[], nearXid=f0724cf8971--0d9e-4326--004a, 
>parentNodeIds=[75431074]]
>    Tx: [xid=7219dcbf971--0d9e-452c--0001, label=null, 
>state=MARKED_ROLLBACK, startTime=2021-06-23 13:12:53.094, duration=607486, 
>isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, 
>topVer=AffinityTopologyVersion [topVer=592, minorTopVer=0], timeout=1, 
>size=0, dhtNodes=[], nearXid=f119dcbf971--0d9e-452c--01d9, 
>parentNodeIds=[458bca81]]
>    Tx: [xid=a1480fb5a71--0d9e-4593--0001, label=null, 
>state=PREPARING, startTime=2021-06-30 10:51:37.874, duration=11161, 
>isolation=READ_COMMITTED, concurrency=OPTIMISTIC, 
>topVer=AffinityTopologyVersion [topVer=695, minorTopVer=0], timeout=0, size=1, 
>dhtNodes=[78b6df9f], nearXid=a1480fb5a71--0d9e-4593--0001, 
>parentNodeIds=[78b6df9f]]
>TcpDiscoveryNode [id=5eb48408-a239-4a92-9fb2-7903407975ad, 
>addrs=[110.10.123.87], order=688, ver=2.8.1#20200521-sha1:86422096, 
>isClient=true, consistentId=5eb48408-a239-4a92-9fb2-7903407975ad]
>    Tx: [xid=e95d3fb5a71--0d9e-4593--02b0, label=null, 
>state=PREPARING, startTime=2021-06-30 13:57:39.511, duration=0, 
>isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, 
>topVer=AffinityTopologyVersion [topVer=695, minorTopVer=0], timeout=0, size=1, 
>dhtNodes=[78b6df9f], nearXid=e95d3fb5a71--0d9e-4593--02b0, 
>parentNodeIds=[5eb48408]]
>Command [TX] finished with code: 0
>What happens if I would kill these two transactions that are in 
>MARKED_ROLLBACK state but look like they never finished? Is it safe to kill 
>them?
>Thanks!
> 
>On 30.06.21 15:20, Zhenya Stanilovsky wrote:
>>
>>Hi, first of all you need to know who is holding a lock, check [1] command.
>> 
>>[1]  
>>https://ignite.apache.org/docs/latest/tools/control-script#transaction-management
>>   
>>>Hi all,
>>>
>>>I have the situation that at least one DB row of my persistent cache
>>>seems locked and I can't change it anymore. Everytime I want to change
>>>it using SQL a TransactionTimeoutException happens like this:
>>>
>>>class org.apache.ignite.transactions.TransactionTimeoutException: Failed
>>>to acquire lock within provided timeout for transaction [timeout=1,
>>>tx=GridNearTxLocal[xid=cc2d4db5a71--0d9e-4593--02b6,
>>>xidVersion=GridCacheVersion [topVer=228476307, order=1625038312140,
>>>nodeOrder=694], nearXidVersion=GridCacheVersion [topVer=228476307,
>>>order=1625038312140, nodeOrder=694], concurrency=PESSIMISTIC,
>>>isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, invalidate=false,
>>>rollbackOnly=true, nodeId=62e91173-e912-49d5-b238-e95f8fe38314,
>>>timeout=1, startTime=1625051540461, duration=10034, label=null]]
>>>
>>>However, the system continues to run and other objects in the cache can
>>>be added and changed.
>>>
>>>Is there a way to unlock this particular entry?
>>>
>>>Thanks!
>>>  
>> 
>> 
>> 
>>

Re: Howto fix object lock in cache?

2021-06-30 Thread Zhenya Stanilovsky



Hi, first of all you need to know who is holding a lock, check [1] command.
 
[1]  
https://ignite.apache.org/docs/latest/tools/control-script#transaction-management
 
>Hi all,
>
>I have the situation that at least one DB row of my persistent cache
>seems locked and I can't change it anymore. Everytime I want to change
>it using SQL a TransactionTimeoutException happens like this:
>
>class org.apache.ignite.transactions.TransactionTimeoutException: Failed
>to acquire lock within provided timeout for transaction [timeout=1,
>tx=GridNearTxLocal[xid=cc2d4db5a71--0d9e-4593--02b6,
>xidVersion=GridCacheVersion [topVer=228476307, order=1625038312140,
>nodeOrder=694], nearXidVersion=GridCacheVersion [topVer=228476307,
>order=1625038312140, nodeOrder=694], concurrency=PESSIMISTIC,
>isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, invalidate=false,
>rollbackOnly=true, nodeId=62e91173-e912-49d5-b238-e95f8fe38314,
>timeout=1, startTime=1625051540461, duration=10034, label=null]]
>
>However, the system continues to run and other objects in the cache can
>be added and changed.
>
>Is there a way to unlock this particular entry?
>
>Thanks!
>

Re[2]: Enable Native persistence only for one node of the cluster

2021-06-14 Thread Zhenya Stanilovsky


Yes it`s possible, but one data region can`t be mixed = persistent on one node 
and in mem on another.



 
>Background on Use case: We have around 1 tasks that we want to run across
>ignite cluster. These tasks will be submitted by multiple independent client
>nodes. Each task will have a priority and task key. Two tasks with the same
>taskKey should not be running in parallel. Execution of all the tasks should
>be based on priority. As far as I know, ignite does not provide locking
>between two tasks and it does not provide a way to distribute tasks based on
>its priority *across *the cluster.
>
>To meet the above requirements, we have a cluster singleton monitoring
>ignite service which will create the compute tasks and distribute them
>across the cluster. It will take care of priority and locking. All clients
>will create a cache object(based on which service can create actual compute
>tasks) and store it in the cache store. Service will query the cache-store
>to find the next set of valid tasks based on priority and locking.
>
>We need native persistence because even if the cluster goes down, it should
>not lose the tasks that were sent by the clients. And only ignite service
>will be dealing with these task objects, we are thinking of enabling it only
>for the cluster node on which our ignite service is hosted.
>
>Let me know if this makes sense or if you see any red flags.
>
>Thanks,
>Krish
>
>
>
>
>
>
>
>--
>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/

Re: Failing client node due to not receiving metrics updates-IGNITE-10354

2021-06-10 Thread Zhenya Stanilovsky


Hello Akash !
I found that fix mentioned by you is ok.
Why do you think that your network between server and client is ok ?
Can you add some network monitoring here ?
thanks.
 
> 
>> 
>>>Hi, There is a cluster of four server nodes and six client nodes in 
>>>production. I was using ignite 2.6.0 version and all six client nodes were 
>>>failing with below error
>>> 
>>>WARN o.a.i.s.d.tcp.TcpDiscoverySpi -  Failing client node due to not 
>>>receiving metrics updates from client node within 
>>>'IgniteConfiguration.clientFailureDetectionTimeout' (consider increasing 
>>>configuration property) [timeout=9, node=TcpDiscoveryNode 
>>>[id=12f9809d-95be-47e3-81fe-d7ffcaab064c, 
>>>consistentId=12f9809d-95be-47e3-81fe-d7ffcaab064c, addrs=ArrayList 
>>>[0:0:0:0:0:0:0:1%lo, 127.0.0.1, ], sockAddrs=HashSet 
>>>[/0:0:0:0:0:0:0:1%lo:0, / 127.0.0.1:0 , /:0], 
>>>discPort=0, order=155, intOrder=82, lastExchangeTime=1623154238808, 
>>>loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true]]
>>> 
>>>Then I have upgraded the ignite version to 2.10.0 to get the fix of known 
>>>issue  IGNITE-10354 .  But I am still facing the issue even after upgrading 
>>>to the 10.2.0 ignite version.
>>> 
>>>Could someone help here.
>>> 
>>>Thanks,
>>>Akash
>>>  
>> 
>> 
>> 
>>

Re: Ignite node crashed

2021-05-13 Thread Zhenya Stanilovsky


Lo, Marcus, hi ! Seems problem really due to long gc pause.
Do you apply all suggestions from [1] [2] ?
[1]  https://apacheignite.readme.io/docs/jvm-and-system-tuning
[2]  https://apacheignite.readme.io/docs/jvm-and-system-tuning#memory-issues
 
>Hi,
> 
>I have a 5 node Ignite cluster setup, and it seems that when I start to create 
>table in the cluster, one of the node would crash. All of the nodes are VM 
>with 8 CPUs and 128GB of memory. I have attached the log file, gc file and 
>also the xml config for the crashing node (with default data region of 90GB, 
>and heap size of 10GB). I can see the node having a long GC starting from 
>04:29:58, but unfortunately the gc log doesn’t show anything at that time. Can 
>you please shed some light on the issue? Thanks.
> 
>Regards,
>Marcus
>

Re: Long transaction suspended

2021-03-31 Thread Zhenya Stanilovsky


Hi, fix [1] already in master.

[1] https://issues.apache.org/jira/browse/IGNITE-14076
> 
>Hi !
>
>
> 
>>Hi,
>>
>>Because of the kind of product we have to develop, we currently have a set
>>of scenarios with this kind of transactions and we're evaluating several
>>datastores as RocksDB and, sadly, timings there are quite better than the
>>ones I've got in Ignite... :(
> 
>I believe tx.putAll will be fixed soon ) I have working prototype for now, 
>need a little bit time to fix all tests )
> 
>>
>>Data streamer is not available in C++ afaik...
>>
>>
>>
>>--
>>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/ 
> 
> 
> 
> 
>
>

Re[2]: Failed to write class name to file

2021-03-31 Thread Zhenya Stanilovsky



what output of: stat /tmp/ignite-workspace/marshaller/-961185899.classname0 ? 
If it`s all ok here, possibly some privileges issues ?
 
>Hello,
> 
>we are using CentOS Linux release 7.8.2003 (Core), which deletes files from 
>/tmp no sooner than 10 days after accessed, so this should not be an issue.
> 
> 
>On Tue, 2021-03-30 at 18:40 +0300, Zhenya Stanilovsky wrote:
>>hi,
>>https://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up ?
>>
>> 
>>>Hi, 
>>> 
>>>We use a cluster consisting of 2 Ignite nodes. Deployment is done without 
>>>downtime or cluster reset.
>>> 
>>>When attempted to deploy new version of an application contaning new caches 
>>>on node 1, Ignite fails to write the class file on node 2. Peer class 
>>>loading is enabled. 
>>>The directory does exist and already contains other serialized classes and 
>>>has proper permissions.
>>> 
>>>Somehow Ignite fails to open a file it has just created, with 
>>>FileNotFoundException:  No such file or directory
>>>Do you have any suggestions what could be the cause? 
>>> 
>>>Thank you,
>>>Matej
>>> 
>>> 
>>>[ERROR] [29.03.2021 14:34:09.686] [] [68.166.6:47500]-#2] 
>>>[i.i.MarshallerMappingFileStore]: Failed to write class name to file 
>>>[platformId=0id=-961185899, clsName=org.profile.AbTest, 
>>>file=/tmp/ignite-workspace/marshaller/-961185899.classname0]
>>>java.io.FileNotFoundException: 
>>>/tmp/ignite-workspace/marshaller/-961185899.classname0 (No such file or 
>>>directory)
>>>at java.base/java.io.FileOutputStream.open0(Native Method)
>>>at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
>>>at java.base/java.io.FileOutputStream.(FileOutputStream.java:237)
>>>at java.base/java.io.FileOutputStream.(FileOutputStream.java:187)
>>>at 
>>>org.apache.ignite.internal.MarshallerMappingFileStore.writeMapping(MarshallerMappingFileStore.java:97)
>>>at 
>>>org.apache.ignite.internal.MarshallerMappingFileStore.mergeAndWriteMapping(MarshallerMappingFileStore.java:222)
>>>at 
>>>org.apache.ignite.internal.MarshallerContextImpl.onMappingDataReceived(MarshallerContextImpl.java:191)
>>>at 
>>>org.apache.ignite.internal.processors.marshaller.GridMarshallerMappingProcessor.processIncomingMappings(GridMarshallerMappingProcessor.java:356)
>>>at 
>>>org.apache.ignite.internal.processors.marshaller.GridMarshallerMappingProcessor.onJoiningNodeDataReceived(GridMarshallerMappingProcessor.java:336)
>>>at 
>>>org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.onExchange(GridDiscoveryManager.java:906)
>>>at 
>>>org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:2090)
>>>at 
>>>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddedMessage(ServerImpl.java:4816)
>>>at 
>>>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:3089)
>>>at 
>>>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2795)
>>>at 
>>>org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7766)
>>>at 
>>>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2946)
>>>at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>>>at 
>>>org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7697)
>>>at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:61)
>>>[ERROR] [29.03.2021 14:34:09.687] [] [68.166.6:47500]-#2] 
>>>[i.i.MarshallerMappingFileStore]: Failed to write class name to file 
>>>[platformId=0id=-1710898632, clsName=org.profile.GridCard, 
>>>file=/tmp/ignite-customer2/marshaller/-1710898632.classname0]
>>>java.io.FileNotFoundException: 
>>>/tmp/ignite-customer2/marshaller/-1710898632.classname0 (No such file or 
>>>directory)
>>>at java.base/java.io.FileOutputStream.open0(Native Method)
>>>at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
>>>at java.base/java.io.FileOutputStream.(FileOutputStream.java:237)
>>>at java.base/java.io.FileOutputStream.(FileOutputStream.java:187)
>>>at 
>>>org.apache.ignite.internal.MarshallerMappingFileStore.writeMapping(MarshallerMappingFileStore.java:97)
>>>at 
>>>org.apache.ignite.internal.MarshallerMappingFileStore.

Re: Failed to write class name to file

2021-03-30 Thread Zhenya Stanilovsky


hi,
https://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up 
 ?

 
>Hi, 
> 
>We use a cluster consisting of 2 Ignite nodes. Deployment is done without 
>downtime or cluster reset.
> 
>When attempted to deploy new version of an application contaning new caches on 
>node 1, Ignite fails to write the class file on node 2. Peer class loading is 
>enabled. 
>The directory does exist and already contains other serialized classes and has 
>proper permissions.
> 
>Somehow Ignite fails to open a file it has just created, with 
>FileNotFoundException:  No such file or directory
>Do you have any suggestions what could be the cause? 
> 
>Thank you,
>Matej
> 
> 
>[ERROR] [29.03.2021 14:34:09.686] [] [68.166.6:47500]-#2] 
>[i.i.MarshallerMappingFileStore]: Failed to write class name to file 
>[platformId=0id=-961185899, clsName=org.profile.AbTest, 
>file=/tmp/ignite-workspace/marshaller/-961185899.classname0]
>java.io.FileNotFoundException: 
>/tmp/ignite-workspace/marshaller/-961185899.classname0 (No such file or 
>directory)
>at java.base/java.io.FileOutputStream.open0(Native Method)
>at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
>at java.base/java.io.FileOutputStream.(FileOutputStream.java:237)
>at java.base/java.io.FileOutputStream.(FileOutputStream.java:187)
>at 
>org.apache.ignite.internal.MarshallerMappingFileStore.writeMapping(MarshallerMappingFileStore.java:97)
>at 
>org.apache.ignite.internal.MarshallerMappingFileStore.mergeAndWriteMapping(MarshallerMappingFileStore.java:222)
>at 
>org.apache.ignite.internal.MarshallerContextImpl.onMappingDataReceived(MarshallerContextImpl.java:191)
>at 
>org.apache.ignite.internal.processors.marshaller.GridMarshallerMappingProcessor.processIncomingMappings(GridMarshallerMappingProcessor.java:356)
>at 
>org.apache.ignite.internal.processors.marshaller.GridMarshallerMappingProcessor.onJoiningNodeDataReceived(GridMarshallerMappingProcessor.java:336)
>at 
>org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.onExchange(GridDiscoveryManager.java:906)
>at 
>org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:2090)
>at 
>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddedMessage(ServerImpl.java:4816)
>at 
>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:3089)
>at 
>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2795)
>at 
>org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7766)
>at 
>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2946)
>at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>at 
>org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7697)
>at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:61)
>[ERROR] [29.03.2021 14:34:09.687] [] [68.166.6:47500]-#2] 
>[i.i.MarshallerMappingFileStore]: Failed to write class name to file 
>[platformId=0id=-1710898632, clsName=org.profile.GridCard, 
>file=/tmp/ignite-customer2/marshaller/-1710898632.classname0]
>java.io.FileNotFoundException: 
>/tmp/ignite-customer2/marshaller/-1710898632.classname0 (No such file or 
>directory)
>at java.base/java.io.FileOutputStream.open0(Native Method)
>at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
>at java.base/java.io.FileOutputStream.(FileOutputStream.java:237)
>at java.base/java.io.FileOutputStream.(FileOutputStream.java:187)
>at 
>org.apache.ignite.internal.MarshallerMappingFileStore.writeMapping(MarshallerMappingFileStore.java:97)
>at 
>org.apache.ignite.internal.MarshallerMappingFileStore.mergeAndWriteMapping(MarshallerMappingFileStore.java:222)
>at 
>org.apache.ignite.internal.MarshallerContextImpl.onMappingDataReceived(MarshallerContextImpl.java:191)
>at 
>org.apache.ignite.internal.processors.marshaller.GridMarshallerMappingProcessor.processIncomingMappings(GridMarshallerMappingProcessor.java:356)
>at 
>org.apache.ignite.internal.processors.marshaller.GridMarshallerMappingProcessor.onJoiningNodeDataReceived(GridMarshallerMappingProcessor.java:336)
>at 
>org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.onExchange(GridDiscoveryManager.java:906)
>at 
>org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:2090)
>at 
>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddedMessage(ServerImpl.java:4816)
>at 
>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:3089)
>at 
>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2795)
>at 
>org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7766)
>at 
>org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2946)
>at

Re[4]: Run sql query on key-value cache

2021-03-12 Thread Zhenya Stanilovsky



schema for persistent cache will be stored into appropriate config.


 
>Hi Team,
>
>Thanks for the response.
>If we create schema(table) with indexes using existing cache , will this
>schema be created in memory?
>in our existing xml config , we are using persistence for ignite node, will
>apache use same persistence storage or create schema in-memory(RAM)?
>Is there a way to create schema on persistent storage rather than in-memory?
>
>regards,
>Rakshita Chaudhary
>
>
>
>--
>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/

Re[2]: Run sql query on key-value cache

2021-03-12 Thread Zhenya Stanilovsky


Ilya, seems you mistaken, check [1]
Seems there is no additional documentation, but API is simple , check example 
[2].
 
[1]  https://issues.apache.org/jira/browse/IGNITE-12808
[2]  
https://github.com/apache/ignite/pull/7627/files#diff-d0f1fdd4e070c92459cb4f1e600977bd4819216e658ff2114fe19bfeb2a93232R1000
 
>Hello!
> 
>Once a cache is created, you can't add indexes. You will need to recreate 
>cache / restart cluster with updated configuration.
> 
>You can define caches with Query Entities in spring XML configuration, pass it 
>to IgniteConfguration instance. 
> 
>Regards,
>--
>Ilya Kasnacheev  
>пт, 12 мар. 2021 г. в 07:07, rakshita04 < rakshita.chaudh...@siemens.com >:
>>How can we create indexes on our existing  cache.
>>As far as i could see on your portal no API support is available for C++ for
>>creating indexes.
>>does Query Entity automatically takes care of creating indexes? or we need
>>to explicitly create indexes on our  cache?
>>if we need to explicitly create , can you please help us how to do that?
>>
>>regards,
>>Rakshita Chaudhary
>>
>>
>>
>>--
>>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/

Re[2]: [External]Re: Ignite PME query

2021-03-08 Thread Zhenya Stanilovsky



hi  Joshi, pme message is logged for now but it`s very lightweight and have no 
partitions exchange.


 
>Hi,
>
>I couldn't get the exact details what am looking for, that's why am writing 
>here again. Can anyone please enlighten me on below points ?
>
>Thanks and Regards,
>Kamlesh Joshi
>
>-Original Message-
>From: VeenaMithare < v.mith...@cmcmarkets.com >
>Sent: 01 March 2021 13:55
>To:  user@ignite.apache.org
>Subject: [External]Re: Ignite PME query
>
>The e-mail below is from an external source. Please do not open attachments or 
>click links from an unknown or suspicious origin.
>
>Hi Kamlesh,
>
>PME related questions have been answered in this forum earlier as well.
>Maybe if you search for these, you might get some of your questions answered.
>
>regards,
>Veena.
>
>
>
>--
>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/
>
>"Confidentiality Warning: This message and any attachments are intended only 
>for the use of the intended recipient(s).
>are confidential and may be privileged. If you are not the intended recipient. 
>you are hereby notified that any
>review. re-transmission. conversion to hard copy. copying. circulation or 
>other use of this message and any attachments is
>strictly prohibited. If you are not the intended recipient. please notify the 
>sender immediately by return email.
>and delete this message and any attachments from your system.
>
>Virus Warning: Although the company has taken reasonable precautions to ensure 
>no viruses are present in this email.
>The company cannot accept responsibility for any loss or damage arising from 
>the use of this email or attachment."

Re[6]: Mixing persistent and in memory cache

2021-03-01 Thread Zhenya Stanilovsky



Ok i found !
18:36:07 noringBase.info            INFO   Topology snapshot [ver=2, 
locNode=2bf85583, servers=2, clients=0, state=ACTIVE, CPUs=8, offheap=6.3GB, 
heap=3.5GB]
18:36:07 noringBase.info            INFO     ^-- Baseline [id=0, size=1, 
online=1, offline=0]
you call baseline command after first node was started, thus you have 2 alive 
nodes with only one in baseline.
 
clear your persistent directories, and rewrite code like :
startNode(...a..)
ignite = startNode(..b...)
ignite.cluster().state(ClusterState.ACTIVE);
 
 
> 
>> 
>>>Hi Zhenya,
>>>To be on the safe side, I increased the sleep to 5 sec. I also removed the 
>>>setBackups(2) – no difference – the test is still failing!
>>> 
>>>Newest log file with corresponding source attached.
>>> 
>>>I added more specific logs:
>>> 
>>>18:36:07 lusterTest.testMem2    INFO   >>>>>> value has been written to 
>>>'a': aval
>>>18:36:12 lusterTest.testMem2    INFO   >>>>>> value retrieved from 'b': 
>>>null
>>> 
>>>---
>>>Mit freundlichen Grüßen
>>> 
>>>Stephan Hesse
>>>Geschäftsführer
>>> 
>>>DICOS GmbH Kommunikationssysteme
>>>Alsfelder Straße 11, 64289 Darmstadt
>>> 
>>>Telefon:  +49 6151 82787 27 , Mobil:  +49 1761 82787 27
>>> 
>>>www.dicos.de
>>> 
>>>DICOS GmbH Kommunikationssysteme, Darmstadt, Amtsgericht Darmstadt HRB 7024,
>>>Geschäftsführer: Dr. Winfried Geyer, Stephan Hesse, Waldemar Wiesner
>>> 
>>> 
>>> 
>>>From: Zhenya Stanilovsky < arzamas...@mail.ru >
>>>Sent: Monday, March 1, 2021 7:15 AM
>>>To: user@ignite.apache.org
>>>Subject: Re[4]: Mixing persistent and in memory cache
>>> 
>>>
>>>hi  Stephan,  due to logs rebalance  still in progress (probably slow 
>>>network ?) will test pass if you increase sleep interval ? 2 sec fro example 
>>>?
>>>Additionally no need to set .setBackups(2) in CacheMode.REPLICATED cache, 
>>>plz check documentation.
>>> 
>>> 
>>>>Hi Zhenya,
>>>> 
>>>>your 2nd point: yes, the cache itself has been propagated.
>>>> 
>>>>Please be aware that I have successfully used the same test with only the 
>>>>in-memory region as well as with only the persistent region. Only when I 
>>>>combine both, the synchronization stops working for the in memory region.
>>>> 
>>>> 
>>>>Please find attached the log file (both Ignite nodes run in the same 
>>>>process and contribute to this log file) as well as the current Junit test.
>>>> 
>>>>In the log file you will find:
>>>> 
>>>>The node startup:
>>>>>>>> starting node A
>>>>>>>> starting node B
>>>> 
>>>>The test case startup:
>>>> testMem2
>>>> 
>>>>The test stops with:
>>>>java.lang.AssertionError: expected: but was:
>>>>    …
>>>>    at 
>>>>de.dicos.cpcfe.ignite.IgniteClusterTest.testMem2(IgniteClusterTest.java:175)
>>>> 
>>>>---
>>>>Mit freundlichen Grüßen
>>>> 
>>>>Stephan Hesse
>>>>Geschäftsführer
>>>> 
>>>>DICOS GmbH Kommunikationssysteme
>>>>Alsfelder Straße 11, 64289 Darmstadt
>>>> 
>>>>Telefon:  +49 6151 82787 27 , Mobil:  +49 1761 82787 27
>>>> 
>>>>www.dicos.de
>>>> 
>>>>DICOS GmbH Kommunikationssysteme, Darmstadt, Amtsgericht Darmstadt HRB 7024,
>>>>Geschäftsführer: Dr. Winfried Geyer, Stephan Hesse, Waldemar Wiesner
>>>> 
>>>> 
>>>> 
>>>>From: Zhenya Stanilovsky < arzamas...@mail.ru >
>>>>Sent: Friday, February 26, 2021 6:57 AM
>>>>To: user@ignite.apache.org
>>>>Subject: Re[2]: Mixing persistent and in memory cache
>>>> 
>>>>
>>>>hi Stephan, something wrong with configuration probably … it`s not expected 
>>>>issue.
>>>>*  plz attach somehow or send me ignite.log from all server nodes ? 
>>>>*  If you change second call :
>>>>IgniteCache kva = getInMemoryKeyValue(igA);
>>>> 
>>>> IgniteCache kvb = getInMemoryKeyValue(igB); ← here
>>>>for something like : IgniteCache kvb = 
>>>>getInMemoryKeyValue2(igB);
>>>> 
>>>>private IgniteCache getInMemoryKeyValue2(Ignite ignite)
>>>> {
>>>> return ignite.cache(new CacheConfiguration() <--- 
>>>> 
>>>>just to check that cache has been already created.
>>>> 
>>>>Does ignite.cache will see the previously created cache ?
>>>> 
>>>>thanks !
>>>> 
>>>> 
>>>>>Hi Zhenya, thanks for this suggestion.
>>>>>
>>>>>However, neither setting CacheWriteSynchronizationMode to Full Sync nor
>>>>>setting it to FULL_ASYNC changes anything: memory cahce changes do not get
>>>>>propagated:
>>>>>
>>>>>private IgniteCache getInMemoryKeyValue(Ignite ignite)
>>>>>{
>>>>>return ignite.getOrCreateCache(new CacheConfiguration()
>>>>>.setName("memkv")
>>>>>.setCacheMode(CacheMode.REPLICATED)
>>>>>.setDataRegionName(NodeController.IN_MEMORY_REGION)
>>>>>.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC)
>>>>>.setBackups(2));
>>>>>}
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>--
>>>>>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>>  
>> 
>> 
>> 
>>

Re[4]: Mixing persistent and in memory cache

2021-02-28 Thread Zhenya Stanilovsky



hi  Stephan,  due to logs rebalance  still in progress (probably slow network 
?) will test pass if you increase sleep interval ? 2 sec fro example ?
Additionally no need to set .setBackups(2) in CacheMode.REPLICATED cache, plz 
check documentation.
 
>Hi Zhenya,
> 
>your 2nd point: yes, the cache itself has been propagated.
> 
>Please be aware that I have successfully used the same test with only the 
>in-memory region as well as with only the persistent region. Only when I 
>combine both, the synchronization stops working for the in memory region.
> 
> 
>Please find attached the log file (both Ignite nodes run in the same process 
>and contribute to this log file) as well as the current Junit test.
> 
>In the log file you will find:
> 
>The node startup:
>>>>> starting node A
>>>>> starting node B
> 
>The test case startup:
> testMem2
> 
>The test stops with:
>java.lang.AssertionError: expected: but was:
>    …
>    at 
>de.dicos.cpcfe.ignite.IgniteClusterTest.testMem2(IgniteClusterTest.java:175)
> 
>---
>Mit freundlichen Grüßen
> 
>Stephan Hesse
>Geschäftsführer
> 
>DICOS GmbH Kommunikationssysteme
>Alsfelder Straße 11, 64289 Darmstadt
> 
>Telefon:  +49 6151 82787 27 , Mobil:  +49 1761 82787 27
> 
>www.dicos.de
> 
>DICOS GmbH Kommunikationssysteme, Darmstadt, Amtsgericht Darmstadt HRB 7024,
>Geschäftsführer: Dr. Winfried Geyer, Stephan Hesse, Waldemar Wiesner
> 
> 
> 
>From: Zhenya Stanilovsky < arzamas...@mail.ru >
>Sent: Friday, February 26, 2021 6:57 AM
>To: user@ignite.apache.org
>Subject: Re[2]: Mixing persistent and in memory cache
> 
>
>hi Stephan, something wrong with configuration probably … it`s not expected 
>issue.
>*  plz attach somehow or send me ignite.log from all server nodes ? 
>*  If you change second call :
>IgniteCache kva = getInMemoryKeyValue(igA);
> 
> IgniteCache kvb = getInMemoryKeyValue(igB); ← here
>for something like : IgniteCache kvb = 
>getInMemoryKeyValue2(igB);
> 
>private IgniteCache getInMemoryKeyValue2(Ignite ignite)
> {
> return ignite.cache(new CacheConfiguration() <--- 
> 
>just to check that cache has been already created.
> 
>Does ignite.cache will see the previously created cache ?
> 
>thanks !
> 
> 
>>Hi Zhenya, thanks for this suggestion.
>>
>>However, neither setting CacheWriteSynchronizationMode to Full Sync nor
>>setting it to FULL_ASYNC changes anything: memory cahce changes do not get
>>propagated:
>>
>>private IgniteCache getInMemoryKeyValue(Ignite ignite)
>>{
>>return ignite.getOrCreateCache(new CacheConfiguration()
>>.setName("memkv")
>>.setCacheMode(CacheMode.REPLICATED)
>>.setDataRegionName(NodeController.IN_MEMORY_REGION)
>>.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC)
>>.setBackups(2));
>>}
>>
>>
>>
>>
>>--
>>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/
> 
> 
> 
>

Re[2]: Mixing persistent and in memory cache

2021-02-25 Thread Zhenya Stanilovsky



hi Stephan, something wrong with configuration probably … it`s not expected 
issue.
*  plz attach somehow or send me ignite.log from all server nodes ? 
*  If you change second call :
IgniteCache kva = getInMemoryKeyValue(igA);
 
 IgniteCache kvb = getInMemoryKeyValue(igB); ← here
for something like : IgniteCache kvb = 
getInMemoryKeyValue2(igB);
 
private IgniteCache getInMemoryKeyValue2(Ignite ignite)
 {
 return ignite.cache(new CacheConfiguration() <--- 
 
just to check that cache has been already created.
 
Does ignite.cache will see the previously created cache ?
 
thanks !
 
>Hi Zhenya, thanks for this suggestion.
>
>However, neither setting CacheWriteSynchronizationMode to Full Sync nor
>setting it to FULL_ASYNC changes anything: memory cahce changes do not get
>propagated:
>
>private IgniteCache getInMemoryKeyValue(Ignite ignite)
>{
>return ignite.getOrCreateCache(new CacheConfiguration()
>.setName("memkv")
>.setCacheMode(CacheMode.REPLICATED)
>.setDataRegionName(NodeController.IN_MEMORY_REGION)
>.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC)
>.setBackups(2));
>}
>
>
>
>
>--
>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/

Re: Mixing persistent and in memory cache

2021-02-20 Thread Zhenya Stanilovsky



Sorry for previous, full_sync of course [1]
 
[1]  
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/CacheWriteSynchronizationMode.html#FULL_SYNC
 
>Hi,
> 
>I am trying to set up a system with some chaes in a persistent and some in an 
>in-memory region.
> 
>However, it seems that the in-memory regions don’t get synchronized between 
>the nodes. This would work with a pure setup of either only an in-memory 
>region or only a persistent one.
> 
>Below is a complete JUnit test showing my issue. The “testMem“ will fail 
>because the update made on node “a” does not reach node “b”. Any suggestions?
> 
>Ignite version: 2.9.1
> 
> 
>package de.dicos.cpcfe.ignite;
> 
>import java.io.File;
>import java.util.Arrays;
>import java.util.List;
> 
>import org.apache.ignite.Ignite;
>import org.apache.ignite.IgniteCache;
>import org.apache.ignite.Ignition;
>import org.apache.ignite.cache.CacheMode;
>import org.apache.ignite.cluster.ClusterState;
>import org.apache.ignite.configuration.CacheConfiguration;
>import org.apache.ignite.configuration.ClientConnectorConfiguration;
>import org.apache.ignite.configuration.ConnectorConfiguration;
>import org.apache.ignite.configuration.DataRegionConfiguration;
>import org.apache.ignite.configuration.DataStorageConfiguration;
>import org.apache.ignite.configuration.IgniteConfiguration;
>import org.apache.ignite.internal.IgniteEx;
>import org.apache.ignite.logger.slf4j.Slf4jLogger;
>import org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi;
>import org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi;
>import org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder;
>import org.junit.After;
>import org.junit.Assert;
>import org.junit.Before;
>import org.junit.Test;
>import org.slf4j.Logger;
>import org.slf4j.LoggerFactory;
> 
>/**
>*
>* @author sth
>*/
>public class IgniteClusterTest
>{
>    // /
>    // Class Fields
>    // /
>    /** */
>    private static Logger log = 
>LoggerFactory.getLogger(IgniteClusterTest.class);
>    
>    /** */
>    private IgniteEx igA;
> 
>    /** */
>    private IgniteEx igB;
> 
>    /** */
>    public static final String PERSISTENT_REGION = "persistent";
> 
>    /** */
>    public static final String IN_MEMORY_REGION = "inmemory";
> 
>    // /
>    // Constructors
>    // /
>    /**
>    */
>    public IgniteClusterTest()
>    {
>    }
> 
>    // /
>    // Methods
>    // /
>    /**
>    * @throws java.lang.Exception
>    */
>    @Before
>    public void setUp()
>   throws Exception
>    {
>   try {
>   log.info(" starting node 
>A");
>   File da = new 
>File("target/idd-a");
>    rmrf(da);
>   igA = startNode("a", da, 47500, 
>47100, 11211, 10800);
>    log.info(" node A is 
>running");
> 
>   Thread.sleep(1000);
> 
>   log.info(" starting node 
>B");
>   File db = new 
>File("target/idd-b");
>    rmrf(db);
>   igB = startNode("b", db, 47501, 
>47101, 11212, 10801);
>    log.info(" node B is 
>running");
>   
>   } catch (Throwable x) {
>   log.error("unexpected 
>exception", x);
>   throw x;
>   }
>    }
> 
>    /**
>    * @throws java.lang.Exception
>    */
>    @After
>    public void tearDown()
>   throws Exception
>    {
>   log.info(" stopping all nodes");
>   Ignition.stopAll(true);
>    }
> 
> 
>    @Test
>    public void testPerm() throws InterruptedException
>

Re: Mixing persistent and in memory cache

2021-02-20 Thread Zhenya Stanilovsky



hi, seems you need to append additional sync option [1]
 
[1]  
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/CacheWriteSynchronizationMode.html#FULL_ASYNC
 
>Hi,
> 
>I am trying to set up a system with some chaes in a persistent and some in an 
>in-memory region.
> 
>However, it seems that the in-memory regions don’t get synchronized between 
>the nodes. This would work with a pure setup of either only an in-memory 
>region or only a persistent one.
> 
>Below is a complete JUnit test showing my issue. The “testMem“ will fail 
>because the update made on node “a” does not reach node “b”. Any suggestions?
> 
>Ignite version: 2.9.1
> 
> 
>package de.dicos.cpcfe.ignite;
> 
>import java.io.File;
>import java.util.Arrays;
>import java.util.List;
> 
>import org.apache.ignite.Ignite;
>import org.apache.ignite.IgniteCache;
>import org.apache.ignite.Ignition;
>import org.apache.ignite.cache.CacheMode;
>import org.apache.ignite.cluster.ClusterState;
>import org.apache.ignite.configuration.CacheConfiguration;
>import org.apache.ignite.configuration.ClientConnectorConfiguration;
>import org.apache.ignite.configuration.ConnectorConfiguration;
>import org.apache.ignite.configuration.DataRegionConfiguration;
>import org.apache.ignite.configuration.DataStorageConfiguration;
>import org.apache.ignite.configuration.IgniteConfiguration;
>import org.apache.ignite.internal.IgniteEx;
>import org.apache.ignite.logger.slf4j.Slf4jLogger;
>import org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi;
>import org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi;
>import org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder;
>import org.junit.After;
>import org.junit.Assert;
>import org.junit.Before;
>import org.junit.Test;
>import org.slf4j.Logger;
>import org.slf4j.LoggerFactory;
> 
>/**
>*
>* @author sth
>*/
>public class IgniteClusterTest
>{
>    // /
>    // Class Fields
>    // /
>    /** */
>    private static Logger log = 
>LoggerFactory.getLogger(IgniteClusterTest.class);
>    
>    /** */
>    private IgniteEx igA;
> 
>    /** */
>    private IgniteEx igB;
> 
>    /** */
>    public static final String PERSISTENT_REGION = "persistent";
> 
>    /** */
>    public static final String IN_MEMORY_REGION = "inmemory";
> 
>    // /
>    // Constructors
>    // /
>    /**
>    */
>    public IgniteClusterTest()
>    {
>    }
> 
>    // /
>    // Methods
>    // /
>    /**
>    * @throws java.lang.Exception
>    */
>    @Before
>    public void setUp()
>   throws Exception
>    {
>   try {
>   log.info(" starting node 
>A");
>   File da = new 
>File("target/idd-a");
>    rmrf(da);
>   igA = startNode("a", da, 47500, 
>47100, 11211, 10800);
>    log.info(" node A is 
>running");
> 
>   Thread.sleep(1000);
> 
>   log.info(" starting node 
>B");
>   File db = new 
>File("target/idd-b");
>    rmrf(db);
>   igB = startNode("b", db, 47501, 
>47101, 11212, 10801);
>    log.info(" node B is 
>running");
>   
>   } catch (Throwable x) {
>   log.error("unexpected 
>exception", x);
>   throw x;
>   }
>    }
> 
>    /**
>    * @throws java.lang.Exception
>    */
>    @After
>    public void tearDown()
>   throws Exception
>    {
>   log.info(" stopping all nodes");
>   Ignition.stopAll(true);
>    }
> 
> 
>    @Test
>    public void testPerm() throws

Re[2]: Long transaction suspended

2021-02-10 Thread Zhenya Stanilovsky


Hi !


 
>Hi,
>
>Because of the kind of product we have to develop, we currently have a set
>of scenarios with this kind of transactions and we're evaluating several
>datastores as RocksDB and, sadly, timings there are quite better than the
>ones I've got in Ignite... :(
 
I believe tx.putAll will be fixed soon ) I have working prototype for now, need 
a little bit time to fix all tests )
 
>
>Data streamer is not available in C++ afaik...
>
>
>
>--
>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/

Re: Failed to execute the cache operation (all partition owners are left the grid partition data has been lost))

2021-02-02 Thread Zhenya Stanilovsky



Seems you faced with https://issues.apache.org/jira/browse/IGNITE-14073


 
>Hi,
>I have .Net Ignite client and server app Ignite v2.9.1. we have a 2 node
>cluster,
>recently we are upgraded ignite 2.7.6 to ignite 2.9.1 after using this
>version we are facing issues frequently.
>
>[req=o.a.i.i.processors.platform.client.cache.ClientCacheGetRequest@bb5fb96]
>javax.cache.CacheException: class
>org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
>Failed to execute the cache operation (all partition owners have left the
>grid, partition data has been lost) [cacheName=CREDENTIALS, partition=759,
>key=UserKeyCacheObjectImpl [part=759,
>val=SqlCredentialAdo-Althing-energyapp, hasValBytes=false]]
>at
>org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202)
>[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56)
>[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279)
>[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109)
>[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.util.nio.GridNioAsyncNotifyFilter$3.body(GridNioAsyncNotifyFilter.java:97)
>[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.util.worker.GridWorkerPool$1.run(GridWorkerPool.java:70)
>[ignite-core-2.9.1.jar:2.9.1]
>at
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>[?:1.8.0_181]
>at
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>[?:1.8.0_181]
>at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
>Caused by:
>org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
>Failed to execute the cache operation (all partition owners have left the
>grid, partition data has been lost) [cacheName=CREDENTIALS, partition=759,
>key=UserKeyCacheObjectImpl [part=759,
>val=SqlCredentialAdo-Althing-energyapp, hasValBytes=false]]
>at
>org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:209)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:128)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.validate(GridPartitionedSingleGetFuture.java:859)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.map(GridPartitionedSingleGetFuture.java:277)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.init(GridPartitionedSingleGetFuture.java:244)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.getAsync0(GridDhtAtomicCache.java:1471)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$1600(GridDhtAtomicCache.java:141)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$16.apply(GridDhtAtomicCache.java:477)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$16.apply(GridDhtAtomicCache.java:475)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.asyncOp(GridDhtAtomicCache.java:779)
>~[ignite-core-2.9.1.jar:2.9.1]
>at
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.getAsync(GridDhtAtomicCache.java:475)

Re[4]: Performance of Ignite as key-value datastore. C++ Thin Client.

2021-01-29 Thread Zhenya Stanilovsky



I suppose you can estimate time of loading 1.2 million keys with 2 ignite nodes 
right now.
Just load 1200 keys 1000 times, i believe assumed time would be similar.
And i hope you understand the difference between lsm\btree and 
embeded\distributed system.  
I can`t give you a fix date right now, may be next week will be clearer here.
 
thanks ! 
 
>First off, thanks to you and Zhenya for your support
>
>I'm afraid that's not easy, as key's data are unrelated.
>
>As we are in an evaluation phase, we would prefer to wait for the fix and
>perform single node benchmarking in the meanwhile... Instead of grouping
>keys, we might split the big transaction into several smaller ones, as
>Zhenya suggested, but that will not give us the real timings anyway.
>
>According to Zhenya's mail, the problem has been identified and it seems you
>can solve it. I know it's difficult, but.. could you give a rough fix date?
>
>
>
>--
>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/

Re[2]: Detecting checkpoints programmatically

2021-01-28 Thread Zhenya Stanilovsky


hi, check these links.
[1]  https://ignite.apache.org/docs/latest/monitoring-metrics/metrics
[2]  
https://www.zylk.net/en/web-2-0/blog/-/blogs/kibana-dashboard-for-monitoring-liferay-jvm-via-jmx
 
 
>Hi Zhenya,
> 
>It seems those events are not so useful as you suggest, [2]
>In the case of [1], how are these events accessible from a C# Ignite client?
> 
>We don't currently use zabbix, but do use fluentd/kibana.
> 
>Thanks,
>Raymond.
> 
>   
>On Fri, Jan 29, 2021 at 4:54 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>wrote:
>>
>>hi ! hope this would be helpful [1].
>>Mentioned events is other one [2] = not useful for you.
>>Also zabbix can parse the logs, you can obtain all cp info there.
>> 
>>[1]  https://issues.apache.org/jira/browse/IGNITE-13845
>>[2]  
>>https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/checkpoint/CheckpointSpi.html
>>   
>>>I want to make my system detect when a check point has occurred for 
>>>monitoring and control purposes.
>>> 
>>>I found the following events defined in the IA source:
>>> 
>>>/**
>>> * All checkpoint events. This array can be directly passed into
>>> * { @link   IgniteEvents # localListen(IgnitePredicate, int...) } 
>>>method to
>>> * subscribe to all checkpoint events.
>>> *
>>> *  @see  CheckpointEvent
>>> */
>>> public   static   final   int []  EVTS_CHECKPOINT  = {
>>>EVT_CHECKPOINT_SAVED,
>>>EVT_CHECKPOINT_LOADED,
>>> EVT_CHECKPOINT_REMOVED
>>>};
>>> 
>>>Are these the appropriate events to listen to?
>>> 
>>>Thanks,
>>>Raymond.
>>>  --
>>>
>>>Raymond Wilson
>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>raymond_wil...@trimble.com
>>>         
>>> 
>> 
>> 
>> 
>>  
> 
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wil...@trimble.com
>         
>

Re[2]: Performance of Ignite as key-value datastore. C++ Thin Client.

2021-01-28 Thread Zhenya Stanilovsky



I also confirm this issue, i will append additional info into ticket, seems we 
can fix it.
jjimeno thanks for highlighting. The only work around for now — is somehow 
decrease tx enlisted keys. 
 
>I am able to reproduce the problem - it occurs with any client, thick or thin.
> 
>With one node the transaction completes in a reasonable time,
>but with two nodes it is orders of magnitude slower.
> 
>We'll investigate and get back to you.
>   
>On Thu, Jan 28, 2021 at 9:48 AM jjimeno < jjim...@omp.com > wrote:
>>Hi again,
>>
>>As a test, I just disabled persistence in both nodes.  The already mentioned
>>transaction of 1.2 million keys and 600MB in size takes 298sec.
>>
>>Remember that for one single node and persistence enabled it takes 70sec, so
>>just adding a second node makes the test more than 4 times slower.
>>
>>Is this, really, the performance that Ignite can offer? Please, don't take
>>me wrong, I'm just asking, not criticizing.  I want to be sure I'm not doing
>>anything wrong and the timings I get are the expected ones...
>>
>>Thanks!
>>
>>
>>
>>--
>>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/

Re: Detecting checkpoints programmatically

2021-01-28 Thread Zhenya Stanilovsky



hi ! hope this would be helpful [1].
Mentioned events is other one [2] = not useful for you.
Also zabbix can parse the logs, you can obtain all cp info there.
 
[1]  https://issues.apache.org/jira/browse/IGNITE-13845
[2]  
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/checkpoint/CheckpointSpi.html
 
>I want to make my system detect when a check point has occurred for monitoring 
>and control purposes.
> 
>I found the following events defined in the IA source:
> 
>/**
> * All checkpoint events. This array can be directly passed into
> * { @link   IgniteEvents # localListen(IgnitePredicate, int...) } method 
>to
> * subscribe to all checkpoint events.
> *
> *  @see  CheckpointEvent
> */
> public   static   final   int []  EVTS_CHECKPOINT  = {
>EVT_CHECKPOINT_SAVED,
>EVT_CHECKPOINT_LOADED,
> EVT_CHECKPOINT_REMOVED
>};
> 
>Are these the appropriate events to listen to?
> 
>Thanks,
>Raymond.
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wil...@trimble.com
>         
>

Re[2]: Performance of Ignite as key-value datastore. C++ Thin Client.

2021-01-27 Thread Zhenya Stanilovsky


1. Yes i know about this message, don`t pay attention (just remove it). Ilya is 
this param is safe to use ?
2. plz rerun your nodes with -DIGNITE_QUIET=false jvm param (there are would be 
more informative logs).
3 You have long running tx in your logs (IGNITE_QUIET will help to detect why 
it hangs) you can configure default timeout [1]
4 If tx will hang one more — plz attach new logs and up this thread.
 
[1] 
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/TransactionConfiguration.html#setDefaultTxTimeout-long-
> 
>> 
>>> 
Hi Zhenya,

Thanks for your quick response

1. If it is not set, the following message appears on node's startup:
[13:22:33] Message queue limit is set to 0 which may lead to potential OOMEs
when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to
message queues growth on sender and receiver sides.
2. No improvement.
3. I've got the same question :) Here are the cluster logs until it
crashes: ignite-9f92ab96.log
< 
http://apache-ignite-users.70518.x6.nabble.com/file/t3059/ignite-9f92ab96.log
 >
4. Yes, I'm aware about that since I reported it... but there is only one
transaction in the test
5. Yes, 4Gb is large enough. There is only one single transaction of 600MB
6. Yes, in fact, that's why I modified the page size



--
Sent from:  http://apache-ignite-users.70518.x6.nabble.com/ 
>>> 
>>> 
>>> 
>>>

Re: Performance of Ignite as key-value datastore. C++ Thin Client.

2021-01-27 Thread Zhenya Stanilovsky




hi jjimeno.
*  I doubt about « messageQueueLimit » correctness, plz remove it and try once 
more.
*  16k per page is questionable, i suggest to try with default.
*  why there is no progress with 2 node? can you append somehow logs from 2 
nodes with transactions degradation?
*  Take into account [1]
*  You can collide with page replacements, is 4 Gb are ok for all your data ?
*  Are you reading performance tricks ? [2]
 
[1] https://issues.apache.org/jira/browse/IGNITE-13997 
[2] 
https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/general-perf-tips
>Hi everyone,
>
>For our project, we have the next requirements:
>
>- One single cache 
>- Ability to lock a list of cache entries.
>- Large transactions. A typical one is to commit 1.2 million keys (a single
>PutAll call) with a total size of around 600MB.
>- Persistence
>
>In our proof of concept, we've got everything implemented and running:
>
>- One server node 2.9.1. Native persistence is enabled for default data
>region.
>- One client application using one Ignite C++ Thin Client to connect to the
>server node.
>- Both, server and client, are in the same machine by now.
>
>With this scenario, we're currently evaluating Ignite vs RocksDB. We would
>really like to choose Ignite because of its scalability, but we are facing a
>problem related to its performance:
>
>In Ignite, one single transaction commit of 1.2 million keys and 600MB takes
>around 70 seconds to complete, while RocksDB takes no more than 12 seconds.
>Moreover, if a second local node is added to the cluster, the application is
>not even able of completing the transaction (it stops after 10 minutes)
>
>Default data region's page size has been modified up to 16KB. Persistence
>has been enabled.
>Cache is PARTITIONED with TRANSACTIONAL atomicity mode.
>Because of the requirement about locking keys, performed transaction is
>PESSIMISTIC + READ_COMMITTED.
>
>The rest of the configuration values are the default ones (No backup,
>PRIMARY_SYNC, no OnHeapCache, etc)
>
>So, my questions are:
>
>- Taking the requirements into account, is Ignite a good option?
>- It's those time values that one might expect?
>- If not, any advice to improve them?
>
>Configuration files for both server nodes have been attached. Thanks
>everyone in advance for your help and time,
>
>first-node.xml
>< http://apache-ignite-users.70518.x6.nabble.com/file/t3059/first-node.xml >
>second-node.xml
>< http://apache-ignite-users.70518.x6.nabble.com/file/t3059/second-node.xml >
>
>Josemari
>
>
>
>--
>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/

Re[4]: Ever increasing startup times as data grow in persistent storage

2021-01-13 Thread Zhenya Stanilovsky





 
>Is there an API version of the cluster deactivation?
 
https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131
 
>On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>wrote:
>>
>>
>> 
>>>Hi Zhenya,
>>> 
>>>Thanks for confirming performing checkpoints more often will help here.
>>Hi Raymond !
>>> 
>>>I have established this configuration so will experiment with settings 
>>>little.
>>> 
>>>On a related note, is there any way to automatically trigger a checkpoint, 
>>>for instance as a pre-shutdown activity?
>> 
>>If you shutdown your cluster gracefully = with deactivation [1] further start 
>>will not trigger wal readings.
>> 
>>[1]  
>>https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>> 
>>>Checkpoints seem to be much faster than the process of applying WAL updates.
>>> 
>>>Raymond.  
>>>On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>>wrote:
>>>>
>>>>
>>>>
>>>> 
>>>>>We have noticed that startup time for our server nodes has been slowly 
>>>>>increasing in time as the amount of data stored in the persistent store 
>>>>>grows.
>>>>> 
>>>>>This appears to be closely related to recovery of WAL changes that were 
>>>>>not checkpointed at the time the node was stopped.
>>>>> 
>>>>>After enabling debug logging we see that the WAL file is scanned, and for 
>>>>>every cache, all partitions in the cache are examined, and if there are 
>>>>>any uncommitted changes in the WAL file then the partition is updated (I 
>>>>>assume this requires reading of the partition itself as a part of this 
>>>>>process).
>>>>> 
>>>>>We now have ~150Gb of data in our persistent store and we see WAL update 
>>>>>times between 5-10 minutes to complete, during which the node is 
>>>>>unavailable.
>>>>> 
>>>>>We use fairly large WAL files (512Mb) and use 10 segments, with WAL 
>>>>>archiving enabled.
>>>>> 
>>>>>We anticipate data in persistent storage to grow to Terabytes, and if the 
>>>>>startup time continues to grow as storage grows then this makes deploys 
>>>>>and restarts difficult.
>>>>> 
>>>>>Until now we have been using the default checkpoint time out of 3 minutes 
>>>>>which may mean we have significant uncheckpointed data in the WAL files. 
>>>>>We are moving to 1 minute checkpoint but don't yet know if this improve 
>>>>>startup times. We also use the default 1024 partitions per cache, though 
>>>>>some partitions may be large. 
>>>>> 
>>>>>Can anyone confirm this is expected behaviour and recommendations for 
>>>>>resolving it?
>>>>> 
>>>>>Will reducing checking pointing intervals help?
>>>> 
>>>>yes, it will help. Check  
>>>>https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>>>Is the entire content of a partition read while applying WAL changes?
>>>> 
>>>>don`t think so, may be someone else suggest here?
>>>>>Does anyone else have this issue?
>>>>> 
>>>>>Thanks,
>>>>>Raymond.
>>>>> 
>>>>>  --
>>>>>
>>>>>Raymond Wilson
>>>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>>>raymond_wil...@trimble.com
>>>>>         
>>>>> 
>>>> 
>>>> 
>>>> 
>>>>  
>>> 
>>>  --
>>>
>>>Raymond Wilson
>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>raymond_wil...@trimble.com
>>>         
>>> 
>> 
>> 
>> 
>>  
> 
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wil...@trimble.com
>         
>

Re[2]: Ever increasing startup times as data grow in persistent storage

2021-01-12 Thread Zhenya Stanilovsky




 
>Hi Zhenya,
> 
>Thanks for confirming performing checkpoints more often will help here.
Hi Raymond !
> 
>I have established this configuration so will experiment with settings little.
> 
>On a related note, is there any way to automatically trigger a checkpoint, for 
>instance as a pre-shutdown activity?
 
If you shutdown your cluster gracefully = with deactivation [1] further start 
will not trigger wal readings.
 
[1] 
https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
 
>Checkpoints seem to be much faster than the process of applying WAL updates.
> 
>Raymond.  
>On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>wrote:
>>
>>
>>
>> 
>>>We have noticed that startup time for our server nodes has been slowly 
>>>increasing in time as the amount of data stored in the persistent store 
>>>grows.
>>> 
>>>This appears to be closely related to recovery of WAL changes that were not 
>>>checkpointed at the time the node was stopped.
>>> 
>>>After enabling debug logging we see that the WAL file is scanned, and for 
>>>every cache, all partitions in the cache are examined, and if there are any 
>>>uncommitted changes in the WAL file then the partition is updated (I assume 
>>>this requires reading of the partition itself as a part of this process).
>>> 
>>>We now have ~150Gb of data in our persistent store and we see WAL update 
>>>times between 5-10 minutes to complete, during which the node is unavailable.
>>> 
>>>We use fairly large WAL files (512Mb) and use 10 segments, with WAL 
>>>archiving enabled.
>>> 
>>>We anticipate data in persistent storage to grow to Terabytes, and if the 
>>>startup time continues to grow as storage grows then this makes deploys and 
>>>restarts difficult.
>>> 
>>>Until now we have been using the default checkpoint time out of 3 minutes 
>>>which may mean we have significant uncheckpointed data in the WAL files. We 
>>>are moving to 1 minute checkpoint but don't yet know if this improve startup 
>>>times. We also use the default 1024 partitions per cache, though some 
>>>partitions may be large. 
>>> 
>>>Can anyone confirm this is expected behaviour and recommendations for 
>>>resolving it?
>>> 
>>>Will reducing checking pointing intervals help?
>> 
>>yes, it will help. Check  
>>https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>Is the entire content of a partition read while applying WAL changes?
>> 
>>don`t think so, may be someone else suggest here?
>>>Does anyone else have this issue?
>>> 
>>>Thanks,
>>>Raymond.
>>> 
>>>  --
>>>
>>>Raymond Wilson
>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>raymond_wil...@trimble.com
>>>         
>>> 
>> 
>> 
>> 
>>  
> 
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wil...@trimble.com
>         
>

Re: Ever increasing startup times as data grow in persistent storage

2021-01-12 Thread Zhenya Stanilovsky





 
>We have noticed that startup time for our server nodes has been slowly 
>increasing in time as the amount of data stored in the persistent store grows.
> 
>This appears to be closely related to recovery of WAL changes that were not 
>checkpointed at the time the node was stopped.
> 
>After enabling debug logging we see that the WAL file is scanned, and for 
>every cache, all partitions in the cache are examined, and if there are any 
>uncommitted changes in the WAL file then the partition is updated (I assume 
>this requires reading of the partition itself as a part of this process).
> 
>We now have ~150Gb of data in our persistent store and we see WAL update times 
>between 5-10 minutes to complete, during which the node is unavailable.
> 
>We use fairly large WAL files (512Mb) and use 10 segments, with WAL archiving 
>enabled.
> 
>We anticipate data in persistent storage to grow to Terabytes, and if the 
>startup time continues to grow as storage grows then this makes deploys and 
>restarts difficult.
> 
>Until now we have been using the default checkpoint time out of 3 minutes 
>which may mean we have significant uncheckpointed data in the WAL files. We 
>are moving to 1 minute checkpoint but don't yet know if this improve startup 
>times. We also use the default 1024 partitions per cache, though some 
>partitions may be large. 
> 
>Can anyone confirm this is expected behaviour and recommendations for 
>resolving it?
> 
>Will reducing checking pointing intervals help?
 
yes, it will help. Check 
https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>Is the entire content of a partition read while applying WAL changes?
 
don`t think so, may be someone else suggest here?
>Does anyone else have this issue?
> 
>Thanks,
>Raymond.
> 
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wil...@trimble.com
>         
>

Re[2]: Questions related to check pointing

2021-01-12 Thread Zhenya Stanilovsky




 
>Hi Zhenya,
> 
>Thanks for the pointers - I will look into them.
> 
>I have been doing some additional reading into this and discovered we are 
>using a 4.0 NFS client, which seems to be the first 'no-no'; we will look at 
>updating to use the 41 NFS client.
> 
>We have modified our default timer cadence for checkpointing from 3 minutes to 
>1 minutes, which seems to be giving us better performance. We will continue to 
>measure the impact that has.
> 
>Lastly, I'm planning to merge our two data regions into a single region to 
>reduce 'too many dirty pages' checkpoints due to high write activity in a 
>small region.
> 
>Would using larger pages sizes (eg: 16kb) be useful with EFS?
Hi, Raymond.
I have no info about it, it would be helpful if will you share your research.
thanks !
> 
>Raymond.  
>On Tue, Jan 12, 2021 at 8:27 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>wrote:
>>hope it would be helpful too:
>>https://www.jeffgeerling.com/blog/2018/getting-best-performance-out-amazon-efs
>>https://docs.aws.amazon.com/efs/latest/ug/storage-classes.html
>>> 
>>>Hi Zhenya,
>>> 
>>>The matching checkpoint finished log is this:
>>> 
>>>2020-12-15 19:07:39,253 [106] INF [MutableCacheComputeServer]  Checkpoint 
>>>finished [cpId=e2c31b43-44df-43f1-b162-6b6cefa24e28, pages=33421, 
>>>markPos=FileWALPointer [idx=6339, fileOff=243287334, len=196573], 
>>>walSegmentsCleared=0, walSegmentsCovered=[], markDuration=218ms, 
>>>pagesWrite=1150ms, fsync=37104ms, total=38571ms]  
>>> 
>>>Regards your comment that 3/4 of pages in whole data region need to be dirty 
>>>to trigger this, can you confirm this is 3/4 of the maximum size of the data 
>>>region, or of the currently used size (eg: if Min is 1Gb, and Max is 4Gb, 
>>>and used is 2Gb, would 1.5Gb of dirty pages trigger this?)
>>> 
>>>Are data regions independently checkpointed, or are they checkpointed as a 
>>>whole, so that a 'too many dirty pages' condition affects all data regions 
>>>in terms of write blocking?
>>> 
>>>Can you comment on my query regarding should we set Min and Max size of the 
>>>data region to be the same? Ie: Don't bother with growing the data region 
>>>memory use on demand, just allocate the maximum?  
>>> 
>>>In terms of the checkpoint lock hold time metric, of the checkpoints quoting 
>>>'too many dirty pages' there is one instance apart from the one I have 
>>>provided earlier violating this limit, ie:
>>> 
>>>2020-12-17 18:56:39,086 [104] INF [MutableCacheComputeServer] Checkpoint 
>>>started [checkpointId=e9ccf0ca-f813-4f91-ac93-5483350fdf66, 
>>>startPtr=FileWALPointer [idx=7164, fileOff=389224517, len=196573], 
>>>checkpointBeforeLockTime=276ms, checkpointLockWait=0ms, 
>>>checkpointListenersExecuteTime=16ms, checkpointLockHoldTime=39ms, 
>>>walCpRecordFsyncDuration=254ms, writeCheckpointEntryDuration=32ms, 
>>>splitAndSortCpPagesDuration=276ms, pages=77774, reason=' too many dirty 
>>>pages ']  
>>> 
>>>This is out of a population of 16 instances I can find. The remainder have 
>>>lock times of 16-17ms.
>>> 
>>>Regarding writes of pages to the persistent store, does the check pointing 
>>>system parallelise writes across partitions ro maximise throughput? 
>>> 
>>>Thanks,
>>>Raymond.
>>> 
>>>   
>>>On Thu, Dec 31, 2020 at 1:17 AM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>>wrote:
>>>>
>>>>All write operations will be blocked for this timeout :  
>>>>checkpointLockHoldTime=32ms (Write Lock holding) If you observe huge amount 
>>>>of such messages :    reason=' too many dirty pages ' may be you need to 
>>>>store some data in not persisted regions for example or reduce indexes (if 
>>>>you use them). And please attach other part of cp message starting with : 
>>>>Checkpoint finished.
>>>>
>>>>
>>>> 
>>>>>In ( 
>>>>>https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>>> ), there is a mention of a dirty pages limit that is a factor that can 
>>>>>trigger check points.
>>>>> 
>>>>>I also found this issue:  
>>>>>http://apache-ignite-users.70518.x6.nabble.com/too-many-dirty-pages-td28572.html
>>>>> where "too many dirty pages" is a reason given for initiating a 
>>>>>checkpoint.
>>>>>

Re:Questions related to check pointing

2021-01-11 Thread Zhenya Stanilovsky


hope it would be helpful too:
https://www.jeffgeerling.com/blog/2018/getting-best-performance-out-amazon-efs
https://docs.aws.amazon.com/efs/latest/ug/storage-classes.html
> 
>Hi Zhenya,
> 
>The matching checkpoint finished log is this:
> 
>2020-12-15 19:07:39,253 [106] INF [MutableCacheComputeServer]  Checkpoint 
>finished [cpId=e2c31b43-44df-43f1-b162-6b6cefa24e28, pages=33421, 
>markPos=FileWALPointer [idx=6339, fileOff=243287334, len=196573], 
>walSegmentsCleared=0, walSegmentsCovered=[], markDuration=218ms, 
>pagesWrite=1150ms, fsync=37104ms, total=38571ms]  
> 
>Regards your comment that 3/4 of pages in whole data region need to be dirty 
>to trigger this, can you confirm this is 3/4 of the maximum size of the data 
>region, or of the currently used size (eg: if Min is 1Gb, and Max is 4Gb, and 
>used is 2Gb, would 1.5Gb of dirty pages trigger this?)
> 
>Are data regions independently checkpointed, or are they checkpointed as a 
>whole, so that a 'too many dirty pages' condition affects all data regions in 
>terms of write blocking?
> 
>Can you comment on my query regarding should we set Min and Max size of the 
>data region to be the same? Ie: Don't bother with growing the data region 
>memory use on demand, just allocate the maximum?  
> 
>In terms of the checkpoint lock hold time metric, of the checkpoints quoting 
>'too many dirty pages' there is one instance apart from the one I have 
>provided earlier violating this limit, ie:
> 
>2020-12-17 18:56:39,086 [104] INF [MutableCacheComputeServer] Checkpoint 
>started [checkpointId=e9ccf0ca-f813-4f91-ac93-5483350fdf66, 
>startPtr=FileWALPointer [idx=7164, fileOff=389224517, len=196573], 
>checkpointBeforeLockTime=276ms, checkpointLockWait=0ms, 
>checkpointListenersExecuteTime=16ms, checkpointLockHoldTime=39ms, 
>walCpRecordFsyncDuration=254ms, writeCheckpointEntryDuration=32ms, 
>splitAndSortCpPagesDuration=276ms, pages=4, reason=' too many dirty pages 
>']  
> 
>This is out of a population of 16 instances I can find. The remainder have 
>lock times of 16-17ms.
> 
>Regarding writes of pages to the persistent store, does the check pointing 
>system parallelise writes across partitions ro maximise throughput? 
> 
>Thanks,
>Raymond.
> 
>   
>On Thu, Dec 31, 2020 at 1:17 AM Zhenya Stanilovsky < arzamas...@mail.ru > 
>wrote:
>>
>>All write operations will be blocked for this timeout :  
>>checkpointLockHoldTime=32ms (Write Lock holding) If you observe huge amount 
>>of such messages :    reason=' too many dirty pages ' may be you need to 
>>store some data in not persisted regions for example or reduce indexes (if 
>>you use them). And please attach other part of cp message starting with : 
>>Checkpoint finished.
>>
>>
>> 
>>>In ( 
>>>https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>> ), there is a mention of a dirty pages limit that is a factor that can 
>>>trigger check points.
>>> 
>>>I also found this issue:  
>>>http://apache-ignite-users.70518.x6.nabble.com/too-many-dirty-pages-td28572.html
>>> where "too many dirty pages" is a reason given for initiating a checkpoint.
>>> 
>>>After reviewing our logs I found this: (one example)
>>> 
>>>2020-12-15 19:07:00,999 [106] INF [MutableCacheComputeServer] Checkpoint 
>>>started [checkpointId=e2c31b43-44df-43f1-b162-6b6cefa24e28, 
>>>startPtr=FileWALPointer [idx=6339, fileOff=243287334, len=196573], 
>>>checkpointBeforeLockTime=99ms, checkpointLockWait=0ms, 
>>>checkpointListenersExecuteTime=16ms, checkpointLockHoldTime=32ms, 
>>>walCpRecordFsyncDuration=113ms, writeCheckpointEntryDuration=27ms, 
>>>splitAndSortCpPagesDuration=45ms, pages=33421, reason=' too many dirty pages 
>>>']   
>>> 
>>>Which suggests we may have the issue where writes are frozen until the check 
>>>point is completed.
>>> 
>>>Looking at the AI 2.8.1 source code, the dirty page limit fraction appears 
>>>to be 0.1 (10%), via this entry in GridCacheDatabaseSharedManager.java:
>>> 
>>>/**
>>> * Threshold to calculate limit for pages list on-heap caches.
>>> * 
>>> * Note: When a checkpoint is triggered, we need some amount of page 
>>>memory to store pages list on-heap cache.
>>> * If a checkpoint is triggered by "too many dirty pages" reason and 
>>>pages list cache is rather big, we can get
>>>* {@code IgniteOutOfMemoryException}. To prevent this, we can limit the 
>>>total amount of cached page list buckets,
>>>

Re: Questions related to check pointing

2021-01-10 Thread Zhenya Stanilovsky



fsync=37104ms too long for such pages amount : pages=33421, plz check how can 
you improve fsync on your storage.

 
>
>
>--- Forwarded message ---
>From: "Raymond Wilson" < raymond_wil...@trimble.com >
>To: user < user@ignite.apache.org >, "Zhenya Stanilovsky" < arzamas...@mail.ru 
>>
>Cc:
>Subject: Re: Re[4]: Questions related to check pointing
>Date: Thu, 31 Dec 2020 01:46:20 +0300
> 
>Hi Zhenya,
> 
>The matching checkpoint finished log is this:
> 
>2020-12-15 19:07:39,253 [106] INF [MutableCacheComputeServer]  Checkpoint 
>finished [cpId=e2c31b43-44df-43f1-b162-6b6cefa24e28, pages=33421, 
>markPos=FileWALPointer [idx=6339, fileOff=243287334, len=196573], 
>walSegmentsCleared=0, walSegmentsCovered=[], markDuration=218ms, 
>pagesWrite=1150ms, fsync=37104ms, total=38571ms]  
> 
>Regards your comment that 3/4 of pages in whole data region need to be dirty 
>to trigger this, can you confirm this is 3/4 of the maximum size of the data 
>region, or of the currently used size (eg: if Min is 1Gb, and Max is 4Gb, and 
>used is 2Gb, would 1.5Gb of dirty pages trigger this?)
> 
>Are data regions independently checkpointed, or are they checkpointed as a 
>whole, so that a 'too many dirty pages' condition affects all data regions in 
>terms of write blocking?
> 
>Can you comment on my query regarding should we set Min and Max size of the 
>data region to be the same? Ie: Don't bother with growing the data region 
>memory use on demand, just allocate the maximum?  
> 
>In terms of the checkpoint lock hold time metric, of the checkpoints quoting 
>'too many dirty pages' there is one instance apart from the one I have 
>provided earlier violating this limit, ie:
> 
>2020-12-17 18:56:39,086 [104] INF [MutableCacheComputeServer] Checkpoint 
>started [checkpointId=e9ccf0ca-f813-4f91-ac93-5483350fdf66, 
>startPtr=FileWALPointer [idx=7164, fileOff=389224517, len=196573], 
>checkpointBeforeLockTime=276ms, checkpointLockWait=0ms, 
>checkpointListenersExecuteTime=16ms, checkpointLockHoldTime=39ms, 
>walCpRecordFsyncDuration=254ms, writeCheckpointEntryDuration=32ms, 
>splitAndSortCpPagesDuration=276ms, pages=4, reason=' too many dirty pages 
>']  
> 
>This is out of a population of 16 instances I can find. The remainder have 
>lock times of 16-17ms.
> 
>Regarding writes of pages to the persistent store, does the check pointing 
>system parallelise writes across partitions ro maximise throughput? 
> 
>Thanks,
>Raymond.
> 
>   
>On Thu, Dec 31, 2020 at 1:17 AM Zhenya Stanilovsky < arzamas...@mail.ru > 
>wrote:
>>
>>All write operations will be blocked for this timeout :  
>>checkpointLockHoldTime=32ms (Write Lock holding) If you observe huge amount 
>>of such messages :    reason=' too many dirty pages ' may be you need to 
>>store some data in not persisted regions for example or reduce indexes (if 
>>you use them). And please attach other part of cp message starting with : 
>>Checkpoint finished.
>>
>>
>> 
>>>In ( 
>>>https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>> ), there is a mention of a dirty pages limit that is a factor that can 
>>>trigger check points.
>>> 
>>>I also found this issue:  
>>>http://apache-ignite-users.70518.x6.nabble.com/too-many-dirty-pages-td28572.html
>>> where "too many dirty pages" is a reason given for initiating a checkpoint.
>>> 
>>>After reviewing our logs I found this: (one example)
>>> 
>>>2020-12-15 19:07:00,999 [106] INF [MutableCacheComputeServer] Checkpoint 
>>>started [checkpointId=e2c31b43-44df-43f1-b162-6b6cefa24e28, 
>>>startPtr=FileWALPointer [idx=6339, fileOff=243287334, len=196573], 
>>>checkpointBeforeLockTime=99ms, checkpointLockWait=0ms, 
>>>checkpointListenersExecuteTime=16ms, checkpointLockHoldTime=32ms, 
>>>walCpRecordFsyncDuration=113ms, writeCheckpointEntryDuration=27ms, 
>>>splitAndSortCpPagesDuration=45ms, pages=33421, reason=' too many dirty pages 
>>>']   
>>> 
>>>Which suggests we may have the issue where writes are frozen until the check 
>>>point is completed.
>>> 
>>>Looking at the AI 2.8.1 source code, the dirty page limit fraction appears 
>>>to be 0.1 (10%), via this entry in GridCacheDatabaseSharedManager.java:
>>> 
>>>/**
>>> * Threshold to calculate limit for pages list on-heap caches.
>>> * 
>>> * Note: When a checkpoint is triggered, we need some amount of page 
>>>memory to store pages list on-heap cache.
>>

Re[4]: Questions related to check pointing

2020-12-30 Thread Zhenya Stanilovsky



All write operations will be blocked for this timeout :  
checkpointLockHoldTime=32ms (Write Lock holding) If you observe huge amount of 
such messages :    reason=' too many dirty pages ' may be you need to store 
some data in not persisted regions for example or reduce indexes (if you use 
them). And please attach other part of cp message starting with : Checkpoint 
finished.


 
>In ( 
>https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
> ), there is a mention of a dirty pages limit that is a factor that can 
>trigger check points.
> 
>I also found this issue:  
>http://apache-ignite-users.70518.x6.nabble.com/too-many-dirty-pages-td28572.html
> where "too many dirty pages" is a reason given for initiating a checkpoint.
> 
>After reviewing our logs I found this: (one example)
> 
>2020-12-15 19:07:00,999 [106] INF [MutableCacheComputeServer] Checkpoint 
>started [checkpointId=e2c31b43-44df-43f1-b162-6b6cefa24e28, 
>startPtr=FileWALPointer [idx=6339, fileOff=243287334, len=196573], 
>checkpointBeforeLockTime=99ms, checkpointLockWait=0ms, 
>checkpointListenersExecuteTime=16ms, checkpointLockHoldTime=32ms, 
>walCpRecordFsyncDuration=113ms, writeCheckpointEntryDuration=27ms, 
>splitAndSortCpPagesDuration=45ms, pages=33421, reason=' too many dirty pages 
>']   
> 
>Which suggests we may have the issue where writes are frozen until the check 
>point is completed.
> 
>Looking at the AI 2.8.1 source code, the dirty page limit fraction appears to 
>be 0.1 (10%), via this entry in GridCacheDatabaseSharedManager.java:
> 
>/**
> * Threshold to calculate limit for pages list on-heap caches.
> * 
> * Note: When a checkpoint is triggered, we need some amount of page 
>memory to store pages list on-heap cache.
> * If a checkpoint is triggered by "too many dirty pages" reason and pages 
>list cache is rather big, we can get
>* {@code IgniteOutOfMemoryException}. To prevent this, we can limit the total 
>amount of cached page list buckets,
> * assuming that checkpoint will be triggered if no more then 3/4 of pages 
>will be marked as dirty (there will be
> * at least 1/4 of clean pages) and each cached page list bucket can be 
>stored to up to 2 pages (this value is not
> * static, but depends on PagesCache.MAX_SIZE, so if PagesCache.MAX_SIZE > 
>PagesListNodeIO#getCapacity it can take
> * more than 2 pages). Also some amount of page memory needed to store 
>page list metadata.
> */
> private   static   final   double   PAGE_LIST_CACHE_LIMIT_THRESHOLD  =  
>0.1 ;
> 
>This raises two questions: 
> 
>1. The data region where most writes are occurring has 4Gb allocated to it, 
>though it is permitted to start at a much lower level. 4Gb should be 1,000,000 
>pages, 10% of which should be 100,000 dirty pages.
> 
>The 'limit holder' is calculated like this:
> 
>/**
> *  @return  Holder for page list cache limit for given data region.
> */
> public   AtomicLong   pageListCacheLimitHolder ( DataRegion   dataRegion 
>) {
> if  ( dataRegion . config (). isPersistenceEnabled ()) {
> return   pageListCacheLimits . computeIfAbsent ( dataRegion . 
>config (). getName (), name  ->   new   AtomicLong (
>( long )(((PageMemoryEx) dataRegion . pageMemory ()). 
>totalPages () * PAGE_LIST_CACHE_LIMIT_THRESHOLD)));
>}  
> return   null ;
>}
> 
>... but I am unsure if totalPages() is referring to the current size of the 
>data region, or the size it is permitted to grow to. ie: Could the 'dirty page 
>limit' be a sliding limit based on the growth of the data region? Is it better 
>to set the initial and maximum sizes of data regions to be the same number?
> 
>2. We have two data regions, one supporting inbound arrival of data (with low 
>numbers of writes), and one supporting storage of processed results from the 
>arriving data (with many more writes). 
> 
>The block on writes due to the number of dirty pages appears to affect all 
>data regions, not just the one which has violated the dirty page limit. Is 
>that correct? If so, is this something that can be improved?
> 
>Thanks,
>Raymond.
>   
>On Wed, Dec 30, 2020 at 9:17 PM Raymond Wilson < raymond_wil...@trimble.com > 
>wrote:
>>I'm working on getting automatic JVM thread stack dumping occurring if we 
>>detect long delays in put (PutIfAbsent) operations. Hopefully this will 
>>provide more information.  
>>On Wed, Dec 30, 2020 at 7:48 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>wrote:
>>>
>>>Don`t think so, checkpointing work perfectly well already before this fix.
>>>Need

Re[4]: Questions related to check pointing

2020-12-30 Thread Zhenya Stanilovsky


Correct code is running from here:
if (checkpointReadWriteLock.getReadHoldCount() > 1 || 
safeToUpdatePageMemories() || checkpointer.runner() == null)
break;
else {
CheckpointProgress pages = checkpointer.scheduleCheckpoint(0, "too many 
dirty pages");
and near you can see that :

maxDirtyPages = throttlingPlc != ThrottlingPolicy.DISABLED
? pool.pages() * 3L / 4
: Math.min(pool.pages() * 2L / 3, cpPoolPages);
Thus if ¾ pages are dirty from whole DataRegion pages — will raise this cp.
 
>In ( 
>https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
> ), there is a mention of a dirty pages limit that is a factor that can 
>trigger check points.
> 
>I also found this issue:  
>http://apache-ignite-users.70518.x6.nabble.com/too-many-dirty-pages-td28572.html
> where "too many dirty pages" is a reason given for initiating a checkpoint.
> 
>After reviewing our logs I found this: (one example)
> 
>2020-12-15 19:07:00,999 [106] INF [MutableCacheComputeServer] Checkpoint 
>started [checkpointId=e2c31b43-44df-43f1-b162-6b6cefa24e28, 
>startPtr=FileWALPointer [idx=6339, fileOff=243287334, len=196573], 
>checkpointBeforeLockTime=99ms, checkpointLockWait=0ms, 
>checkpointListenersExecuteTime=16ms, checkpointLockHoldTime=32ms, 
>walCpRecordFsyncDuration=113ms, writeCheckpointEntryDuration=27ms, 
>splitAndSortCpPagesDuration=45ms, pages=33421, reason=' too many dirty pages 
>']   
> 
>Which suggests we may have the issue where writes are frozen until the check 
>point is completed.
> 
>Looking at the AI 2.8.1 source code, the dirty page limit fraction appears to 
>be 0.1 (10%), via this entry in GridCacheDatabaseSharedManager.java:
> 
>/**
> * Threshold to calculate limit for pages list on-heap caches.
> * 
> * Note: When a checkpoint is triggered, we need some amount of page 
>memory to store pages list on-heap cache.
> * If a checkpoint is triggered by "too many dirty pages" reason and pages 
>list cache is rather big, we can get
>* {@code IgniteOutOfMemoryException}. To prevent this, we can limit the total 
>amount of cached page list buckets,
> * assuming that checkpoint will be triggered if no more then 3/4 of pages 
>will be marked as dirty (there will be
> * at least 1/4 of clean pages) and each cached page list bucket can be 
>stored to up to 2 pages (this value is not
> * static, but depends on PagesCache.MAX_SIZE, so if PagesCache.MAX_SIZE > 
>PagesListNodeIO#getCapacity it can take
> * more than 2 pages). Also some amount of page memory needed to store 
>page list metadata.
> */
> private   static   final   double   PAGE_LIST_CACHE_LIMIT_THRESHOLD  =  
>0.1 ;
> 
>This raises two questions: 
> 
>1. The data region where most writes are occurring has 4Gb allocated to it, 
>though it is permitted to start at a much lower level. 4Gb should be 1,000,000 
>pages, 10% of which should be 100,000 dirty pages.
> 
>The 'limit holder' is calculated like this:
> 
>/**
> *  @return  Holder for page list cache limit for given data region.
> */
> public   AtomicLong   pageListCacheLimitHolder ( DataRegion   dataRegion 
>) {
> if  ( dataRegion . config (). isPersistenceEnabled ()) {
> return   pageListCacheLimits . computeIfAbsent ( dataRegion . 
>config (). getName (), name  ->   new   AtomicLong (
>( long )(((PageMemoryEx) dataRegion . pageMemory ()). 
>totalPages () * PAGE_LIST_CACHE_LIMIT_THRESHOLD)));
>}  
> return   null ;
>}
> 
>... but I am unsure if totalPages() is referring to the current size of the 
>data region, or the size it is permitted to grow to. ie: Could the 'dirty page 
>limit' be a sliding limit based on the growth of the data region? Is it better 
>to set the initial and maximum sizes of data regions to be the same number?
> 
>2. We have two data regions, one supporting inbound arrival of data (with low 
>numbers of writes), and one supporting storage of processed results from the 
>arriving data (with many more writes). 
> 
>The block on writes due to the number of dirty pages appears to affect all 
>data regions, not just the one which has violated the dirty page limit. Is 
>that correct? If so, is this something that can be improved?
> 
>Thanks,
>Raymond.
>   
>On Wed, Dec 30, 2020 at 9:17 PM Raymond Wilson < raymond_wil...@trimble.com > 
>wrote:
>>I'm working on getting automatic JVM thread stack dumping occurring if we 
>>detect long delays in put (PutIfAbsent) operations. Hopefully this will 
>>provide more information.  
>>On Wed, Dec 30, 2020 at 7:48 PM Zhenya Stanilovsky < arzamas...@mail.r

Re[2]: Questions related to check pointing

2020-12-29 Thread Zhenya Stanilovsky



Don`t think so, checkpointing work perfectly well already before this fix.
Need additional info for start digging your problem, can you share ignite logs 
somewhere?
 
>I noticed an entry in the Ignite 2.9.1 changelog:
>*  Improved checkpoint concurrent behaviour
>I am having trouble finding the relevant Jira ticket for this in the 2.9.1 
>Jira area at  
>https://issues.apache.org/jira/browse/IGNITE-13876?jql=project%20%3D%20IGNITE%20AND%20fixVersion%20%3D%202.9.1%20and%20status%20%3D%20Resolved
> 
>Perhaps this change may improve the checkpointing issue we are seeing?
> 
>Raymond.
>   
>On Tue, Dec 29, 2020 at 8:35 PM Raymond Wilson < raymond_wil...@trimble.com > 
>wrote:
>>Hi Zhenya,
>> 
>>1. We currently use AWS EFS for primary storage, with provisioned IOPS to 
>>provide sufficient IO. Our Ignite cluster currently tops out at ~10% usage 
>>(with at least 5 nodes writing to it, including WAL and WAL archive), so we 
>>are not saturating the EFS interface. We use the default page size 
>>(experiments with larger page sizes showed instability when checkpointing due 
>>to free page starvation, so we reverted to the default size). 
>> 
>>2. Thanks for the detail, we will look for that in thread dumps when we can 
>>create them.
>> 
>>3. We are using the default CP buffer size, which is max(256Mb, 
>>DataRagionSize / 4) according to the Ignite documentation, so this should 
>>have more than enough checkpoint buffer space to cope with writes. As 
>>additional information, the cache which is displaying very slow writes is in 
>>a data region with relatively slow write traffic. There is a primary 
>>(default) data region with large write traffic, and the vast majority of 
>>pages being written in a checkpoint will be for that default data region.
>> 
>>4. Yes, this is very surprising. Anecdotally from our logs it appears write 
>>traffic into the low write traffic cache is blocked during checkpoints.
>> 
>>Thanks,
>>Raymond.
>>    
>>   
>>On Tue, Dec 29, 2020 at 7:31 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>wrote:
>>>*  
>>>Additionally to Ilya reply you can check vendors page for additional info, 
>>>all in this page are applicable for ignite too [1]. Increasing threads 
>>>number leads to concurrent io usage, thus if your have something like nvme — 
>>>it`s up to you but in case of sas possibly better would be to reduce this 
>>>param.
>>>*  Log will shows you something like :
>>>Parking thread=%Thread name% for timeout(ms)= %time% and appropriate :
>>>Unparking thread=
>>>*  No additional looging with cp buffer usage are provided. cp buffer need 
>>>to be more than 10% of overall persistent  DataRegions size.
>>>*  90 seconds or longer  —    Seems like problems in io or system tuning, 
>>>it`s very bad score i hope. 
>>>[1]  
>>>https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning
>>>
>>>
>>> 
>>>>Hi,
>>>> 
>>>>We have been investigating some issues which appear to be related to 
>>>>checkpointing. We currently use the IA 2.8.1 with the C# client.
>>>> 
>>>>I have been trying to gain clarity on how certain aspects of the Ignite 
>>>>configuration relate to the checkpointing process:
>>>> 
>>>>1. Number of check pointing threads. This defaults to 4, but I don't 
>>>>understand how it applies to the checkpointing process. Are more threads 
>>>>generally better (eg: because it makes the disk IO parallel across the 
>>>>threads), or does it only have a positive effect if you have many data 
>>>>storage regions? Or something else? If this could be clarified in the 
>>>>documentation (or a pointer to it which Google has not yet found), that 
>>>>would be good.
>>>> 
>>>>2. Checkpoint frequency. This is defaulted to 180 seconds. I was thinking 
>>>>that reducing this time would result in smaller less disruptive check 
>>>>points. Setting it to 60 seconds seems pretty safe, but is there a 
>>>>practical lower limit that should be used for use cases with new data 
>>>>constantly being added, eg: 5 seconds, 10 seconds?
>>>> 
>>>>3. Write exclusivity constraints during checkpointing. I understand that 
>>>>while a checkpoint is occurring ongoing writes will be supported into the 
>>>>caches being check pointed, and if those are writes to existing pages then 
>>>>those will be duplicated into t

Re[2]: Feature request: On demand thread dumps from Ignite

2020-12-28 Thread Zhenya Stanilovsky



I understand that you use C# of course, you can use  var process =  new Process 
( )  for it and furthet jstack -l PID call


 
>Hi Zhenya,
> 
>We use the IA C# client (deployed in a.Net Core implementation using 
>containers on AWS EKS) so this makes it hard for us to run Java closures from 
>the C# client, which is why a client interface capability would be useful!
> 
>Thanks,
>Raymond.
>   
>On Tue, Dec 29, 2020 at 8:38 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>wrote:
>>
>>You can call it through compute api [1], i suppose.
>> 
>>[1]  
>>https://ignite.apache.org/docs/latest/distributed-computing/distributed-computing
>>   
>>>Many of the discussion threads here will generate a request for the Jave 
>>>Ignite thread dump to help triage an issue.
>>> 
>>>This is not difficult to do with command line Java tooling if you can easily 
>>>access the server running the node. However, access to those nodes may not 
>>>be simple (especially in production) and requires hands-on manual 
>>>intervention to produce.
>>> 
>>>There does not seem to be a way for an Ignite client (eg: the C# client we 
>>>use in our implementation) to ask the local Ignite node to dump the thread 
>>>state to the log based on conditions the client itself may determine.
>>> 
>>>If this is actually the case then please point me at it:) Otherwise, is this 
>>>something worth adding to the backlog?
>>> 
>>>Thanks,
>>>Raymond.
>>>  --
>>>
>>>Raymond Wilson
>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>+64-21-2013317  Mobile
>>>raymond_wil...@trimble.com
>>>         
>>> 
>> 
>> 
>> 
>>  
> 
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>+64-21-2013317  Mobile
>raymond_wil...@trimble.com
>         
>

Re: Feature request: On demand thread dumps from Ignite

2020-12-28 Thread Zhenya Stanilovsky



You can call it through compute api [1], i suppose.
 
[1]  
https://ignite.apache.org/docs/latest/distributed-computing/distributed-computing
 
>Many of the discussion threads here will generate a request for the Jave 
>Ignite thread dump to help triage an issue.
> 
>This is not difficult to do with command line Java tooling if you can easily 
>access the server running the node. However, access to those nodes may not be 
>simple (especially in production) and requires hands-on manual intervention to 
>produce.
> 
>There does not seem to be a way for an Ignite client (eg: the C# client we use 
>in our implementation) to ask the local Ignite node to dump the thread state 
>to the log based on conditions the client itself may determine.
> 
>If this is actually the case then please point me at it:) Otherwise, is this 
>something worth adding to the backlog?
> 
>Thanks,
>Raymond.
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>+64-21-2013317  Mobile
>raymond_wil...@trimble.com
>         
>

Re: Questions related to check pointing

2020-12-28 Thread Zhenya Stanilovsky


*  Additionally to Ilya reply you can check vendors page for additional info, 
all in this page are applicable for ignite too [1]. Increasing threads number 
leads to concurrent io usage, thus if your have something like nvme — it`s up 
to you but in case of sas possibly better would be to reduce this param.
*  Log will shows you something like :
Parking thread=%Thread name% for timeout(ms)= %time% and appropriate :
Unparking thread=
*  No additional looging with cp buffer usage are provided. cp buffer need to 
be more than 10% of overall persistent  DataRegions size.
*  90 seconds or longer  —    Seems like problems in io or system tuning, it`s 
very bad score i hope. 
[1] 
https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning


 
>Hi,
> 
>We have been investigating some issues which appear to be related to 
>checkpointing. We currently use the IA 2.8.1 with the C# client.
> 
>I have been trying to gain clarity on how certain aspects of the Ignite 
>configuration relate to the checkpointing process:
> 
>1. Number of check pointing threads. This defaults to 4, but I don't 
>understand how it applies to the checkpointing process. Are more threads 
>generally better (eg: because it makes the disk IO parallel across the 
>threads), or does it only have a positive effect if you have many data storage 
>regions? Or something else? If this could be clarified in the documentation 
>(or a pointer to it which Google has not yet found), that would be good.
> 
>2. Checkpoint frequency. This is defaulted to 180 seconds. I was thinking that 
>reducing this time would result in smaller less disruptive check points. 
>Setting it to 60 seconds seems pretty safe, but is there a practical lower 
>limit that should be used for use cases with new data constantly being added, 
>eg: 5 seconds, 10 seconds?
> 
>3. Write exclusivity constraints during checkpointing. I understand that while 
>a checkpoint is occurring ongoing writes will be supported into the caches 
>being check pointed, and if those are writes to existing pages then those will 
>be duplicated into the checkpoint buffer. If this buffer becomes full or 
>stressed then Ignite will throttle, and perhaps block, writes until the 
>checkpoint is complete. If this is the case then Ignite will emit logging 
>(warning or informational?) that writes are being throttled.
> 
>We have cases where simple puts to caches (a few requests per second) are 
>taking up to 90 seconds to execute when there is an active check point 
>occurring, where the check point has been triggered by the checkpoint timer. 
>When a checkpoint is not occurring the time to do this is usually in the 
>milliseconds. The checkpoints themselves can take 90 seconds or longer, and 
>are updating up to 30,000-40,000 pages, across a pair of data storage regions, 
>one with 4Gb in-memory space allocated (which should be 1,000,000 pages at the 
>standard 4kb page size), and one small region with 128Mb. There is no 
>'throttling' logging being emitted that we can tell, so the checkpoint buffer 
>(which should be 1Gb for the first data region and 256 Mb for the second 
>smaller region in this case) does not look like it can fill up during the 
>checkpoint.
> 
>It seems like the checkpoint is affecting the put operations, but I don't 
>understand why that may be given the documented checkpointing process, and the 
>checkpoint itself (at least via Informational logging) is not advertising any 
>restrictions.
> 
>Thanks,
>Raymond.
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>

RE: Loosing cache data when persistence is enabled

2020-12-10 Thread Zhenya Stanilovsky



If nodes are partially offline and still in baseline, as was mentioned earlier 
— you really can loose your data
check for example :
[1  https://apacheignite.readme.io/docs/partition-loss-policies ]
[2]  https://www.gridgain.com/docs/latest/developers-guide/partition-loss-policy
 
if all nodes are online you can`t obtain such case, i hope.  
>
>
>--- Forwarded message ---
>From: "BEELA GAYATRI" < beela.gaya...@tcs.com >
>To: "user@ignite.apache.org" < user@ignite.apache.org >, VincentCE < 
>v...@cephei.com >
>Cc:
>Subject: RE: Loosing cache data when persistence is enabled
>Date: Thu, 10 Dec 2020 16:54:57 +0300
> 
>Hi,
> 
>  The nodes are  not getting  out of  Baseline topology. They are showing 
>offline , after restart of the nodes data is getting deleted from persistence 
>folders(some time data not getting deleted and some times getting deleted). 
>PFA web console
> 
>Sent from  Mail for Windows 10
> 
>From:  VincentCE
>Sent:  Thursday, December 10, 2020 6:10 PM
>To:  user@ignite.apache.org
>Subject:  Re: Loosing cache data when persistence is enabled
> 
>"External email. Open with Caution"
>
>Hi Beela,
>
>could the root cause be that you had changes in your baseline topology when
>you saw the data loss? According to my understanding and the answer I got on
>my question
>http://apache-ignite-users.70518.x6.nabble.com/Ignite-persistence-Data-of-node-is-lost-after-being-excluded-from-baseline-topology-td34675.html
>some weeks ago the data of nodes that got excluded from the baseline
>topology will be deleted for consistency reasons.
>
>This is just an immediate guess from my side, I did not have a look into
>your attachments so far.
>
>Best,
>Vincent
>
>
>
>--
>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/
> 
>=-=-=
>Notice: The information contained in this e-mail
>message and/or attachments to it may contain
>confidential or privileged information. If you are
>not the intended recipient, any dissemination, use,
>review, distribution, printing or copying of the
>information contained in this e-mail message
>and/or attachments to it are strictly prohibited. If
>you have received this communication in error,
>please notify us by reply e-mail or telephone and
>immediately and permanently delete the message
>and any attachments. Thank you
> 
>
>  
 
 
 
 

   
--

Re[2]: [2.9.0]NPE on invoke IgniteCache.destroy()

2020-12-08 Thread Zhenya Stanilovsky


looks like we deactivate pageMem concurrently with cp pageWrite, plz someone 
fill the ticket with appropriate logs. 

  
>Среда, 9 декабря 2020, 4:52 +03:00 от 38797715 <38797...@qq.com>:
> 
>Hi Ilya,
>This issue is not easy to reproduce.
>However, judging from the exception stack, the issue may be related to the 
>checkpoint process during the destruction of the cache.
>在 2020/12/9 上午9:33, Ilya Kazakov 写道:
>>Hello! Can you provide some details, or show some short reproducer?
>> 
>>-
>>Ilya Kazakov  
>>пн, 7 дек. 2020 г. в 21:31, 38797715 < 38797...@qq.com >:
>>>Hi community,
>>>Call IgniteCache.destroy Method, NPE appears,logs are as follows:
>>>2020-12-07 17:32:18.870 [] [exchange-worker-# 54 %tradecore%]  INFO 
>>>o.a.i.i.e.time - Started exchange init [topVer=AffinityTopologyVersion 
>>>[topVer= 1 , minorTopVer= 279 ], crd= true , evt=DISCOVERY_CUSTOM_EVT, 
>>>evtNode= 320935f6-3516-4b0b-9e5f-e80768696522 , 
>>>customEvt=DynamicCacheChangeBatch [id= 
>>>1f2a90e2671-562b8f51-fa6a-4094-928c-6976ce87614a , reqs=ArrayList 
>>>[DynamicCacheChangeRequest [cacheName=PksQuota, hasCfg= false , nodeId= 
>>>320935f6-3516-4b0b-9e5f-e80768696522 , clientStartOnly= false , stop= true , 
>>>destroy= false , disabledAfterStartfalse]], exchangeActions=ExchangeActions 
>>>[startCaches= null , stopCaches=[PksQuota], startGrps=[], 
>>>stopGrps=[PksQuota, destroy= true ], resetParts= null , stateChangeRequest= 
>>>null ], startCaches= false ], allowMerge= false , exchangeFreeSwitch= false ]
>>>2020-12-07 17:32:18.873 [] [exchange-worker-# 54 %tradecore%]  INFO 
>>>o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Finished waiting for 
>>>partition release future [topVer=AffinityTopologyVersion [topVer= 1 , 
>>>minorTopVer= 279 ], waitTime=0ms, futInfo=NA, mode=DISTRIBUTED]
>>>2020-12-07 17:32:18.873 [] [exchange-worker-# 54 %tradecore%]  INFO 
>>>o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Finished waiting for 
>>>partitions release latch: ServerLatch [permits= 0 , pendingAcks=HashSet [], 
>>>super=CompletableLatch [id=CompletableLatchUid [id=exchange, 
>>>topVer=AffinityTopologyVersion [topVer= 1 , minorTopVer= 279 
>>>2020-12-07 17:32:18.873 [] [exchange-worker-# 54 %tradecore%]  INFO 
>>>o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Finished waiting for 
>>>partition release future [topVer=AffinityTopologyVersion [topVer= 1 , 
>>>minorTopVer= 279 ], waitTime=0ms, futInfo=NA, mode=LOCAL]
>>>2020-12-07 17:32:19.037 [] [exchange-worker-# 54 %tradecore%]  INFO 
>>>o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - 
>>>finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer= 1 , 
>>>minorTopVer= 279 ], resVer=AffinityTopologyVersion [topVer= 1 , minorTopVer= 
>>>279 ]]
>>>2020-12-07 17:32:19.438 [] [exchange-worker-# 54 %tradecore%]  INFO 
>>>o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Finish exchange future 
>>>[startVer=AffinityTopologyVersion [topVer= 1 , minorTopVer= 279 ], 
>>>resVer=AffinityTopologyVersion [topVer= 1 , minorTopVer= 279 ], err= null , 
>>>rebalanced= true , wasRebalanced= true ]
>>>2020-12-07 17:32:20.870 [] [ db-c heckpoint-thread-# 75 %tradecore%]  INFO 
>>>o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - Checkpoint started 
>>>[checkpointId= f001018a-100e-4154-98c9-547dabf5015f , 
>>>startPtr=FileWALPointer [idx= 16 , fileOff= 1059360696 , len= 4770137 ], 
>>>checkpointBeforeLockTime=549ms, checkpointLockWait=0ms, 
>>>checkpointListenersExecuteTime=529ms, checkpointLockHoldTime=853ms, 
>>>walCpRecordFsyncDuration=17ms, writeCheckpointEntryDuration=7ms, 
>>>splitAndSortCpPagesDuration=4ms, pages= 10775 , reason= 'caches stop' ]
>>>2020-12-07 17:32:21.255 [] [checkpoint-runner-# 79 %tradecore%]  WARN 
>>>o.a.i.i.p.c.p.GridCacheDatabaseSharedManager -  1 checkpoint pages were not 
>>>written yet due to unsuccessful page write lock acquisition and will be 
>>>retried
>>>2020-12-07 17:32:21.261 [] [exchange-worker-# 54 %tradecore%]  ERROR 
>>>o.a.i.i.p.c.GridCacheProcessor - Failed to wait for checkpoint finish during 
>>>cache stop.
>>>org.apache.ignite.IgniteCheckedException : Compound exception for 
>>>CountDownFuture.
>>>at 
>>>org.apache.ignite.internal.util.future.CountDownFuture.addError(CountDownFuture.java:72)
>>> ~[ignite-core-2.9.0.jar!/:2.9.0]
>>>at 
>>>org.apache.ignite.internal.util.future.CountDownFuture.onDone(CountDownFuture.java:46)
>>> ~[ignite-core-2.9.0.jar!/:2.9.0]
>>>at 
>>>org.apache.ignite.internal.util.future.CountDownFuture.onDone(CountDownFuture.java:28)
>>> ~[ignite-core-2.9.0.jar!/:2.9.0]
>>>at 
>>>org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:478)
>>> ~[ignite-core-2.9.0.jar!/:2.9.0]
>>>at 
>>>org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.run(GridCacheDatabaseSharedManager.java:4546)
>>> ~[ignite-core-2.9.0.jar!/:2.9.0]
>>>at 
>>>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>

Re: Native persistence enabled: Loading the same data again takes longer than initially

2020-12-01 Thread Zhenya Stanilovsky


VincentCE,  how match is your uploaded portions ?
If you upload more than ${IGNITE_DEFAULT_REGION} after restart you will obtain 
i.e. page replacements (some data pages pages will be requested from disk to 
memory). Plz check closely for example [1].
If uploading speed is sensible for you, just increase data region size, is 
possible, disable swap and over commiting [2].
Remove writeThrottlingEnabled — i believe  that default is more usable here.
 
[1]  
https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood
[2]  https://apacheignite.readme.io/docs/durable-memory-tuning
 
>
>From: VincentCE < v...@cephei.com >
>To:  user@ignite.apache.org
>Cc:
>Subject: Native persistence enabled: Loading the same data again takes
>longer than initially
>Date: Mon, 30 Nov 2020 16:55:17 +0300
>
>Hi!
>
>I made the observation that loading data initially into the ignite-cluster
>with native persistence enabled is usually a lot faster than subsequent
>loadings of the same data, that is 30 min (initially) vs 52 min (4th time)
>for 170 GB of data.
>
>Does this indicate bad configurations from our side or is this expected
>behaviour? In fact we are quite happy with the initial loading speed of our
>data but will generally need to overwrite significant parts of it (which is
>the reasons for my question). I already tried to apply all the suggestion
>mentioned in  https://apacheignite.readme.io/docs/durable-memory-tuning .
>
>We are using ignite 2.8.1 currently. Here is our data storage
>configuration:
>
>
>class="\"org.apache.ignite.configuration.DataStorageConfiguration\"">
>
>
>
>value=\#{2 * 1000 * 1000 * 1000}\/>
>value=\#{10L * 1024 * 1024 * 1024}\/>
>
>
>class="\"org.apache.ignite.configuration.DataRegionConfiguration\"">
>
>value=\#{2L * 1024 * 1024 * 1024}\/>
>
>
>
>
>
>
>
>
>
>
>Thanks in advance!
>
>
>
>--
>Sent from:  http://apache-ignite-users.70518.x6.nabble.com/

85 matches

Mail list logo