[ https://issues.apache.org/jira/browse/IGNITE-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Ozerov updated IGNITE-6025: ------------------------------------ Labels: performance (was: ) > SQL: improve CREATE INDEX performance > ------------------------------------- > > Key: IGNITE-6025 > URL: https://issues.apache.org/jira/browse/IGNITE-6025 > Project: Ignite > Issue Type: Improvement > Components: persistence, sql > Affects Versions: 2.1 > Reporter: Vladimir Ozerov > Labels: performance > Fix For: 2.2 > > > When bulk data load is performed, it is considered a good practice to bypass > certain facilities of underlying engine to achieve greater throughput. E.g., > TX or MVCC managers can by bypassed, global table lock can be held instead of > fine-grained page/row/field locks, etc.. > Another widely used technique is to drop table indexes and re-build them form > scratch when load finished. This is now possible with help of {{CREATE > INDEX}} command which could be executed in runtime. However, experiments with > large data sets shown that {{DROP INDEX}} -> load -> {{CREATE INDEX}} is > *much slower* than simple load to indexed table. The reasons for this are > both inefficient implementation of {{CREATE INDEX}} command, as well as some > storage architectural decisions. > 1) Index is created by a single thread; probably multiple threads could speed > it up and the cost of higher CPU usage. But how to split iteration between > several threads? > 2) Cache iteration happens through primary index. So we read an index page, > but to read entries we have to navigate to data page. If single data page is > referenced from N places in the index tree, we will read it N times. This > leads to bad cache locality in memory-only case, and to excessive disk IO in > case of persistence. This could be avoided, if we iterate over data pages, > and index all data from a single page at once. > 3) Another widely used technique is building BTree in bottom-up approach. > That is, we sort all data rows first, then build leafs, then go one level up, > etc.. This approach could give us the best build performance possible, > especially if p.2 is implemented. However it is the most difficult > optimization, which will require implementation of spilling to disk if result > set is too large. -- This message was sent by Atlassian JIRA (v6.4.14#64029)