[ https://issues.apache.org/jira/browse/IGNITE-7437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yury Babak reopened IGNITE-7437: -------------------------------- > Partition based dataset implementation > -------------------------------------- > > Key: IGNITE-7437 > URL: https://issues.apache.org/jira/browse/IGNITE-7437 > Project: Ignite > Issue Type: New Feature > Components: ml > Reporter: Yury Babak > Assignee: Anton Dmitriev > Priority: Major > Fix For: 2.5 > > > We want to implement our dataset based on entire partition instead of key > sets. > > *A main idea behind the partition based datasets is the classic > [MapReduce.|https://en.wikipedia.org/wiki/MapReduce]* > The most important advantage of the MapReduce is an ability to perform > computations on a data distributed across the cluster without involving > significant data transmissions over the network. This idea is adopted in the > partition based datasets in the following way: > 1. Every dataset consists of partitions. > 2. Partitions consists of a _context_ built on top of the Apache Ignite > Cache and _recoverable data_ stored locally on every node. > 3. Computations needed to be performed on a dataset splits on Map operations > which executes on every partition and Reduce operations which reduces results > of Map operations into one final result. > _Why partitions have been selected as a building block of dataset and > learning contain instead of cluster node?_ > One of the fundamental ideas of Apache Ignite Cache is that partitions are > atomic, which means that they cannot be splitted between multiply nodes. As > result in case of rebalancing or node failure partition will be recovered on > another node with the same data it contained on the previous node. > In case of machine learning algorithm it's very important because most of the > ML algorithms are iterative and require some context maintained between > iterations. This context cannot be split or merged and should be maintained > in the consistent state during the whole learning process. > *Another idea behind the partition based datasets is that we need to have > data (in every partition) in > [BLAS-|https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms]like > format as much as it possible.* > [BLAS|https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms] and > [CUDA|https://en.wikipedia.org/wiki/CUDA] makes machine learning 100x faster > and more reliable than algorithms based on self-written linear algebra > subroutines and it means that not using BLAS is a recipe for disaster. In > other words we need to keep data in BLAS-like format at any price. -- This message was sent by Atlassian JIRA (v7.6.3#76005)