[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484595#comment-14484595 ]
Yi Liu commented on SPARK-4590: ------------------------------- Agree that Parameter server is a good approach to solve the (memory/network io) issue if features are high-dimensional. We are also interested in this and have done some work. IndexedRDD is key-value store built on RDDs, which does not seem to be the right abstraction to support PS. RDD just represents some logic dataset that can be always reconstructed from its lineage; while the user can provide some hint, there is no mechanism to control exactly how the data is stored, distributed, replicated, etc., which is needed for a NoSQL style system like PS. [~rezazadeh], could you describe more details about your plan? In general, we have several ways to add parameter server in Spark: # Define a parameter server interfaces in Spark, and allow user to use custom implementations, but not add an implementation in Spark. # Besides the interfaces, also add a default implementation in Spark. Do similar thing with spark shuffle service, each parameter server node will run as a service in Spark Worker (standalone mode) or auxiliary service in YARN NM (Spark on YARN). The parameter server is decentralized (no-master node) and is implemented using Java. Since parameter server is memory-intensive and in this approach parameter server is in the same process with Spark worker or Yarn NM, so it’s better to use off-heap for parameter store. # Similar with #2 to have a default implementation in Spark. We extend Spark BlockManager to support distributing the features to executors and allow efficiently getting/updating subset or range of features. We can use java heap here. > Early investigation of parameter server > --------------------------------------- > > Key: SPARK-4590 > URL: https://issues.apache.org/jira/browse/SPARK-4590 > Project: Spark > Issue Type: Brainstorming > Components: ML, MLlib > Reporter: Xiangrui Meng > Assignee: Reza Zadeh > > In the currently implementation of GLM solvers, we save intermediate models > on the driver node and update it through broadcast and aggregation. Even with > torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond > ~10 million features. This JIRA is for investigating the parameter server > approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org