[ https://issues.apache.org/jira/browse/S2GRAPH-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453523#comment-16453523 ]
ASF GitHub Bot commented on S2GRAPH-206: ---------------------------------------- GitHub user SteamShon opened a pull request: https://github.com/apache/incubator-s2graph/pull/162 [S2GRAPH-206]: Generalize machine learning model serving. - abstract traversing edges as Fetcher interface. You can merge this pull request into a Git repository by running: $ git pull https://github.com/SteamShon/incubator-s2graph S2GRAPH-206 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-s2graph/pull/162.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #162 ---- commit 72c35a39e9f739d6df941d86db546811c9cb8a2a Author: DO YUNG YOON <steamshon@...> Date: 2018-04-26T05:26:06Z - abstract traversing edges as Fetcher interface. ---- > Generalize machine learning model serving. > ------------------------------------------ > > Key: S2GRAPH-206 > URL: https://issues.apache.org/jira/browse/S2GRAPH-206 > Project: S2Graph > Issue Type: New Feature > Components: s2core > Reporter: DOYUNG YOON > Assignee: DOYUNG YOON > Priority: Major > Original Estimate: 672h > Remaining Estimate: 672h > > One of the top use cases of OLTP graph database is the > recommendation(arguably). > Let's see how item-based collaborative filtering(item-based CF) can be served > as graph query. > # fetch user's history as the edges of clicked items. > # fetch each item's similar items. > There are few problems with above naive approach since we need to insert many > item pairs as edges(N^2 where N is the total number of items). > Even though bulk load can update a large number of edges in a stable manner, > the user needs to generate similarity matrix, which is often very large. > Also above approach does not generalize other model-based approaches. > For example, the user wants to use matrix factorization, need to work on > following steps. > # dump user's history in raw records. > # convert user history to the matrix by creating dictionary map between raw > value and sequence. > # factorize user history, usually using Alternating least squares (ALS) > which yields factorized model U, I. > # run k nearest neighbor per each item on I, which yield an array of item > sequence per each item sequence. > # convert item sequence an array of similar item sequence back to an item > array of the similar item by using dictionary created from 2. > # bulk load item-item similarity as edges. > Note that these steps become tedious. > I think above steps can be changed into following if S2Graph support the more > generalized way to support serving machine learning model. > 1,2,3 is inevitably done by who focus build better models, but 4,5,6 can be > automated. > To automate 4,5,6, we need to provide ways to load ML models from the remote > location and integrate pre-loaded ML model into graph query structure. > So logically, the original query should be changed into following. > # fetch user's history as the edge of clicked items. > # convert clicked items into item sequences. > # run the k-nearest-neighbor search on pre-loaded ML model and get an array > of similar item sequence. > # convert an array of similar item sequence into an array of the similar > item using pre-loaded ML model's dictionary. > > One might argue that supporting machine learning serving is not S2Graph's > focus. > The reason behind this suggestion is that I believe providing a unified > interface to traverse not only pre-stored data as vertex/edge, but also model > generated data on the fly as vertex/edge can be very useful (not only for > collaborative filtering use cases). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)