There is a large-ish data structure in the Spark version of this algorithm. Each slave has a copy of several BiMaps that handle translation of your IDs into and out of Mahout IDs. One of these is created for user IDs, and one for each item ID set. For a single action that would be 2 BiMaps. These are broadcast values. So enough memory must be available for these. Their size depends on how many user and item IDs you have. On Dec 23, 2014, at 8:05 AM, Ted Dunning <[email protected]> wrote:
On Tue, Dec 23, 2014 at 7:39 AM, AlShater, Hani <[email protected]> wrote: > @Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and > yarn is configured accordingly. I am trying to avoid spark memory caching. > Have you tried the map-reduce version?
