[ https://issues.apache.org/jira/browse/MAHOUT-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suneel Marthi updated MAHOUT-1700: ---------------------------------- Assignee: Dmitriy Lyubimov > OutOfMemory Problem in ABtDenseOutJob in Distributed SSVD > --------------------------------------------------------- > > Key: MAHOUT-1700 > URL: https://issues.apache.org/jira/browse/MAHOUT-1700 > Project: Mahout > Issue Type: Bug > Components: Math > Affects Versions: 0.9, 0.10.0 > Reporter: Ethan Yi > Assignee: Dmitriy Lyubimov > Labels: patch > Fix For: 0.12.0 > > > Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0) job. > There's a java heap space out of memory problem in ABtDenseOutJob. I found > the reason, the ABtDenseOutJob map code is as below: > protected void map(Writable key, VectorWritable value, Context context) > throws IOException, InterruptedException { > Vector vec = value.get(); > int vecSize = vec.size(); > if (aCols == null) { > aCols = new Vector[vecSize]; > } else if (aCols.length < vecSize) { > aCols = Arrays.copyOf(aCols, vecSize); > } > if (vec.isDense()) { > for (int i = 0; i < vecSize; i++) { > extendAColIfNeeded(i, aRowCount + 1); > aCols[i].setQuick(aRowCount, vec.getQuick(i)); > } > } else if (vec.size() > 0) { > for (Vector.Element vecEl : vec.nonZeroes()) { > int i = vecEl.index(); > extendAColIfNeeded(i, aRowCount + 1); > aCols[i].setQuick(aRowCount, vecEl.get()); > } > } > aRowCount++; > } > If the input is RandomAccessSparseVector, usually with big data, it's > vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new > Vector[vecSize] will introduce the OutOfMemory problem. The settlement of > course should be enlarge every tasktracker's maximum memory: > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx1024m</value> > </property> > However, if you are NOT hadoop administrator or ops, you have no permission > to modify the config. So, I try to modify ABtDenseOutJob map code to support > RandomAccessSparseVector situation, I use hashmap to represent aCols instead > of the original Vector[] aCols array, the modified code is as below: > private Map<Integer, Vector> aColsMap = new HashMap<Integer, Vector>(); > protected void map(Writable key, VectorWritable value, Context context) > throws IOException, InterruptedException { > Vector vec = value.get(); > if (vec.isDense()) { > for (int i = 0; i < vecSize; i++) { > //extendAColIfNeeded(i, aRowCount + 1); > if (aColsMap.get(i) == null) { > aColsMap.put(i, new > RandomAccessSparseVector(Integer.MAX_VALUE, 100)); > } > aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i)); > //aCols[i].setQuick(aRowCount, vec.getQuick(i)); > } > } else if (vec.size() > 0) { > for (Vector.Element vecEl : vec.nonZeroes()) { > int i = vecEl.index(); > //extendAColIfNeeded(i, aRowCount + 1); > if (aColsMap.get(i) == null) { > aColsMap.put(i, new > RandomAccessSparseVector(Integer.MAX_VALUE, 100)); > } > aColsMap.get(i).setQuick(aRowCount, vecEl.get()); > //aCols[i].setQuick(aRowCount, vecEl.get()); > } > } > aRowCount++; > } > Then the OutofMemory problem is dismissed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)