[jira] [Updated] (MAHOUT-1700) OutOfMemory Problem in ABtDenseOutJob in Distributed SSVD

Suneel Marthi (JIRA) Fri, 23 Oct 2015 16:07:13 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Suneel Marthi updated MAHOUT-1700:
----------------------------------
    Assignee: Dmitriy Lyubimov

> OutOfMemory Problem in ABtDenseOutJob in Distributed SSVD
> ---------------------------------------------------------
>
>                 Key: MAHOUT-1700
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1700
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.9, 0.10.0
>            Reporter: Ethan Yi
>            Assignee: Dmitriy Lyubimov
>              Labels: patch
>             Fix For: 0.12.0
>
>
>  Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0)  job. 
> There's a java heap space out of memory problem  in ABtDenseOutJob. I found 
> the reason, the ABtDenseOutJob map code is as below:
>     protected void map(Writable key, VectorWritable value, Context context)
>       throws IOException, InterruptedException {
>       Vector vec = value.get();
>       int vecSize = vec.size();
>       if (aCols == null) {
>         aCols = new Vector[vecSize];
>       } else if (aCols.length < vecSize) {
>         aCols = Arrays.copyOf(aCols, vecSize);
>       }
>       if (vec.isDense()) {
>         for (int i = 0; i < vecSize; i++) {
>           extendAColIfNeeded(i, aRowCount + 1);
>           aCols[i].setQuick(aRowCount, vec.getQuick(i));
>         }
>       } else if (vec.size() > 0) {
>         for (Vector.Element vecEl : vec.nonZeroes()) {
>           int i = vecEl.index();
>           extendAColIfNeeded(i, aRowCount + 1);
>           aCols[i].setQuick(aRowCount, vecEl.get());
>         }
>       }
>       aRowCount++;
>     }
> If the input is RandomAccessSparseVector, usually with big data, it's 
> vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new 
> Vector[vecSize] will introduce the OutOfMemory problem. The settlement of 
> course should be enlarge every tasktracker's maximum memory:
> <property>
>   <name>mapred.child.java.opts</name>
>   <value>-Xmx1024m</value>
> </property>
> However, if you are NOT hadoop administrator or ops, you have no permission 
> to modify the config. So, I try to modify ABtDenseOutJob map code to support 
> RandomAccessSparseVector situation, I use hashmap to represent aCols instead 
> of the original Vector[] aCols array, the modified code is as below:
> private Map<Integer, Vector> aColsMap = new HashMap<Integer, Vector>();
>     protected void map(Writable key, VectorWritable value, Context context)
>       throws IOException, InterruptedException {
>       Vector vec = value.get();
>       if (vec.isDense()) {
>         for (int i = 0; i < vecSize; i++) {
>           //extendAColIfNeeded(i, aRowCount + 1);
>           if (aColsMap.get(i) == null) {
>                 aColsMap.put(i, new 
> RandomAccessSparseVector(Integer.MAX_VALUE, 100));
>           }
>           aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i));
>           //aCols[i].setQuick(aRowCount, vec.getQuick(i));
>         }
>       } else if (vec.size() > 0) {
>         for (Vector.Element vecEl : vec.nonZeroes()) {
>           int i = vecEl.index();
>           //extendAColIfNeeded(i, aRowCount + 1);
>           if (aColsMap.get(i) == null) {
>                 aColsMap.put(i, new 
> RandomAccessSparseVector(Integer.MAX_VALUE, 100));
>           }
>           aColsMap.get(i).setQuick(aRowCount, vecEl.get());
>           //aCols[i].setQuick(aRowCount, vecEl.get());
>         }
>       }
>       aRowCount++;
>     }
> Then the OutofMemory problem is dismissed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1700) OutOfMemory Problem in ABtDenseOutJob in Distributed SSVD

Reply via email to