Just Dmitriy is fine. In order to create a pull request, please check out the process page http://mahout.apache.org/developers/github.html. Note that it is written for both committers and contributors, so you need to ignore the details for committers.
Basically, you just need a github account, clone (fork) apache/mahout in your account, (optionally) create a patch branch, commit your modifications there, and then use github UI to create a pull request against apache/mahout. thanks. -d On Mon, Apr 27, 2015 at 8:39 PM, lastarsenal <lastarse...@163.com> wrote: > Hi, Dmitriy Lyubimov > > > OK, I have submitted a JIRA issue at > https://issues.apache.org/jira/browse/MAHOUT-1700 > > > I'm a newbie for mahout, so, what should I do next for this issue? Thank > you! > > At 2015-04-28 02:16:37, "Dmitriy Lyubimov" <dlie...@gmail.com> wrote: > >Thank you for this analysis. I can't immediately confirm this since it's > >been a while but this sounds credible. > > > >Do you mind to file a jira with all this information, and even perhaps do > a > >PR on github? > > > >thank you. > > > >On Mon, Apr 27, 2015 at 4:32 AM, lastarsenal <lastarse...@163.com> wrote: > > > >> Hi, All, > >> > >> > >> Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0) > >> job. There's a java heap space out of memory problem in > ABtDenseOutJob. I > >> found the reason, the ABtDenseOutJob map code is as below: > >> > >> > >> protected void map(Writable key, VectorWritable value, Context > context) > >> throws IOException, InterruptedException { > >> > >> > >> Vector vec = value.get(); > >> > >> > >> int vecSize = vec.size(); > >> if (aCols == null) { > >> aCols = new Vector[vecSize]; > >> } else if (aCols.length < vecSize) { > >> aCols = Arrays.copyOf(aCols, vecSize); > >> } > >> > >> > >> if (vec.isDense()) { > >> for (int i = 0; i < vecSize; i++) { > >> extendAColIfNeeded(i, aRowCount + 1); > >> aCols[i].setQuick(aRowCount, vec.getQuick(i)); > >> } > >> } else if (vec.size() > 0) { > >> for (Vector.Element vecEl : vec.nonZeroes()) { > >> int i = vecEl.index(); > >> extendAColIfNeeded(i, aRowCount + 1); > >> aCols[i].setQuick(aRowCount, vecEl.get()); > >> } > >> } > >> aRowCount++; > >> } > >> > >> > >> If the input is RandomAccessSparseVector, usually with big data, it's > >> vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new > >> Vector[vecSize] will introduce the OutOfMemory problem. The settlement > of > >> course should be enlarge every tasktracker's maximum memory: > >> <property> > >> <name>mapred.child.java.opts</name> > >> <value>-Xmx1024m</value> > >> </property> > >> However, if you are NOT hadoop administrator or ops, you have no > >> permission to modify the config. So, I try to modify ABtDenseOutJob map > >> code to support RandomAccessSparseVector situation, I use hashmap to > >> represent aCols instead of the original Vector[] aCols array, the > modified > >> code is as below: > >> > >> > >> private Map<Integer, Vector> aColsMap = new HashMap<Integer, Vector>(); > >> protected void map(Writable key, VectorWritable value, Context > context) > >> throws IOException, InterruptedException { > >> > >> > >> Vector vec = value.get(); > >> if (vec.isDense()) { > >> for (int i = 0; i < vecSize; i++) { > >> //extendAColIfNeeded(i, aRowCount + 1); > >> if (aColsMap.get(i) == null) { > >> aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE, > >> 100)); > >> } > >> aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i)); > >> //aCols[i].setQuick(aRowCount, vec.getQuick(i)); > >> } > >> } else if (vec.size() > 0) { > >> for (Vector.Element vecEl : vec.nonZeroes()) { > >> int i = vecEl.index(); > >> //extendAColIfNeeded(i, aRowCount + 1); > >> if (aColsMap.get(i) == null) { > >> aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE, > >> 100)); > >> } > >> aColsMap.get(i).setQuick(aRowCount, vecEl.get()); > >> //aCols[i].setQuick(aRowCount, vecEl.get()); > >> } > >> } > >> aRowCount++; > >> } > >> > >> > >> Then the OutofMemory problem is dismissed. > >> > >> > >> Thank you! > >> > >> >