Re: Re: Hadoop SSVD OutOfMemory Problem

Dmitriy Lyubimov Tue, 28 Apr 2015 10:27:18 -0700

Just Dmitriy is fine.

In order to create a pull request, please check out the process page
http://mahout.apache.org/developers/github.html. Note that it is written
for both committers and contributors, so you need to ignore the details for
committers.


Basically, you just need a github account, clone (fork) apache/mahout in
your account, (optionally) create a patch branch, commit your modifications
there, and then use github UI to create a pull request against
apache/mahout.

thanks.

-d

On Mon, Apr 27, 2015 at 8:39 PM, lastarsenal <lastarse...@163.com> wrote:

> Hi, Dmitriy Lyubimov
>
>
> OK, I have submitted a JIRA issue at
> https://issues.apache.org/jira/browse/MAHOUT-1700
>
>
> I'm a newbie for mahout, so, what should I do next for this issue? Thank
> you!
>
> At 2015-04-28 02:16:37, "Dmitriy Lyubimov" <dlie...@gmail.com> wrote:
> >Thank you for this analysis. I can't immediately confirm this since it's
> >been a while but this sounds credible.
> >
> >Do you mind to file a jira with all this information, and even perhaps do
> a
> >PR on github?
> >
> >thank you.
> >
> >On Mon, Apr 27, 2015 at 4:32 AM, lastarsenal <lastarse...@163.com> wrote:
> >
> >> Hi, All,
> >>
> >>
> >>      Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0)
> >> job. There's a java heap space out of memory problem  in
> ABtDenseOutJob. I
> >> found the reason, the ABtDenseOutJob map code is as below:
> >>
> >>
> >>     protected void map(Writable key, VectorWritable value, Context
> context)
> >>       throws IOException, InterruptedException {
> >>
> >>
> >>       Vector vec = value.get();
> >>
> >>
> >>       int vecSize = vec.size();
> >>       if (aCols == null) {
> >>         aCols = new Vector[vecSize];
> >>       } else if (aCols.length < vecSize) {
> >>         aCols = Arrays.copyOf(aCols, vecSize);
> >>       }
> >>
> >>
> >>       if (vec.isDense()) {
> >>         for (int i = 0; i < vecSize; i++) {
> >>           extendAColIfNeeded(i, aRowCount + 1);
> >>           aCols[i].setQuick(aRowCount, vec.getQuick(i));
> >>         }
> >>       } else if (vec.size() > 0) {
> >>         for (Vector.Element vecEl : vec.nonZeroes()) {
> >>           int i = vecEl.index();
> >>           extendAColIfNeeded(i, aRowCount + 1);
> >>           aCols[i].setQuick(aRowCount, vecEl.get());
> >>         }
> >>       }
> >>       aRowCount++;
> >>     }
> >>
> >>
> >> If the input is RandomAccessSparseVector, usually with big data, it's
> >> vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new
> >> Vector[vecSize] will introduce the OutOfMemory problem. The settlement
> of
> >> course should be enlarge every tasktracker's maximum memory:
> >> <property>
> >>   <name>mapred.child.java.opts</name>
> >>   <value>-Xmx1024m</value>
> >> </property>
> >> However, if you are NOT hadoop administrator or ops, you have no
> >> permission to modify the config. So, I try to modify ABtDenseOutJob map
> >> code to support RandomAccessSparseVector situation, I use hashmap to
> >> represent aCols instead of the original Vector[] aCols array, the
> modified
> >> code is as below:
> >>
> >>
> >> private Map<Integer, Vector> aColsMap = new HashMap<Integer, Vector>();
> >>     protected void map(Writable key, VectorWritable value, Context
> context)
> >>       throws IOException, InterruptedException {
> >>
> >>
> >>       Vector vec = value.get();
> >>       if (vec.isDense()) {
> >>         for (int i = 0; i < vecSize; i++) {
> >>           //extendAColIfNeeded(i, aRowCount + 1);
> >>           if (aColsMap.get(i) == null) {
> >>          aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE,
> >> 100));
> >>           }
> >>           aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i));
> >>           //aCols[i].setQuick(aRowCount, vec.getQuick(i));
> >>         }
> >>       } else if (vec.size() > 0) {
> >>         for (Vector.Element vecEl : vec.nonZeroes()) {
> >>           int i = vecEl.index();
> >>           //extendAColIfNeeded(i, aRowCount + 1);
> >>           if (aColsMap.get(i) == null) {
> >>          aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE,
> >> 100));
> >>           }
> >>           aColsMap.get(i).setQuick(aRowCount, vecEl.get());
> >>           //aCols[i].setQuick(aRowCount, vecEl.get());
> >>         }
> >>       }
> >>       aRowCount++;
> >>     }
> >>
> >>
> >> Then the OutofMemory problem is dismissed.
> >>
> >>
> >> Thank you!
> >>
> >>
>

Re: Re: Hadoop SSVD OutOfMemory Problem

Reply via email to