Re:Re: Re: Hadoop SSVD OutOfMemory Problem

2015-04-28 Thread Dmitriy Lyubimov
I think they used to run individually in eclipse just fine. I am sure it
will also work with idea.

With maven, I never ran anything less than module worth of tests.
On Apr 28, 2015 7:55 PM, "lastarsenal"  wrote:

> Ok, I have github account and clone mahout in my local workdir.
>
>
> I revised the code and run test: mvn test, however, there are 3 test
> failure:
> Failed tests:
>
> LocalSSVDPCASparseTest.runPCATest1:87->runSSVDSolver:222->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> null
>
> LocalSSVDSolverDenseTest.testSSVDSolverPowerIterations1:59->runSSVDSolver:172->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> null
>
> LocalSSVDSolverSparseSequentialTest.testSSVDSolverPowerIterations1:69->runSSVDSolver:177->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> null
>
>
> Now, my question is, how can I run a specified test with maven? For "mvn
> test" is so slow, then if I can do like "mvn test LocalSSVDPCASparseTest",
> my efficiency will be improved.
>
> At 2015-04-29 01:25:34, "Dmitriy Lyubimov"  wrote:
> >Just Dmitriy is fine.
> >
> >In order to create a pull request, please check out the process page
> >http://mahout.apache.org/developers/github.html. Note that it is written
> >for both committers and contributors, so you need to ignore the details
> for
> >committers.
> >
> >Basically, you just need a github account, clone (fork) apache/mahout in
> >your account, (optionally) create a patch branch, commit your
> modifications
> >there, and then use github UI to create a pull request against
> >apache/mahout.
> >
> >thanks.
> >
> >-d
> >
> >On Mon, Apr 27, 2015 at 8:39 PM, lastarsenal  wrote:
> >
> >> Hi, Dmitriy Lyubimov
> >>
> >>
> >> OK, I have submitted a JIRA issue at
> >> https://issues.apache.org/jira/browse/MAHOUT-1700
> >>
> >>
> >> I'm a newbie for mahout, so, what should I do next for this issue? Thank
> >> you!
> >>
> >> At 2015-04-28 02:16:37, "Dmitriy Lyubimov"  wrote:
> >> >Thank you for this analysis. I can't immediately confirm this since
> it's
> >> >been a while but this sounds credible.
> >> >
> >> >Do you mind to file a jira with all this information, and even perhaps
> do
> >> a
> >> >PR on github?
> >> >
> >> >thank you.
> >> >
> >> >On Mon, Apr 27, 2015 at 4:32 AM, lastarsenal 
> wrote:
> >> >
> >> >> Hi, All,
> >> >>
> >> >>
> >> >>  Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0)
> >> >> job. There's a java heap space out of memory problem  in
> >> ABtDenseOutJob. I
> >> >> found the reason, the ABtDenseOutJob map code is as below:
> >> >>
> >> >>
> >> >> protected void map(Writable key, VectorWritable value, Context
> >> context)
> >> >>   throws IOException, InterruptedException {
> >> >>
> >> >>
> >> >>   Vector vec = value.get();
> >> >>
> >> >>
> >> >>   int vecSize = vec.size();
> >> >>   if (aCols == null) {
> >> >> aCols = new Vector[vecSize];
> >> >>   } else if (aCols.length < vecSize) {
> >> >> aCols = Arrays.copyOf(aCols, vecSize);
> >> >>   }
> >> >>
> >> >>
> >> >>   if (vec.isDense()) {
> >> >> for (int i = 0; i < vecSize; i++) {
> >> >>   extendAColIfNeeded(i, aRowCount + 1);
> >> >>   aCols[i].setQuick(aRowCount, vec.getQuick(i));
> >> >> }
> >> >>   } else if (vec.size() > 0) {
> >> >> for (Vector.Element vecEl : vec.nonZeroes()) {
> >> >>   int i = vecEl.index();
> >> >>   extendAColIfNeeded(i, aRowCount + 1);
> >> >>   aCols[i].setQuick(aRowCount, vecEl.get());
> >> >> }
> >> >>   }
> >> >>   aRowCount++;
> >> >> }
> >> >>
> >> >>
> >> >> If the input is RandomAccessSparseVector, usually with big data, it's
> >> >> vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new
> >> >> Vector[vecSize] will introduce the OutOfMemory problem. The
> settlement
> >> of
> >> >> course should be enlarge every tasktracker's maximum memory:
> >> >> 
> >> >>   mapred.child.java.opts
> >> >>   -Xmx1024m
> >> >> 
> >> >> However, if you are NOT hadoop administrator or ops, you have no
> >> >> permission to modify the config. So, I try to modify ABtDenseOutJob
> map
> >> >> code to support RandomAccessSparseVector situation, I use hashmap to
> >> >> represent aCols instead of the original Vector[] aCols array, the
> >> modified
> >> >> code is as below:
> >> >>
> >> >>
> >> >> private Map aColsMap = new HashMap Vector>();
> >> >> protected void map(Writable key, VectorWritable value, Context
> >> context)
> >> >>   throws IOException, InterruptedException {
> >> >>
> >> >>
> >> >>   Vector vec = value.get();
> >> >>   if (vec.isDense()) {
> >> >> for (int i = 0; i < vecSize; i++) {
> >> >>   //extendAColIfNeeded(i, aRowCount + 1);
> >> >>   if (aColsMap.get(i) == null) {
> >> >>  aColsMap.put(i, new
> RandomAccessSparseVector(Integer.MAX_VALUE,
> >> >> 100));
> >> >>   }
> >> >>   aColsMap.get(i).setQuick(aRo

Re:Re: Re: Hadoop SSVD OutOfMemory Problem

2015-04-28 Thread lastarsenal
Ok, I have github account and clone mahout in my local workdir. 


I revised the code and run test: mvn test, however, there are 3 test failure:
Failed tests: 
  
LocalSSVDPCASparseTest.runPCATest1:87->runSSVDSolver:222->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
 null
  
LocalSSVDSolverDenseTest.testSSVDSolverPowerIterations1:59->runSSVDSolver:172->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
 null
  
LocalSSVDSolverSparseSequentialTest.testSSVDSolverPowerIterations1:69->runSSVDSolver:177->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
 null


Now, my question is, how can I run a specified test with maven? For "mvn test" 
is so slow, then if I can do like "mvn test LocalSSVDPCASparseTest", my 
efficiency will be improved.

At 2015-04-29 01:25:34, "Dmitriy Lyubimov"  wrote:
>Just Dmitriy is fine.
>
>In order to create a pull request, please check out the process page
>http://mahout.apache.org/developers/github.html. Note that it is written
>for both committers and contributors, so you need to ignore the details for
>committers.
>
>Basically, you just need a github account, clone (fork) apache/mahout in
>your account, (optionally) create a patch branch, commit your modifications
>there, and then use github UI to create a pull request against
>apache/mahout.
>
>thanks.
>
>-d
>
>On Mon, Apr 27, 2015 at 8:39 PM, lastarsenal  wrote:
>
>> Hi, Dmitriy Lyubimov
>>
>>
>> OK, I have submitted a JIRA issue at
>> https://issues.apache.org/jira/browse/MAHOUT-1700
>>
>>
>> I'm a newbie for mahout, so, what should I do next for this issue? Thank
>> you!
>>
>> At 2015-04-28 02:16:37, "Dmitriy Lyubimov"  wrote:
>> >Thank you for this analysis. I can't immediately confirm this since it's
>> >been a while but this sounds credible.
>> >
>> >Do you mind to file a jira with all this information, and even perhaps do
>> a
>> >PR on github?
>> >
>> >thank you.
>> >
>> >On Mon, Apr 27, 2015 at 4:32 AM, lastarsenal  wrote:
>> >
>> >> Hi, All,
>> >>
>> >>
>> >>  Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0)
>> >> job. There's a java heap space out of memory problem  in
>> ABtDenseOutJob. I
>> >> found the reason, the ABtDenseOutJob map code is as below:
>> >>
>> >>
>> >> protected void map(Writable key, VectorWritable value, Context
>> context)
>> >>   throws IOException, InterruptedException {
>> >>
>> >>
>> >>   Vector vec = value.get();
>> >>
>> >>
>> >>   int vecSize = vec.size();
>> >>   if (aCols == null) {
>> >> aCols = new Vector[vecSize];
>> >>   } else if (aCols.length < vecSize) {
>> >> aCols = Arrays.copyOf(aCols, vecSize);
>> >>   }
>> >>
>> >>
>> >>   if (vec.isDense()) {
>> >> for (int i = 0; i < vecSize; i++) {
>> >>   extendAColIfNeeded(i, aRowCount + 1);
>> >>   aCols[i].setQuick(aRowCount, vec.getQuick(i));
>> >> }
>> >>   } else if (vec.size() > 0) {
>> >> for (Vector.Element vecEl : vec.nonZeroes()) {
>> >>   int i = vecEl.index();
>> >>   extendAColIfNeeded(i, aRowCount + 1);
>> >>   aCols[i].setQuick(aRowCount, vecEl.get());
>> >> }
>> >>   }
>> >>   aRowCount++;
>> >> }
>> >>
>> >>
>> >> If the input is RandomAccessSparseVector, usually with big data, it's
>> >> vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new
>> >> Vector[vecSize] will introduce the OutOfMemory problem. The settlement
>> of
>> >> course should be enlarge every tasktracker's maximum memory:
>> >> 
>> >>   mapred.child.java.opts
>> >>   -Xmx1024m
>> >> 
>> >> However, if you are NOT hadoop administrator or ops, you have no
>> >> permission to modify the config. So, I try to modify ABtDenseOutJob map
>> >> code to support RandomAccessSparseVector situation, I use hashmap to
>> >> represent aCols instead of the original Vector[] aCols array, the
>> modified
>> >> code is as below:
>> >>
>> >>
>> >> private Map aColsMap = new HashMap();
>> >> protected void map(Writable key, VectorWritable value, Context
>> context)
>> >>   throws IOException, InterruptedException {
>> >>
>> >>
>> >>   Vector vec = value.get();
>> >>   if (vec.isDense()) {
>> >> for (int i = 0; i < vecSize; i++) {
>> >>   //extendAColIfNeeded(i, aRowCount + 1);
>> >>   if (aColsMap.get(i) == null) {
>> >>  aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE,
>> >> 100));
>> >>   }
>> >>   aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i));
>> >>   //aCols[i].setQuick(aRowCount, vec.getQuick(i));
>> >> }
>> >>   } else if (vec.size() > 0) {
>> >> for (Vector.Element vecEl : vec.nonZeroes()) {
>> >>   int i = vecEl.index();
>> >>   //extendAColIfNeeded(i, aRowCount + 1);
>> >>   if (aColsMap.get(i) == null) {
>> >>  aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE,
>> >> 100));
>> >>   }
>> >>   aColsMap.g