Re: Problems installing Mahout

2010-04-06 Thread Sean Owen
I'm ready to patch this issue, but I went the other way -- fixed the
output to use Locale.ENGLISH.

Either way works, what's preferred to you guys? Is it making the
output deterministic, or locale-friendly? I opted for fixing it to
Locale.ENGLISH because I like not depending on the platform, and
because the project is hardly internationalized to begin with.

On Tue, Apr 6, 2010 at 12:22 AM, Jeff Eastman
 wrote:
> We've been seeing a lot of similar string comparison problems recently and
> have made some progress in minimizing them. Its ironic that this problem is
> in the Printable tests which were supposed to be an improvement to the
> situation. The NormalModel asFormatString() uses
> ClusterBase.formatVector() which itself uses String.format(). In this case I
> think I would choose to fix the tests since the formatString *should* track
> the local language settings as its intended to be user readable.
>
> I suggest deleting the test class until I can get a patch in for the tests,
> or just run mvn install -DskipTests=true. Actually, it would be most useful
> to delete the class and see if there are any other tests like that one to
> bite us.


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Sean Owen
Weak, surely my changes that did it but I don't know why I didn't see
this in a local build / test.

On Tue, Apr 6, 2010 at 10:41 AM, Apache Hudson Server
 wrote:
> See 
>
> Changes:
>
> [srowen] MAHOUT-362 last refactorings for now
>
> [srowen] MAHOUT-362 More refinement of writables
>
> [srowen] MAHOUT-362 Fix test location and merge ItemWritable/UserWritable 
> into EntityWritable
>
> [srowen] Oops, fixed compile error from last commit which missed out some 
> changes
>
> [srowen] Initial commit of MAHOUT-362. Refactoring to come.
>
> [srowen] Restore logging to SVD related code
>
> [srowen] MAHOUT-361 Hearing no objection and believing Math shouldn't have 
> log statements and seeing that they're not really used much, I commit
>
> [adeneche] MAHOUT-323 Added a Basic Mapreduce version of TestForest
>
> [srowen] MAHOUT-361 Remove logging from collections -- uncontroversial it 
> seems
>


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Robin Anil
Running org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest
Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.039
sec <<< FAILURE!



On Tue, Apr 6, 2010 at 3:13 PM, Sean Owen  wrote:

> Weak, surely my changes that did it but I don't know why I didn't see
> this in a local build / test.
>
> On Tue, Apr 6, 2010 at 10:41 AM, Apache Hudson Server
>  wrote:
> > See <
> http://hudson.zones.apache.org/hudson/job/Mahout%20Trunk/584/changes>
> >
> > Changes:
> >
> > [srowen] MAHOUT-362 last refactorings for now
> >
> > [srowen] MAHOUT-362 More refinement of writables
> >
> > [srowen] MAHOUT-362 Fix test location and merge ItemWritable/UserWritable
> into EntityWritable
> >
> > [srowen] Oops, fixed compile error from last commit which missed out some
> changes
> >
> > [srowen] Initial commit of MAHOUT-362. Refactoring to come.
> >
> > [srowen] Restore logging to SVD related code
> >
> > [srowen] MAHOUT-361 Hearing no objection and believing Math shouldn't
> have log statements and seeing that they're not really used much, I commit
> >
> > [adeneche] MAHOUT-323 Added a Basic Mapreduce version of TestForest
> >
> > [srowen] MAHOUT-361 Remove logging from collections -- uncontroversial it
> seems
> >
>


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Robin Anil
I have tasted this before, That was when I didn't do a clean install before
checking in.

On Tue, Apr 6, 2010 at 3:13 PM, Sean Owen  wrote:

> Weak, surely my changes that did it but I don't know why I didn't see
> this in a local build / test.
>
> On Tue, Apr 6, 2010 at 10:41 AM, Apache Hudson Server
>  wrote:
> > See <
> http://hudson.zones.apache.org/hudson/job/Mahout%20Trunk/584/changes>
> >
> > Changes:
> >
> > [srowen] MAHOUT-362 last refactorings for now
> >
> > [srowen] MAHOUT-362 More refinement of writables
> >
> > [srowen] MAHOUT-362 Fix test location and merge ItemWritable/UserWritable
> into EntityWritable
> >
> > [srowen] Oops, fixed compile error from last commit which missed out some
> changes
> >
> > [srowen] Initial commit of MAHOUT-362. Refactoring to come.
> >
> > [srowen] Restore logging to SVD related code
> >
> > [srowen] MAHOUT-361 Hearing no objection and believing Math shouldn't
> have log statements and seeing that they're not really used much, I commit
> >
> > [adeneche] MAHOUT-323 Added a Basic Mapreduce version of TestForest
> >
> > [srowen] MAHOUT-361 Remove logging from collections -- uncontroversial it
> seems
> >
>


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Sean Owen
I see all tests pass in a full clean / test. :( I will look at
Hudson's output to see why it think it failed.

On Tue, Apr 6, 2010 at 10:48 AM, Robin Anil  wrote:
> I have tasted this before, That was when I didn't do a clean install before
> checking in.
>


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Sean Owen
I can't reproduce this at all and don't see how to get details out of
Hudson. Does anyone know where it sticks test output? or can anyone
repro this?

On Tue, Apr 6, 2010 at 10:58 AM, Sean Owen  wrote:
> I see all tests pass in a full clean / test. :( I will look at
> Hudson's output to see why it think it failed.
>
> On Tue, Apr 6, 2010 at 10:48 AM, Robin Anil  wrote:
>> I have tasted this before, That was when I didn't do a clean install before
>> checking in.
>>
>


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Sebastian Schelter
Hi Sean,

I can only do I guess why the test fails:

Line 225 in ItemSimilarityTest is missing a / when constructing the path
to the temporary directory:

String tmpDirPath = System.getProperty("java.io.tmpdir") +
  ItemSimilarityTest.class.getCanonicalName();

 which makes it

/tmporg.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest

on my system for example. Maybe the hudson user on the CI server is not
allowed to create this path.

Sebastian

Sean Owen schrieb:
> I can't reproduce this at all and don't see how to get details out of
> Hudson. Does anyone know where it sticks test output? or can anyone
> repro this?
>
> On Tue, Apr 6, 2010 at 10:58 AM, Sean Owen  wrote:
>   
>> I see all tests pass in a full clean / test. :( I will look at
>> Hudson's output to see why it think it failed.
>>
>> On Tue, Apr 6, 2010 at 10:48 AM, Robin Anil  wrote:
>> 
>>> I have tasted this before, That was when I didn't do a clean install before
>>> checking in.
>>>
>>>   



Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Sean Owen
That must be it. I had removed the '/' earlier since on OS X the temp
dir path ends with '/', and at the time I believed it was the cause of
some other failures (which I'm guessing I was wrong about). I can
easily make the logic account for both cases.

Sean

On Tue, Apr 6, 2010 at 11:24 AM, Sebastian Schelter
 wrote:
> Hi Sean,
>
> I can only do I guess why the test fails:
>
> Line 225 in ItemSimilarityTest is missing a / when constructing the path
> to the temporary directory:
>
>    String tmpDirPath = System.getProperty("java.io.tmpdir") +
>          ItemSimilarityTest.class.getCanonicalName();
>
>  which makes it
>
> /tmporg.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest
>
> on my system for example. Maybe the hudson user on the CI server is not
> allowed to create this path.
>
> Sebastian
>


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Benson Margulies
The File class is my usual solution here.

On Tue, Apr 6, 2010 at 7:10 AM, Sean Owen  wrote:
> That must be it. I had removed the '/' earlier since on OS X the temp
> dir path ends with '/', and at the time I believed it was the cause of
> some other failures (which I'm guessing I was wrong about). I can
> easily make the logic account for both cases.
>
> Sean
>
> On Tue, Apr 6, 2010 at 11:24 AM, Sebastian Schelter
>  wrote:
>> Hi Sean,
>>
>> I can only do I guess why the test fails:
>>
>> Line 225 in ItemSimilarityTest is missing a / when constructing the path
>> to the temporary directory:
>>
>>    String tmpDirPath = System.getProperty("java.io.tmpdir") +
>>          ItemSimilarityTest.class.getCanonicalName();
>>
>>  which makes it
>>
>> /tmporg.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest
>>
>> on my system for example. Maybe the hudson user on the CI server is not
>> allowed to create this path.
>>
>> Sebastian
>>
>


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Sean Owen
You mean File.createTempFile()? Yes, though here the test wants to
create a temp directory. Is there a good way to do that?

On Tue, Apr 6, 2010 at 12:17 PM, Benson Margulies  wrote:
> The File class is my usual solution here.


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Benson Margulies
 I mean new File(SomeDir, SomeFile) whenever I need to compose, and
let it worry over delimiters.

On Tue, Apr 6, 2010 at 7:19 AM, Sean Owen  wrote:
> You mean File.createTempFile()? Yes, though here the test wants to
> create a temp directory. Is there a good way to do that?
>
> On Tue, Apr 6, 2010 at 12:17 PM, Benson Margulies  
> wrote:
>> The File class is my usual solution here.
>


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Sebastian Schelter
Hi Sean,
I think I saw another potential problem, lines 233 to 237 should be
changed from

  if (tmpDir.exists()) {
recursiveDelete(tmpDir);
  } else {
tmpDir.mkdirs();
  }

to
  if (tmpDir.exists()) {
recursiveDelete(tmpDir);
  }
  tmpDir.mkdirs();
 
to always make sure the temporary path exists.

Sebastian

Sean Owen schrieb:
> That must be it. I had removed the '/' earlier since on OS X the temp
> dir path ends with '/', and at the time I believed it was the cause of
> some other failures (which I'm guessing I was wrong about). I can
> easily make the logic account for both cases.
>
> Sean
>
> On Tue, Apr 6, 2010 at 11:24 AM, Sebastian Schelter
>  wrote:
>   
>> Hi Sean,
>>
>> I can only do I guess why the test fails:
>>
>> Line 225 in ItemSimilarityTest is missing a / when constructing the path
>> to the temporary directory:
>>
>>String tmpDirPath = System.getProperty("java.io.tmpdir") +
>>  ItemSimilarityTest.class.getCanonicalName();
>>
>>  which makes it
>>
>> /tmporg.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest
>>
>> on my system for example. Maybe the hudson user on the CI server is not
>> allowed to create this path.
>>
>> Sebastian
>>
>> 



[jira] Commented: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-06 Thread Hui Wen Han (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853919#action_12853919
 ] 

Hui Wen Han commented on MAHOUT-358:


I used the latest code test again ,
the final output also is strange.


> the pref value  field of output of 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative
> -
>
> Key: MAHOUT-358
> URL: https://issues.apache.org/jira/browse/MAHOUT-358
> Project: Mahout
>  Issue Type: Test
>  Components: Collaborative Filtering
>Affects Versions: 0.4
>Reporter: Hui Wen Han
> Attachments: screenshot-1.jpg, screenshot-2.jpg
>
>
> In my test the input pref values all is positive.
> the output score value has negative value ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-06 Thread Hui Wen Han (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui Wen Han updated MAHOUT-358:
---

Attachment: screenshot-2.jpg

> the pref value  field of output of 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative
> -
>
> Key: MAHOUT-358
> URL: https://issues.apache.org/jira/browse/MAHOUT-358
> Project: Mahout
>  Issue Type: Test
>  Components: Collaborative Filtering
>Affects Versions: 0.4
>Reporter: Hui Wen Han
> Attachments: screenshot-1.jpg, screenshot-2.jpg
>
>
> In my test the input pref values all is positive.
> the output score value has negative value ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853927#action_12853927
 ] 

Sean Owen commented on MAHOUT-358:
--

Yes, the very small zero values are now being output. I think it should filter 
very small zero values instead, which I can add.

The part I don't quite understand are the negative values. Do you have negative 
ratings?

> the pref value  field of output of 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative
> -
>
> Key: MAHOUT-358
> URL: https://issues.apache.org/jira/browse/MAHOUT-358
> Project: Mahout
>  Issue Type: Test
>  Components: Collaborative Filtering
>Affects Versions: 0.4
>Reporter: Hui Wen Han
> Attachments: screenshot-1.jpg, screenshot-2.jpg
>
>
> In my test the input pref values all is positive.
> the output score value has negative value ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853929#action_12853929
 ] 

Sean Owen commented on MAHOUT-358:
--

Maybe it also clarifies to say: those valus are *not* estimated preferences. 
They can be very large; they are not supposed to be in a certain range.

The negative values make sense if you have negative ratings, and I suppose we 
should allow for that.

In that case, your results seem correct. The only change you may want then is 
to output very large or small numbers in scientific notation instead of a long 
string. I can make that change easily.

> the pref value  field of output of 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative
> -
>
> Key: MAHOUT-358
> URL: https://issues.apache.org/jira/browse/MAHOUT-358
> Project: Mahout
>  Issue Type: Test
>  Components: Collaborative Filtering
>Affects Versions: 0.4
>Reporter: Hui Wen Han
> Attachments: screenshot-1.jpg, screenshot-2.jpg
>
>
> In my test the input pref values all is positive.
> the output score value has negative value ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-356) ClassNotFoundException: org.apache.mahout.math.function.IntDoubleProcedure

2010-04-06 Thread Kris Jack (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853930#action_12853930
 ] 

Kris Jack commented on MAHOUT-356:
--

No, I haven't set a CLASSPATH var (not intentionally anyway ;)).

I have managed to successfully run mahout 0.3 on a small test set and got good 
results.  Still getting the same error with 0.4 though.  I'll keep on looking 
for the cause...

> ClassNotFoundException: org.apache.mahout.math.function.IntDoubleProcedure
> --
>
> Key: MAHOUT-356
> URL: https://issues.apache.org/jira/browse/MAHOUT-356
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.3
> Environment: karmic ubuntu 9.10, java version "1.6.0_15" hadoop 0.20
>Reporter: Kris Jack
> Fix For: 0.3
>
>
> When running org.apache.mahout.cf.taste.hadoop.item.RecommenderJob in a 
> pseudo-distributed hadoop, I get a java class not found exception.
> Full Output:
> 10/04/01 16:50:42 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 10/04/01 16:50:43 INFO mapred.JobClient: Running job: job_201004011631_0005
> 10/04/01 16:50:44 INFO mapred.JobClient:  map 0% reduce 0%
> 10/04/01 16:50:55 INFO mapred.JobClient:  map 2% reduce 0%
> 10/04/01 16:50:58 INFO mapred.JobClient:  map 14% reduce 0%
> 10/04/01 16:51:01 INFO mapred.JobClient:  map 24% reduce 0%
> 10/04/01 16:51:04 INFO mapred.JobClient:  map 33% reduce 0%
> 10/04/01 16:51:07 INFO mapred.JobClient:  map 41% reduce 0%
> 10/04/01 16:51:10 INFO mapred.JobClient:  map 50% reduce 0%
> 10/04/01 16:51:23 INFO mapred.JobClient:  map 63% reduce 0%
> 10/04/01 16:51:26 INFO mapred.JobClient:  map 72% reduce 16%
> 10/04/01 16:51:29 INFO mapred.JobClient:  map 83% reduce 16%
> 10/04/01 16:51:32 INFO mapred.JobClient:  map 92% reduce 16%
> 10/04/01 16:51:35 INFO mapred.JobClient:  map 98% reduce 16%
> 10/04/01 16:51:38 INFO mapred.JobClient:  map 100% reduce 16%
> 10/04/01 16:51:41 INFO mapred.JobClient:  map 100% reduce 25%
> 10/04/01 16:51:59 INFO mapred.JobClient:  map 100% reduce 100%
> 10/04/01 16:52:01 INFO mapred.JobClient: Job complete: job_201004011631_0005
> 10/04/01 16:52:01 INFO mapred.JobClient: Counters: 18
> 10/04/01 16:52:01 INFO mapred.JobClient:   Job Counters 
> 10/04/01 16:52:01 INFO mapred.JobClient: Launched reduce tasks=1
> 10/04/01 16:52:01 INFO mapred.JobClient: Launched map tasks=4
> 10/04/01 16:52:01 INFO mapred.JobClient: Data-local map tasks=4
> 10/04/01 16:52:01 INFO mapred.JobClient:   FileSystemCounters
> 10/04/01 16:52:01 INFO mapred.JobClient: FILE_BYTES_READ=603502320
> 10/04/01 16:52:01 INFO mapred.JobClient: HDFS_BYTES_READ=257007616
> 10/04/01 16:52:01 INFO mapred.JobClient: FILE_BYTES_WRITTEN=846533316
> 10/04/01 16:52:01 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3417233
> 10/04/01 16:52:01 INFO mapred.JobClient:   Map-Reduce Framework
> 10/04/01 16:52:01 INFO mapred.JobClient: Reduce input groups=168791
> 10/04/01 16:52:01 INFO mapred.JobClient: Combine output records=0
> 10/04/01 16:52:01 INFO mapred.JobClient: Map input records=17359346
> 10/04/01 16:52:01 INFO mapred.JobClient: Reduce shuffle bytes=179672560
> 10/04/01 16:52:01 INFO mapred.JobClient: Reduce output records=168791
> 10/04/01 16:52:01 INFO mapred.JobClient: Spilled Records=60466622
> 10/04/01 16:52:01 INFO mapred.JobClient: Map output bytes=208312152
> 10/04/01 16:52:01 INFO mapred.JobClient: Map input bytes=256995325
> 10/04/01 16:52:01 INFO mapred.JobClient: Combine input records=0
> 10/04/01 16:52:01 INFO mapred.JobClient: Map output records=17359346
> 10/04/01 16:52:01 INFO mapred.JobClient: Reduce input records=17359346
> 10/04/01 16:52:01 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 10/04/01 16:52:01 INFO mapred.JobClient: Running job: job_201004011631_0006
> 10/04/01 16:52:02 INFO mapred.JobClient:  map 0% reduce 0%
> 10/04/01 16:52:17 INFO mapred.JobClient:  map 15% reduce 0%
> 10/04/01 16:52:20 INFO mapred.JobClient:  map 25% reduce 0%
> 10/04/01 16:52:23 INFO mapred.JobClient:  map 34% reduce 0%
> 10/04/01 16:52:26 INFO mapred.JobClient:  map 45% reduce 0%
> 10/04/01 16:52:29 INFO mapred.JobClient:  map 50% reduce 0%
> 10/04/01 16:52:41 INFO mapred.JobClient:  map 62% reduce 0%
> 10/04/01 16:52:44 INFO mapred.JobClient:  map 70% reduce 16%
> 10/04/01 16:52:48 INFO mapred.JobClient:  map 81% reduce 16%
> 10/04/01 16:52:51 INFO mapred.JobClient:  map 91% reduce 16%
> 10/04/01 16:52:53 INFO mapred.JobClient:  map 96% reduce 16%
> 10/04/01 16:52:56 INFO mapred.JobClient:  map 100% reduce 16%
> 10/04/01 16:53:02 INFO mapred.JobClient:  map 100% reduce 25%
> 10/04/01 16:53:05 INFO mapred.JobClient:  map 100% reduce 0%
> 10/04/01 16:53:07 INFO mapre

[jira] Commented: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-06 Thread Hui Wen Han (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853977#action_12853977
 ] 

Hui Wen Han commented on MAHOUT-358:


I have no negative ratings.


> the pref value  field of output of 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative
> -
>
> Key: MAHOUT-358
> URL: https://issues.apache.org/jira/browse/MAHOUT-358
> Project: Mahout
>  Issue Type: Test
>  Components: Collaborative Filtering
>Affects Versions: 0.4
>Reporter: Hui Wen Han
> Attachments: screenshot-1.jpg, screenshot-2.jpg
>
>
> In my test the input pref values all is positive.
> the output score value has negative value ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853982#action_12853982
 ] 

Sean Owen commented on MAHOUT-358:
--

That is truly strange then. The user vectors have nonnegative values, and so 
does the co-occurrence matrix. Their product can't have negative values. 
Something is going wrong somewhere in there.

I can't reproduce to check without your data, though if you're in a position to 
debug, you'll have to see why user vectors or the matrix has negative values. I 
have looked at the code many times and do not see a way this can happen.

> the pref value  field of output of 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative
> -
>
> Key: MAHOUT-358
> URL: https://issues.apache.org/jira/browse/MAHOUT-358
> Project: Mahout
>  Issue Type: Test
>  Components: Collaborative Filtering
>Affects Versions: 0.4
>Reporter: Hui Wen Han
> Attachments: screenshot-1.jpg, screenshot-2.jpg
>
>
> In my test the input pref values all is positive.
> the output score value has negative value ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-06 Thread Hui Wen Han (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853983#action_12853983
 ] 

Hui Wen Han commented on MAHOUT-358:


I will debug and tell you the result .
Thanks :)

> the pref value  field of output of 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative
> -
>
> Key: MAHOUT-358
> URL: https://issues.apache.org/jira/browse/MAHOUT-358
> Project: Mahout
>  Issue Type: Test
>  Components: Collaborative Filtering
>Affects Versions: 0.4
>Reporter: Hui Wen Han
> Attachments: screenshot-1.jpg, screenshot-2.jpg
>
>
> In my test the input pref values all is positive.
> the output score value has negative value ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-345) [GSOC] integrate Mahout with Drupal/PHP

2010-04-06 Thread Y.W.D.D.Dissanayake (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854008#action_12854008
 ] 

Y.W.D.D.Dissanayake commented on MAHOUT-345:


i'm computer science student. i like to join your project. plz give more 
details about project. how to start following your project.

> [GSOC] integrate Mahout with Drupal/PHP
> ---
>
> Key: MAHOUT-345
> URL: https://issues.apache.org/jira/browse/MAHOUT-345
> Project: Mahout
>  Issue Type: Task
>  Components: Website
>Reporter: Daniel Xiaodan Zhou
>
> Drupal is a very popular open source web content management system. It's been 
> widely used in e-commerce sites, media sites, etc. This is a list of famous 
> site using Drupal: 
> http://socialcmsbuzz.com/45-drupal-sites-which-you-may-not-have-known-were-drupal-based-24092008/
> Integrate Mahout with Drupal would greatly increase the impact of Mahout in 
> web systems: any Drupal website can easily use Mahout to make content 
> recommendations or cluster contents.
> I'm a PhD student at University of Michigan, with a research focus on 
> recommender systems. Last year I participated GSOC 2009 with Drupal.org, and 
> developed a recommender system for Drupal. But that module was not as 
> sophisticated as Mahout. And I think it would be nice just to integrate 
> Mahout into Drupal rather than developing a separate Mahout-like module for 
> Drupal.
> Any comments? I can provide more information if people here are interested. 
> Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-06 Thread Hui Wen Han (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854073#action_12854073
 ] 

Hui Wen Han commented on MAHOUT-358:


if use Text as the out format ,everything is ok.

maybe something is wrong with RecommendedItemsWritable or IdentityReducer.


public final class RecommenderMapper extends MapReduceBase implements
Mapper

  private final Text user = new Text();
  private final Text recomScore = new Text();
  private static final String FIELD_SEPERATOR = ",";

//output.collect(userID, new RecommendedItemsWritable(recommendations));
for (RecommendedItem recommendation : recommendations)
{ 
user.set(String.valueOf(userID)); 
recomScore.set(recommendation.getItemID() + FIELD_SEPERATOR + 
recommendation.getValue()); 
output.collect(user, recomScore); 
} 

  JobConf recommenderConf = prepareJobConf(userVectorPath, outputPath,
  SequenceFileInputFormat.class, RecommenderMapper.class, Text.class,
  Text.class, IdentityReducer.class, Text.class,
  Text.class, TextOutputFormat.class);

> the pref value  field of output of 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative
> -
>
> Key: MAHOUT-358
> URL: https://issues.apache.org/jira/browse/MAHOUT-358
> Project: Mahout
>  Issue Type: Test
>  Components: Collaborative Filtering
>Affects Versions: 0.4
>Reporter: Hui Wen Han
> Attachments: screenshot-1.jpg, screenshot-2.jpg
>
>
> In my test the input pref values all is positive.
> the output score value has negative value ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854077#action_12854077
 ] 

Sean Owen commented on MAHOUT-358:
--

You mean that you do not see those negative values? Truly strange, since that 
means everything's working except output?

The output is simple:

  @Override
  public String toString() {
StringBuilder result = new StringBuilder(200);
result.append('[');
boolean first = true;
for (RecommendedItem item : recommended) {
  if (first) {
first = false;
  } else {
result.append(',');
  }
  result.append(item.getItemID());
  result.append(':');
  BigDecimal bd = new BigDecimal(item.getValue()).round(ROUNDING);
  result.append(bd.toPlainString());
}
result.append(']');
return result.toString();
  }

I can remove the BigDecimal call, which is somewhat new, but still looks 
entirely correct to me.

Are you sure the output is really different?

> the pref value  field of output of 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative
> -
>
> Key: MAHOUT-358
> URL: https://issues.apache.org/jira/browse/MAHOUT-358
> Project: Mahout
>  Issue Type: Test
>  Components: Collaborative Filtering
>Affects Versions: 0.4
>Reporter: Hui Wen Han
> Attachments: screenshot-1.jpg, screenshot-2.jpg
>
>
> In my test the input pref values all is positive.
> the output score value has negative value ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Benson Margulies
Hearing no other remarks, I will proceed to disconnect and make the
version 1.0-SNAPSHOT, and call a release vote RSN.


On Sun, Apr 4, 2010 at 7:58 PM, Benson Margulies  wrote:
> Last question: What's the first version going to be? I propose '1.0'.
> 0.4 would get mighty confusion. I really don't see the harm in calling
> it 1.0.
>
>
> On Sat, Apr 3, 2010 at 6:00 PM, Grant Ingersoll  wrote:
>>
>> On Apr 3, 2010, at 2:22 PM, Benson Margulies wrote:
>>
>>> On Sat, Apr 3, 2010 at 2:07 PM, Sean Owen  wrote:
>
>

 Actually it seems like this a valid subproject of a Mahout TLP in its
 own right, if that would be a useful middle-ground status.
>>>
>>> I'm not trying to suggest anything different. I'm opposed to having
>>> 'separate committers', but I'm happy to have multiple releasable
>>> components all in the Mahout TLP.
>>
>> For those following the sub project saga in Lucene, let's not go down that 
>> road.  +1 to releasable components, though.  We can release what we want 
>> when we want.  It doesn't have to be the whole thing all the time.  But I'd 
>> say no to separate committers, etc.
>


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Sean Owen
This still lives in Mahout, just has a different version number?
what's the substance of the change in the short-term; I think I missed
that step.

On Tue, Apr 6, 2010 at 6:41 PM, Benson Margulies  wrote:
> Hearing no other remarks, I will proceed to disconnect and make the
> version 1.0-SNAPSHOT, and call a release vote RSN.


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Benson Margulies
Substance:

1: remove collections-codegen and collections from the top-level pom's
module list.
2: change their parents to point to the apache parent.
3: tweak their poms so that the release plugin works right with them.
4: release them
5: change rest of mahout to consume release.


On Tue, Apr 6, 2010 at 1:44 PM, Sean Owen  wrote:
> This still lives in Mahout, just has a different version number?
> what's the substance of the change in the short-term; I think I missed
> that step.
>
> On Tue, Apr 6, 2010 at 6:41 PM, Benson Margulies  
> wrote:
>> Hearing no other remarks, I will proceed to disconnect and make the
>> version 1.0-SNAPSHOT, and call a release vote RSN.
>


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Ted Dunning
For what it is worth, I actually prefer this approach to the multi-pom
approach in many cases.  If it really is a separate thing, it might as well
have a separate release schedule and artifact.  If it isn't a separate
thing, then you might as well use a single pom.  This heuristic doesn't
always work, and I know that people with more maven experience than I have
work under different principles.  My explanation for the difference in
opinion is that the separated project may be better for those with limited
maven experience while the more complex arrangement may be better for those
with a native fluency.

As such, giving mahout-collections and ultimately mahout-math their own
version number is a fine thing.  Also will pretty much always exhibit more
maturity than the core mahout project if only because the needs they fulfill
are better understood.  That makes the 1.0 version for collections match the
0.4 upcoming version for Mahout.

On Tue, Apr 6, 2010 at 11:17 AM, Benson Margulies wrote:

> Substance:
>
> 1: remove collections-codegen and collections from the top-level pom's
> module list.
> 2: change their parents to point to the apache parent.
> 3: tweak their poms so that the release plugin works right with them.
> 4: release them
> 5: change rest of mahout to consume release.
>
>
> On Tue, Apr 6, 2010 at 1:44 PM, Sean Owen  wrote:
> > This still lives in Mahout, just has a different version number?
> > what's the substance of the change in the short-term; I think I missed
> > that step.
> >
> > On Tue, Apr 6, 2010 at 6:41 PM, Benson Margulies 
> wrote:
> >> Hearing no other remarks, I will proceed to disconnect and make the
> >> version 1.0-SNAPSHOT, and call a release vote RSN.
> >
>


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Jake Mannix
I agree in principal, but having a whole different set of versionings seems
kinda... messy?  If m-collections goes 1.0, and then 1.1, and then m-math
goes 1.0, and core goes to 0.5, we have a whole pile of different version
numbers to keep track of.

Didn't Lucene and Solr just intentionally do the reverse, locking their
release
numbers and schedules?  And now we're doing the opposite on a less
mature project?  What exactly do we gain by this?

  -jake

On Tue, Apr 6, 2010 at 11:43 AM, Ted Dunning  wrote:

> For what it is worth, I actually prefer this approach to the multi-pom
> approach in many cases.  If it really is a separate thing, it might as well
> have a separate release schedule and artifact.  If it isn't a separate
> thing, then you might as well use a single pom.  This heuristic doesn't
> always work, and I know that people with more maven experience than I have
> work under different principles.  My explanation for the difference in
> opinion is that the separated project may be better for those with limited
> maven experience while the more complex arrangement may be better for those
> with a native fluency.
>
> As such, giving mahout-collections and ultimately mahout-math their own
> version number is a fine thing.  Also will pretty much always exhibit more
> maturity than the core mahout project if only because the needs they
> fulfill
> are better understood.  That makes the 1.0 version for collections match
> the
> 0.4 upcoming version for Mahout.
>
> On Tue, Apr 6, 2010 at 11:17 AM, Benson Margulies  >wrote:
>
> > Substance:
> >
> > 1: remove collections-codegen and collections from the top-level pom's
> > module list.
> > 2: change their parents to point to the apache parent.
> > 3: tweak their poms so that the release plugin works right with them.
> > 4: release them
> > 5: change rest of mahout to consume release.
> >
> >
> > On Tue, Apr 6, 2010 at 1:44 PM, Sean Owen  wrote:
> > > This still lives in Mahout, just has a different version number?
> > > what's the substance of the change in the short-term; I think I missed
> > > that step.
> > >
> > > On Tue, Apr 6, 2010 at 6:41 PM, Benson Margulies <
> bimargul...@gmail.com>
> > wrote:
> > >> Hearing no other remarks, I will proceed to disconnect and make the
> > >> version 1.0-SNAPSHOT, and call a release vote RSN.
> > >
> >
>


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Ted Dunning
The Lucene/Solr community have decided to loosely couple release schedules
and explicitly decided to not lock version numbers.  One of their arguments
was that it would confuse users, which doesn't apply for us.  The other
argument was that either side should be free to have a release that was
completely compatible without the other side having to bump version number.
This especially applies to major version numbers.

The question is the degree of coupling.  I expect that we will have nearly
zero coupling between collections releases and mahout releases, if only
because there should be a vanishingly small number of collections releases.
In some sense, tying the collections and core version numbers together
should be about as compelling as, say, tying mahout to hadoop releases.  We
may need to have a new Mahout version when 0.21 or hadoop 1.0 comes out, but
we definitely should be free to release many times before that happens.  The
only difference is that with collections, we will really have a say in
whether the maven artifact gets pushed onto the main repositories.

On Tue, Apr 6, 2010 at 11:48 AM, Jake Mannix  wrote:

> I agree in principal, but having a whole different set of versionings seems
> kinda... messy?  If m-collections goes 1.0, and then 1.1, and then m-math
> goes 1.0, and core goes to 0.5, we have a whole pile of different version
> numbers to keep track of.
>
> Didn't Lucene and Solr just intentionally do the reverse, locking their
> release
> numbers and schedules?  And now we're doing the opposite on a less
> mature project?  What exactly do we gain by this?
>
>  -jake
>
> On Tue, Apr 6, 2010 at 11:43 AM, Ted Dunning 
> wrote:
>
> > For what it is worth, I actually prefer this approach to the multi-pom
> > approach in many cases.  If it really is a separate thing, it might as
> well
> > have a separate release schedule and artifact.  If it isn't a separate
> > thing, then you might as well use a single pom.  This heuristic doesn't
> > always work, and I know that people with more maven experience than I
> have
> > work under different principles.  My explanation for the difference in
> > opinion is that the separated project may be better for those with
> limited
> > maven experience while the more complex arrangement may be better for
> those
> > with a native fluency.
> >
> > As such, giving mahout-collections and ultimately mahout-math their own
> > version number is a fine thing.  Also will pretty much always exhibit
> more
> > maturity than the core mahout project if only because the needs they
> > fulfill
> > are better understood.  That makes the 1.0 version for collections match
> > the
> > 0.4 upcoming version for Mahout.
> >
> > On Tue, Apr 6, 2010 at 11:17 AM, Benson Margulies  > >wrote:
> >
> > > Substance:
> > >
> > > 1: remove collections-codegen and collections from the top-level pom's
> > > module list.
> > > 2: change their parents to point to the apache parent.
> > > 3: tweak their poms so that the release plugin works right with them.
> > > 4: release them
> > > 5: change rest of mahout to consume release.
> > >
> > >
> > > On Tue, Apr 6, 2010 at 1:44 PM, Sean Owen  wrote:
> > > > This still lives in Mahout, just has a different version number?
> > > > what's the substance of the change in the short-term; I think I
> missed
> > > > that step.
> > > >
> > > > On Tue, Apr 6, 2010 at 6:41 PM, Benson Margulies <
> > bimargul...@gmail.com>
> > > wrote:
> > > >> Hearing no other remarks, I will proceed to disconnect and make the
> > > >> version 1.0-SNAPSHOT, and call a release vote RSN.
> > > >
> > >
> >
>


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Ted Dunning
I should have said "there should SOON be a vanishingly small number of
collections releases".  Clearly that isn't so just yet.

On Tue, Apr 6, 2010 at 12:09 PM, Ted Dunning  wrote:

> if only because there should be a vanishingly small number of collections
> releases


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Benson Margulies
We gain the ability to release collections more frequently. *because*
it is less mature, it needs that.

On Tue, Apr 6, 2010 at 2:48 PM, Jake Mannix  wrote:
> I agree in principal, but having a whole different set of versionings seems
> kinda... messy?  If m-collections goes 1.0, and then 1.1, and then m-math
> goes 1.0, and core goes to 0.5, we have a whole pile of different version
> numbers to keep track of.
>
> Didn't Lucene and Solr just intentionally do the reverse, locking their
> release
> numbers and schedules?  And now we're doing the opposite on a less
> mature project?  What exactly do we gain by this?
>
>  -jake
>
> On Tue, Apr 6, 2010 at 11:43 AM, Ted Dunning  wrote:
>
>> For what it is worth, I actually prefer this approach to the multi-pom
>> approach in many cases.  If it really is a separate thing, it might as well
>> have a separate release schedule and artifact.  If it isn't a separate
>> thing, then you might as well use a single pom.  This heuristic doesn't
>> always work, and I know that people with more maven experience than I have
>> work under different principles.  My explanation for the difference in
>> opinion is that the separated project may be better for those with limited
>> maven experience while the more complex arrangement may be better for those
>> with a native fluency.
>>
>> As such, giving mahout-collections and ultimately mahout-math their own
>> version number is a fine thing.  Also will pretty much always exhibit more
>> maturity than the core mahout project if only because the needs they
>> fulfill
>> are better understood.  That makes the 1.0 version for collections match
>> the
>> 0.4 upcoming version for Mahout.
>>
>> On Tue, Apr 6, 2010 at 11:17 AM, Benson Margulies > >wrote:
>>
>> > Substance:
>> >
>> > 1: remove collections-codegen and collections from the top-level pom's
>> > module list.
>> > 2: change their parents to point to the apache parent.
>> > 3: tweak their poms so that the release plugin works right with them.
>> > 4: release them
>> > 5: change rest of mahout to consume release.
>> >
>> >
>> > On Tue, Apr 6, 2010 at 1:44 PM, Sean Owen  wrote:
>> > > This still lives in Mahout, just has a different version number?
>> > > what's the substance of the change in the short-term; I think I missed
>> > > that step.
>> > >
>> > > On Tue, Apr 6, 2010 at 6:41 PM, Benson Margulies <
>> bimargul...@gmail.com>
>> > wrote:
>> > >> Hearing no other remarks, I will proceed to disconnect and make the
>> > >> version 1.0-SNAPSHOT, and call a release vote RSN.
>> > >
>> >
>>
>


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Benson Margulies
On Tue, Apr 6, 2010 at 3:10 PM, Ted Dunning  wrote:
> I should have said "there should SOON be a vanishingly small number of
> collections releases".  Clearly that isn't so just yet.
>
> On Tue, Apr 6, 2010 at 12:09 PM, Ted Dunning  wrote:
>
>> if only because there should be a vanishingly small number of collections
>> releases

Until we all all the unit tests and remove all the deprecations, I
expect a some releases, as per Ted's later message. Then, it should
get really, really, quiet. Unless we decide that it's the right place
for things like bloom filters.


I should also add that I still have hopes that collections will
transmigrate to commons, so making it more independent of mahout is
better.

>


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Benson Margulies
Where are we on the consensus process?

Jake, have Ted and I satisfied you? Does this call for a VOTE to be
sure that we're on the same page?


On Tue, Apr 6, 2010 at 3:33 PM, Benson Margulies  wrote:
> On Tue, Apr 6, 2010 at 3:10 PM, Ted Dunning  wrote:
>> I should have said "there should SOON be a vanishingly small number of
>> collections releases".  Clearly that isn't so just yet.
>>
>> On Tue, Apr 6, 2010 at 12:09 PM, Ted Dunning  wrote:
>>
>>> if only because there should be a vanishingly small number of collections
>>> releases
>
> Until we all all the unit tests and remove all the deprecations, I
> expect a some releases, as per Ted's later message. Then, it should
> get really, really, quiet. Unless we decide that it's the right place
> for things like bloom filters.
>
>
> I should also add that I still have hopes that collections will
> transmigrate to commons, so making it more independent of mahout is
> better.
>
>>
>


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Jake Mannix
I guess I'm fine with whatever, making fast releases of collections is
in fact pretty cool, it will give us practice with making releases in mahout
in general.  And if we can do this for mahout-math as well, some of us who
care about, for example, eventually adding unit tests for all of the old
Colt
stuff, can do so bit by bit, and undeprecate code which is now tested, and
then re-release with new version number frequently there as well.

I'm in favor, I guess, of:

1: remove collections-codegen and collections from the top-level pom's
module list.
2: change their parents to point to the apache parent.
3: tweak their poms so that the release plugin works right with them.
4: release them
5: change rest of mahout to consume release.

  -jake

On Tue, Apr 6, 2010 at 12:44 PM, Benson Margulies wrote:

> Where are we on the consensus process?
>
> Jake, have Ted and I satisfied you? Does this call for a VOTE to be
> sure that we're on the same page?
>
>
> On Tue, Apr 6, 2010 at 3:33 PM, Benson Margulies 
> wrote:
> > On Tue, Apr 6, 2010 at 3:10 PM, Ted Dunning 
> wrote:
> >> I should have said "there should SOON be a vanishingly small number of
> >> collections releases".  Clearly that isn't so just yet.
> >>
> >> On Tue, Apr 6, 2010 at 12:09 PM, Ted Dunning 
> wrote:
> >>
> >>> if only because there should be a vanishingly small number of
> collections
> >>> releases
> >
> > Until we all all the unit tests and remove all the deprecations, I
> > expect a some releases, as per Ted's later message. Then, it should
> > get really, really, quiet. Unless we decide that it's the right place
> > for things like bloom filters.
> >
> >
> > I should also add that I still have hopes that collections will
> > transmigrate to commons, so making it more independent of mahout is
> > better.
> >
> >>
> >
>


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Ted Dunning
Very cool.  Very exciting.

Benson, that sounds like consensus to me.

On Tue, Apr 6, 2010 at 1:02 PM, Jake Mannix  wrote:

> ... I'm in favor, I guess, of:
>
> 1: remove collections-codegen and collections from the top-level pom's
> module list.
> 2: change their parents to point to the apache parent.
> 3: tweak their poms so that the release plugin works right with them.
> 4: release them
> 5: change rest of mahout to consume release.
>
>   -jake
>
>


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Benson Margulies
Indeed. Off I go.

On Tue, Apr 6, 2010 at 4:23 PM, Ted Dunning  wrote:
> Very cool.  Very exciting.
>
> Benson, that sounds like consensus to me.
>
> On Tue, Apr 6, 2010 at 1:02 PM, Jake Mannix  wrote:
>
>> ... I'm in favor, I guess, of:
>>
>> 1: remove collections-codegen and collections from the top-level pom's
>> module list.
>> 2: change their parents to point to the apache parent.
>> 3: tweak their poms so that the release plugin works right with them.
>> 4: release them
>> 5: change rest of mahout to consume release.
>>
>>   -jake
>>
>>
>


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Grant Ingersoll
+1.  Release early, release often. 

-Grant

On Apr 6, 2010, at 5:12 PM, Benson Margulies wrote:

> Indeed. Off I go.
> 
> On Tue, Apr 6, 2010 at 4:23 PM, Ted Dunning  wrote:
>> Very cool.  Very exciting.
>> 
>> Benson, that sounds like consensus to me.
>> 
>> On Tue, Apr 6, 2010 at 1:02 PM, Jake Mannix  wrote:
>> 
>>> ... I'm in favor, I guess, of:
>>> 
>>> 1: remove collections-codegen and collections from the top-level pom's
>>> module list.
>>> 2: change their parents to point to the apache parent.
>>> 3: tweak their poms so that the release plugin works right with them.
>>> 4: release them
>>> 5: change rest of mahout to consume release.
>>> 
>>>   -jake
>>> 
>>> 
>> 



Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Robin Anil
Great proposal. Hopefully this will push Mahout core to have faster releases


Robin


On Wed, Apr 7, 2010 at 3:29 AM, Grant Ingersoll  wrote:

> +1.  Release early, release often.
>
> -Grant
>
> On Apr 6, 2010, at 5:12 PM, Benson Margulies wrote:
>
> > Indeed. Off I go.
> >
> > On Tue, Apr 6, 2010 at 4:23 PM, Ted Dunning 
> wrote:
> >> Very cool.  Very exciting.
> >>
> >> Benson, that sounds like consensus to me.
> >>
> >> On Tue, Apr 6, 2010 at 1:02 PM, Jake Mannix 
> wrote:
> >>
> >>> ... I'm in favor, I guess, of:
> >>>
> >>> 1: remove collections-codegen and collections from the top-level pom's
> >>> module list.
> >>> 2: change their parents to point to the apache parent.
> >>> 3: tweak their poms so that the release plugin works right with them.
> >>> 4: release them
> >>> 5: change rest of mahout to consume release.
> >>>
> >>>   -jake
> >>>
> >>>
> >>
>
>


VOTE: release mahout-collections-codegen 1.0

2010-04-06 Thread Benson Margulies
In order to decouple the mahout-collections library from the rest of
Mahout, to allow more frequent releases and other good things, we
propose to release the code generator for the collections library as a
separate Maven artifact. (Followed, in short order, by the collections
library proper.) This is proposed release 1.0 of
mahout-collections-codegen-plugin. This is intended as a maven-only
release; we'll put the artifacts in the Mahout download area as well,
but we don't ever expect anyone to use this except from Maven,
inasmuch as it is a maven plugin.

The release artifacts are in the Nexus stage, as follows.

https://repository.apache.org/content/repositories/orgapachemahout-006/

This vote will remain open for 72 hours.


[jira] Created: (MAHOUT-364) [GSOC] Proposal to implement Neural Network with backpropagation learning on Hadoop

2010-04-06 Thread Zaid Md. Abdul Wahab Sheikh (JIRA)
[GSOC] Proposal to implement Neural Network with backpropagation learning on 
Hadoop
---

 Key: MAHOUT-364
 URL: https://issues.apache.org/jira/browse/MAHOUT-364
 Project: Mahout
  Issue Type: New Feature
Reporter: Zaid Md. Abdul Wahab Sheikh


Proposal Title: Implement Multi-Layer Perceptrons with backpropagation learning 
on Hadoop (addresses issue Mahout-342)

Student Name: Zaid Md. Abdul Wahab Sheikh

Student E-mail: (gmail id) sheikh.zaid



I. Brief Description

A feedforward neural network (NN) reveals several degrees of parallelism within 
it such as weight parallelism, node parallelism, network parallelism, layer 
parallelism and training parallelism. However network based parallelism 
requires fine-grained synchronization and communication and thus is not 
suitable for map/reduce based algorithms. On the other hand, training-set 
parallelism is coarse grained. This can be easily exploited on Hadoop which can 
split up the input among different mappers. Each of the mappers will then 
propagate the 'InputSplit' through their own copy of the complete neural 
network.
The backpropagation algorithm will operate in batch mode. This is because 
updating a common set of parameters after each training example creates a 
bottleneck for parallelization. The overall error gradient vector calculation 
can be parallelized by calculating the gradients from each training vector in 
the Mapper, combining them to get partial batch gradients and then adding them 
in a reducer to get the overall batch gradient.
In a similiar manner, error function evaluations during line searches (for the 
conjugate gradient and quasi-Newton algorithms) can be efficiently parallelized.
Lastly, to avoid local minima in its error function, we can take advantage of 
training session parallelism to start multiple training sessions in parallel 
with different initial weights (simulated annealing).



II. Detailed Proposal

The most important step is to design the base neural network classes in such a 
way that other NN architectures like Hopfield nets, Boltzman machines, SOM etc 
can be easily implemented by deriving from these base classes. For that I 
propose to implement a set of core classes that correspond to basic neural 
network concepts like artificial neuron, neuron layer, neuron connections, 
weight, transfer function, input function, learning rule etc. This architecture 
is inspired from that of the opensource Neuroph neural network framework 
(http://imgur.com/gDIOe.jpg). This design of the base architecture allows for 
great flexibility in deriving newer NNs and learning rules. All that needs to 
be done is to derive from the NeuralNetwork class, provide the method for 
network creation, create a new training method by deriving from LearningRule, 
and then add that learning rule to the network during creation. In addition, 
the API is very intuitive and easy to understand (in comparision to other NN 
frameworks like Encog and JOONE).


** The approach to parallelization in Hadoop:

In the Driver class:
- The input parameters are read and the NeuralNetwork with a specified 
LearningRule (training algorithm) created.
- Initial weight values are randomly generated and written to the FileSystem. 
If number of training sessions (for simulated annealing) is specified, multiple 
sets of initial weight values are generated.
- Training is started by calling the NeuralNetwork's learn() method. For each 
iteration, every time the error gradient vector needs to be calculated, the 
method submits a Job where the input path to the training-set vectors and 
various key properties (like path to the stored weight values) are set. The 
gradient vectors calculated by the Reducers are written back to an output path 
in the FileSystem.
- After the JobClient.runJob() returns, the gradient vectors are retrieved from 
the FileSystem and tested to see if the stopping criterion is satisfied. The 
weights are then updated, using the method implemented by the particular 
LearningRule. For line searches, each error function evaluation is again done 
by submitting a job.
- The NN is trained in iterations until it converges.

In the Mapper class:
- Each Mapper is initialized using the configure method, the weights are 
retrieved and the complete NeuralNetwork created.
- The map function then takes in the training vectors as key/value pairs (the 
key is ignored), runs them through the NN to calculate the outputs and 
backpropagates the errors to find out the error gradients. The error gradient 
vectors are then output as key/value pairs where all the keys are set to a 
common value, such as the training session number (for each training session, 
all keys in the outputs of all the mappers have to be identical).

In the Combiner class:
- Iterates through the all individual error gradient vectors output by the 
mappe

GSOC [mentor idea]: Clustering visualization with GraphViz

2010-04-06 Thread Robin Anil
Here is a good project wish list, If anyone wishes to take it forward I
would be willing to help mentor.

http://www.graphviz.org/
Check out one of the graphs which i believe is a good way to represent
clusters. Creating this graph is as easy was writing cluster output to the
graphviz format
http://www.bioconductor.org/overview/Screenshots/photoalbum_photo_view?b_start=6

This is an excellent project which allows you to display graphviz in a
browser. Maybe we can create a generic webapp from the current taste webapp
and add clustering functionality there.
http://code.google.com/p/canviz/

Robin


[jira] Updated: (MAHOUT-364) [GSOC] Proposal to implement Neural Network with backpropagation learning on Hadoop

2010-04-06 Thread Zaid Md. Abdul Wahab Sheikh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zaid Md. Abdul Wahab Sheikh updated MAHOUT-364:
---

Description: 
Proposal Title: Implement Multi-Layer Perceptrons with backpropagation learning 
on Hadoop (addresses issue Mahout-342)

Student Name: Zaid Md. Abdul Wahab Sheikh

Student E-mail: (gmail id) sheikh.zaid



h2. I. Brief Description

A feedforward neural network (NN) reveals several degrees of parallelism within 
it such as weight parallelism, node parallelism, network parallelism, layer 
parallelism and training parallelism. However network based parallelism 
requires fine-grained synchronization and communication and thus is not 
suitable for map/reduce based algorithms. On the other hand, training-set 
parallelism is coarse grained. This can be easily exploited on Hadoop which can 
split up the input among different mappers. Each of the mappers will then 
propagate the 'InputSplit' through their own copy of the complete neural 
network.
The backpropagation algorithm will operate in batch mode. This is because 
updating a common set of parameters after each training example creates a 
bottleneck for parallelization. The overall error gradient vector calculation 
can be parallelized by calculating the gradients from each training vector in 
the Mapper, combining them to get partial batch gradients and then adding them 
in a reducer to get the overall batch gradient.
In a similiar manner, error function evaluations during line searches (for the 
conjugate gradient and quasi-Newton algorithms) can be efficiently parallelized.
Lastly, to avoid local minima in its error function, we can take advantage of 
training session parallelism to start multiple training sessions in parallel 
with different initial weights (simulated annealing).



h2. II. Detailed Proposal

The most important step is to design the base neural network classes in such a 
way that other NN architectures like Hopfield nets, Boltzman machines, SOM etc 
can be easily implemented by deriving from these base classes. For that I 
propose to implement a set of core classes that correspond to basic neural 
network concepts like artificial neuron, neuron layer, neuron connections, 
weight, transfer function, input function, learning rule etc. This architecture 
is inspired from that of the opensource Neuroph neural network framework 
(http://imgur.com/gDIOe.jpg). This design of the base architecture allows for 
great flexibility in deriving newer NNs and learning rules. All that needs to 
be done is to derive from the NeuralNetwork class, provide the method for 
network creation, create a new training method by deriving from LearningRule, 
and then add that learning rule to the network during creation. In addition, 
the API is very intuitive and easy to understand (in comparision to other NN 
frameworks like Encog and JOONE).


h3. The approach to parallelization in Hadoop:

In the Driver class:
- The input parameters are read and the NeuralNetwork with a specified 
LearningRule (training algorithm) created.
- Initial weight values are randomly generated and written to the FileSystem. 
If number of training sessions (for simulated annealing) is specified, multiple 
sets of initial weight values are generated.
- Training is started by calling the NeuralNetwork's learn() method. For each 
iteration, every time the error gradient vector needs to be calculated, the 
method submits a Job where the input path to the training-set vectors and 
various key properties (like path to the stored weight values) are set. The 
gradient vectors calculated by the Reducers are written back to an output path 
in the FileSystem.
- After the JobClient.runJob() returns, the gradient vectors are retrieved from 
the FileSystem and tested to see if the stopping criterion is satisfied. The 
weights are then updated, using the method implemented by the particular 
LearningRule. For line searches, each error function evaluation is again done 
by submitting a job.
- The NN is trained in iterations until it converges.

In the Mapper class:
- Each Mapper is initialized using the configure method, the weights are 
retrieved and the complete NeuralNetwork created.
- The map function then takes in the training vectors as key/value pairs (the 
key is ignored), runs them through the NN to calculate the outputs and 
backpropagates the errors to find out the error gradients. The error gradient 
vectors are then output as key/value pairs where all the keys are set to a 
common value, such as the training session number (for each training session, 
all keys in the outputs of all the mappers have to be identical).

In the Combiner class:
- Iterates through the all individual error gradient vectors output by the 
mapper (since they all have the same key) and adds them up to get a partial 
batch gradient.

In the Reducer class:
- There's a single reducer

[jira] Updated: (MAHOUT-364) [GSOC] Proposal to implement Neural Network with backpropagation learning on Hadoop

2010-04-06 Thread Zaid Md. Abdul Wahab Sheikh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zaid Md. Abdul Wahab Sheikh updated MAHOUT-364:
---

Comment: was deleted

(was: formatting :()

> [GSOC] Proposal to implement Neural Network with backpropagation learning on 
> Hadoop
> ---
>
> Key: MAHOUT-364
> URL: https://issues.apache.org/jira/browse/MAHOUT-364
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Zaid Md. Abdul Wahab Sheikh
>
> Proposal Title: Implement Multi-Layer Perceptrons with backpropagation 
> learning on Hadoop (addresses issue Mahout-342)
> Student Name: Zaid Md. Abdul Wahab Sheikh
> Student E-mail: (gmail id) sheikh.zaid
> h2. I. Brief Description
> A feedforward neural network (NN) reveals several degrees of parallelism 
> within it such as weight parallelism, node parallelism, network parallelism, 
> layer parallelism and training parallelism. However network based parallelism 
> requires fine-grained synchronization and communication and thus is not 
> suitable for map/reduce based algorithms. On the other hand, training-set 
> parallelism is coarse grained. This can be easily exploited on Hadoop which 
> can split up the input among different mappers. Each of the mappers will then 
> propagate the 'InputSplit' through their own copy of the complete neural 
> network.
> The backpropagation algorithm will operate in batch mode. This is because 
> updating a common set of parameters after each training example creates a 
> bottleneck for parallelization. The overall error gradient vector calculation 
> can be parallelized by calculating the gradients from each training vector in 
> the Mapper, combining them to get partial batch gradients and then adding 
> them in a reducer to get the overall batch gradient.
> In a similiar manner, error function evaluations during line searches (for 
> the conjugate gradient and quasi-Newton algorithms) can be efficiently 
> parallelized.
> Lastly, to avoid local minima in its error function, we can take advantage of 
> training session parallelism to start multiple training sessions in parallel 
> with different initial weights (simulated annealing).
> h2. II. Detailed Proposal
> The most important step is to design the base neural network classes in such 
> a way that other NN architectures like Hopfield nets, Boltzman machines, SOM 
> etc can be easily implemented by deriving from these base classes. For that I 
> propose to implement a set of core classes that correspond to basic neural 
> network concepts like artificial neuron, neuron layer, neuron connections, 
> weight, transfer function, input function, learning rule etc. This 
> architecture is inspired from that of the opensource Neuroph neural network 
> framework (http://imgur.com/gDIOe.jpg). This design of the base architecture 
> allows for great flexibility in deriving newer NNs and learning rules. All 
> that needs to be done is to derive from the NeuralNetwork class, provide the 
> method for network creation, create a new training method by deriving from 
> LearningRule, and then add that learning rule to the network during creation. 
> In addition, the API is very intuitive and easy to understand (in comparision 
> to other NN frameworks like Encog and JOONE).
> h3. The approach to parallelization in Hadoop:
> In the Driver class:
> - The input parameters are read and the NeuralNetwork with a specified 
> LearningRule (training algorithm) created.
> - Initial weight values are randomly generated and written to the FileSystem. 
> If number of training sessions (for simulated annealing) is specified, 
> multiple sets of initial weight values are generated.
> - Training is started by calling the NeuralNetwork's learn() method. For each 
> iteration, every time the error gradient vector needs to be calculated, the 
> method submits a Job where the input path to the training-set vectors and 
> various key properties (like path to the stored weight values) are set. The 
> gradient vectors calculated by the Reducers are written back to an output 
> path in the FileSystem.
> - After the JobClient.runJob() returns, the gradient vectors are retrieved 
> from the FileSystem and tested to see if the stopping criterion is satisfied. 
> The weights are then updated, using the method implemented by the particular 
> LearningRule. For line searches, each error function evaluation is again done 
> by submitting a job.
> - The NN is trained in iterations until it converges.
> In the Mapper class:
> - Each Mapper is initialized using the configure method, the weights are 
> retrieved and the complete NeuralNetwork created.
> - The map function then takes in the training vectors as key/value pairs (the 
> key is ignored), runs them through the NN to calculate the outputs

Re: VOTE: release mahout-collections-codegen 1.0

2010-04-06 Thread Ted Dunning
Is that possible here instead:
https://repository.apache.org/content/repositories/staging/org/apache/mahout/?

On Tue, Apr 6, 2010 at 6:08 PM, Benson Margulies wrote:

> In order to decouple the mahout-collections library from the rest of
> Mahout, to allow more frequent releases and other good things, we
> propose to release the code generator for the collections library as a
> separate Maven artifact. (Followed, in short order, by the collections
> library proper.) This is proposed release 1.0 of
> mahout-collections-codegen-plugin. This is intended as a maven-only
> release; we'll put the artifacts in the Mahout download area as well,
> but we don't ever expect anyone to use this except from Maven,
> inasmuch as it is a maven plugin.
>
> The release artifacts are in the Nexus stage, as follows.
>
> https://repository.apache.org/content/repositories/orgapachemahout-006/
>
> This vote will remain open for 72 hours.
>


Re: VOTE: release mahout-collections-codegen 1.0

2010-04-06 Thread Benson Margulies
On Tue, Apr 6, 2010 at 9:40 PM, Ted Dunning  wrote:
> Is that possible here instead:
> https://repository.apache.org/content/repositories/staging/org/apache/mahout/?

No, that's not right. That path has our last (0.3) release in it.
However, I had forgotten to close it.

https://repository.apache.org/content/repositories/orgapachemahout-006/

It should work better now.


>
> On Tue, Apr 6, 2010 at 6:08 PM, Benson Margulies wrote:
>
>> In order to decouple the mahout-collections library from the rest of
>> Mahout, to allow more frequent releases and other good things, we
>> propose to release the code generator for the collections library as a
>> separate Maven artifact. (Followed, in short order, by the collections
>> library proper.) This is proposed release 1.0 of
>> mahout-collections-codegen-plugin. This is intended as a maven-only
>> release; we'll put the artifacts in the Mahout download area as well,
>> but we don't ever expect anyone to use this except from Maven,
>> inasmuch as it is a maven plugin.
>>
>> The release artifacts are in the Nexus stage, as follows.
>>
>> https://repository.apache.org/content/repositories/orgapachemahout-006/
>>
>> This vote will remain open for 72 hours.
>>
>


Re: A request for prospective GSOC students

2010-04-06 Thread Zaid Md Abdul Wahab Sheikh
I just submitted a proposal to implement Neural Network with
backpropagation learning
Jira issue: http://issues.apache.org/jira/browse/MAHOUT-364

On Sat, Apr 3, 2010 at 9:07 PM, Robin Anil  wrote:
> I am having a tough time separating Mahout proposals from rest of Apache on
> gsoc website. So I would request you all to reply to this thread when you
> have submitted a proposal so that we don't miss out on reading your hard
> worked proposal. For now I could only find Zhao Zhendong's LIBLINEAR
> proposal. If anyone else have applied do reply back with the title of the
> proposal.
>
> Robin
>



-- 
Zaid Md. Abdul Wahab Sheikh
Senior Undergraduate
B.Tech Computer Science and Engineering
NIT Allahabad (MNNIT)


Re: [GSOC] 2010 Timelines

2010-04-06 Thread Robin Anil
2 days to go till the close of student submissions. A request to mentors to
provide feedback to all the queries on the list so that students can go and
work on tuning their proposal

Robin

On Sat, Apr 3, 2010 at 10:50 PM, Grant Ingersoll wrote:

>
> http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs#timeline


[GSoC 2010] Requesting feedback on my proposal for implementing Neural Network with backpropagation learning

2010-04-06 Thread Zaid Md Abdul Wahab Sheikh
Hi all,

I just submitted a GSoC proposal for implementing Neural Network with
backpropagation on Hadoop.

Jira issue: http://issues.apache.org/jira/browse/MAHOUT-364

I would appreciate your feedback and comments on the proposal and on
the working or implementation plan.

---


I. Brief Description

A feedforward neural network (NN) reveals several degrees of
parallelism within it such as weight parallelism, node parallelism,
network parallelism, layer parallelism and training parallelism.
However network based parallelism requires fine-grained
synchronization and communication and thus is not suitable for
map/reduce based algorithms. On the other hand, training-set
parallelism is coarse grained. This can be easily exploited on Hadoop
which can split up the input among different mappers. Each of the
mappers will then propagate the 'InputSplit' through their own copy of
the complete neural network.
The backpropagation algorithm will operate in batch mode. This is
because updating a common set of parameters after each training
example creates a bottleneck for parallelization. The overall error
gradient vector calculation can be parallelized by calculating the
gradients from each training vector in the Mapper, combining them to
get partial batch gradients and then adding them in a reducer to get
the overall batch gradient.
In a similiar manner, error function evaluations during line searches
(for the conjugate gradient and quasi-Newton algorithms) can be
efficiently parallelized.
Lastly, to avoid local minima in its error function, we can take
advantage of training session parallelism to start multiple training
sessions in parallel with different initial weights (simulated
annealing).



II. Detailed Proposal

The most important step is to design the base neural network classes
in such a way that other NN architectures like Hopfield nets, Boltzman
machines, SOM etc can be easily implemented by deriving from these
base classes. For that I propose to implement a set of core classes
that correspond to basic neural network concepts like artificial
neuron, neuron layer, neuron connections, weight, transfer function,
input function, learning rule etc. This architecture is inspired from
that of the opensource Neuroph neural network framework
(http://imgur.com/gDIOe.jpg). This design of the base architecture
allows for great flexibility in deriving newer NNs and learning rules.
All that needs to be done is to derive from the NeuralNetwork class,
provide the method for network creation, create a new training method
by deriving from LearningRule, and then add that learning rule to the
network during creation. In addition, the API is very intuitive and
easy to understand (in comparision to other NN frameworks like Encog
and JOONE).


** The approach to parallelization:

In the Driver class:
- The input parameters are read and the NeuralNetwork with a specified
LearningRule (training algorithm) created.
Initial weight values are randomly generated and written to the
FileSystem. If number of training sessions (for simulated annealing)
is specified, multiple sets of initial weight values are generated.
- Training is started by calling the NeuralNetwork's learn() method.
For each iteration, every time the error gradient vector needs to be
calculated, the method submits a Job where the input path to the
training-set vectors and various key properties (like path to the
stored weight values) are set. The gradient vectors calculated by the
Reducers are written back to an output path in the FileSystem.
After the JobClient.runJob() returns, the gradient vectors are
retrieved from the FileSystem and tested to see if the stopping
criterion is satisfied. The weights are then updated, using the method
implemented by the particular LearningRule. For line searches, each
error function evaluation is again done by submitting a job.
- The NN is trained in iterations until it converges.

In the Mapper class:
- Each Mapper is initialized using the configure method, the weights
are retrieved and the complete NeuralNetwork created.
- The map function then takes in the training vectors as key/value
pairs (the key is ignored), runs them through the NN to calculate the
outputs and backpropagates the errors to find out the error gradients.
The error gradient vectors are then output as key/value pairs where
all the keys are set to a common value, such as the training session
number (for each training session, all keys in the outputs of all the
mappers have to be identical).

In the Combiner class:
- Iterates through the all individual error gradient vectors output by
the mapper (since they all have the same key) and adds them up to get
a partial batch gradient.
In the Reducer class:
- There's a single reducer class that will combine all the partial
gradients from the Mappers to get the overall batch gradient.
- The final error gradient vector is wr

[jira] Commented: (MAHOUT-364) [GSOC] Proposal to implement Neural Network with backpropagation learning on Hadoop

2010-04-06 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854304#action_12854304
 ] 

Jake Mannix commented on MAHOUT-364:


I've got to say, this is a fantastically well written proposal, with perfect 
breadth of scope as well.  

Do we have someone who can shepherd this?

> [GSOC] Proposal to implement Neural Network with backpropagation learning on 
> Hadoop
> ---
>
> Key: MAHOUT-364
> URL: https://issues.apache.org/jira/browse/MAHOUT-364
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Zaid Md. Abdul Wahab Sheikh
>
> Proposal Title: Implement Multi-Layer Perceptrons with backpropagation 
> learning on Hadoop (addresses issue Mahout-342)
> Student Name: Zaid Md. Abdul Wahab Sheikh
> Student E-mail: (gmail id) sheikh.zaid
> h2. I. Brief Description
> A feedforward neural network (NN) reveals several degrees of parallelism 
> within it such as weight parallelism, node parallelism, network parallelism, 
> layer parallelism and training parallelism. However network based parallelism 
> requires fine-grained synchronization and communication and thus is not 
> suitable for map/reduce based algorithms. On the other hand, training-set 
> parallelism is coarse grained. This can be easily exploited on Hadoop which 
> can split up the input among different mappers. Each of the mappers will then 
> propagate the 'InputSplit' through their own copy of the complete neural 
> network.
> The backpropagation algorithm will operate in batch mode. This is because 
> updating a common set of parameters after each training example creates a 
> bottleneck for parallelization. The overall error gradient vector calculation 
> can be parallelized by calculating the gradients from each training vector in 
> the Mapper, combining them to get partial batch gradients and then adding 
> them in a reducer to get the overall batch gradient.
> In a similiar manner, error function evaluations during line searches (for 
> the conjugate gradient and quasi-Newton algorithms) can be efficiently 
> parallelized.
> Lastly, to avoid local minima in its error function, we can take advantage of 
> training session parallelism to start multiple training sessions in parallel 
> with different initial weights (simulated annealing).
> h2. II. Detailed Proposal
> The most important step is to design the base neural network classes in such 
> a way that other NN architectures like Hopfield nets, Boltzman machines, SOM 
> etc can be easily implemented by deriving from these base classes. For that I 
> propose to implement a set of core classes that correspond to basic neural 
> network concepts like artificial neuron, neuron layer, neuron connections, 
> weight, transfer function, input function, learning rule etc. This 
> architecture is inspired from that of the opensource Neuroph neural network 
> framework (http://imgur.com/gDIOe.jpg). This design of the base architecture 
> allows for great flexibility in deriving newer NNs and learning rules. All 
> that needs to be done is to derive from the NeuralNetwork class, provide the 
> method for network creation, create a new training method by deriving from 
> LearningRule, and then add that learning rule to the network during creation. 
> In addition, the API is very intuitive and easy to understand (in comparision 
> to other NN frameworks like Encog and JOONE).
> h3. The approach to parallelization in Hadoop:
> In the Driver class:
> - The input parameters are read and the NeuralNetwork with a specified 
> LearningRule (training algorithm) created.
> - Initial weight values are randomly generated and written to the FileSystem. 
> If number of training sessions (for simulated annealing) is specified, 
> multiple sets of initial weight values are generated.
> - Training is started by calling the NeuralNetwork's learn() method. For each 
> iteration, every time the error gradient vector needs to be calculated, the 
> method submits a Job where the input path to the training-set vectors and 
> various key properties (like path to the stored weight values) are set. The 
> gradient vectors calculated by the Reducers are written back to an output 
> path in the FileSystem.
> - After the JobClient.runJob() returns, the gradient vectors are retrieved 
> from the FileSystem and tested to see if the stopping criterion is satisfied. 
> The weights are then updated, using the method implemented by the particular 
> LearningRule. For line searches, each error function evaluation is again done 
> by submitting a job.
> - The NN is trained in iterations until it converges.
> In the Mapper class:
> - Each Mapper is initialized using the configure method, the weights are 
> retrieved and the complete NeuralNetwork created.
> - The map function th

[jira] Commented: (MAHOUT-358) the pref value field of output of org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative

2010-04-06 Thread Hui Wen Han (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854317#action_12854317
 ] 

Hui Wen Han commented on MAHOUT-358:


http://java.sun.com/javase/6/docs/api/java/math/BigDecimal.html#BigDecimal(double)

> the pref value  field of output of 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob has negative
> -
>
> Key: MAHOUT-358
> URL: https://issues.apache.org/jira/browse/MAHOUT-358
> Project: Mahout
>  Issue Type: Test
>  Components: Collaborative Filtering
>Affects Versions: 0.4
>Reporter: Hui Wen Han
> Attachments: screenshot-1.jpg, screenshot-2.jpg
>
>
> In my test the input pref values all is positive.
> the output score value has negative value ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-364) [GSOC] Proposal to implement Neural Network with backpropagation learning on Hadoop

2010-04-06 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854349#action_12854349
 ] 

Ted Dunning commented on MAHOUT-364:



This is a very nicely written proposal.

One technical question I have is whether you will see gains in parallelism for 
training a single model.  The experience with logistic regression makes this 
seem less likely.

The dual level model structure that John Langford proposes in this lecture 
might be of interest: http://videolectures.net/nipsworkshops09_langford_pol/  
He makes some inflammatory comments right off the bat that you might need to 
address.

All that said, having a good implementation of an ANN learner is a good thing.

> [GSOC] Proposal to implement Neural Network with backpropagation learning on 
> Hadoop
> ---
>
> Key: MAHOUT-364
> URL: https://issues.apache.org/jira/browse/MAHOUT-364
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Zaid Md. Abdul Wahab Sheikh
>
> Proposal Title: Implement Multi-Layer Perceptrons with backpropagation 
> learning on Hadoop (addresses issue Mahout-342)
> Student Name: Zaid Md. Abdul Wahab Sheikh
> Student E-mail: (gmail id) sheikh.zaid
> h2. I. Brief Description
> A feedforward neural network (NN) reveals several degrees of parallelism 
> within it such as weight parallelism, node parallelism, network parallelism, 
> layer parallelism and training parallelism. However network based parallelism 
> requires fine-grained synchronization and communication and thus is not 
> suitable for map/reduce based algorithms. On the other hand, training-set 
> parallelism is coarse grained. This can be easily exploited on Hadoop which 
> can split up the input among different mappers. Each of the mappers will then 
> propagate the 'InputSplit' through their own copy of the complete neural 
> network.
> The backpropagation algorithm will operate in batch mode. This is because 
> updating a common set of parameters after each training example creates a 
> bottleneck for parallelization. The overall error gradient vector calculation 
> can be parallelized by calculating the gradients from each training vector in 
> the Mapper, combining them to get partial batch gradients and then adding 
> them in a reducer to get the overall batch gradient.
> In a similiar manner, error function evaluations during line searches (for 
> the conjugate gradient and quasi-Newton algorithms) can be efficiently 
> parallelized.
> Lastly, to avoid local minima in its error function, we can take advantage of 
> training session parallelism to start multiple training sessions in parallel 
> with different initial weights (simulated annealing).
> h2. II. Detailed Proposal
> The most important step is to design the base neural network classes in such 
> a way that other NN architectures like Hopfield nets, Boltzman machines, SOM 
> etc can be easily implemented by deriving from these base classes. For that I 
> propose to implement a set of core classes that correspond to basic neural 
> network concepts like artificial neuron, neuron layer, neuron connections, 
> weight, transfer function, input function, learning rule etc. This 
> architecture is inspired from that of the opensource Neuroph neural network 
> framework (http://imgur.com/gDIOe.jpg). This design of the base architecture 
> allows for great flexibility in deriving newer NNs and learning rules. All 
> that needs to be done is to derive from the NeuralNetwork class, provide the 
> method for network creation, create a new training method by deriving from 
> LearningRule, and then add that learning rule to the network during creation. 
> In addition, the API is very intuitive and easy to understand (in comparision 
> to other NN frameworks like Encog and JOONE).
> h3. The approach to parallelization in Hadoop:
> In the Driver class:
> - The input parameters are read and the NeuralNetwork with a specified 
> LearningRule (training algorithm) created.
> - Initial weight values are randomly generated and written to the FileSystem. 
> If number of training sessions (for simulated annealing) is specified, 
> multiple sets of initial weight values are generated.
> - Training is started by calling the NeuralNetwork's learn() method. For each 
> iteration, every time the error gradient vector needs to be calculated, the 
> method submits a Job where the input path to the training-set vectors and 
> various key properties (like path to the stored weight values) are set. The 
> gradient vectors calculated by the Reducers are written back to an output 
> path in the FileSystem.
> - After the JobClient.runJob() returns, the gradient vectors are retrieved 
> from the FileSystem and tested to see if the stopping criterion is satisfied. 
> The weights are then

Re: VOTE: release mahout-collections-codegen 1.0

2010-04-06 Thread Ted Dunning
I confirm that the components exist and appear in good order.

Is there a way for me to test this component?  Is there any testing needed
beyond checking existence?

On Tue, Apr 6, 2010 at 7:13 PM, Benson Margulies wrote:

> On Tue, Apr 6, 2010 at 9:40 PM, Ted Dunning  wrote:
> > Is that possible here instead:
> >
> https://repository.apache.org/content/repositories/staging/org/apache/mahout/
> ?
>
> No, that's not right. That path has our last (0.3) release in it.
> However, I had forgotten to close it.
>
> https://repository.apache.org/content/repositories/orgapachemahout-006/
>
> It should work better now.
>
>
> >
> > On Tue, Apr 6, 2010 at 6:08 PM, Benson Margulies  >wrote:
> >
> >> In order to decouple the mahout-collections library from the rest of
> >> Mahout, to allow more frequent releases and other good things, we
> >> propose to release the code generator for the collections library as a
> >> separate Maven artifact. (Followed, in short order, by the collections
> >> library proper.) This is proposed release 1.0 of
> >> mahout-collections-codegen-plugin. This is intended as a maven-only
> >> release; we'll put the artifacts in the Mahout download area as well,
> >> but we don't ever expect anyone to use this except from Maven,
> >> inasmuch as it is a maven plugin.
> >>
> >> The release artifacts are in the Nexus stage, as follows.
> >>
> >> https://repository.apache.org/content/repositories/orgapachemahout-006/
> >>
> >> This vote will remain open for 72 hours.
> >>
> >
>


Re: VOTE: release mahout-collections-codegen 1.0

2010-04-06 Thread Robin Anil
Is there a patch which pulls this dependency to build Mahout. Thats the good
test for it

Robin

On Wed, Apr 7, 2010 at 10:45 AM, Ted Dunning  wrote:

> I confirm that the components exist and appear in good order.
>
> Is there a way for me to test this component?  Is there any testing needed
> beyond checking existence?
>
> On Tue, Apr 6, 2010 at 7:13 PM, Benson Margulies  >wrote:
>
> > On Tue, Apr 6, 2010 at 9:40 PM, Ted Dunning 
> wrote:
> > > Is that possible here instead:
> > >
> >
> https://repository.apache.org/content/repositories/staging/org/apache/mahout/
> > ?
> >
> > No, that's not right. That path has our last (0.3) release in it.
> > However, I had forgotten to close it.
> >
> > https://repository.apache.org/content/repositories/orgapachemahout-006/
> >
> > It should work better now.
> >
> >
> > >
> > > On Tue, Apr 6, 2010 at 6:08 PM, Benson Margulies <
> bimargul...@gmail.com
> > >wrote:
> > >
> > >> In order to decouple the mahout-collections library from the rest of
> > >> Mahout, to allow more frequent releases and other good things, we
> > >> propose to release the code generator for the collections library as a
> > >> separate Maven artifact. (Followed, in short order, by the collections
> > >> library proper.) This is proposed release 1.0 of
> > >> mahout-collections-codegen-plugin. This is intended as a maven-only
> > >> release; we'll put the artifacts in the Mahout download area as well,
> > >> but we don't ever expect anyone to use this except from Maven,
> > >> inasmuch as it is a maven plugin.
> > >>
> > >> The release artifacts are in the Nexus stage, as follows.
> > >>
> > >>
> https://repository.apache.org/content/repositories/orgapachemahout-006/
> > >>
> > >> This vote will remain open for 72 hours.
> > >>
> > >
> >
>


Introducing Gizzard, a framework for creating distributed datastores

2010-04-06 Thread Robin Anil
Its apache licensed and looks like a great option for storing and querying
large graphs. May be useful as a model store for classifier

http://engineering.twitter.com/2010/04/introducing-gizzard-framework-for.html
http://github.com/twitter/gizzard

Robin