Need Help in Clustering

2013-06-24 Thread Rajan Gupta
Hi,
I am new to mahout.

i have text data in fomat as

Id,age,income,perwt,sex,city,product
1,23,2200,40,2,Boston,product #1

I want to perform kmeans clustering based on 2 feilds that is age and
income.And i also want perform in specific number of clusters.

I have already performed clustering by changing file into sequence  vector
files but i get empty file while performing clusterdump.I guess their is
something wrong in the way the class are written and the way my input file
is.

Can anyone help me how to do this.

Thanks is advance
Rajan Gupta


Re: Need Help in Clustering

2013-06-24 Thread Suneel Marthi
How are u converting your data to sequencefile?  
If you are not sure check this link: 
http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program

Are you getting any clusteredpoints after running k-means?

It would help if you could list the commands you had executed for 
troubleshooting.




 From: Rajan Gupta rajangupta0...@gmail.com
To: dev@mahout.apache.org 
Sent: Monday, June 24, 2013 3:09 AM
Subject: Need Help in Clustering
 

Hi,
I am new to mahout.

i have text data in fomat as

Id,age,income,perwt,sex,city,product
1,23,2200,40,2,Boston,product #1

I want to perform kmeans clustering based on 2 feilds that is age and
income.And i also want perform in specific number of clusters.

I have already performed clustering by changing file into sequence  vector
files but i get empty file while performing clusterdump.I guess their is
something wrong in the way the class are written and the way my input file
is.

Can anyone help me how to do this.

Thanks is advance
Rajan Gupta

Re: Need Help in Clustering

2013-06-24 Thread Rajan Gupta
Thanks for your response

yes,I get clustered points after running Kmeans. I have done clustering
sucessfully  with 20newsdata and reuters data.Clusterdump also works
properly with above stated examples.
Now,
i have text data in fomat as

Id,age,income,perwt,sex,city,product
1,23,2200,40,2,Boston,product #1

--

i want to have ouput as

Id,'age,'income,perwt,sex,city,product,cluster
1,23,2200,25,2,Boston,product #1,1
2,26,6600,30,1,New york,product #5,3
3,24,4400,48,2,Portland,product #24,2
4,29,9900,60,1,San Jose,product #70,4

Can anyone help...


Do i need to create custom code for this, if yes do help me

Thanks In advance,

Regards,
Rajan Gupta



On Mon, Jun 24, 2013 at 12:46 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 How are u converting your data to sequencefile?
 If you are not sure check this link:
 http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program

 Are you getting any clusteredpoints after running k-means?

 It would help if you could list the commands you had executed for
 troubleshooting.



 
  From: Rajan Gupta rajangupta0...@gmail.com
 To: dev@mahout.apache.org
 Sent: Monday, June 24, 2013 3:09 AM
 Subject: Need Help in Clustering


 Hi,
 I am new to mahout.

 i have text data in fomat as

 Id,age,income,perwt,sex,city,product
 1,23,2200,40,2,Boston,product #1

 I want to perform kmeans clustering based on 2 feilds that is age and
 income.And i also want perform in specific number of clusters.

 I have already performed clustering by changing file into sequence  vector
 files but i get empty file while performing clusterdump.I guess their is
 something wrong in the way the class are written and the way my input file
 is.

 Can anyone help me how to do this.

 Thanks is advance
 Rajan Gupta



[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691954#comment-13691954
 ] 

Grant Ingersoll commented on MAHOUT-1214:
-

Hi,

Any progress on this?  It is the last open issue for 0.8.

Thanks,
Grant


 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Need Help in Clustering

2013-06-24 Thread Ted Dunning
On Mon, Jun 24, 2013 at 12:14 PM, Rajan Gupta rajangupta0...@gmail.comwrote:

 Do i need to create custom code for this, if yes do help me


Yes.  You definitely need custom code for this.

You also need to think about your data and why you want clusters.

What does age mean to a cluster?  Are people with the same age supposed to
be the same in some sense?  What does 5 years difference mean?  Is the
distance from 20 to 25 the same as the different between 55 and 60?

What about city?  How many cities are there?  Do you have any sense of
which cities are more like some than others?

What about income?  Should perhaps use log(income) for computing distances?

What is perwt?

Why is there just one product per line?  What products are more similar
than others?


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread Yiqun Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691969#comment-13691969
 ] 

Yiqun Hu commented on MAHOUT-1214:
--

Grant, we have addressed all review comments and uploaded the updated patch. 
But haven't got any reply. Can this be included in 0.8?

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Build failed in Jenkins: Mahout-Quality #2102

2013-06-24 Thread Apache Jenkins Server
See https://builds.apache.org/job/Mahout-Quality/2102/

--
[...truncated 7204 lines...]
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
[WARNING] Failure executing PMD: Couldn't find the class Can't find resource 
rulesets/basic.xml.  Make sure the resource is a valid file or URL or is on the 
CLASSPATH.  Here's the current classpath: 
/home/hudson/tools/maven/apache-maven-3.0.4/boot/plexus-classworlds-2.4.jar
java.lang.RuntimeException: Couldn't find the class Can't find resource 
rulesets/basic.xml.  Make sure the resource is a valid file or URL or is on the 
CLASSPATH.  Here's the current classpath: 
/home/hudson/tools/maven/apache-maven-3.0.4/boot/plexus-classworlds-2.4.jar
at 
net.sourceforge.pmd.RuleSetFactory.classNotFoundProblem(RuleSetFactory.java:244)
at 
net.sourceforge.pmd.RuleSetFactory.parseRuleSetNode(RuleSetFactory.java:234)
at 
net.sourceforge.pmd.RuleSetFactory.createRuleSet(RuleSetFactory.java:161)
at 
net.sourceforge.pmd.RuleSetFactory.createRuleSets(RuleSetFactory.java:126)
at 
net.sourceforge.pmd.RuleSetFactory.createRuleSets(RuleSetFactory.java:111)
at 
net.sourceforge.pmd.processor.AbstractPMDProcessor.createRuleSets(AbstractPMDProcessor.java:56)
at 
net.sourceforge.pmd.processor.MonoThreadProcessor.processFiles(MonoThreadProcessor.java:41)
at net.sourceforge.pmd.PMD.processFiles(PMD.java:271)
at 
org.apache.maven.plugin.pmd.PmdReport.generateReport(PmdReport.java:296)
at org.apache.maven.plugin.pmd.PmdReport.execute(PmdReport.java:194)
at 
org.apache.maven.plugin.pmd.PmdReport.executeReport(PmdReport.java:168)
at 
org.apache.maven.reporting.AbstractMavenReport.generate(AbstractMavenReport.java:190)
at 
org.apache.maven.reporting.AbstractMavenReport.execute(AbstractMavenReport.java:99)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
[INFO] 
[INFO] 
[INFO] Building Mahout Integration 0.8-SNAPSHOT
[INFO] 
[WARNING] The POM for 
com.atlassian.maven.plugins:maven-clover2-plugin:jar:3.1.11.1 is missing, no 
dependency information available
[WARNING] Failed to retrieve plugin descriptor for 
com.atlassian.maven.plugins:maven-clover2-plugin:3.1.11.1: Plugin 
com.atlassian.maven.plugins:maven-clover2-plugin:3.1.11.1 or one of its 
dependencies could not be resolved: Failed to read artifact descriptor 

Re: Mahout vectors/matrices/solvers on spark

2013-06-24 Thread Dmitriy Lyubimov
Ok, so i was fairly easily able to build some DSL for our matrix
manipulation (similar to breeze) in scala:

inline matrix or vector:

val  a = dense((1, 2, 3), (3, 4, 5))

val b:Vector = (1,2,3)

block views and assignments (element/row/vector/block/block of row or
vector)


a(::, 0)
a(1, ::)
a(0 to 1, 1 to 2)

assignments

a(0, ::) :=(3, 5, 7)
a(0, 0 to 1) :=(3, 5)
a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))

operators

// hadamard
val c = a * b
 a *= b

// matrix mul
 val m = a %*% b

and bunch of other little things like sum, mean, colMeans etc. That much is
easy.

Also stuff like the ones found in breeze along the lines

val (u,v,s) = svd(a)

diag ((1,2,3))

and Cholesky in similar ways.

I don't have inline initialization for sparse things (yet) simply because
i don't need them, but of course all regular java constructors and methods
are retained, all that is just a syntactic sugar in the spirit of DSLs in
hope to make things a bit mroe readable.

my (very little, and very insignificantly opinionated, really) criticism of
Breeze in this context is its inconsistency between dense and sparse
representations, namely, lack of consistent overarching trait(s), so that
building structure-agnostic solvers like Mahout's Cholesky solver is
impossible, as well as cross-type matrix use (say, the way i understand it,
it is pretty much imposible to multiply a sparse matrix by a dense matrix).

I suspect these problems stem from the fact that the authors for whatever
reason decided to hardwire dense things with JBlas solvers whereas i dont
believe matrix storage structures must be. But these problems do appear to
be serious enough  for me to ignore Breeze for now. If i decide to plug in
jblas dense solvers, i guess i will just have them as yet another top-level
routine interface taking any Matrix, e.g.

val (u,v,s) = svd(m, jblas=true)



On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Thank you.
 On Jun 23, 2013 6:16 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I think that this contract has migrated a bit from the first starting
 point.

 My feeling is that there is a de facto contract now that the matrix slice
 is a single row.

 Sent from my iPhone

 On Jun 23, 2013, at 16:32, Dmitriy Lyubimov dlie...@gmail.com wrote:

  What does Matrix. iterateAll() contractually do? Practically it seems
 to be
  row wise iteration for some implementations but it doesnt seem
  contractually state so in the javadoc. What is MatrixSlice if it is
 neither
  a row nor a colimn? How can i tell what exactly it is i am iterating
 over?
  On Jun 19, 2013 12:21 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix jake.man...@gmail.com
  wrote:
 
  Question #2: which in-core solvers are available for Mahout
 matrices? I
  know there's SSVD, probably Cholesky, is there something else? In
  paticular, i need to be solving linear systems, I guess Cholesky
 should
  be
  equipped enough to do just that?
 
  Question #3: why did we try to import Colt solvers rather than
 actually
  depend on Colt in the first place? Why did we not accept Colt's
 sparse
  matrices and created native ones instead?
 
  Colt seems to have a notion of parse in-core matrices too and seems
  like
  a
  well-rounded solution. However, it doesn't seem like being actively
  supported, whereas I know Mahout experienced continued enhancements
 to
  the
  in-core matrix support.
 
 
  Colt was totally abandoned, and I talked to the original author and he
  blessed it's adoption.  When we pulled it in, we found it was woefully
  undertested,
  and tried our best to hook it in with proper tests and use APIs that
 fit
  with
  the use cases we had.  Plus, we already had the start of some linear
 apis
  (i.e.
  the Vector interface) and dropping the API completely seemed not
 terribly
  worth it at the time.
 
 
  There was even more to it than that.
 
  Colt was under-tested and there have been warts that had to be pulled
 out
  in much of the code.
 
  But, worse than that, Colt's matrix and vector structure was a real
 bugger
  to extend or change.  It also had all kinds of cruft where it
 pretended to
  support matrices of things, but in fact only supported matrices of
 doubles
  and floats.
 
  So using Colt as it was (and is since it is largely abandoned) was a
  non-starter.
 
  As far as in-memory solvers, we have:
 
  1) LR decomposition (tested and kinda fast)
 
  2) Cholesky decomposition (tested)
 
  3) SVD (tested)
 




Re: Mahout vectors/matrices/solvers on spark

2013-06-24 Thread Ted Dunning
Dmitriy,

This is very pretty.




On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Ok, so i was fairly easily able to build some DSL for our matrix
 manipulation (similar to breeze) in scala:

 inline matrix or vector:

 val  a = dense((1, 2, 3), (3, 4, 5))

 val b:Vector = (1,2,3)

 block views and assignments (element/row/vector/block/block of row or
 vector)


 a(::, 0)
 a(1, ::)
 a(0 to 1, 1 to 2)

 assignments

 a(0, ::) :=(3, 5, 7)
 a(0, 0 to 1) :=(3, 5)
 a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))

 operators

 // hadamard
 val c = a * b
  a *= b

 // matrix mul
  val m = a %*% b

 and bunch of other little things like sum, mean, colMeans etc. That much is
 easy.

 Also stuff like the ones found in breeze along the lines

 val (u,v,s) = svd(a)

 diag ((1,2,3))

 and Cholesky in similar ways.

 I don't have inline initialization for sparse things (yet) simply because
 i don't need them, but of course all regular java constructors and methods
 are retained, all that is just a syntactic sugar in the spirit of DSLs in
 hope to make things a bit mroe readable.

 my (very little, and very insignificantly opinionated, really) criticism of
 Breeze in this context is its inconsistency between dense and sparse
 representations, namely, lack of consistent overarching trait(s), so that
 building structure-agnostic solvers like Mahout's Cholesky solver is
 impossible, as well as cross-type matrix use (say, the way i understand it,
 it is pretty much imposible to multiply a sparse matrix by a dense matrix).

 I suspect these problems stem from the fact that the authors for whatever
 reason decided to hardwire dense things with JBlas solvers whereas i dont
 believe matrix storage structures must be. But these problems do appear to
 be serious enough  for me to ignore Breeze for now. If i decide to plug in
 jblas dense solvers, i guess i will just have them as yet another top-level
 routine interface taking any Matrix, e.g.

 val (u,v,s) = svd(m, jblas=true)



 On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Thank you.
  On Jun 23, 2013 6:16 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  I think that this contract has migrated a bit from the first starting
  point.
 
  My feeling is that there is a de facto contract now that the matrix
 slice
  is a single row.
 
  Sent from my iPhone
 
  On Jun 23, 2013, at 16:32, Dmitriy Lyubimov dlie...@gmail.com wrote:
 
   What does Matrix. iterateAll() contractually do? Practically it seems
  to be
   row wise iteration for some implementations but it doesnt seem
   contractually state so in the javadoc. What is MatrixSlice if it is
  neither
   a row nor a colimn? How can i tell what exactly it is i am iterating
  over?
   On Jun 19, 2013 12:21 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  
   On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix jake.man...@gmail.com
   wrote:
  
   Question #2: which in-core solvers are available for Mahout
  matrices? I
   know there's SSVD, probably Cholesky, is there something else? In
   paticular, i need to be solving linear systems, I guess Cholesky
  should
   be
   equipped enough to do just that?
  
   Question #3: why did we try to import Colt solvers rather than
  actually
   depend on Colt in the first place? Why did we not accept Colt's
  sparse
   matrices and created native ones instead?
  
   Colt seems to have a notion of parse in-core matrices too and seems
   like
   a
   well-rounded solution. However, it doesn't seem like being actively
   supported, whereas I know Mahout experienced continued enhancements
  to
   the
   in-core matrix support.
  
  
   Colt was totally abandoned, and I talked to the original author and
 he
   blessed it's adoption.  When we pulled it in, we found it was
 woefully
   undertested,
   and tried our best to hook it in with proper tests and use APIs that
  fit
   with
   the use cases we had.  Plus, we already had the start of some linear
  apis
   (i.e.
   the Vector interface) and dropping the API completely seemed not
  terribly
   worth it at the time.
  
  
   There was even more to it than that.
  
   Colt was under-tested and there have been warts that had to be pulled
  out
   in much of the code.
  
   But, worse than that, Colt's matrix and vector structure was a real
  bugger
   to extend or change.  It also had all kinds of cruft where it
  pretended to
   support matrices of things, but in fact only supported matrices of
  doubles
   and floats.
  
   So using Colt as it was (and is since it is largely abandoned) was a
   non-starter.
  
   As far as in-memory solvers, we have:
  
   1) LR decomposition (tested and kinda fast)
  
   2) Cholesky decomposition (tested)
  
   3) SVD (tested)
  
 
 



Build failed in Jenkins: Mahout-Examples-Cluster-Reuters-II #522

2013-06-24 Thread Apache Jenkins Server
See 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters-II/522/changes

Changes:

[smarthi] MAHOUT-944: lucene2seq - more code cleanup, removed unused imports

[smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - fixed 
issue with not reading a directory list

[smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - first 
round of Code cleanup based on feedback from code review

[smarthi] MAHOUT-944:lucene2seq - removed unused import

--
[...truncated 5861 lines...]
INFO: Starting flush of map output
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
sortAndSpill
INFO: Finished spill 0
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task done
INFO: Task:attempt_local_0019_m_00_0 is done. And is in the process of 
commiting
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task sendDone
INFO: Task 'attempt_local_0019_m_00_0' done.
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task initialize
INFO:  Using ResourceCalculatorPlugin : null
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Merging 1 sorted segments
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Down to the last merge-pass, with 1 segments left of total size: 57 bytes
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task done
INFO: Task:attempt_local_0019_r_00_0 is done. And is in the process of 
commiting
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task commit
INFO: Task attempt_local_0019_r_00_0 is allowed to commit now
Jun 24, 2013 6:35:28 PM 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
INFO: Saved output of task 'attempt_local_0019_r_00_0' to 
/tmp/mahout-work-hudson/reuters-lda-model/model-19
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: reduce  reduce
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task sendDone
INFO: Task 'attempt_local_0019_r_00_0' done.
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 100% reduce 100%
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Job complete: job_local_0019
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Counters: 17
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO:   File Output Format Counters 
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Bytes Written=389
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO:   FileSystemCounters
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: FILE_BYTES_READ=1614631021
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: FILE_BYTES_WRITTEN=1629239107
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO:   File Input Format Counters 
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Bytes Read=152
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO:   Map-Reduce Framework
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Map output materialized bytes=61
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Map input records=0
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Reduce shuffle bytes=0
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Spilled Records=40
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Map output bytes=120
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Total committed heap usage (bytes)=697171968
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: SPLIT_RAW_BYTES=119
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Combine input records=20
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Reduce input records=20
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Reduce input groups=20
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Combine output records=20
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Reduce output records=20
Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log
INFO: Map output records=20
Jun 24, 2013 6:35:28 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: About to run iteration 20 of 20
Jun 24, 2013 6:35:28 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: About to run: Iteration 20 of 20, input path: 
/tmp/mahout-work-hudson/reuters-lda-model/model-19
Jun 24, 2013 6:35:28 PM 

Re: Mahout vectors/matrices/solvers on spark

2013-06-24 Thread Jake Mannix
Yeah, I'm totally on board with a pretty scala DSL on top of some of our
stuff.

In particular, I've been experimenting with with wrapping the
DistributedRowMatrix
in a scalding wrapper, so we can do things like

val matrixAsTypedPipe =
   DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
path, conf))

// e.g. L1 normalize:
  matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx, v.normalize(1)) )
 .write(new
DistributedRowMatrixPipe(outputPath, conf))

// and anything else you would want to do with a scalding TypedPipe[Int,
Vector]

Currently I've been doing this with a package structure directly in Mahout,
in:

   mahout/contrib/scalding

What do people think about having this be something real, after 0.8 goes
out?  Are
we ready for contrib modules which fold in diverse external projects in new
ways?
Integrating directly with pig and scalding is a bit too wide of a tent for
Mahout core,
but putting these integrations in entirely new projects is maybe a bit too
far away.


On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Dmitriy,

 This is very pretty.




 On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Ok, so i was fairly easily able to build some DSL for our matrix
  manipulation (similar to breeze) in scala:
 
  inline matrix or vector:
 
  val  a = dense((1, 2, 3), (3, 4, 5))
 
  val b:Vector = (1,2,3)
 
  block views and assignments (element/row/vector/block/block of row or
  vector)
 
 
  a(::, 0)
  a(1, ::)
  a(0 to 1, 1 to 2)
 
  assignments
 
  a(0, ::) :=(3, 5, 7)
  a(0, 0 to 1) :=(3, 5)
  a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
 
  operators
 
  // hadamard
  val c = a * b
   a *= b
 
  // matrix mul
   val m = a %*% b
 
  and bunch of other little things like sum, mean, colMeans etc. That much
 is
  easy.
 
  Also stuff like the ones found in breeze along the lines
 
  val (u,v,s) = svd(a)
 
  diag ((1,2,3))
 
  and Cholesky in similar ways.
 
  I don't have inline initialization for sparse things (yet) simply
 because
  i don't need them, but of course all regular java constructors and
 methods
  are retained, all that is just a syntactic sugar in the spirit of DSLs in
  hope to make things a bit mroe readable.
 
  my (very little, and very insignificantly opinionated, really) criticism
 of
  Breeze in this context is its inconsistency between dense and sparse
  representations, namely, lack of consistent overarching trait(s), so that
  building structure-agnostic solvers like Mahout's Cholesky solver is
  impossible, as well as cross-type matrix use (say, the way i understand
 it,
  it is pretty much imposible to multiply a sparse matrix by a dense
 matrix).
 
  I suspect these problems stem from the fact that the authors for whatever
  reason decided to hardwire dense things with JBlas solvers whereas i dont
  believe matrix storage structures must be. But these problems do appear
 to
  be serious enough  for me to ignore Breeze for now. If i decide to plug
 in
  jblas dense solvers, i guess i will just have them as yet another
 top-level
  routine interface taking any Matrix, e.g.
 
  val (u,v,s) = svd(m, jblas=true)
 
 
 
  On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   Thank you.
   On Jun 23, 2013 6:16 PM, Ted Dunning ted.dunn...@gmail.com wrote:
  
   I think that this contract has migrated a bit from the first starting
   point.
  
   My feeling is that there is a de facto contract now that the matrix
  slice
   is a single row.
  
   Sent from my iPhone
  
   On Jun 23, 2013, at 16:32, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
  
What does Matrix. iterateAll() contractually do? Practically it
 seems
   to be
row wise iteration for some implementations but it doesnt seem
contractually state so in the javadoc. What is MatrixSlice if it is
   neither
a row nor a colimn? How can i tell what exactly it is i am iterating
   over?
On Jun 19, 2013 12:21 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
   
On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix 
 jake.man...@gmail.com
wrote:
   
Question #2: which in-core solvers are available for Mahout
   matrices? I
know there's SSVD, probably Cholesky, is there something else? In
paticular, i need to be solving linear systems, I guess Cholesky
   should
be
equipped enough to do just that?
   
Question #3: why did we try to import Colt solvers rather than
   actually
depend on Colt in the first place? Why did we not accept Colt's
   sparse
matrices and created native ones instead?
   
Colt seems to have a notion of parse in-core matrices too and
 seems
like
a
well-rounded solution. However, it doesn't seem like being
 actively
supported, whereas I know Mahout experienced continued
 enhancements
   to
the
in-core matrix support.
   
   
Colt was totally abandoned, and I talked to the original author
 and
  he

Re: Mahout vectors/matrices/solvers on spark

2013-06-24 Thread Nick Pentreath
That looks great Dmitry! 


​The thing about Breeze that drives the complexity in it is partly 
specialization for Float, Double and Int matrices, and partly getting the 
syntax to just work for all combinations of matrix types and operands etc. 
mostly it does just work but occasionally not.



​I am surprised that dense * sparse matrix doesn't work but I guess as I 
previously mentioned the sparse matrix support is a bit shaky. 


David Hall is pretty happy to both look into enhancements and help out for 
contributions (eg I'm hoping to find time to look into a proper Diagonal matrix 
implementation and he was very helpful with pointers etc), so please do drop 
things into the google group mailing list. Hopefully wider adoption especially 
by this type of community will drive Breeze development.


​In another note I also really like Scaldings matrix API so scala ish wrappers 
for mahout would be cool - another pet project of mine is a port of that API to 
spark too :)


​N 



—
Sent from Mailbox for iPhone

On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix jake.man...@gmail.com
wrote:

 Yeah, I'm totally on board with a pretty scala DSL on top of some of our
 stuff.
 In particular, I've been experimenting with with wrapping the
 DistributedRowMatrix
 in a scalding wrapper, so we can do things like
 val matrixAsTypedPipe =
DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
 path, conf))
 // e.g. L1 normalize:
   matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx, v.normalize(1)) )
  .write(new
 DistributedRowMatrixPipe(outputPath, conf))
 // and anything else you would want to do with a scalding TypedPipe[Int,
 Vector]
 Currently I've been doing this with a package structure directly in Mahout,
 in:
mahout/contrib/scalding
 What do people think about having this be something real, after 0.8 goes
 out?  Are
 we ready for contrib modules which fold in diverse external projects in new
 ways?
 Integrating directly with pig and scalding is a bit too wide of a tent for
 Mahout core,
 but putting these integrations in entirely new projects is maybe a bit too
 far away.
 On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 Dmitriy,

 This is very pretty.




 On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Ok, so i was fairly easily able to build some DSL for our matrix
  manipulation (similar to breeze) in scala:
 
  inline matrix or vector:
 
  val  a = dense((1, 2, 3), (3, 4, 5))
 
  val b:Vector = (1,2,3)
 
  block views and assignments (element/row/vector/block/block of row or
  vector)
 
 
  a(::, 0)
  a(1, ::)
  a(0 to 1, 1 to 2)
 
  assignments
 
  a(0, ::) :=(3, 5, 7)
  a(0, 0 to 1) :=(3, 5)
  a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
 
  operators
 
  // hadamard
  val c = a * b
   a *= b
 
  // matrix mul
   val m = a %*% b
 
  and bunch of other little things like sum, mean, colMeans etc. That much
 is
  easy.
 
  Also stuff like the ones found in breeze along the lines
 
  val (u,v,s) = svd(a)
 
  diag ((1,2,3))
 
  and Cholesky in similar ways.
 
  I don't have inline initialization for sparse things (yet) simply
 because
  i don't need them, but of course all regular java constructors and
 methods
  are retained, all that is just a syntactic sugar in the spirit of DSLs in
  hope to make things a bit mroe readable.
 
  my (very little, and very insignificantly opinionated, really) criticism
 of
  Breeze in this context is its inconsistency between dense and sparse
  representations, namely, lack of consistent overarching trait(s), so that
  building structure-agnostic solvers like Mahout's Cholesky solver is
  impossible, as well as cross-type matrix use (say, the way i understand
 it,
  it is pretty much imposible to multiply a sparse matrix by a dense
 matrix).
 
  I suspect these problems stem from the fact that the authors for whatever
  reason decided to hardwire dense things with JBlas solvers whereas i dont
  believe matrix storage structures must be. But these problems do appear
 to
  be serious enough  for me to ignore Breeze for now. If i decide to plug
 in
  jblas dense solvers, i guess i will just have them as yet another
 top-level
  routine interface taking any Matrix, e.g.
 
  val (u,v,s) = svd(m, jblas=true)
 
 
 
  On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   Thank you.
   On Jun 23, 2013 6:16 PM, Ted Dunning ted.dunn...@gmail.com wrote:
  
   I think that this contract has migrated a bit from the first starting
   point.
  
   My feeling is that there is a de facto contract now that the matrix
  slice
   is a single row.
  
   Sent from my iPhone
  
   On Jun 23, 2013, at 16:32, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
  
What does Matrix. iterateAll() contractually do? Practically it
 seems
   to be
row wise iteration for some implementations but it doesnt seem
contractually state so in the javadoc. What is 

Development guide (ICFOSS)

2013-06-24 Thread Samiran Raj Boro
Hi,
I am Samiran. I participated in 3 day local workshop at ICFOSS (
http://community.apache.org/mentoringprogramme-icfoss-pilot.html). I am
looking forward to contribute to Mahout project.

I am Java beginner and learning it fast. My interest domain is data mining
and I am familiar with clustering algorithms. I was checking  this
https://issues.apache.org/jira/browse/MAHOUT-1177 Let me know if somebody
is already working on it. Also please suggest if I need to pay special
attention to something. It would be great for me if you could point me,
some bug or enhancement in jira (for beginners) so that I can have some
hands on practice and understand the code base.

Regards,
Samiran


Re: Mahout vectors/matrices/solvers on spark

2013-06-24 Thread Dmitriy Lyubimov
On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath nick.pentre...@gmail.comwrote:

 That looks great Dmitry!


 The thing about Breeze that drives the complexity in it is partly
 specialization for Float, Double and Int matrices, and partly getting the
 syntax to just work for all combinations of matrix types and operands
 etc. mostly it does just work but occasionally not.

yes i noticed that, but since i am wrapping Mahout matrices, there's only a
choice of double-filled matrices and vectors. Actually, i would argue
that's the way it is supposed to be in the interest of KISS principle. I am
not sure i see a value in int matrices for any problem i ever worked on,
and skipping on precision to save the space is even more far-fetched notion
as in real life numbers don't take as much space as their pre-vectorized
features and annotations. In fact. model training parts and linear algebra
are not where memory bottleneck seems to fat-up at all in my experience.
There's often exponentially growing cpu-bound behavior, yes, but not RAM.





 I am surprised that dense * sparse matrix doesn't work but I guess as I
 previously mentioned the sparse matrix support is a bit shaky.

This is solely based on eye-balling the trait architecture. I did not
actually attempt it. But there's no single unifying trait for sure.



 David Hall is pretty happy to both look into enhancements and help out for
 contributions (eg I'm hoping to find time to look into a proper Diagonal
 matrix implementation and he was very helpful with pointers etc), so please
 do drop things into the google group mailing list. Hopefully wider adoption
 especially by this type of community will drive Breeze development.


 In another note I also really like Scaldings matrix API so scala ish
 wrappers for mahout would be cool - another pet project of mine is a port
 of that API to spark too :)


 N



 —
 Sent from Mailbox for iPhone

 On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix jake.man...@gmail.com
 wrote:

  Yeah, I'm totally on board with a pretty scala DSL on top of some of our
  stuff.
  In particular, I've been experimenting with with wrapping the
  DistributedRowMatrix
  in a scalding wrapper, so we can do things like
  val matrixAsTypedPipe =
 DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
  path, conf))
  // e.g. L1 normalize:
matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx,
 v.normalize(1)) )
   .write(new
  DistributedRowMatrixPipe(outputPath, conf))
  // and anything else you would want to do with a scalding TypedPipe[Int,
  Vector]
  Currently I've been doing this with a package structure directly in
 Mahout,
  in:
 mahout/contrib/scalding
  What do people think about having this be something real, after 0.8 goes
  out?  Are
  we ready for contrib modules which fold in diverse external projects in
 new
  ways?
  Integrating directly with pig and scalding is a bit too wide of a tent
 for
  Mahout core,
  but putting these integrations in entirely new projects is maybe a bit
 too
  far away.
  On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  Dmitriy,
 
  This is very pretty.
 
 
 
 
  On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   Ok, so i was fairly easily able to build some DSL for our matrix
   manipulation (similar to breeze) in scala:
  
   inline matrix or vector:
  
   val  a = dense((1, 2, 3), (3, 4, 5))
  
   val b:Vector = (1,2,3)
  
   block views and assignments (element/row/vector/block/block of row or
   vector)
  
  
   a(::, 0)
   a(1, ::)
   a(0 to 1, 1 to 2)
  
   assignments
  
   a(0, ::) :=(3, 5, 7)
   a(0, 0 to 1) :=(3, 5)
   a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
  
   operators
  
   // hadamard
   val c = a * b
a *= b
  
   // matrix mul
val m = a %*% b
  
   and bunch of other little things like sum, mean, colMeans etc. That
 much
  is
   easy.
  
   Also stuff like the ones found in breeze along the lines
  
   val (u,v,s) = svd(a)
  
   diag ((1,2,3))
  
   and Cholesky in similar ways.
  
   I don't have inline initialization for sparse things (yet) simply
  because
   i don't need them, but of course all regular java constructors and
  methods
   are retained, all that is just a syntactic sugar in the spirit of
 DSLs in
   hope to make things a bit mroe readable.
  
   my (very little, and very insignificantly opinionated, really)
 criticism
  of
   Breeze in this context is its inconsistency between dense and sparse
   representations, namely, lack of consistent overarching trait(s), so
 that
   building structure-agnostic solvers like Mahout's Cholesky solver is
   impossible, as well as cross-type matrix use (say, the way i
 understand
  it,
   it is pretty much imposible to multiply a sparse matrix by a dense
  matrix).
  
   I suspect these problems stem from the fact that the authors for
 whatever
   reason decided to hardwire dense things with JBlas solvers 

Re: Mahout vectors/matrices/solvers on spark

2013-06-24 Thread Ted Dunning
I think that contrib modules would be very interesting.  Specifically, good
Scala DSL, pig integration and so on.


On Mon, Jun 24, 2013 at 9:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

  That looks great Dmitry!
 
 
  The thing about Breeze that drives the complexity in it is partly
  specialization for Float, Double and Int matrices, and partly getting the
  syntax to just work for all combinations of matrix types and operands
  etc. mostly it does just work but occasionally not.

 yes i noticed that, but since i am wrapping Mahout matrices, there's only a
 choice of double-filled matrices and vectors. Actually, i would argue
 that's the way it is supposed to be in the interest of KISS principle. I am
 not sure i see a value in int matrices for any problem i ever worked on,
 and skipping on precision to save the space is even more far-fetched notion
 as in real life numbers don't take as much space as their pre-vectorized
 features and annotations. In fact. model training parts and linear algebra
 are not where memory bottleneck seems to fat-up at all in my experience.
 There's often exponentially growing cpu-bound behavior, yes, but not RAM.



 
 
  I am surprised that dense * sparse matrix doesn't work but I guess as I
  previously mentioned the sparse matrix support is a bit shaky.
 
 This is solely based on eye-balling the trait architecture. I did not
 actually attempt it. But there's no single unifying trait for sure.

 
 
  David Hall is pretty happy to both look into enhancements and help out
 for
  contributions (eg I'm hoping to find time to look into a proper Diagonal
  matrix implementation and he was very helpful with pointers etc), so
 please
  do drop things into the google group mailing list. Hopefully wider
 adoption
  especially by this type of community will drive Breeze development.
 
 
  In another note I also really like Scaldings matrix API so scala ish
  wrappers for mahout would be cool - another pet project of mine is a port
  of that API to spark too :)
 
 
  N
 
 
 
  —
  Sent from Mailbox for iPhone
 
  On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix jake.man...@gmail.com
  wrote:
 
   Yeah, I'm totally on board with a pretty scala DSL on top of some of
 our
   stuff.
   In particular, I've been experimenting with with wrapping the
   DistributedRowMatrix
   in a scalding wrapper, so we can do things like
   val matrixAsTypedPipe =
  DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
   path, conf))
   // e.g. L1 normalize:
 matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx,
  v.normalize(1)) )
.write(new
   DistributedRowMatrixPipe(outputPath, conf))
   // and anything else you would want to do with a scalding
 TypedPipe[Int,
   Vector]
   Currently I've been doing this with a package structure directly in
  Mahout,
   in:
  mahout/contrib/scalding
   What do people think about having this be something real, after 0.8
 goes
   out?  Are
   we ready for contrib modules which fold in diverse external projects in
  new
   ways?
   Integrating directly with pig and scalding is a bit too wide of a tent
  for
   Mahout core,
   but putting these integrations in entirely new projects is maybe a bit
  too
   far away.
   On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
   Dmitriy,
  
   This is very pretty.
  
  
  
  
   On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
Ok, so i was fairly easily able to build some DSL for our matrix
manipulation (similar to breeze) in scala:
   
inline matrix or vector:
   
val  a = dense((1, 2, 3), (3, 4, 5))
   
val b:Vector = (1,2,3)
   
block views and assignments (element/row/vector/block/block of row
 or
vector)
   
   
a(::, 0)
a(1, ::)
a(0 to 1, 1 to 2)
   
assignments
   
a(0, ::) :=(3, 5, 7)
a(0, 0 to 1) :=(3, 5)
a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
   
operators
   
// hadamard
val c = a * b
 a *= b
   
// matrix mul
 val m = a %*% b
   
and bunch of other little things like sum, mean, colMeans etc. That
  much
   is
easy.
   
Also stuff like the ones found in breeze along the lines
   
val (u,v,s) = svd(a)
   
diag ((1,2,3))
   
and Cholesky in similar ways.
   
I don't have inline initialization for sparse things (yet) simply
   because
i don't need them, but of course all regular java constructors and
   methods
are retained, all that is just a syntactic sugar in the spirit of
  DSLs in
hope to make things a bit mroe readable.
   
my (very little, and very insignificantly opinionated, really)
  criticism
   of
Breeze in this context is its inconsistency between dense and sparse
representations, namely, lack of consistent overarching trait(s), so
  that
building structure-agnostic 

Re: Mahout vectors/matrices/solvers on spark

2013-06-24 Thread Nick Pentreath
You're right on that - so far doubles is all I've needed and all I can 
currently see needing. 


​I'll take a look at your project and see how easy it is to integrate with my 
Spark ALS and other code - syntax wise it looks almost the same so swapping out 
the linear algebra backend would be quite trivial in theory.


So far I've a working implementation of both implicit and explicit ALS versions 
that matches Mahout in RMSE given same parameters on the 3 movielens data sets. 
Still some work to do and more testing at scale, plus framework stuff. But 
hopefully I'd like to open source this at some point (but the Spark guys have a 
few projects upcoming so I'm also waiting a bit to see what happens there as it 
may end up duplicating a lot of what they're doing).

—
Sent from Mailbox for iPhone

On Mon, Jun 24, 2013 at 10:55 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

 On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath 
 nick.pentre...@gmail.comwrote:
 That looks great Dmitry!


 The thing about Breeze that drives the complexity in it is partly
 specialization for Float, Double and Int matrices, and partly getting the
 syntax to just work for all combinations of matrix types and operands
 etc. mostly it does just work but occasionally not.
 yes i noticed that, but since i am wrapping Mahout matrices, there's only a
 choice of double-filled matrices and vectors. Actually, i would argue
 that's the way it is supposed to be in the interest of KISS principle. I am
 not sure i see a value in int matrices for any problem i ever worked on,
 and skipping on precision to save the space is even more far-fetched notion
 as in real life numbers don't take as much space as their pre-vectorized
 features and annotations. In fact. model training parts and linear algebra
 are not where memory bottleneck seems to fat-up at all in my experience.
 There's often exponentially growing cpu-bound behavior, yes, but not RAM.


 I am surprised that dense * sparse matrix doesn't work but I guess as I
 previously mentioned the sparse matrix support is a bit shaky.

 This is solely based on eye-balling the trait architecture. I did not
 actually attempt it. But there's no single unifying trait for sure.


 David Hall is pretty happy to both look into enhancements and help out for
 contributions (eg I'm hoping to find time to look into a proper Diagonal
 matrix implementation and he was very helpful with pointers etc), so please
 do drop things into the google group mailing list. Hopefully wider adoption
 especially by this type of community will drive Breeze development.


 In another note I also really like Scaldings matrix API so scala ish
 wrappers for mahout would be cool - another pet project of mine is a port
 of that API to spark too :)


 N



 —
 Sent from Mailbox for iPhone

 On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix jake.man...@gmail.com
 wrote:

  Yeah, I'm totally on board with a pretty scala DSL on top of some of our
  stuff.
  In particular, I've been experimenting with with wrapping the
  DistributedRowMatrix
  in a scalding wrapper, so we can do things like
  val matrixAsTypedPipe =
 DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
  path, conf))
  // e.g. L1 normalize:
matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx,
 v.normalize(1)) )
   .write(new
  DistributedRowMatrixPipe(outputPath, conf))
  // and anything else you would want to do with a scalding TypedPipe[Int,
  Vector]
  Currently I've been doing this with a package structure directly in
 Mahout,
  in:
 mahout/contrib/scalding
  What do people think about having this be something real, after 0.8 goes
  out?  Are
  we ready for contrib modules which fold in diverse external projects in
 new
  ways?
  Integrating directly with pig and scalding is a bit too wide of a tent
 for
  Mahout core,
  but putting these integrations in entirely new projects is maybe a bit
 too
  far away.
  On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  Dmitriy,
 
  This is very pretty.
 
 
 
 
  On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   Ok, so i was fairly easily able to build some DSL for our matrix
   manipulation (similar to breeze) in scala:
  
   inline matrix or vector:
  
   val  a = dense((1, 2, 3), (3, 4, 5))
  
   val b:Vector = (1,2,3)
  
   block views and assignments (element/row/vector/block/block of row or
   vector)
  
  
   a(::, 0)
   a(1, ::)
   a(0 to 1, 1 to 2)
  
   assignments
  
   a(0, ::) :=(3, 5, 7)
   a(0, 0 to 1) :=(3, 5)
   a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
  
   operators
  
   // hadamard
   val c = a * b
a *= b
  
   // matrix mul
val m = a %*% b
  
   and bunch of other little things like sum, mean, colMeans etc. That
 much
  is
   easy.
  
   Also stuff like the ones found in breeze along the lines
  
   val (u,v,s) = svd(a)
  
   diag ((1,2,3))
  
   and Cholesky in similar ways.
  
   I 

Re: Mahout vectors/matrices/solvers on spark

2013-06-24 Thread Dmitriy Lyubimov
Well one fundamental step to get there in Mahout realm, the way i see it,
is to create DSLs for Mahout's DRMs in spark. That's actually one of the
other reasons i chose not to follow Breeze. When we unwind Mahout DRM's, we
may see sparse or dense slices there with named vectors. To translate that
into Breeze blocks would be a problem (and annotations/named vector
treatment is yet another problem i guess).


On Mon, Jun 24, 2013 at 2:08 PM, Nick Pentreath nick.pentre...@gmail.comwrote:

 You're right on that - so far doubles is all I've needed and all I can
 currently see needing.


 I'll take a look at your project and see how easy it is to integrate with
 my Spark ALS and other code - syntax wise it looks almost the same so
 swapping out the linear algebra backend would be quite trivial in theory.


 So far I've a working implementation of both implicit and explicit ALS
 versions that matches Mahout in RMSE given same parameters on the 3
 movielens data sets. Still some work to do and more testing at scale, plus
 framework stuff. But hopefully I'd like to open source this at some point
 (but the Spark guys have a few projects upcoming so I'm also waiting a bit
 to see what happens there as it may end up duplicating a lot of what
 they're doing).

 —
 Sent from Mailbox for iPhone

 On Mon, Jun 24, 2013 at 10:55 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath 
 nick.pentre...@gmail.comwrote:
  That looks great Dmitry!
 
 
  The thing about Breeze that drives the complexity in it is partly
  specialization for Float, Double and Int matrices, and partly getting
 the
  syntax to just work for all combinations of matrix types and operands
  etc. mostly it does just work but occasionally not.
  yes i noticed that, but since i am wrapping Mahout matrices, there's
 only a
  choice of double-filled matrices and vectors. Actually, i would argue
  that's the way it is supposed to be in the interest of KISS principle. I
 am
  not sure i see a value in int matrices for any problem i ever worked
 on,
  and skipping on precision to save the space is even more far-fetched
 notion
  as in real life numbers don't take as much space as their pre-vectorized
  features and annotations. In fact. model training parts and linear
 algebra
  are not where memory bottleneck seems to fat-up at all in my experience.
  There's often exponentially growing cpu-bound behavior, yes, but not RAM.
 
 
  I am surprised that dense * sparse matrix doesn't work but I guess as I
  previously mentioned the sparse matrix support is a bit shaky.
 
  This is solely based on eye-balling the trait architecture. I did not
  actually attempt it. But there's no single unifying trait for sure.
 
 
  David Hall is pretty happy to both look into enhancements and help out
 for
  contributions (eg I'm hoping to find time to look into a proper Diagonal
  matrix implementation and he was very helpful with pointers etc), so
 please
  do drop things into the google group mailing list. Hopefully wider
 adoption
  especially by this type of community will drive Breeze development.
 
 
  In another note I also really like Scaldings matrix API so scala ish
  wrappers for mahout would be cool - another pet project of mine is a
 port
  of that API to spark too :)
 
 
  N
 
 
 
  —
  Sent from Mailbox for iPhone
 
  On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix jake.man...@gmail.com
  wrote:
 
   Yeah, I'm totally on board with a pretty scala DSL on top of some of
 our
   stuff.
   In particular, I've been experimenting with with wrapping the
   DistributedRowMatrix
   in a scalding wrapper, so we can do things like
   val matrixAsTypedPipe =
  DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
   path, conf))
   // e.g. L1 normalize:
 matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx,
  v.normalize(1)) )
.write(new
   DistributedRowMatrixPipe(outputPath, conf))
   // and anything else you would want to do with a scalding
 TypedPipe[Int,
   Vector]
   Currently I've been doing this with a package structure directly in
  Mahout,
   in:
  mahout/contrib/scalding
   What do people think about having this be something real, after 0.8
 goes
   out?  Are
   we ready for contrib modules which fold in diverse external projects
 in
  new
   ways?
   Integrating directly with pig and scalding is a bit too wide of a tent
  for
   Mahout core,
   but putting these integrations in entirely new projects is maybe a bit
  too
   far away.
   On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
   Dmitriy,
  
   This is very pretty.
  
  
  
  
   On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com
 
   wrote:
  
Ok, so i was fairly easily able to build some DSL for our matrix
manipulation (similar to breeze) in scala:
   
inline matrix or vector:
   
val  a = dense((1, 2, 3), (3, 4, 5))
   
val b:Vector = 

Jenkins build is back to normal : Mahout-Quality #2103

2013-06-24 Thread Apache Jenkins Server
See https://builds.apache.org/job/Mahout-Quality/2103/



Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-24 Thread Bhaskar Mookerji
Hi!

Is the Google hangouts dev session tomorrow/Tuesday still happening?

Lurkingly,
Buro Mookerji


On Fri, Jun 14, 2013 at 3:37 AM, Grant Ingersoll gsing...@apache.orgwrote:

 It seems to be that 6 pm ET is the consensus time for the majority of
 people, although my having screwed up the poll didn't help.

 Bi-weekly is the other consensus.  It also looks like Tuesday or Thursday
 are the preferred dates.

 I can't make next week, so I'm going to propose we kick off on Tuesday,
 June 25 at 6 pm.  That will give us time to dry-run the Google Hangouts,
 etc.

 Again, just to be clear, the goal here is to work on the development of
 Mahout, not to answer questions about how to run Mahout (we could do that
 separately if there is a desire.)

 I'll send out a reminder as we get closer.

 -Grant


 On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  I am from Northern Virginia, how many of us here are from the Washington
 DC Metro area?
 
 
 
 
  
  From: Jake Mannix jake.man...@gmail.com
  To: dev@mahout.apache.org dev@mahout.apache.org
  Sent: Wednesday, June 12, 2013 1:56 PM
  Subject: Re: (Bi-)Weekly/Monthly Dev Sessions
 
 
  Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon
  when I get back from europe at the end of the summer!
 
 
  On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman 
  andrew.mussel...@gmail.com wrote:
 
  Bi-weekly is good for me; I'm in Seattle and just filled out the poll.
 
  Great idea!
 
 
  On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com
  wrote:
 
  +1, am in Seattle as well and would love to attend and be involved.
 
  Sent from my iPhone
 
  On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com
  wrote:
 
  Good idea on recurring meetings. Im very interested in participating.
  Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8.
 
  An agenda for the meetings ahead of time will help us get the most of
  our
  time at the meetings.
 
  Thanks.
  On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org
  wrote:
 
 
  On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu
 wrote:
 
  Angel and Suneel, you may want to re-fill out the new doodle.
 
  FYI, this week won't be representative of my schedule; I'm in the
  last
  few weeks of a job at ORNL where I travel every weekend. Normally
 I'll
  have
  more flexibility than just 6pm on weeknights.
 
  Yeah, Doodle makes you pick dates, but I just want it to be
  representative
  a week long period of time and not tied to a specific set of dates.
So,
  just put in what your ideal times are in general and ignore the fact
  that
  it is set to next week.
 
 
  On 6/12/13 8:26 AM, Grant Ingersoll wrote:
  On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu
  wrote:
 
  +1, awesome idea
 
  One question: the poll, while set to GMT -5, does say it's in
  Central
  Time. Is this a daylight savings thing?
  I turned on Time Zone support, so not sure how it will look to
  others,
  but it sounds like it adjusts based on your location...  I see: 8 am,
  10,
  1, so on.
 
  I also realize, that I messed it up.  I meant 9 pm, not 9 am.
 
  Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv
 
  
  Grant Ingersoll | @gsingers
  http://www.lucidworks.com
 
 
 
 
 
 
 
 
 
 
 
  --
 
-jake

 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com








Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-24 Thread Suneel Marthi
Not sure, but if we are having it I think we should focus on what's left for 
0.8 release.





 From: Bhaskar Mookerji mooke...@spin-one.org
To: dev@mahout.apache.org 
Cc: Suneel Marthi suneel_mar...@yahoo.com 
Sent: Monday, June 24, 2013 6:35 PM
Subject: Re: (Bi-)Weekly/Monthly Dev Sessions
 

Hi!

Is the Google hangouts dev session tomorrow/Tuesday still happening?

Lurkingly,
Buro Mookerji


On Fri, Jun 14, 2013 at 3:37 AM, Grant Ingersoll gsing...@apache.orgwrote:

 It seems to be that 6 pm ET is the consensus time for the majority of
 people, although my having screwed up the poll didn't help.

 Bi-weekly is the other consensus.  It also looks like Tuesday or Thursday
 are the preferred dates.

 I can't make next week, so I'm going to propose we kick off on Tuesday,
 June 25 at 6 pm.  That will give us time to dry-run the Google Hangouts,
 etc.

 Again, just to be clear, the goal here is to work on the development of
 Mahout, not to answer questions about how to run Mahout (we could do that
 separately if there is a desire.)

 I'll send out a reminder as we get closer.

 -Grant


 On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  I am from Northern Virginia, how many of us here are from the Washington
 DC Metro area?
 
 
 
 
  
  From: Jake Mannix jake.man...@gmail.com
  To: dev@mahout.apache.org dev@mahout.apache.org
  Sent: Wednesday, June 12, 2013 1:56 PM
  Subject: Re: (Bi-)Weekly/Monthly Dev Sessions
 
 
  Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon
  when I get back from europe at the end of the summer!
 
 
  On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman 
  andrew.mussel...@gmail.com wrote:
 
  Bi-weekly is good for me; I'm in Seattle and just filled out the poll.
 
  Great idea!
 
 
  On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com
  wrote:
 
  +1, am in Seattle as well and would love to attend and be involved.
 
  Sent from my iPhone
 
  On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com
  wrote:
 
  Good idea on recurring meetings. Im very interested in participating.
  Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8.
 
  An agenda for the meetings ahead of time will help us get the most of
  our
  time at the meetings.
 
  Thanks.
  On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org
  wrote:
 
 
  On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu
 wrote:
 
  Angel and Suneel, you may want to re-fill out the new doodle.
 
  FYI, this week won't be representative of my schedule; I'm in the
  last
  few weeks of a job at ORNL where I travel every weekend. Normally
 I'll
  have
  more flexibility than just 6pm on weeknights.
 
  Yeah, Doodle makes you pick dates, but I just want it to be
  representative
  a week long period of time and not tied to a specific set of dates.
    So,
  just put in what your ideal times are in general and ignore the fact
  that
  it is set to next week.
 
 
  On 6/12/13 8:26 AM, Grant Ingersoll wrote:
  On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu
  wrote:
 
  +1, awesome idea
 
  One question: the poll, while set to GMT -5, does say it's in
  Central
  Time. Is this a daylight savings thing?
  I turned on Time Zone support, so not sure how it will look to
  others,
  but it sounds like it adjusts based on your location...  I see: 8 am,
  10,
  1, so on.
 
  I also realize, that I messed it up.  I meant 9 pm, not 9 am.
 
  Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv
 
  
  Grant Ingersoll | @gsingers
  http://www.lucidworks.com
 
 
 
 
 
 
 
 
 
 
 
  --
 
    -jake

 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com







Build failed in Jenkins: mahout-nightly » Mahout Integration #1272

2013-06-24 Thread Apache Jenkins Server
See 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/1272/changes

Changes:

[smarthi] MAHOUT-944: lucene2seq - more code cleanup, removed unused imports

[smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - fixed 
issue with not reading a directory list

[smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - first 
round of Code cleanup based on feedback from code review

[smarthi] MAHOUT-944:lucene2seq - removed unused import

--
[INFO] 
[INFO] 
[INFO] Building Mahout Integration 0.8-SNAPSHOT
[INFO] 
[INFO] [INFO] Deleting 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/ws/target

[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ mahout-integration 
---
[INFO] [INFO] Using 'UTF-8' encoding to copy filtered resources.

[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
mahout-integration ---
[INFO] Copying 0 resource
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
mahout-integration ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 131 source files to 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/ws/target/classes
[WARNING] Note: Some input files use or override a deprecated API.
[WARNING] Note: Recompile with -Xlint:deprecation for details.
[WARNING] Note: 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/ws/src/main/java/org/apache/mahout/cf/taste/impl/model/mongodb/MongoDBDataModel.java
 uses unchecked or unsafe operations.
[WARNING] Note: Recompile with -Xlint:unchecked for details.
[INFO] [INFO] Using 'UTF-8' encoding to copy filtered resources.

[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ 
mahout-integration ---
[INFO] Copying 10 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
mahout-integration ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 39 source files to 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/ws/target/test-classes
[WARNING] Note: Some input files use or override a deprecated API.
[WARNING] Note: Recompile with -Xlint:deprecation for details.
[INFO] 
[INFO] --- maven-surefire-plugin:2.14.1:test (default-test) @ 
mahout-integration ---
[INFO] Surefire report directory: 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/ws/target/surefire-reports

---
 T E S T S
---

---
 T E S T S
---
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.utils.nlp.collocations.llr.BloomTokenFilterTest
Running org.apache.mahout.utils.vectors.arff.ARFFTypeTest
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 

Build failed in Jenkins: mahout-nightly #1272

2013-06-24 Thread Apache Jenkins Server
See https://builds.apache.org/job/mahout-nightly/1272/changes

Changes:

[smarthi] MAHOUT-944: lucene2seq - more code cleanup, removed unused imports

[smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - fixed 
issue with not reading a directory list

[smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - first 
round of Code cleanup based on feedback from code review

[smarthi] MAHOUT-944:lucene2seq - removed unused import

--
[...truncated 2088 lines...]
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/maven-metadata.xml
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/maven-metadata.xml
 (344 B at 2.0 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-tests.jar
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-tests.jar
 (2436 KB at 9055.6 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml
 (2 KB at 20.0 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-job.jar
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-job.jar
 (19450 KB at 22381.7 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml
 (2 KB at 17.5 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-sources.jar
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-sources.jar
 (1149 KB at 3587.6 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml
 (2 KB at 20.3 KB/sec)
[INFO] 
[INFO] 
[INFO] Building Mahout Integration 0.8-SNAPSHOT
[INFO] 
[INFO] [INFO] Deleting 
https://builds.apache.org/job/mahout-nightly/ws/trunk/integration/target

[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ mahout-integration 
---
[INFO] [INFO] Using 'UTF-8' encoding to copy filtered resources.

[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
mahout-integration ---
[INFO] Copying 0 resource
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
mahout-integration ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 131 source files to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/integration/target/classes
[WARNING] Note: Some input files use or override a deprecated API.
[WARNING] Note: Recompile with -Xlint:deprecation for details.
[WARNING] Note: 
https://builds.apache.org/job/mahout-nightly/ws/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/mongodb/MongoDBDataModel.java
 uses unchecked or unsafe operations.
[WARNING] Note: Recompile with -Xlint:unchecked for details.
[INFO] [INFO] Using 'UTF-8' encoding to copy filtered resources.

[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ 
mahout-integration ---
[INFO] Copying 10 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
mahout-integration ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 39 source files to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/integration/target/test-classes
[WARNING] Note: Some input files use or override a deprecated API.
[WARNING] Note: Recompile with -Xlint:deprecation for details.
[INFO] 
[INFO] --- maven-surefire-plugin:2.14.1:test (default-test) @ 
mahout-integration ---
[INFO] Surefire report directory: 
https://builds.apache.org/job/mahout-nightly/ws/trunk/integration/target/surefire-reports

---
 T E S T S
---


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692518#comment-13692518
 ] 

Robin Anil commented on MAHOUT-1214:


https://reviews.apache.org/r/11931/

I have actually replied to your comments. My comment still stands with respect 
to using a non standard input format. Grant, can you take a look as well. 

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread Yiqun Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692589#comment-13692589
 ] 

Yiqun Hu commented on MAHOUT-1214:
--

Hi, Robin,
We also response to your comments about why a new input format is used. Please 
check our response in reviewboard. Because we introduce a new support for 
spectralkmeans in mahout: we allow user to specify affinity between data using 
any data identity. We believe this support is huge for mahout users. Just 
imagine when you need to specify pairwise affinities of petabyte data. Asking 
user to map data point first and specify row/column id is inconvenient.

We response the comments and wait for the further discussion. There are two 
options here. One, if there is a way to use standard input format to implement 
this support, please suggest, because we thought it is impossible. Two, if you 
think this support is useless, we don't mind to remove it and keep with 
ourselves. 

Again, we need discussion to move forward.

Sent from my iPhone




 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread Yiqun Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692593#comment-13692593
 ] 

Yiqun Hu commented on MAHOUT-1214:
--

Robin, just see your response. Let us digest it then response.

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-24 Thread Grant Ingersoll
I'd really like to, but had a trip come up.  If possible, can we push for one 
week?  Otherwise, if others want to go forward, I can try to set things up and 
share it w/ others.

On Jun 24, 2013, at 6:35 PM, Bhaskar Mookerji mooke...@spin-one.org wrote:

 Hi!
 
 Is the Google hangouts dev session tomorrow/Tuesday still happening?
 
 Lurkingly,
 Buro Mookerji
 
 
 On Fri, Jun 14, 2013 at 3:37 AM, Grant Ingersoll gsing...@apache.orgwrote:
 
 It seems to be that 6 pm ET is the consensus time for the majority of
 people, although my having screwed up the poll didn't help.
 
 Bi-weekly is the other consensus.  It also looks like Tuesday or Thursday
 are the preferred dates.
 
 I can't make next week, so I'm going to propose we kick off on Tuesday,
 June 25 at 6 pm.  That will give us time to dry-run the Google Hangouts,
 etc.
 
 Again, just to be clear, the goal here is to work on the development of
 Mahout, not to answer questions about how to run Mahout (we could do that
 separately if there is a desire.)
 
 I'll send out a reminder as we get closer.
 
 -Grant
 
 
 On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
 I am from Northern Virginia, how many of us here are from the Washington
 DC Metro area?
 
 
 
 
 
 From: Jake Mannix jake.man...@gmail.com
 To: dev@mahout.apache.org dev@mahout.apache.org
 Sent: Wednesday, June 12, 2013 1:56 PM
 Subject: Re: (Bi-)Weekly/Monthly Dev Sessions
 
 
 Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon
 when I get back from europe at the end of the summer!
 
 
 On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Bi-weekly is good for me; I'm in Seattle and just filled out the poll.
 
 Great idea!
 
 
 On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com
 wrote:
 
 +1, am in Seattle as well and would love to attend and be involved.
 
 Sent from my iPhone
 
 On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com
 wrote:
 
 Good idea on recurring meetings. Im very interested in participating.
 Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8.
 
 An agenda for the meetings ahead of time will help us get the most of
 our
 time at the meetings.
 
 Thanks.
 On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 
 On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu
 wrote:
 
 Angel and Suneel, you may want to re-fill out the new doodle.
 
 FYI, this week won't be representative of my schedule; I'm in the
 last
 few weeks of a job at ORNL where I travel every weekend. Normally
 I'll
 have
 more flexibility than just 6pm on weeknights.
 
 Yeah, Doodle makes you pick dates, but I just want it to be
 representative
 a week long period of time and not tied to a specific set of dates.
  So,
 just put in what your ideal times are in general and ignore the fact
 that
 it is set to next week.
 
 
 On 6/12/13 8:26 AM, Grant Ingersoll wrote:
 On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu
 wrote:
 
 +1, awesome idea
 
 One question: the poll, while set to GMT -5, does say it's in
 Central
 Time. Is this a daylight savings thing?
 I turned on Time Zone support, so not sure how it will look to
 others,
 but it sounds like it adjusts based on your location...  I see: 8 am,
 10,
 1, so on.
 
 I also realize, that I messed it up.  I meant 9 pm, not 9 am.
 
 Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 
 
 
 
 
 --
 
  -jake
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread Yiqun Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692620#comment-13692620
 ] 

Yiqun Hu commented on MAHOUT-1214:
--

Robin, I understand the philosophy of mahout. But when you said we can write a 
new mapreduce to finish the mapping string id to row/column. From my 
understanding, it does not solve the issue. In the new mapreduce, we still have 
to introduce the new input format as we did here. Am I right?

Sent from my iPhone




 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Build failed in Jenkins: mahout-nightly » Mahout Integration #1272

2013-06-24 Thread Grant Ingersoll
Can someone w/ more Hadoop experience look at this?  We are getting:

java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit 
cannot be cast to org.apache.hadoop.mapred.InputSplit
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)

AFAICT, we are using the new APIs, but this seems to think it should be the old 
APIs. Note, this is an intermittent issue.  Sometimes it goes through just 
fine.  Locally, it passes for me.

Note, this could also be related to the Parallel tests stuff.

-Grant

On Jun 24, 2013, at 7:06 PM, Apache Jenkins Server jenk...@builds.apache.org 
wrote:

 Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.611 sec  
 FAILURE!
 testSequential(org.apache.mahout.text.SequenceFilesFromMailArchivesTest)  
 Time elapsed: 1.268 sec   FAILURE!
 org.junit.ComparisonFailure: 
 expected:TEST/subdir/[mail-messages].gz/u...@example.com but 
 was:TEST/subdir/[subsubdir/mail-messages-2].gz/u...@example.com
   at org.junit.Assert.assertEquals(Assert.java:115)
   at org.junit.Assert.assertEquals(Assert.java:144)
   at 
 org.apache.mahout.text.SequenceFilesFromMailArchivesTest.testSequential(SequenceFilesFromMailArchivesTest.java:108)


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: Build failed in Jenkins: mahout-nightly » Mahout Integration #1272

2013-06-24 Thread Grant Ingersoll
Never mind the noise here, I misread this!

Still, we have some error going on w/ random failures.

On Jun 24, 2013, at 8:33 PM, Grant Ingersoll gsing...@apache.org wrote:

 Can someone w/ more Hadoop experience look at this?  We are getting:
 
 java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit 
 cannot be cast to org.apache.hadoop.mapred.InputSplit
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
 
 AFAICT, we are using the new APIs, but this seems to think it should be the 
 old APIs. Note, this is an intermittent issue.  Sometimes it goes through 
 just fine.  Locally, it passes for me.
 
 Note, this could also be related to the Parallel tests stuff.
 
 -Grant
 
 On Jun 24, 2013, at 7:06 PM, Apache Jenkins Server 
 jenk...@builds.apache.org wrote:
 
 Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.611 sec 
  FAILURE!
 testSequential(org.apache.mahout.text.SequenceFilesFromMailArchivesTest)  
 Time elapsed: 1.268 sec   FAILURE!
 org.junit.ComparisonFailure: 
 expected:TEST/subdir/[mail-messages].gz/u...@example.com but 
 was:TEST/subdir/[subsubdir/mail-messages-2].gz/u...@example.com
  at org.junit.Assert.assertEquals(Assert.java:115)
  at org.junit.Assert.assertEquals(Assert.java:144)
  at 
 org.apache.mahout.text.SequenceFilesFromMailArchivesTest.testSequential(SequenceFilesFromMailArchivesTest.java:108)
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-24 Thread Suneel Marthi
I am fine with pushing by a week. 





 From: Grant Ingersoll gsing...@apache.org
To: dev@mahout.apache.org 
Cc: Suneel Marthi suneel_mar...@yahoo.com 
Sent: Monday, June 24, 2013 8:25 PM
Subject: Re: (Bi-)Weekly/Monthly Dev Sessions
 


I'd really like to, but had a trip come up.  If possible, can we push for one 
week?  Otherwise, if others want to go forward, I can try to set things up and 
share it w/ others.


On Jun 24, 2013, at 6:35 PM, Bhaskar Mookerji mooke...@spin-one.org wrote:

Hi!

Is the Google hangouts dev session tomorrow/Tuesday still happening?

Lurkingly,
Buro Mookerji


On Fri, Jun 14, 2013 at 3:37 AM, Grant Ingersoll gsing...@apache.orgwrote:


It seems to be that 6 pm ET is the consensus time for the majority of
people, although my having screwed up the poll didn't help.

Bi-weekly is the other consensus.  It also looks like Tuesday or Thursday
are the preferred dates.

I can't make next week, so I'm going to propose we kick off on Tuesday,
June 25 at 6 pm.  That will give us time to dry-run the Google Hangouts,
etc.

Again, just to be clear, the goal here is to work on the development of
Mahout, not to answer questions about how to run Mahout (we could do that
separately if there is a desire.)

I'll send out a reminder as we get closer.

-Grant


On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com
wrote:


I am from Northern Virginia, how many of us here are from the Washington
DC Metro area?






From: Jake Mannix jake.man...@gmail.com
To: dev@mahout.apache.org dev@mahout.apache.org
Sent: Wednesday, June 12, 2013 1:56 PM
Subject: Re: (Bi-)Weekly/Monthly Dev Sessions


Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon
when I get back from europe at the end of the summer!


On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:


Bi-weekly is good for me; I'm in Seattle and just filled out the poll.

Great idea!


On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com

wrote:


+1, am in Seattle as well and would love to attend and be involved.

Sent from my iPhone

On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com
wrote:


Good idea on recurring meetings. Im very interested in participating.
Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8.

An agenda for the meetings ahead of time will help us get the most of
our

time at the meetings.

Thanks.
On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org
wrote:




On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu
wrote:



Angel and Suneel, you may want to re-fill out the new doodle.

FYI, this week won't be representative of my schedule; I'm in the
last

few weeks of a job at ORNL where I travel every weekend. Normally
I'll

have

more flexibility than just 6pm on weeknights.

Yeah, Doodle makes you pick dates, but I just want it to be
representative

a week long period of time and not tied to a specific set of dates.

 So,

just put in what your ideal times are in general and ignore the fact
that

it is set to next week.



On 6/12/13 8:26 AM, Grant Ingersoll wrote:

On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu
wrote:



+1, awesome idea

One question: the poll, while set to GMT -5, does say it's in
Central

Time. Is this a daylight savings thing?

I turned on Time Zone support, so not sure how it will look to
others,

but it sounds like it adjusts based on your location...  I see: 8 am,
10,

1, so on.


I also realize, that I messed it up.  I meant 9 pm, not 9 am.

Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv


Grant Ingersoll | @gsingers
http://www.lucidworks.com











--

 -jake


Grant Ingersoll | @gsingers
http://www.lucidworks.com









Grant Ingersoll | @gsingers
http://www.lucidworks.com

[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread zhang da (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhang da updated MAHOUT-1214:
-

Attachment: (was: MAHOUT-1214.patch)

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: MAHOUT-1214.patch, matrix_1, matrix_2


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1268) Wrong output directory for CVB

2013-06-24 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1268:
--

 Summary: Wrong output directory for CVB
 Key: MAHOUT-1268
 URL: https://issues.apache.org/jira/browse/MAHOUT-1268
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 0.8


I think that I introduced a bug in MAHOUT-1262 by accidentally writing to the 
wrong output dir (as reported by Mark Wicks on the mailinglist).



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1268) Wrong output directory for CVB

2013-06-24 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1268:
---

Attachment: MAHOUT-1268.patch

 Wrong output directory for CVB
 --

 Key: MAHOUT-1268
 URL: https://issues.apache.org/jira/browse/MAHOUT-1268
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 0.8

 Attachments: MAHOUT-1268.patch


 I think that I introduced a bug in MAHOUT-1262 by accidentally writing to the 
 wrong output dir (as reported by Mark Wicks on the mailinglist).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1268) Wrong output directory for CVB

2013-06-24 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692745#comment-13692745
 ] 

Jake Mannix commented on MAHOUT-1268:
-

has this been tested with cluster_reuters.sh?  If so, +1 to get this in asap.

 Wrong output directory for CVB
 --

 Key: MAHOUT-1268
 URL: https://issues.apache.org/jira/browse/MAHOUT-1268
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 0.8

 Attachments: MAHOUT-1268.patch


 I think that I introduced a bug in MAHOUT-1262 by accidentally writing to the 
 wrong output dir (as reported by Mark Wicks on the mailinglist).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1268) Wrong output directory for CVB

2013-06-24 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692749#comment-13692749
 ] 

Suneel Marthi commented on MAHOUT-1268:
---

[~jake.mannix] testing cluster_reuters.sh now as I am typing this. But this 
should fix the issue.

 Wrong output directory for CVB
 --

 Key: MAHOUT-1268
 URL: https://issues.apache.org/jira/browse/MAHOUT-1268
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 0.8

 Attachments: MAHOUT-1268.patch


 I think that I introduced a bug in MAHOUT-1262 by accidentally writing to the 
 wrong output dir (as reported by Mark Wicks on the mailinglist).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1268) Wrong output directory for CVB

2013-06-24 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692753#comment-13692753
 ] 

Suneel Marthi commented on MAHOUT-1268:
---

[~ssc] Please commit this, applied the patch and tested CVB with 
cluster_reuters.sh and we are good now.

 Wrong output directory for CVB
 --

 Key: MAHOUT-1268
 URL: https://issues.apache.org/jira/browse/MAHOUT-1268
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 0.8

 Attachments: MAHOUT-1268.patch


 I think that I introduced a bug in MAHOUT-1262 by accidentally writing to the 
 wrong output dir (as reported by Mark Wicks on the mailinglist).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira