Problems with mapBlock()
I've updated the codebase to work on the cooccurrence analysis algo, but I always run into this error now: error: value mapBlock is not a member of org.apache.mahout.math.drm.DrmLike[Int] I have the feeling that an implicit conversion might be missing, but I couldn't figure out where to put it, with out producing even more errors. --sebastian
[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.
[ https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014918#comment-14014918 ] Sebastian Schelter commented on MAHOUT-1566: If its a mere showcase, could we maybe add it as an example in an example package, not a full fledged algorithm implementation somehow? > Regular ALS factorizer with convergence test. > - > > Key: MAHOUT-1566 > URL: https://issues.apache.org/jira/browse/MAHOUT-1566 > Project: Mahout > Issue Type: Task >Affects Versions: 0.9 >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov >Priority: Trivial > Fix For: 1.0 > > > ALS-related: let's start with unweighed, unregularized implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: mlib versus spark
Hi Saikat, The differences are that MLLib offers a different set of algorithms (e.g. you want find cooccurrence analysis or stochastic svd) and that their codebase consists of hand-tuned, spark-specific implementations. Mahout on the other hand, allows to implement algorithms in an engine-agnostic, declarative way. This allows for the automatic optimization of our algorithms as well as for running the same code on multiple backends (there has been interested from h20 as well as Apache Flink to integrate with our DSL). --sebastian On 06/01/2014 01:41 AM, Saikat Kanjilal wrote: Actually the subject of my email should say spark->mlib versus mahout->spark :) From: sxk1...@hotmail.com To: dev@mahout.apache.org Subject: mlib versus spark Date: Sat, 31 May 2014 16:38:13 -0700 Ok I'll admit I'm not seeing what the obvious differences are, I'm a bit confused when I think of mahout using spark, since spark already uses an embedded machine learning library (mlib) what would be the impetus to use mahout instead, seems like you should be able to write or add algortihms to mlib and use spark, has someone from mahout looked at mlib to see if there will be a strongusecase for using one versus the other? http://spark.apache.org/mllib/
RE: mlib versus spark
Actually the subject of my email should say spark->mlib versus mahout->spark :) > From: sxk1...@hotmail.com > To: dev@mahout.apache.org > Subject: mlib versus spark > Date: Sat, 31 May 2014 16:38:13 -0700 > > Ok I'll admit I'm not seeing what the obvious differences are, I'm a bit > confused when I think of mahout using spark, since spark already uses an > embedded machine learning library (mlib) what would be the impetus to use > mahout instead, seems like you should be able to write or add algortihms to > mlib and use spark, has someone from mahout looked at mlib to see if there > will be a strongusecase for using one versus the other? > http://spark.apache.org/mllib/
mlib versus spark
Ok I'll admit I'm not seeing what the obvious differences are, I'm a bit confused when I think of mahout using spark, since spark already uses an embedded machine learning library (mlib) what would be the impetus to use mahout instead, seems like you should be able to write or add algortihms to mlib and use spark, has someone from mahout looked at mlib to see if there will be a strongusecase for using one versus the other? http://spark.apache.org/mllib/
[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output
[ https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014856#comment-14014856 ] Suneel Marthi commented on MAHOUT-1505: --- Let's leave Canopy for now then, the tests will be automatically removed once Canopy driver is removed > structure of clusterdump's JSON output > -- > > Key: MAHOUT-1505 > URL: https://issues.apache.org/jira/browse/MAHOUT-1505 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.9 >Reporter: Terry Blankers >Assignee: Andrew Musselman > Labels: json > Fix For: 1.0 > > Attachments: MAHOUT-1505-canopy-removal.patch, MAHOUT-1505.patch > > > Hi all, I'm working on some automated analysis of the clusterdump output > using '-of = JSON'. While digging into the structure of the representation of > the data I've noticed something that seems a little odd to me. > In order to access the data for a particular cluster, the 'cluster', 'n', 'c' > & 'r' values are all in one continuous string. For example: > {noformat} > {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, > administration:0.011 r=[action:0.446, adherence:1.501, > administration:0.306]}"} > {noformat} > This is also the case for the "point": > {noformat} > {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, > harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"} > {noformat} > This leads me to believe that the only way I can get to the individual data > in these items is by string parsing. For JSON deserialization I would have > expected to see something along the lines of: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > [ > {"action":0.023}, > {"adherence":0.223}, > {"administration":0.011} > ], > "r": > [ > {"action":0.446}, > {"adherence":1.501}, > {"administration":0.306} > ] > } > {noformat} > and: > {noformat} > { > "point": { > "body": 6.904, > "harm": 10.101 > }, > "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138", > "weight": 1.0 > } > {noformat} > Andrew Musselman replied: > {quote} > Looks like a bug to me as well; I would have expected something similar to > what you were expecting except maybe something like this which puts the "c" > and "r" values in objects rather than arrays of single-element objects: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > { > "action":0.023, > "adherence":0.223, > "administration":0.011 > }, > "r": > { >"action":0.446, >"adherence":1.501, >"administration":0.306 > } > } > {noformat} > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1567) Add online sparse dictionary learning (dimensionality reduction)
[ https://issues.apache.org/jira/browse/MAHOUT-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014851#comment-14014851 ] Ted Dunning commented on MAHOUT-1567: - Mairal's method is one of the absolute standards for this algorithm and sparse coding is an important algorithm. Maciej, Can you attempt to port to the new math DSL that Mahout is building? That will be very instructive for the development of the DSL as well. > Add online sparse dictionary learning (dimensionality reduction) > > > Key: MAHOUT-1567 > URL: https://issues.apache.org/jira/browse/MAHOUT-1567 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Reporter: Maciej Kula > > I have recently implemented a sparse online dictionary learning algorithm, > with an emphasis on learning very high-dimensional and very sparse > dictionaries. It is based on J. Mairal et al 'Online Dictionary Learning for > Sparse Coding' (http://www.di.ens.fr/willow/pdfs/icml09.pdf). It's an online > variant of low-rank matrix factorization, suitable for sparse binary matrices > (such as implicit feedback matrices). > I would be very happy to bring this up to the Mahout standard and contribute > to the main codebase --- is this something you would in principle be > interested in having? > The code (as well as some examples) are here: > https://github.com/maciejkula/dictionarylearning -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1505) structure of clusterdump's JSON output
[ https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Musselman updated MAHOUT-1505: - Attachment: MAHOUT-1505-canopy-removal.patch Fixes suggested by Suneel above. Note Canopy and CanopyDriver are still used in TestClusterEvaluation and other tests; how far do we want to go on this before removing Canopy entirely? > structure of clusterdump's JSON output > -- > > Key: MAHOUT-1505 > URL: https://issues.apache.org/jira/browse/MAHOUT-1505 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.9 >Reporter: Terry Blankers >Assignee: Andrew Musselman > Labels: json > Fix For: 1.0 > > Attachments: MAHOUT-1505-canopy-removal.patch, MAHOUT-1505.patch > > > Hi all, I'm working on some automated analysis of the clusterdump output > using '-of = JSON'. While digging into the structure of the representation of > the data I've noticed something that seems a little odd to me. > In order to access the data for a particular cluster, the 'cluster', 'n', 'c' > & 'r' values are all in one continuous string. For example: > {noformat} > {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, > administration:0.011 r=[action:0.446, adherence:1.501, > administration:0.306]}"} > {noformat} > This is also the case for the "point": > {noformat} > {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, > harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"} > {noformat} > This leads me to believe that the only way I can get to the individual data > in these items is by string parsing. For JSON deserialization I would have > expected to see something along the lines of: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > [ > {"action":0.023}, > {"adherence":0.223}, > {"administration":0.011} > ], > "r": > [ > {"action":0.446}, > {"adherence":1.501}, > {"administration":0.306} > ] > } > {noformat} > and: > {noformat} > { > "point": { > "body": 6.904, > "harm": 10.101 > }, > "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138", > "weight": 1.0 > } > {noformat} > Andrew Musselman replied: > {quote} > Looks like a bug to me as well; I would have expected something similar to > what you were expecting except maybe something like this which puts the "c" > and "r" values in objects rather than arrays of single-element objects: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > { > "action":0.023, > "adherence":0.223, > "administration":0.011 > }, > "r": > { >"action":0.446, >"adherence":1.501, >"administration":0.306 > } > } > {noformat} > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1505) structure of clusterdump's JSON output
[ https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014777#comment-14014777 ] Suneel Marthi edited comment on MAHOUT-1505 at 5/31/14 9:45 PM: Reviewing the patch and few places where the code needs some cleanup:- 1. Since Canopy Clustering has been now marked as @deprecated, please remove all references and unit tests for Canopy Clustering in TestClusterInterface, TestClusterDumper and TestClusterClassifier. 2. AbstractClusrer.java Method formatVectorAsJson() - What are we doing with vectorTerms ? I don't see that being used anywhere, If vectorTerms is not needed, we may also not need the static class TermIndexWeight ? 3. Cluster.java - remove the following variable as its never used anywhere in the code String CLUSTERED_POINTS_DIR = "clusteredPoints"; was (Author: smarthi): Reviewing the patch and few places where the code needs some cleanup:- 1. Since Canopy Clustering has been now marked as @deprecated, please remove all references and unit tests for Canopy Clustering in TestClusterInterface, TestClusterDumper and TestClusterClassifier. 2. AbstractClusrer.java Method formatVectorAsJson() - What are we doing with vectorTerms ? If u don't need vectorTerms, then do we really need the static class TermIndexWeight ? > structure of clusterdump's JSON output > -- > > Key: MAHOUT-1505 > URL: https://issues.apache.org/jira/browse/MAHOUT-1505 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.9 >Reporter: Terry Blankers >Assignee: Andrew Musselman > Labels: json > Fix For: 1.0 > > Attachments: MAHOUT-1505.patch > > > Hi all, I'm working on some automated analysis of the clusterdump output > using '-of = JSON'. While digging into the structure of the representation of > the data I've noticed something that seems a little odd to me. > In order to access the data for a particular cluster, the 'cluster', 'n', 'c' > & 'r' values are all in one continuous string. For example: > {noformat} > {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, > administration:0.011 r=[action:0.446, adherence:1.501, > administration:0.306]}"} > {noformat} > This is also the case for the "point": > {noformat} > {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, > harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"} > {noformat} > This leads me to believe that the only way I can get to the individual data > in these items is by string parsing. For JSON deserialization I would have > expected to see something along the lines of: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > [ > {"action":0.023}, > {"adherence":0.223}, > {"administration":0.011} > ], > "r": > [ > {"action":0.446}, > {"adherence":1.501}, > {"administration":0.306} > ] > } > {noformat} > and: > {noformat} > { > "point": { > "body": 6.904, > "harm": 10.101 > }, > "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138", > "weight": 1.0 > } > {noformat} > Andrew Musselman replied: > {quote} > Looks like a bug to me as well; I would have expected something similar to > what you were expecting except maybe something like this which puts the "c" > and "r" values in objects rather than arrays of single-element objects: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > { > "action":0.023, > "adherence":0.223, > "administration":0.011 > }, > "r": > { >"action":0.446, >"adherence":1.501, >"administration":0.306 > } > } > {noformat} > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1564) Naive Bayes Classifier for New Text Documents
[ https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014827#comment-14014827 ] Andrew Palumbo commented on MAHOUT-1564: Thanks for looking at this [~ssc]. I was worried about potential conflicts with 1252, but its really just a minor addition to the existing implementation, doesn't require any changes and can always come out. Haven't had a chance to do much with it yet, will probably work on it later this week. > Naive Bayes Classifier for New Text Documents > - > > Key: MAHOUT-1564 > URL: https://issues.apache.org/jira/browse/MAHOUT-1564 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.9 >Reporter: Andrew Palumbo > Fix For: 1.0 > > > MapReduce Naive Bayes implementation currently lacks the ability to classify > a new document (outside of the training/holdout corpus). I've begun some > work on a "ClassifyNew" job which will do the following: > 1. Vectorize a new text document using the dictionary and document > frequencies from the training/holdout corpus > - assume the original corpus was vectorized using `seq2sparse`; step (1) > will use all of the same parameters. > 2. Score and label a new document using a previously trained model. > I think that it will be a useful addition to the NB package. Unfortunately, > this is going to be mostly MR workhorse code and doesn't really introduce > much new logic. I will try to keep any new logic separate from MR code so > that it can be called from scala for MAHOUT-1493. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output
[ https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014826#comment-14014826 ] Andrew Musselman commented on MAHOUT-1505: -- Good catches, thanks, will fix. > structure of clusterdump's JSON output > -- > > Key: MAHOUT-1505 > URL: https://issues.apache.org/jira/browse/MAHOUT-1505 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.9 >Reporter: Terry Blankers >Assignee: Andrew Musselman > Labels: json > Fix For: 1.0 > > Attachments: MAHOUT-1505.patch > > > Hi all, I'm working on some automated analysis of the clusterdump output > using '-of = JSON'. While digging into the structure of the representation of > the data I've noticed something that seems a little odd to me. > In order to access the data for a particular cluster, the 'cluster', 'n', 'c' > & 'r' values are all in one continuous string. For example: > {noformat} > {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, > administration:0.011 r=[action:0.446, adherence:1.501, > administration:0.306]}"} > {noformat} > This is also the case for the "point": > {noformat} > {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, > harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"} > {noformat} > This leads me to believe that the only way I can get to the individual data > in these items is by string parsing. For JSON deserialization I would have > expected to see something along the lines of: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > [ > {"action":0.023}, > {"adherence":0.223}, > {"administration":0.011} > ], > "r": > [ > {"action":0.446}, > {"adherence":1.501}, > {"administration":0.306} > ] > } > {noformat} > and: > {noformat} > { > "point": { > "body": 6.904, > "harm": 10.101 > }, > "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138", > "weight": 1.0 > } > {noformat} > Andrew Musselman replied: > {quote} > Looks like a bug to me as well; I would have expected something similar to > what you were expecting except maybe something like this which puts the "c" > and "r" values in objects rather than arrays of single-element objects: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > { > "action":0.023, > "adherence":0.223, > "administration":0.011 > }, > "r": > { >"action":0.446, >"adherence":1.501, >"administration":0.306 > } > } > {noformat} > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1567) Add online sparse dictionary learning (dimensionality reduction)
[ https://issues.apache.org/jira/browse/MAHOUT-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Kula updated MAHOUT-1567: Summary: Add online sparse dictionary learning (dimensionality reduction) (was: Add online sparse dictionary learning) > Add online sparse dictionary learning (dimensionality reduction) > > > Key: MAHOUT-1567 > URL: https://issues.apache.org/jira/browse/MAHOUT-1567 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Reporter: Maciej Kula > > I have recently implemented a sparse online dictionary learning algorithm, > with an emphasis on learning very high-dimensional and very sparse > dictionaries. It is based on J. Mairal et al 'Online Dictionary Learning for > Sparse Coding' (http://www.di.ens.fr/willow/pdfs/icml09.pdf). It's an online > variant of low-rank matrix factorization, suitable for sparse binary matrices > (such as implicit feedback matrices). > I would be very happy to bring this up to the Mahout standard and contribute > to the main codebase --- is this something you would in principle be > interested in having? > The code (as well as some examples) are here: > https://github.com/maciejkula/dictionarylearning -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1567) Add online sparse dictionary learning
[ https://issues.apache.org/jira/browse/MAHOUT-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014804#comment-14014804 ] Maciej Kula commented on MAHOUT-1567: - Correct. It is effectively an online dimensionality reduction technique. We start with a set of n vectors (datapoints, rows in a matrix), and then look for a smaller number of vectors m such that the original vectors can be well approximated by linear combinations of the m new vectors. In the parlance of dictionary learning, the m vectors are called dictionary atoms, and the linear combinations are called (sparse) codes. The idea of learning a dictionary is that, instead of picking a predefined set of vectors, we fit the dictionary vectors to the data using an SGD-like process. I can amend the issue title to reflect this a little bit better? > Add online sparse dictionary learning > - > > Key: MAHOUT-1567 > URL: https://issues.apache.org/jira/browse/MAHOUT-1567 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Reporter: Maciej Kula > > I have recently implemented a sparse online dictionary learning algorithm, > with an emphasis on learning very high-dimensional and very sparse > dictionaries. It is based on J. Mairal et al 'Online Dictionary Learning for > Sparse Coding' (http://www.di.ens.fr/willow/pdfs/icml09.pdf). It's an online > variant of low-rank matrix factorization, suitable for sparse binary matrices > (such as implicit feedback matrices). > I would be very happy to bring this up to the Mahout standard and contribute > to the main codebase --- is this something you would in principle be > interested in having? > The code (as well as some examples) are here: > https://github.com/maciejkula/dictionarylearning -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output
[ https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014788#comment-14014788 ] Hudson commented on MAHOUT-1505: SUCCESS: Integrated in Mahout-Quality #2625 (See [https://builds.apache.org/job/Mahout-Quality/2625/]) MAHOUT-1505: structure of clusterdump's JSON output (akm) (andrew.musselman: rev e0751eaaca4bb06adcf80108d85ad0dc6cc67074) * mrlegacy/src/main/java/org/apache/mahout/clustering/AbstractCluster.java * mrlegacy/src/main/java/org/apache/mahout/clustering/Cluster.java * integration/src/main/java/org/apache/mahout/utils/clustering/JsonClusterWriter.java * mrlegacy/src/test/java/org/apache/mahout/clustering/iterator/TestClusterClassifier.java * mrlegacy/src/test/java/org/apache/mahout/clustering/TestClusterInterface.java * CHANGELOG * integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java MAHOUT-1505: structure of clusterdump's JSON output (akm) This closes #5 (akm: rev 1d1134ee941c6a06804fdb5fef536e3ed95ed36e) * CHANGELOG > structure of clusterdump's JSON output > -- > > Key: MAHOUT-1505 > URL: https://issues.apache.org/jira/browse/MAHOUT-1505 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.9 >Reporter: Terry Blankers >Assignee: Andrew Musselman > Labels: json > Fix For: 1.0 > > Attachments: MAHOUT-1505.patch > > > Hi all, I'm working on some automated analysis of the clusterdump output > using '-of = JSON'. While digging into the structure of the representation of > the data I've noticed something that seems a little odd to me. > In order to access the data for a particular cluster, the 'cluster', 'n', 'c' > & 'r' values are all in one continuous string. For example: > {noformat} > {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, > administration:0.011 r=[action:0.446, adherence:1.501, > administration:0.306]}"} > {noformat} > This is also the case for the "point": > {noformat} > {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, > harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"} > {noformat} > This leads me to believe that the only way I can get to the individual data > in these items is by string parsing. For JSON deserialization I would have > expected to see something along the lines of: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > [ > {"action":0.023}, > {"adherence":0.223}, > {"administration":0.011} > ], > "r": > [ > {"action":0.446}, > {"adherence":1.501}, > {"administration":0.306} > ] > } > {noformat} > and: > {noformat} > { > "point": { > "body": 6.904, > "harm": 10.101 > }, > "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138", > "weight": 1.0 > } > {noformat} > Andrew Musselman replied: > {quote} > Looks like a bug to me as well; I would have expected something similar to > what you were expecting except maybe something like this which puts the "c" > and "r" values in objects rather than arrays of single-element objects: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > { > "action":0.023, > "adherence":0.223, > "administration":0.011 > }, > "r": > { >"action":0.446, >"adherence":1.501, >"administration":0.306 > } > } > {noformat} > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1567) Add online sparse dictionary learning
[ https://issues.apache.org/jira/browse/MAHOUT-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014782#comment-14014782 ] Ted Dunning commented on MAHOUT-1567: - Maciej, Can you say a bit more about this? I think that people might be used here by the use of the word dictionary. I think that you mean dictionary in the sense of sparse coding, whcih is a common first step for deep learning as opposed to building a lookup table for converting strings to integers. Is this correct? > Add online sparse dictionary learning > - > > Key: MAHOUT-1567 > URL: https://issues.apache.org/jira/browse/MAHOUT-1567 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Reporter: Maciej Kula > > I have recently implemented a sparse online dictionary learning algorithm, > with an emphasis on learning very high-dimensional and very sparse > dictionaries. It is based on J. Mairal et al 'Online Dictionary Learning for > Sparse Coding' (http://www.di.ens.fr/willow/pdfs/icml09.pdf). It's an online > variant of low-rank matrix factorization, suitable for sparse binary matrices > (such as implicit feedback matrices). > I would be very happy to bring this up to the Mahout standard and contribute > to the main codebase --- is this something you would in principle be > interested in having? > The code (as well as some examples) are here: > https://github.com/maciejkula/dictionarylearning -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output
[ https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014777#comment-14014777 ] Suneel Marthi commented on MAHOUT-1505: --- Reviewing the patch and few places where the code needs some cleanup:- 1. Since Canopy Clustering has been now marked as @deprecated, please remove all references and unit tests for Canopy Clustering in TestClusterInterface, TestClusterDumper and TestClusterClassifier. 2. AbstractClusrer.java Method formatVectorAsJson() - What are we doing with vectorTerms ? If u don't need vectorTerms, then do we really need the static class TermIndexWeight ? > structure of clusterdump's JSON output > -- > > Key: MAHOUT-1505 > URL: https://issues.apache.org/jira/browse/MAHOUT-1505 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.9 >Reporter: Terry Blankers >Assignee: Andrew Musselman > Labels: json > Fix For: 1.0 > > Attachments: MAHOUT-1505.patch > > > Hi all, I'm working on some automated analysis of the clusterdump output > using '-of = JSON'. While digging into the structure of the representation of > the data I've noticed something that seems a little odd to me. > In order to access the data for a particular cluster, the 'cluster', 'n', 'c' > & 'r' values are all in one continuous string. For example: > {noformat} > {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, > administration:0.011 r=[action:0.446, adherence:1.501, > administration:0.306]}"} > {noformat} > This is also the case for the "point": > {noformat} > {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, > harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"} > {noformat} > This leads me to believe that the only way I can get to the individual data > in these items is by string parsing. For JSON deserialization I would have > expected to see something along the lines of: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > [ > {"action":0.023}, > {"adherence":0.223}, > {"administration":0.011} > ], > "r": > [ > {"action":0.446}, > {"adherence":1.501}, > {"administration":0.306} > ] > } > {noformat} > and: > {noformat} > { > "point": { > "body": 6.904, > "harm": 10.101 > }, > "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138", > "weight": 1.0 > } > {noformat} > Andrew Musselman replied: > {quote} > Looks like a bug to me as well; I would have expected something similar to > what you were expecting except maybe something like this which puts the "c" > and "r" values in objects rather than arrays of single-element objects: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > { > "action":0.023, > "adherence":0.223, > "administration":0.011 > }, > "r": > { >"action":0.446, >"adherence":1.501, >"administration":0.306 > } > } > {noformat} > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Apache group on github
Awesome, thanks. On Sat, May 31, 2014 at 11:52 AM, Ahmet Arslan wrote: > Hi Andrew, > > Taken from Noah's e-mail : > > If you want to be added, edit this file: > > https://svn.apache.org/repos/private/committers/docs/github_team.txt > > Ahmet > > > On Saturday, May 31, 2014 8:30 PM, Andrew Musselman < > andrew.mussel...@gmail.com> wrote: > Does anyone know who to ask to be added to that group? > >
[jira] [Comment Edited] (MAHOUT-1566) Regular ALS factorizer with convergence test.
[ https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014752#comment-14014752 ] Dmitriy Lyubimov edited comment on MAHOUT-1566 at 5/31/14 6:56 PM: --- on implicit feedback issue, i can not just throw it in at the moment because like i said, it is implemented in bare Spark + scala bindings but porting it to drm algebra without going native to Spark seems to be a bit of a puzzle at the moment. I am also contemplating what additional primitives we may formalize in drm linalg before I resort to falling back to a Spark-coupled method. was (Author: dlyubimov): on implicit feedback issue, i can not just throw it in at the moment because like i said, it is implemented in bare Spark + scala bindings but porting it to drm without going native to Spark seems to be a bit of a puzzle at the moment. I am also contemplating what additional primitives we may formalize in drm linalg before I resort to falling back to a Spark-coupled method. > Regular ALS factorizer with convergence test. > - > > Key: MAHOUT-1566 > URL: https://issues.apache.org/jira/browse/MAHOUT-1566 > Project: Mahout > Issue Type: Task >Affects Versions: 0.9 >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov >Priority: Trivial > Fix For: 1.0 > > > ALS-related: let's start with unweighed, unregularized implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1566) Regular ALS factorizer with convergence test.
[ https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014752#comment-14014752 ] Dmitriy Lyubimov edited comment on MAHOUT-1566 at 5/31/14 6:56 PM: --- on implicit feedback issue, i can not just throw it in at the moment because like i said, it is implemented in bare Spark + scala bindings but porting it to drm without going native to Spark seems to be a bit of a puzzle at the moment. I am also contemplating what additional primitives we may formalize in drm linalg before I resort to falling back to a Spark-coupled method. was (Author: dlyubimov): on implicit feedback issue, i can just jump it up because like i said, it is implemented in bare Spark + scala bindings but porting it to drm without going native to Spark seems to be a bit of a puzzle at the moment. I am also contemplating what additional primitives we may formalize in drm linalg before I resort to falling back to a Spark-coupled method. > Regular ALS factorizer with convergence test. > - > > Key: MAHOUT-1566 > URL: https://issues.apache.org/jira/browse/MAHOUT-1566 > Project: Mahout > Issue Type: Task >Affects Versions: 0.9 >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov >Priority: Trivial > Fix For: 1.0 > > > ALS-related: let's start with unweighed, unregularized implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.
[ https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014752#comment-14014752 ] Dmitriy Lyubimov commented on MAHOUT-1566: -- on implicit feedback issue, i can just jump it up because like i said, it is implemented in bare Spark + scala bindings but porting it to drm without going native to Spark seems to be a bit of a puzzle at the moment. I am also contemplating what additional primitives we may formalize in drm linalg before I resort to falling back to a Spark-coupled method. > Regular ALS factorizer with convergence test. > - > > Key: MAHOUT-1566 > URL: https://issues.apache.org/jira/browse/MAHOUT-1566 > Project: Mahout > Issue Type: Task >Affects Versions: 0.9 >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov >Priority: Trivial > Fix For: 1.0 > > > ALS-related: let's start with unweighed, unregularized implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.
[ https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014747#comment-14014747 ] Dmitriy Lyubimov commented on MAHOUT-1566: -- primary merit here is an new "showcase" of drm algebra and iron out as many bugs and issues in preparation for implicit feedback issue. I have had implicit feedback issue implemented in spark and scala bindings for more than a year (long before it was added mllib) but it would be hard to address all issues at once with mahout drm algebra in a bigger method than in a smaller one. For example, in addition to already listed improvements and fixes in spark bindings, rmse computation makes it is pretty clear that elementwise logical op is missing a physical op for identically distributed matrices (which would produce, again, an identically distributed matrix to operands). It is also clear that there are probably some bona-fide cases when ABt operator can produce result identically distributed to A operand if result does not have a significant size skew (I am still contemplating the criteria of such identical distribution estimate here). On a a more supportive argument side, it is obviously false dilemma. Since you mentioned there's some nonzero merit, and since both methods are of the same family and can coexist, merit of two is more than merit of either one, not less. Since drm algebra for the basic one is extremely simple, it has practically zero maintenance cost. > Regular ALS factorizer with convergence test. > - > > Key: MAHOUT-1566 > URL: https://issues.apache.org/jira/browse/MAHOUT-1566 > Project: Mahout > Issue Type: Task >Affects Versions: 0.9 >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov >Priority: Trivial > Fix For: 1.0 > > > ALS-related: let's start with unweighed, unregularized implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Apache group on github
Hi Andrew, Taken from Noah's e-mail : If you want to be added, edit this file: https://svn.apache.org/repos/private/committers/docs/github_team.txt Ahmet On Saturday, May 31, 2014 8:30 PM, Andrew Musselman wrote: Does anyone know who to ask to be added to that group?
[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output
[ https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014737#comment-14014737 ] Hudson commented on MAHOUT-1505: SUCCESS: Integrated in Mahout-Quality #2624 (See [https://builds.apache.org/job/Mahout-Quality/2624/]) MAHOUT-1505: structure of clusterdump's JSON output (akm) (andrew.musselman: rev e0751eaaca4bb06adcf80108d85ad0dc6cc67074) * mrlegacy/src/main/java/org/apache/mahout/clustering/Cluster.java * mrlegacy/src/main/java/org/apache/mahout/clustering/AbstractCluster.java * CHANGELOG * mrlegacy/src/test/java/org/apache/mahout/clustering/iterator/TestClusterClassifier.java * mrlegacy/src/test/java/org/apache/mahout/clustering/TestClusterInterface.java * integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java * integration/src/main/java/org/apache/mahout/utils/clustering/JsonClusterWriter.java MAHOUT-1505: structure of clusterdump's JSON output (akm) This closes #5 (akm: rev 1d1134ee941c6a06804fdb5fef536e3ed95ed36e) * CHANGELOG > structure of clusterdump's JSON output > -- > > Key: MAHOUT-1505 > URL: https://issues.apache.org/jira/browse/MAHOUT-1505 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.9 >Reporter: Terry Blankers >Assignee: Andrew Musselman > Labels: json > Fix For: 1.0 > > Attachments: MAHOUT-1505.patch > > > Hi all, I'm working on some automated analysis of the clusterdump output > using '-of = JSON'. While digging into the structure of the representation of > the data I've noticed something that seems a little odd to me. > In order to access the data for a particular cluster, the 'cluster', 'n', 'c' > & 'r' values are all in one continuous string. For example: > {noformat} > {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, > administration:0.011 r=[action:0.446, adherence:1.501, > administration:0.306]}"} > {noformat} > This is also the case for the "point": > {noformat} > {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, > harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"} > {noformat} > This leads me to believe that the only way I can get to the individual data > in these items is by string parsing. For JSON deserialization I would have > expected to see something along the lines of: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > [ > {"action":0.023}, > {"adherence":0.223}, > {"administration":0.011} > ], > "r": > [ > {"action":0.446}, > {"adherence":1.501}, > {"administration":0.306} > ] > } > {noformat} > and: > {noformat} > { > "point": { > "body": 6.904, > "harm": 10.101 > }, > "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138", > "weight": 1.0 > } > {noformat} > Andrew Musselman replied: > {quote} > Looks like a bug to me as well; I would have expected something similar to > what you were expecting except maybe something like this which puts the "c" > and "r" values in objects rather than arrays of single-element objects: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > { > "action":0.023, > "adherence":0.223, > "administration":0.011 > }, > "r": > { >"action":0.446, >"adherence":1.501, >"administration":0.306 > } > } > {noformat} > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1470) Topic dump
[ https://issues.apache.org/jira/browse/MAHOUT-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Musselman updated MAHOUT-1470: - Affects Version/s: (was: 1.0) 0.9 > Topic dump > -- > > Key: MAHOUT-1470 > URL: https://issues.apache.org/jira/browse/MAHOUT-1470 > Project: Mahout > Issue Type: New Feature > Components: Clustering >Affects Versions: 0.9 >Reporter: Andrew Musselman >Assignee: Andrew Musselman >Priority: Minor > Fix For: 1.0 > > > Per > http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCAMc_qaL2DCgbVbam2miNsLpa4qvaA9sMy1-arccF9Nz6ApcsvQ%40mail.gmail.com%3E > > The script needs to be corrected to not call vectordump for LDA as > > vectordump utility (or even clusterdump) are presently not capable of > > displaying topics and relevant documents. I recall this issue was > > previously reported by Peyman Faratin post 0.9 release. > > > > Mahout's missing a clusterdump utility that reads in LDA > > topics, Document - DocumentId mapping and displays a report of the topics > > and the documents that belong to a topic. -- This message was sent by Atlassian JIRA (v6.2#6252)
Apache group on github
Does anyone know who to ask to be added to that group?
[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output
[ https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014720#comment-14014720 ] ASF GitHub Bot commented on MAHOUT-1505: Github user asfgit closed the pull request at: https://github.com/apache/mahout/pull/5 > structure of clusterdump's JSON output > -- > > Key: MAHOUT-1505 > URL: https://issues.apache.org/jira/browse/MAHOUT-1505 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.9 >Reporter: Terry Blankers >Assignee: Andrew Musselman > Labels: json > Fix For: 1.0 > > Attachments: MAHOUT-1505.patch > > > Hi all, I'm working on some automated analysis of the clusterdump output > using '-of = JSON'. While digging into the structure of the representation of > the data I've noticed something that seems a little odd to me. > In order to access the data for a particular cluster, the 'cluster', 'n', 'c' > & 'r' values are all in one continuous string. For example: > {noformat} > {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, > administration:0.011 r=[action:0.446, adherence:1.501, > administration:0.306]}"} > {noformat} > This is also the case for the "point": > {noformat} > {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, > harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"} > {noformat} > This leads me to believe that the only way I can get to the individual data > in these items is by string parsing. For JSON deserialization I would have > expected to see something along the lines of: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > [ > {"action":0.023}, > {"adherence":0.223}, > {"administration":0.011} > ], > "r": > [ > {"action":0.446}, > {"adherence":1.501}, > {"administration":0.306} > ] > } > {noformat} > and: > {noformat} > { > "point": { > "body": 6.904, > "harm": 10.101 > }, > "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138", > "weight": 1.0 > } > {noformat} > Andrew Musselman replied: > {quote} > Looks like a bug to me as well; I would have expected something similar to > what you were expecting except maybe something like this which puts the "c" > and "r" values in objects rather than arrays of single-element objects: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > { > "action":0.023, > "adherence":0.223, > "administration":0.011 > }, > "r": > { >"action":0.446, >"adherence":1.501, >"administration":0.306 > } > } > {noformat} > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1505) structure of clusterdump's JSON output
[ https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Musselman resolved MAHOUT-1505. -- Resolution: Fixed Merged and pushed to master. > structure of clusterdump's JSON output > -- > > Key: MAHOUT-1505 > URL: https://issues.apache.org/jira/browse/MAHOUT-1505 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.9 >Reporter: Terry Blankers >Assignee: Andrew Musselman > Labels: json > Fix For: 1.0 > > Attachments: MAHOUT-1505.patch > > > Hi all, I'm working on some automated analysis of the clusterdump output > using '-of = JSON'. While digging into the structure of the representation of > the data I've noticed something that seems a little odd to me. > In order to access the data for a particular cluster, the 'cluster', 'n', 'c' > & 'r' values are all in one continuous string. For example: > {noformat} > {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, > administration:0.011 r=[action:0.446, adherence:1.501, > administration:0.306]}"} > {noformat} > This is also the case for the "point": > {noformat} > {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, > harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"} > {noformat} > This leads me to believe that the only way I can get to the individual data > in these items is by string parsing. For JSON deserialization I would have > expected to see something along the lines of: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > [ > {"action":0.023}, > {"adherence":0.223}, > {"administration":0.011} > ], > "r": > [ > {"action":0.446}, > {"adherence":1.501}, > {"administration":0.306} > ] > } > {noformat} > and: > {noformat} > { > "point": { > "body": 6.904, > "harm": 10.101 > }, > "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138", > "weight": 1.0 > } > {noformat} > Andrew Musselman replied: > {quote} > Looks like a bug to me as well; I would have expected something similar to > what you were expecting except maybe something like this which puts the "c" > and "r" values in objects rather than arrays of single-element objects: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > { > "action":0.023, > "adherence":0.223, > "administration":0.011 > }, > "r": > { >"action":0.446, >"adherence":1.501, >"administration":0.306 > } > } > {noformat} > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Sketching out scala traits and 1.0 API
Many if not most Mahout committers and contributors will be new to Scala and Spark, certainly to the Mahout Scala DSL. I’m a complete noob to Spark and Scala so I dove into Scala as a first step. It is deceptively simple but you run into odd limitations and special cases quickly. Anyway a good starting point seems to be Scala, especially its functional programming features. Those plus Spark’s architecture, the Mahout Scala DSL, and (especially for the scientist types out there) the Mahout Shell will make doing new code a couple of orders of magnitude easier than java/hadoop/mapreduce. There is very strong support for Scala on stackoverflow. You will see my simpleton questions there and I encourage everyone to take advantage because the volume of stuff to Google is much smaller than for Java (obviously?) On May 30, 2014, at 3:12 PM, Andrew Palumbo wrote: >> IMO we should wait on core DSL functionality if it’s >> not there but if you are doing something that is external >> then full blown dataframes may not block you or even help you. >> Drms are pretty mature. You’ll have to decide that based on >> your own needs.Also wanted to say I agree completely- not trying to jump the >> gun on this. From: ap@outlook.com To: dev@mahout.apache.org Subject: RE: Sketching out scala traits and 1.0 API Date: Fri, 30 May 2014 18:04:33 -0400 Just jumping in here real quick.. not trying to derail the conversation... I have a lot of catching up to do on the status of the Dataframe implementation, the DSL, Pat's ItemSimiliarity implementation so that i can better understand what's going on and. I'm going to try to take a look at this stuff over the weekend I think i see how my thinking of this has been wrong in terms of "Translating a Dataframe to a DRM". Also I think that NB was a bad example because it's kind of a special case classifier. I guess from my end what im wondering of in terms of laying out traits for classifiers is are we going to try to provide a kind of weka or R-like pluggable interface? and if so, how would that look? I guess I'm speaking specifically about about batch trained, supervised, classification algorithms at this point. (Which im not sure going forward if anybody is interested in, but I am). For example, I'm doing some work right that involves comparing results from some off the shelf algorithms. Working in R, with a small dense dataset- nothing really novel. Once my dataframe is all set up, switching classifiers looks like basically like this: # Train a random forest res.rf <- randomForest( formula=formula, data=d_train, nodesize=1, classwt=CLASSWT, sampsize=length(d_train[,1]), proximity=F, na.action=na.roughfix, ntree=1000) # Train an rPartTree res.rf <-rpart( formula=formula, data=d_train, method="class", control=rpart.control(minsplit=2, cp=0)) I know that this is not that useful to the typical Mahout user right now. But with a shell/script, a Linear Algebra DSL with a distributed back end and a bunch of algorithms in the library, i think that this will be, or will draw in new users. The reason I brought up the full NB pipeline is to ensure that if we are to lay out traits for new (classification) algorithms, it is done so in a the most robust way possible, and in a way that eases development from prototyping in the shell to deployment. > Date: Fri, 30 May 2014 14:54:20 -0700 > Subject: Re: Sketching out scala traits and 1.0 API > From: dlie...@gmail.com > To: dev@mahout.apache.org > > Frankly, except for columnar organization and sine math summarization > functionality, i don't see much difference between these data frames and > e.g. scalding tuple-based manipulations. > > > On Fri, May 30, 2014 at 2:50 PM, Dmitriy Lyubimov wrote: > >> I am not sure i understand the question. It would possible to save results >> of rowSimilarityJob as a data frame. No, data frames do not support quick >> bidirectional indexing on demand in a sense if we wanted to bring full >> column or row to front-end process very quickly (e.g. row id -> row vector, >> or columnName -> column). They will support iterative filtering and >> mutating just like in dplyr package of R. (I hope). >> >> In general, i'd only say that data frames are called data frames because >> the scope of functionality and intent is that of R data frames (there's no >> other source for the term of "data frame", i.e. matlab doesn't have those i >> think) minus quick random individual cell access which is replaced by >> dplyr-style FP computations. >> >> So really i'd say one needs to look at dplyr and R to understand the scope >> of this at this point in my head. >> >> Filtering over rows (including there labels) is implied by dplyr and R. >> column selection pattern is a bit different, via %.% select() and %.% >> mutate (it assumes data frames are like tables, few attributes but a lot of >> rows). Data frames
[jira] [Created] (MAHOUT-1567) Add online sparse dictionary learning
Maciej Kula created MAHOUT-1567: --- Summary: Add online sparse dictionary learning Key: MAHOUT-1567 URL: https://issues.apache.org/jira/browse/MAHOUT-1567 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Reporter: Maciej Kula I have recently implemented a sparse online dictionary learning algorithm, with an emphasis on learning very high-dimensional and very sparse dictionaries. It is based on J. Mairal et al 'Online Dictionary Learning for Sparse Coding' (http://www.di.ens.fr/willow/pdfs/icml09.pdf). It's an online variant of low-rank matrix factorization, suitable for sparse binary matrices (such as implicit feedback matrices). I would be very happy to bring this up to the Mahout standard and contribute to the main codebase --- is this something you would in principle be interested in having? The code (as well as some examples) are here: https://github.com/maciejkula/dictionarylearning -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1552) Avoid new Configuration() instantiation
[ https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014575#comment-14014575 ] Suneel Marthi edited comment on MAHOUT-1552 at 5/31/14 9:03 AM: Is this related to the MR configuration not being passed across the job pipeline which was definitely an issue in Mahout 0.7 and was fixed in Mahout 0.8? If so this can be resolved as not a problem, given that this is still being reported against Mahout 0.7. I would also add that any issue that's being reported against 0.7 needs to be first confirmed against present trunk before accepting any patches (most likely to happen with existing CDH 4x distros that were packaged with Mahout 0.7). was (Author: smarthi): Is this related to the MR configuration not being passed across the job pipeline which was definitely an issue in Mahout 0.7 and was fixed in Mahout 0.8? If so this can be resolved as not a problem, given that this is still being reported against Mahout 0.7. > Avoid new Configuration() instantiation > --- > > Key: MAHOUT-1552 > URL: https://issues.apache.org/jira/browse/MAHOUT-1552 > Project: Mahout > Issue Type: Bug > Components: Classification >Affects Versions: 0.7 > Environment: CDH 4.4, CDH 4.6 >Reporter: Sergey > Fix For: 1.0 > > > Hi, it's related to MAHOUT-1498 > You get troubles when run mahout stuff from oozie java action. > {code} > ava.lang.InterruptedException: Cluster Classification Driver Job failed > processing /tmp/sku/tfidf/90453 > at > org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276) > at > org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135) > at > org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372) > at > org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158) > at > org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1552) Avoid new Configuration() instantiation
[ https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014575#comment-14014575 ] Suneel Marthi commented on MAHOUT-1552: --- Is this related to the MR configuration not being passed across the job pipeline which was definitely an issue in Mahout 0.7 and was fixed in Mahout 0.8? If so this can be resolved as not a problem, given that this is still being reported against Mahout 0.7. > Avoid new Configuration() instantiation > --- > > Key: MAHOUT-1552 > URL: https://issues.apache.org/jira/browse/MAHOUT-1552 > Project: Mahout > Issue Type: Bug > Components: Classification >Affects Versions: 0.7 > Environment: CDH 4.4, CDH 4.6 >Reporter: Sergey > Fix For: 1.0 > > > Hi, it's related to MAHOUT-1498 > You get troubles when run mahout stuff from oozie java action. > {code} > ava.lang.InterruptedException: Cluster Classification Driver Job failed > processing /tmp/sku/tfidf/90453 > at > org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276) > at > org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135) > at > org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372) > at > org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158) > at > org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output
[ https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014574#comment-14014574 ] ASF GitHub Bot commented on MAHOUT-1505: Github user sscdotopen commented on the pull request: https://github.com/apache/mahout/pull/5#issuecomment-44743038 looks good to me, +1 for including this > structure of clusterdump's JSON output > -- > > Key: MAHOUT-1505 > URL: https://issues.apache.org/jira/browse/MAHOUT-1505 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.9 >Reporter: Terry Blankers >Assignee: Andrew Musselman > Labels: json > Fix For: 1.0 > > Attachments: MAHOUT-1505.patch > > > Hi all, I'm working on some automated analysis of the clusterdump output > using '-of = JSON'. While digging into the structure of the representation of > the data I've noticed something that seems a little odd to me. > In order to access the data for a particular cluster, the 'cluster', 'n', 'c' > & 'r' values are all in one continuous string. For example: > {noformat} > {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, > administration:0.011 r=[action:0.446, adherence:1.501, > administration:0.306]}"} > {noformat} > This is also the case for the "point": > {noformat} > {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, > harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"} > {noformat} > This leads me to believe that the only way I can get to the individual data > in these items is by string parsing. For JSON deserialization I would have > expected to see something along the lines of: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > [ > {"action":0.023}, > {"adherence":0.223}, > {"administration":0.011} > ], > "r": > [ > {"action":0.446}, > {"adherence":1.501}, > {"administration":0.306} > ] > } > {noformat} > and: > {noformat} > { > "point": { > "body": 6.904, > "harm": 10.101 > }, > "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138", > "weight": 1.0 > } > {noformat} > Andrew Musselman replied: > {quote} > Looks like a bug to me as well; I would have expected something similar to > what you were expecting except maybe something like this which puts the "c" > and "r" values in objects rather than arrays of single-element objects: > {noformat} > { > "cluster":"VL-10515", > "n":5924, > "c": > { > "action":0.023, > "adherence":0.223, > "administration":0.011 > }, > "r": > { >"action":0.446, >"adherence":1.501, >"administration":0.306 > } > } > {noformat} > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.
[ https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014573#comment-14014573 ] Sebastian Schelter commented on MAHOUT-1566: I'm not sure whether we should really include the "standard" ALS in the new codebase. It is optimized for rating prediction on Netflix-like data which rarely exists outside of academia. I think we should rather focus on the ALS version targeted for implicit data (clicks, views, etc). > Regular ALS factorizer with convergence test. > - > > Key: MAHOUT-1566 > URL: https://issues.apache.org/jira/browse/MAHOUT-1566 > Project: Mahout > Issue Type: Task >Affects Versions: 0.9 >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov >Priority: Trivial > Fix For: 1.0 > > > ALS-related: let's start with unweighed, unregularized implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1565: --- Fix Version/s: 1.0 > add MR2 options to MAHOUT_OPTS in bin/mahout > > > Key: MAHOUT-1565 > URL: https://issues.apache.org/jira/browse/MAHOUT-1565 > Project: Mahout > Issue Type: Improvement >Affects Versions: 1.0, 0.9 >Reporter: Nishkam Ravi > Fix For: 1.0 > > Attachments: MAHOUT-1565.patch > > > MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add > those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1564) Naive Bayes Classifier for New Text Documents
[ https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014572#comment-14014572 ] Sebastian Schelter commented on MAHOUT-1564: I don't see any reason to veto this, as it will make stuff that we have more useful. > Naive Bayes Classifier for New Text Documents > - > > Key: MAHOUT-1564 > URL: https://issues.apache.org/jira/browse/MAHOUT-1564 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.9 >Reporter: Andrew Palumbo > Fix For: 1.0 > > > MapReduce Naive Bayes implementation currently lacks the ability to classify > a new document (outside of the training/holdout corpus). I've begun some > work on a "ClassifyNew" job which will do the following: > 1. Vectorize a new text document using the dictionary and document > frequencies from the training/holdout corpus > - assume the original corpus was vectorized using `seq2sparse`; step (1) > will use all of the same parameters. > 2. Score and label a new document using a previously trained model. > I think that it will be a useful addition to the NB package. Unfortunately, > this is going to be mostly MR workhorse code and doesn't really introduce > much new logic. I will try to keep any new logic separate from MR code so > that it can be called from scala for MAHOUT-1493. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests
[ https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014570#comment-14014570 ] Sebastian Schelter commented on MAHOUT-1543: Could you create a pull request to the current mahout codebase? > JSON output format for classifying with random forests > -- > > Key: MAHOUT-1543 > URL: https://issues.apache.org/jira/browse/MAHOUT-1543 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.7, 0.8, 0.9 >Reporter: larryhu > Labels: patch > Fix For: 0.7 > > Attachments: MAHOUT-1543.patch > > > This patch adds JSON output format to build random forests, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1552) Avoid new Configuration() instantiation
[ https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014571#comment-14014571 ] Sebastian Schelter commented on MAHOUT-1552: Could you suggest a way to fix the bug? > Avoid new Configuration() instantiation > --- > > Key: MAHOUT-1552 > URL: https://issues.apache.org/jira/browse/MAHOUT-1552 > Project: Mahout > Issue Type: Bug > Components: Classification >Affects Versions: 0.7 > Environment: CDH 4.4, CDH 4.6 >Reporter: Sergey > Fix For: 1.0 > > > Hi, it's related to MAHOUT-1498 > You get troubles when run mahout stuff from oozie java action. > {code} > ava.lang.InterruptedException: Cluster Classification Driver Job failed > processing /tmp/sku/tfidf/90453 > at > org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276) > at > org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135) > at > org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372) > at > org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158) > at > org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1552) Avoid new Configuration() instantiation
[ https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1552: --- Fix Version/s: 1.0 > Avoid new Configuration() instantiation > --- > > Key: MAHOUT-1552 > URL: https://issues.apache.org/jira/browse/MAHOUT-1552 > Project: Mahout > Issue Type: Bug > Components: Classification >Affects Versions: 0.7 > Environment: CDH 4.4, CDH 4.6 >Reporter: Sergey > Fix For: 1.0 > > > Hi, it's related to MAHOUT-1498 > You get troubles when run mahout stuff from oozie java action. > {code} > ava.lang.InterruptedException: Cluster Classification Driver Job failed > processing /tmp/sku/tfidf/90453 > at > org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276) > at > org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135) > at > org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372) > at > org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158) > at > org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1551) Add document to describe how to use mlp with command line
[ https://issues.apache.org/jira/browse/MAHOUT-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1551: --- Fix Version/s: 1.0 > Add document to describe how to use mlp with command line > - > > Key: MAHOUT-1551 > URL: https://issues.apache.org/jira/browse/MAHOUT-1551 > Project: Mahout > Issue Type: Documentation > Components: Classification, CLI, Documentation >Affects Versions: 0.9 >Reporter: Yexi Jiang > Labels: documentation > Fix For: 1.0 > > > Add documentation about the usage of multi-layer perceptron in command line. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1524) Script to auto-generate and view the Mahout website on a local machine
[ https://issues.apache.org/jira/browse/MAHOUT-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1524: --- Fix Version/s: 1.0 > Script to auto-generate and view the Mahout website on a local machine > --- > > Key: MAHOUT-1524 > URL: https://issues.apache.org/jira/browse/MAHOUT-1524 > Project: Mahout > Issue Type: New Feature > Components: Documentation >Reporter: Saleem Ansari > Fix For: 1.0 > > Attachments: mahout-website.sh > > > Attached with this ticket is a script that creates a simple setup for editing > Mahout Website on a local machine. > It is useful in the sense that, we can edit the source and the changes are > automatically reflected in the generated site. All we need to do is refresh > the browser. No further steps required. > So now one can review the website changes ( the complete website ), on a > developer's machine. -- This message was sent by Atlassian JIRA (v6.2#6252)