Re: [ANNOUNCE] Andrew Musselman, New Mahout PMC Chair
Congrats! 2018-07-19 9:31 GMT+02:00 Peng Zhang : > Congrats Andrew! > > On Thu, Jul 19, 2018 at 04:01 Andrew Musselman > > wrote: > > > Thanks Andy, looking forward to it! Thank you too for your support and > > dedication the past two years; here's to continued progress! > > > > Best > > Andrew > > > > On Wed, Jul 18, 2018 at 1:30 PM, Andrew Palumbo > > wrote: > > > Please join me in congratulating Andrew Musselman as the new Chair of > > > the > > > Apache Mahout Project Management Committee. I would like to thank > > > Andrew > > > for stepping up, all of us who have worked with him over the years > > > know his > > > dedication to the project to be invaluable. I look forward to Andrew > > > taking taking the project into the future. > > > > > > Thank you, > > > > > > Andy > > >
[jira] [Commented] (MAHOUT-1884) Allow specification of dimensions of a DRM
[ https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546112#comment-15546112 ] Sebastian Schelter commented on MAHOUT-1884: I know that this is already supported internally, I want to expose it as optional parameters to drmDfsRead. I disagree that caching an input matrix to read is always intended by the users, at least I want to be able to retain control over what is cached and what not. > Allow specification of dimensions of a DRM > -- > > Key: MAHOUT-1884 > URL: https://issues.apache.org/jira/browse/MAHOUT-1884 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.12.2 > Reporter: Sebastian Schelter > Assignee: Sebastian Schelter >Priority: Minor > > Currently, in many cases, a DRM must be read to compute its dimensions when a > user calls nrow or ncol. This also implicitly caches the corresponding DRM. > In some cases, the user actually knows the matrix dimensions (e.g., when the > matrices are synthetically generated, or when some metadata about them is > known). In such cases, the user should be able to specify the dimensions upon > creating the DRM and the caching should be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1884) Allow specification of dimensions of a DRM
Sebastian Schelter created MAHOUT-1884: -- Summary: Allow specification of dimensions of a DRM Key: MAHOUT-1884 URL: https://issues.apache.org/jira/browse/MAHOUT-1884 Project: Mahout Issue Type: Improvement Affects Versions: 0.12.2 Reporter: Sebastian Schelter Assignee: Sebastian Schelter Priority: Minor Currently, in many cases, a DRM must be read to compute its dimensions when a user calls nrow or ncol. This also implicitly caches the corresponding DRM. In some cases, the user actually knows the matrix dimensions (e.g., when the matrices are synthetically generated, or when some metadata about them is known). In such cases, the user should be able to specify the dimensions upon creating the DRM and the caching should be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1748) Mahout DSL for Flink: switch to Flink Scala API
[ https://issues.apache.org/jira/browse/MAHOUT-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599247#comment-14599247 ] Sebastian Schelter commented on MAHOUT-1748: +1 makes sense Mahout DSL for Flink: switch to Flink Scala API --- Key: MAHOUT-1748 URL: https://issues.apache.org/jira/browse/MAHOUT-1748 Project: Mahout Issue Type: Task Components: Math Affects Versions: 0.10.2 Reporter: Alexey Grigorev Priority: Minor In Flink-Mahout (MAHOUT-1570) Flink Java API is used because Scala API caused different strange compilation problems. But Scala API handles types better than Flink Java API, so it's better to switch to Scala API. It also can solve MAHOUT-1747 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
[ https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585002#comment-14585002 ] Sebastian Schelter commented on MAHOUT-1739: The FileItemSimilarity class reads the output of ItemSimilarityJob. You can then use the resulting ItemSimilarity with Mahout's recommenders. [1] https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/cf/taste/impl/similarity/file/FileItemSimilarity.java maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct Key: MAHOUT-1739 URL: https://issues.apache.org/jira/browse/MAHOUT-1739 Project: Mahout Issue Type: Bug Components: Collaborative Filtering Affects Versions: 0.10.0 Reporter: lariven Labels: easyfix, patch Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch the output similar items of ItemSimilarityJob for each target item may exceed the number of similar items we set to maxSimilarItemsPerItem parameter. the following code of ItemSimilarityJob.java about line NO. 200 may affect: if (itemID otherItemID) { ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); } else { ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity())); } Don't know why need to switch itemID with otherItemID, but I think a single line is enough: ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
[ https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584970#comment-14584970 ] Sebastian Schelter commented on MAHOUT-1739: Actually, this is exactly what we want. All the similarity measures used in Mahout are symmetric, so the upper triangular part of the similarity matrix already contains all information. I think I also know where this bug comes from. Its actually not a bug, but the parameter maxSimilarItemsPerItem is not named very good. Lets say maxSimilarItemsPerItem is 10. Now for an item A, we compute the 10 most similar items. There might be an item B for which A is in its 10 most similar items, but B is not in the 10 most similar items of A. In order to guarantee that we have 10 most similar items for B, we must output 11 similar items for A unfortunately. Does that make sense? maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct Key: MAHOUT-1739 URL: https://issues.apache.org/jira/browse/MAHOUT-1739 Project: Mahout Issue Type: Bug Components: Collaborative Filtering Affects Versions: 0.10.0 Reporter: lariven Labels: easyfix, patch Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch the output similar items of ItemSimilarityJob for each target item may exceed the number of similar items we set to maxSimilarItemsPerItem parameter. the following code of ItemSimilarityJob.java about line NO. 200 may affect: if (itemID otherItemID) { ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); } else { ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity())); } Don't know why need to switch itemID with otherItemID, but I think a single line is enough: ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
[ https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584986#comment-14584986 ] Sebastian Schelter commented on MAHOUT-1739: We have code that takes this triangular matrix and uses it as an ItemSimilarity for our recommenders. In that case, users don't even have to care about the internal data presentation. maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct Key: MAHOUT-1739 URL: https://issues.apache.org/jira/browse/MAHOUT-1739 Project: Mahout Issue Type: Bug Components: Collaborative Filtering Affects Versions: 0.10.0 Reporter: lariven Labels: easyfix, patch Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch the output similar items of ItemSimilarityJob for each target item may exceed the number of similar items we set to maxSimilarItemsPerItem parameter. the following code of ItemSimilarityJob.java about line NO. 200 may affect: if (itemID otherItemID) { ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); } else { ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity())); } Don't know why need to switch itemID with otherItemID, but I think a single line is enough: ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
[ https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1739: --- Resolution: Not A Problem Status: Resolved (was: Patch Available) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct Key: MAHOUT-1739 URL: https://issues.apache.org/jira/browse/MAHOUT-1739 Project: Mahout Issue Type: Bug Components: Collaborative Filtering Affects Versions: 0.10.0 Reporter: lariven Labels: easyfix, patch Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch the output similar items of ItemSimilarityJob for each target item may exceed the number of similar items we set to maxSimilarItemsPerItem parameter. the following code of ItemSimilarityJob.java about line NO. 200 may affect: if (itemID otherItemID) { ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); } else { ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity())); } Don't know why need to switch itemID with otherItemID, but I think a single line is enough: ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
[ https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584474#comment-14584474 ] Sebastian Schelter commented on MAHOUT-1739: Could you supply a unit test that clearly shows that this is not working? maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct Key: MAHOUT-1739 URL: https://issues.apache.org/jira/browse/MAHOUT-1739 Project: Mahout Issue Type: Bug Components: Collaborative Filtering Affects Versions: 0.10.0 Reporter: lariven Labels: easyfix, patch Fix For: 0.10.0, 0.10.1 the output of may exceed the number of similar items we set to this parameter. the following code of ItemSimilarityJob.java about line NO. 200 may affect: if (itemID otherItemID) { ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); } else { ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity())); } Don't know why need to switch itemID with otherItemID, but I think a single line is enough: ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
[ https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584479#comment-14584479 ] Sebastian Schelter commented on MAHOUT-1739: Could you supply a unit test that shows a case where the maxSimilarItemsPerItems is not correctly handled by the current code? maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct Key: MAHOUT-1739 URL: https://issues.apache.org/jira/browse/MAHOUT-1739 Project: Mahout Issue Type: Bug Components: Collaborative Filtering Affects Versions: 0.10.0 Reporter: lariven Labels: easyfix, patch Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch the output similar items of ItemSimilarityJob for each target item may exceed the number of similar items we set to maxSimilarItemsPerItem parameter. the following code of ItemSimilarityJob.java about line NO. 200 may affect: if (itemID otherItemID) { ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); } else { ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity())); } Don't know why need to switch itemID with otherItemID, but I think a single line is enough: ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1570: --- Comment: was deleted (was: I don't think it makes sense to issue pull requests with unfinished code.) Adding support for Apache Flink as a backend for the Mahout DSL --- Key: MAHOUT-1570 URL: https://issues.apache.org/jira/browse/MAHOUT-1570 Project: Mahout Issue Type: Improvement Reporter: Till Rohrmann Assignee: Suneel Marthi Labels: DSL, flink, scala With the finalized abstraction of the Mahout DSL plans from the backend operations (MAHOUT-1529), it should be possible to integrate further backends for the Mahout DSL. Apache Flink would be a suitable candidate to act as a good execution backend. With respect to the implementation, the biggest difference between Spark and Flink at the moment is probably the incremental rollout of plans, which is triggered by Spark's actions and which is not supported by Flink yet. However, the Flink community is working on this issue. For the moment, it should be possible to circumvent this problem by writing intermediate results required by an action to HDFS and reading from there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1570: --- Comment: was deleted (was: I don't think it makes sense to issue pull requests with unfinished code.) Adding support for Apache Flink as a backend for the Mahout DSL --- Key: MAHOUT-1570 URL: https://issues.apache.org/jira/browse/MAHOUT-1570 Project: Mahout Issue Type: Improvement Reporter: Till Rohrmann Assignee: Suneel Marthi Labels: DSL, flink, scala With the finalized abstraction of the Mahout DSL plans from the backend operations (MAHOUT-1529), it should be possible to integrate further backends for the Mahout DSL. Apache Flink would be a suitable candidate to act as a good execution backend. With respect to the implementation, the biggest difference between Spark and Flink at the moment is probably the incremental rollout of plans, which is triggered by Spark's actions and which is not supported by Flink yet. However, the Flink community is working on this issue. For the moment, it should be possible to circumvent this problem by writing intermediate results required by an action to HDFS and reading from there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578440#comment-14578440 ] Sebastian Schelter commented on MAHOUT-1570: I don't think it makes sense to issue pull requests with unfinished code. Adding support for Apache Flink as a backend for the Mahout DSL --- Key: MAHOUT-1570 URL: https://issues.apache.org/jira/browse/MAHOUT-1570 Project: Mahout Issue Type: Improvement Reporter: Till Rohrmann Assignee: Suneel Marthi Labels: DSL, flink, scala With the finalized abstraction of the Mahout DSL plans from the backend operations (MAHOUT-1529), it should be possible to integrate further backends for the Mahout DSL. Apache Flink would be a suitable candidate to act as a good execution backend. With respect to the implementation, the biggest difference between Spark and Flink at the moment is probably the incremental rollout of plans, which is triggered by Spark's actions and which is not supported by Flink yet. However, the Flink community is working on this issue. For the moment, it should be possible to circumvent this problem by writing intermediate results required by an action to HDFS and reading from there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578438#comment-14578438 ] Sebastian Schelter commented on MAHOUT-1570: I don't think it makes sense to issue pull requests with unfinished code. Adding support for Apache Flink as a backend for the Mahout DSL --- Key: MAHOUT-1570 URL: https://issues.apache.org/jira/browse/MAHOUT-1570 Project: Mahout Issue Type: Improvement Reporter: Till Rohrmann Assignee: Suneel Marthi Labels: DSL, flink, scala With the finalized abstraction of the Mahout DSL plans from the backend operations (MAHOUT-1529), it should be possible to integrate further backends for the Mahout DSL. Apache Flink would be a suitable candidate to act as a good execution backend. With respect to the implementation, the biggest difference between Spark and Flink at the moment is probably the incremental rollout of plans, which is triggered by Spark's actions and which is not supported by Flink yet. However, the Flink community is working on this issue. For the moment, it should be possible to circumvent this problem by writing intermediate results required by an action to HDFS and reading from there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578439#comment-14578439 ] Sebastian Schelter commented on MAHOUT-1570: I don't think it makes sense to issue pull requests with unfinished code. Adding support for Apache Flink as a backend for the Mahout DSL --- Key: MAHOUT-1570 URL: https://issues.apache.org/jira/browse/MAHOUT-1570 Project: Mahout Issue Type: Improvement Reporter: Till Rohrmann Assignee: Suneel Marthi Labels: DSL, flink, scala With the finalized abstraction of the Mahout DSL plans from the backend operations (MAHOUT-1529), it should be possible to integrate further backends for the Mahout DSL. Apache Flink would be a suitable candidate to act as a good execution backend. With respect to the implementation, the biggest difference between Spark and Flink at the moment is probably the incremental rollout of plans, which is triggered by Spark's actions and which is not supported by Flink yet. However, the Flink community is working on this issue. For the moment, it should be possible to circumvent this problem by writing intermediate results required by an action to HDFS and reading from there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521581#comment-14521581 ] Sebastian Schelter commented on MAHOUT-1570: great to see this finally happening Adding support for Apache Flink as a backend for the Mahout DSL --- Key: MAHOUT-1570 URL: https://issues.apache.org/jira/browse/MAHOUT-1570 Project: Mahout Issue Type: Improvement Reporter: Till Rohrmann Assignee: Sebastian Schelter Labels: DSL, flink, scala With the finalized abstraction of the Mahout DSL plans from the backend operations (MAHOUT-1529), it should be possible to integrate further backends for the Mahout DSL. Apache Flink would be a suitable candidate to act as a good execution backend. With respect to the implementation, the biggest difference between Spark and Flink at the moment is probably the incremental rollout of plans, which is triggered by Spark's actions and which is not supported by Flink yet. However, the Flink community is working on this issue. For the moment, it should be possible to circumvent this problem by writing intermediate results required by an action to HDFS and reading from there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: co-occurrence paper and code
I chose against porting all the similarity measures to the dsl version of the cooccurrence analysis for two reasons. First, adding the measures in a generalizable way makes the code superhard to read. Second, in practice, I have never seen something giving better results than llr. As Ted pointed out, a lot of the foundations of using similarity measures comes from wanting to predict ratings, which people never do in practice. I think we should restrict ourselves to approaches that work with implicit, count-like data. -s Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com: On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test LLR is a classic test. It is essentially Pearson's chi^2 test without the normal approximation. See my papers[1][2] introducing the test into computational linguistics (which ultimately brought it into all kinds of fields including recommendations) and also references for the G^2 test[3]. [1] http://www.aclweb.org/anthology/J93-1003 [2] http://arxiv.org/abs/1207.1847 [3] http://en.wikipedia.org/wiki/G-test
Re: co-occurrence paper and code
Sounds good to me. -s Am 06.08.2014 17:15 schrieb Dmitriy Lyubimov dlie...@gmail.com: what i mean here i probably need to refactor it a little so that there's part of algorithm that accepts co-occurrence input directly and which is somewhat decoupled from the part that accepts u x item input and does downsampling and co-occurrence construction. So i could do some customization of my own to co-occurrence construction. Would that be reasonable if i do that? On Wed, Aug 6, 2014 at 5:12 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Asking because i am considering pulling this implementation but for some (mostly political) reasons people want to try different things here. I may also have to start with a different way of constructing co-occurrences, and may do a few optimizations there (i.e. priority queue queing/enqueing does twice the work it really needs to do etc.) On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: I chose against porting all the similarity measures to the dsl version of the cooccurrence analysis for two reasons. First, adding the measures in a generalizable way makes the code superhard to read. Second, in practice, I have never seen something giving better results than llr. As Ted pointed out, a lot of the foundations of using similarity measures comes from wanting to predict ratings, which people never do in practice. I think we should restrict ourselves to approaches that work with implicit, count-like data. -s Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com: On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test LLR is a classic test. It is essentially Pearson's chi^2 test without the normal approximation. See my papers[1][2] introducing the test into computational linguistics (which ultimately brought it into all kinds of fields including recommendations) and also references for the G^2 test[3]. [1] http://www.aclweb.org/anthology/J93-1003 [2] http://arxiv.org/abs/1207.1847 [3] http://en.wikipedia.org/wiki/G-test
Re: Mahout V2
Nice. There is even still a huge potential for optimization in the spark bindings. -s Am 05.07.2014 15:21 schrieb Andrew Musselman andrew.mussel...@gmail.com: Crazy awesome. On Jul 5, 2014, at 4:19 PM, Pat Ferrel p...@occamsmachete.com wrote: I compared spark-itemsimilatity to the Hadoop version on sample data that is 8.7 M, 49290 x 139738 using my little 2 machine cluster and got the following speedup. PlatformElapsed Time Mahout Hadoop0:20:37 Mahout Spark0:02:19 This isn’t quite apples to apples because the Spark version does all the dictionary management, which is usually two extra jobs tacked on before and after the Hadoop job. I’ve done the complete pipeline using Hadoop and Spark now and can say that not only is it faster now but the old Hadoop way required keeping track of 10x more intermediate data and connecting up many more jobs to get the pipeline working. Now it’s just one job. You don’t need to worry about ID translation anymore and you get over 10x faster completion — this is one of those times when speed meets ease-of-use.
Re: H2O integration - intermediate progress update
I share the impression that the tone of conversation has not been very welcoming lately, be it intentional or not. I think that we should remind ourselves why we are working on open source and try to improve our ways of communication. I think we should try to get as much people as possible together to sit on a table and have some face-to-face discussion during a beer or coffee. --sebastian On 06/19/2014 07:18 AM, Dmitriy Lyubimov wrote: On Wed, Jun 18, 2014 at 10:03 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I did not mean to discourage sincere search for answers. The tone of answers has lately been very discouraging for those sincerely searching for answers. I think we as a community have a responsibility to do better about this. There is no need to be insulting to people asking honest questions in a civil tone. Ted, we've been at this already. There have been more arguments than questions. I am just providing my counter arguments. Do you insist on terms insulting? Cause this, you know, insulting. You are heading ad hominem direction again.
Re: cf/couccurence code
Hi Anand, Yes, this should not contain anything spark-specific. +1 for moving it. --sebastian On 06/19/2014 08:38 PM, Anand Avati wrote: Hi Pat and others, I see that cf/CooccuranceAnalysis.scala is currently under spark. Is there a specific reason? I see that the code itself is completely spark agnostic. I tried moving the code under math-scala/src/main/scala/org/apache/mahout/math/cf/ with the following trivial patch: diff --git a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala index ee44f90..bd20956 100644 --- a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala +++ b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala @@ -22,7 +22,6 @@ import scalabindings._ import RLikeOps._ import drm._ import RLikeDrmOps._ -import org.apache.mahout.sparkbindings._ import scala.collection.JavaConversions._ import org.apache.mahout.math.stats.LogLikelihood and it seems to work just fine. From what I see, this should work just fine on H2O as well with no changes.. Why give up generality and make it spark specific? Thanks
[jira] [Resolved] (MAHOUT-1580) Optimize getNumNonZeroElements
[ https://issues.apache.org/jira/browse/MAHOUT-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1580. Resolution: Fixed Optimize getNumNonZeroElements -- Key: MAHOUT-1580 URL: https://issues.apache.org/jira/browse/MAHOUT-1580 Project: Mahout Issue Type: Improvement Components: Math Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 getNumNonZeroElements in AbstractVector uses the nonZeroes -iterator internally which adds a lot of overhead for certain types of vectors, e.g. the dense ones. We should add custom implementations here. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Engine specific algos
I think rejecting that contribution is the right thing to do. I think its very important to narrow our focus. Let us put our efforts into finishing and polishing what we are working on right now. A big problem of the old mahout was that we set the barrier for contributions too low and ended up with lots of non-integrated, hard-to-use algorithms of varying quality. What is the problem with not accepting a contribution? We agreed with Andy that this might be better suited for inclusion in Spark's codebase and I think that was the right decision. -s On 06/18/2014 10:29 PM, Pat Ferrel wrote: Taken from: Re: [jira] [Resolved] (MAHOUT-1153) Implement streaming random forests Also, we don't have any mappings for Spark Streaming -- so if your implementation heavily relies on Spark streaming, i think Spark itself is the right place for it to be a part of. We are discouraging engine specific work? Even dismissing Spark Streaming as a whole? As it stands we don't have purely (c) methods and indeed i believe these methods may be totally engine-specific in which case mllib is one of possibly good homes for them. Adherence to a specific incarnation of an engine-neutral DSL has become a requirement for inclusion in Mahout? The current DSL cannot be extended? Or it can’t be extended with engine specific ways? Or it can’t be extended with Spark Streaming? I would have thought all of these things desirable otherwise we are limiting ourselves to a subset of what an engine can do or a subset of problems that the current DSL supports. I hope I’m misreading this but it looks like we just discourage a contributor from adding post hadoop code in an interesting area to Mahout?
[jira] [Commented] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly
[ https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030429#comment-14030429 ] Sebastian Schelter commented on MAHOUT-1579: Xiaomeng, could you create a pull request to https://github.com/apache/mahout on github? That would make it easier to review your code. Implement a datamodel which can load data from hadoop filesystem directly - Key: MAHOUT-1579 URL: https://issues.apache.org/jira/browse/MAHOUT-1579 Project: Mahout Issue Type: Improvement Reporter: Xiaomeng Huang Priority: Minor Attachments: Mahout-1579.patch As we all know, FileDataModel can only load data from local filesystem. But the big-data are usually stored in hadoop filesystem(e.g. hdfs). If we want to deal with the data in hdfs, we must run mapred job. It's necessay to implement a data model which can load data from hadoop filesystem directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAHOUT-1580) Optimize getNumNonZeroElements
Sebastian Schelter created MAHOUT-1580: -- Summary: Optimize getNumNonZeroElements Key: MAHOUT-1580 URL: https://issues.apache.org/jira/browse/MAHOUT-1580 Project: Mahout Issue Type: Improvement Components: Math Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 getNumNonZeroElements in AbstractVector uses the nonZeroes -iterator internally which adds a lot of overhead for certain types of vectors, e.g. the dense ones. We should add custom implementations here. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
I'm a bit lost in this discussion. Why do we assume that getNumNonZeroElements() on a Vector only returns an upper bound? The code in AbstractVector clearly returns the non-zeros only: int count = 0; IteratorElement it = iterateNonZero(); while (it.hasNext()) { if (it.next().get() != 0.0) { count++; } } return count; On the other hand, the internal code seems broken here, why does iterateNonZero potentially return 0's? --sebastian On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029345#comment-14029345 ] ASF GitHub Bot commented on MAHOUT-1464: Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/12#issuecomment-45915940 fix header to say MAHOUT-1464, then hit close and reopen, it will restart the echo. Cooccurrence Analysis on Spark -- Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Environment: hadoop, spark Reporter: Pat Ferrel Assignee: Pat Ferrel Fix For: 1.0 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAHOUT-1578) Optimizations in matrix serialization
Sebastian Schelter created MAHOUT-1578: -- Summary: Optimizations in matrix serialization Key: MAHOUT-1578 URL: https://issues.apache.org/jira/browse/MAHOUT-1578 Project: Mahout Issue Type: Bug Components: Math Reporter: Sebastian Schelter Fix For: 1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028360#comment-14028360 ] Sebastian Schelter commented on MAHOUT-1464: Hi, The computation of A'A is usually done without explicitly forming A'. Instead A'A is computed as the sum of outer products of rows of A. --sebastian Cooccurrence Analysis on Spark -- Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Environment: hadoop, spark Reporter: Pat Ferrel Assignee: Pat Ferrel Fix For: 1.0 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
Oh good catch! I had an extra binarize method before, so that the data was already binary. I merged that into the downsample code and must have overlooked that thing. You are right, numNonZeros is the way to go! On 06/10/2014 01:11 AM, Ted Dunning wrote: Sounds like a very plausible root cause. On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025893#comment-14025893 ] Pat Ferrel commented on MAHOUT-1464: seems like the downsampleAndBinarize method is returning the wrong values. It is actually summing the values where it should be counting the non-zero elements? // Downsample the interaction vector of each user for (userIndex - 0 until keys.size) { val interactionsOfUser = block(userIndex, ::) // this is a Vector // if the values are non-boolean the sum will not be the number of interactions it will be a sum of strength-of-interaction, right? // val numInteractionsOfUser = interactionsOfUser.sum // doesn't this sum strength of interactions? val numInteractionsOfUser = interactionsOfUser.getNumNonZeroElements() // should do this I think val perUserSampleRate = math.min(maxNumInteractions, numInteractionsOfUser) / numInteractionsOfUser interactionsOfUser.nonZeroes().foreach { elem = val numInteractionsWithThing = numInteractions(elem.index) val perThingSampleRate = math.min(maxNumInteractions, numInteractionsWithThing) / numInteractionsWithThing if (random.nextDouble() = math.min(perUserSampleRate, perThingSampleRate)) { // We ignore the original interaction value and create a binary 0-1 matrix // as we only consider whether interactions happened or did not happen downsampledBlock(userIndex, elem.index) = 1 } } Cooccurrence Analysis on Spark -- Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Environment: hadoop, spark Reporter: Pat Ferrel Assignee: Pat Ferrel Fix For: 1.0 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: TreeBasedRecommenders(Deprecated?)
Hi Sahil, don't worry, you're not breaking any rules. We removed the tree-based recommenders because we have never heard of anyone using them over the years. --sebastian On 06/10/2014 09:01 AM, Sahil Sharma wrote: Hi, Firstly I apologize if I'm breaking certain rules by mailing this way, I'm new to this and would appreciate any help I could get. I was just playing around with the tree-based Recommender ( which seems to be deprecated in the current version for the lack of use ) . Why was it deprecated? Also, I just looked at the code, and it seems to be doing a lot of redundant computations, for example we could store a matrix of cluster-cluster distances ( and hence avoid recomputing the closest clusters every time by updating the matrix whenever we merge two clusters) and also , when trying to determine the farthest distance based similarity between two clusters again the pair which realizes this could be stored , and updated upon merging so that this computation need not to repeated again and again. Just wondering if this repeated computation was not a reason for deprecating the class ( since people might have found a slow recommender lacking use ) . Would be glad to hear the thoughts of others on this, and also implement an efficient version if the community agrees.
SparkBindings on a real cluster
Hi, I did some experimentation with the spark bindings on a real cluster yesterday, as I had to run some experiments for a paper (unrelated to Mahout) that I'm currently writing. The experiment basically consists of multiplying a sparse data matrix by a super-sparse permutation-like matrix from the left. It took me the whole day to get it working, up to matrices with 500M entries. I ran into lots of issues that we have to fix asap, unfortunately I don't have much time in the next weeks, so I'm just sharing a list of the issues that I ran into (maybe I'll find some time to create issues for these things on the weekend). I think the major challenge for us will be to get choice of dense/sparse correct and put lots of work into memory efficiency. This could be a great hook for collaborating with the h20 folks, as they know how to make vector-like data small and computations fast. Here's the list: * our matrix serialization in MatrixWritable is seriously flawed, I ran into the following errors - the type information is stored with every vector although a matrix always only contains vectors of the same type - all entries of a TransposeView (and possibly other views) of a sparse matrix are serialized, resulting in OOM - for sparse row matrices, the vectors are set using assign instead of via constructor injection, this results in huge memory consumption and long creation times, as in some implementations, binary search is used for assignment * a dense matrix is converted into a SparseRowMatrix with dense row vectors by blockify(), after serialization this becomes a dense matrix in sparse format (triggering OOMs)! * drmFromHDFS does not have an option to set the number of desired partitions * SparseRowMatrix with sequential vectors times SparseRowMatrix with sequential vectors is totally broken, it uses three nested loops and uses get(row, col) on the matrices, which internally uses binary search... * At operator adds the column vectors it creates, this is unnecessary as we don't need the addition, we can just merge the vectors * we need a dedicated operator for inCoreA %*% drmB, currently this gets rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have a prototype of that operator) Best, Sebastian
Contributions coming
Hi, as you know we still have a lot of open documenttion tickets. Therefore, I decided to offer these tickets as projects to students in a university lecture that I'm giving with some colleagues: MAHOUT-1495 MAHOUT-1470 MAHOUT-1462 MAHOUT-1477 MAHOUT-1485 MAHOUT-1423 MAHOUT-1427 MAHOUT-1536 MAHOUT-1551 MAHOUT-1493 In the next weeks, the students will join the mailinglist and start working on the documentation and examples. Let's give them a warm welcome and help them learn how to produce open source software. Best, Sebastian
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
The important thing here is that we test the code on a sufficiently large dataset on a real cluster. Take that on, if you want! Am 02.06.2014 20:08 schrieb Pat Ferrel (JIRA) j...@apache.org: [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015667#comment-14015667 ] Pat Ferrel commented on MAHOUT-1464: [~ssc] Should I reassign to me for now so we can get this committed? Cooccurrence Analysis on Spark -- Key: MAHOUT-1464 URL: https://issues.apache.org/jira/browse/MAHOUT-1464 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Environment: hadoop, spark Reporter: Pat Ferrel Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: mlib versus spark
Hi Saikat, The differences are that MLLib offers a different set of algorithms (e.g. you want find cooccurrence analysis or stochastic svd) and that their codebase consists of hand-tuned, spark-specific implementations. Mahout on the other hand, allows to implement algorithms in an engine-agnostic, declarative way. This allows for the automatic optimization of our algorithms as well as for running the same code on multiple backends (there has been interested from h20 as well as Apache Flink to integrate with our DSL). --sebastian On 06/01/2014 01:41 AM, Saikat Kanjilal wrote: Actually the subject of my email should say spark-mlib versus mahout-spark :) From: sxk1...@hotmail.com To: dev@mahout.apache.org Subject: mlib versus spark Date: Sat, 31 May 2014 16:38:13 -0700 Ok I'll admit I'm not seeing what the obvious differences are, I'm a bit confused when I think of mahout using spark, since spark already uses an embedded machine learning library (mlib) what would be the impetus to use mahout instead, seems like you should be able to write or add algortihms to mlib and use spark, has someone from mahout looked at mlib to see if there will be a strongusecase for using one versus the other? http://spark.apache.org/mllib/
[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.
[ https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014918#comment-14014918 ] Sebastian Schelter commented on MAHOUT-1566: If its a mere showcase, could we maybe add it as an example in an example package, not a full fledged algorithm implementation somehow? Regular ALS factorizer with convergence test. - Key: MAHOUT-1566 URL: https://issues.apache.org/jira/browse/MAHOUT-1566 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Priority: Trivial Fix For: 1.0 ALS-related: let's start with unweighed, unregularized implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
Problems with mapBlock()
I've updated the codebase to work on the cooccurrence analysis algo, but I always run into this error now: error: value mapBlock is not a member of org.apache.mahout.math.drm.DrmLike[Int] I have the feeling that an implicit conversion might be missing, but I couldn't figure out where to put it, with out producing even more errors. --sebastian
[jira] [Updated] (MAHOUT-1524) Script to auto-generate and view the Mahout website on a local machine
[ https://issues.apache.org/jira/browse/MAHOUT-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1524: --- Fix Version/s: 1.0 Script to auto-generate and view the Mahout website on a local machine --- Key: MAHOUT-1524 URL: https://issues.apache.org/jira/browse/MAHOUT-1524 Project: Mahout Issue Type: New Feature Components: Documentation Reporter: Saleem Ansari Fix For: 1.0 Attachments: mahout-website.sh Attached with this ticket is a script that creates a simple setup for editing Mahout Website on a local machine. It is useful in the sense that, we can edit the source and the changes are automatically reflected in the generated site. All we need to do is refresh the browser. No further steps required. So now one can review the website changes ( the complete website ), on a developer's machine. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1551) Add document to describe how to use mlp with command line
[ https://issues.apache.org/jira/browse/MAHOUT-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1551: --- Fix Version/s: 1.0 Add document to describe how to use mlp with command line - Key: MAHOUT-1551 URL: https://issues.apache.org/jira/browse/MAHOUT-1551 Project: Mahout Issue Type: Documentation Components: Classification, CLI, Documentation Affects Versions: 0.9 Reporter: Yexi Jiang Labels: documentation Fix For: 1.0 Add documentation about the usage of multi-layer perceptron in command line. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1552) Avoid new Configuration() instantiation
[ https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1552: --- Fix Version/s: 1.0 Avoid new Configuration() instantiation --- Key: MAHOUT-1552 URL: https://issues.apache.org/jira/browse/MAHOUT-1552 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.7 Environment: CDH 4.4, CDH 4.6 Reporter: Sergey Fix For: 1.0 Hi, it's related to MAHOUT-1498 You get troubles when run mahout stuff from oozie java action. {code} ava.lang.InterruptedException: Cluster Classification Driver Job failed processing /tmp/sku/tfidf/90453 at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276) at org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135) at org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372) at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158) at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests
[ https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014570#comment-14014570 ] Sebastian Schelter commented on MAHOUT-1543: Could you create a pull request to the current mahout codebase? JSON output format for classifying with random forests -- Key: MAHOUT-1543 URL: https://issues.apache.org/jira/browse/MAHOUT-1543 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.7, 0.8, 0.9 Reporter: larryhu Labels: patch Fix For: 0.7 Attachments: MAHOUT-1543.patch This patch adds JSON output format to build random forests, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1552) Avoid new Configuration() instantiation
[ https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014571#comment-14014571 ] Sebastian Schelter commented on MAHOUT-1552: Could you suggest a way to fix the bug? Avoid new Configuration() instantiation --- Key: MAHOUT-1552 URL: https://issues.apache.org/jira/browse/MAHOUT-1552 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.7 Environment: CDH 4.4, CDH 4.6 Reporter: Sergey Fix For: 1.0 Hi, it's related to MAHOUT-1498 You get troubles when run mahout stuff from oozie java action. {code} ava.lang.InterruptedException: Cluster Classification Driver Job failed processing /tmp/sku/tfidf/90453 at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276) at org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135) at org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372) at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158) at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1564) Naive Bayes Classifier for New Text Documents
[ https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014572#comment-14014572 ] Sebastian Schelter commented on MAHOUT-1564: I don't see any reason to veto this, as it will make stuff that we have more useful. Naive Bayes Classifier for New Text Documents - Key: MAHOUT-1564 URL: https://issues.apache.org/jira/browse/MAHOUT-1564 Project: Mahout Issue Type: Improvement Affects Versions: 0.9 Reporter: Andrew Palumbo Fix For: 1.0 MapReduce Naive Bayes implementation currently lacks the ability to classify a new document (outside of the training/holdout corpus). I've begun some work on a ClassifyNew job which will do the following: 1. Vectorize a new text document using the dictionary and document frequencies from the training/holdout corpus - assume the original corpus was vectorized using `seq2sparse`; step (1) will use all of the same parameters. 2. Score and label a new document using a previously trained model. I think that it will be a useful addition to the NB package. Unfortunately, this is going to be mostly MR workhorse code and doesn't really introduce much new logic. I will try to keep any new logic separate from MR code so that it can be called from scala for MAHOUT-1493. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1565: --- Fix Version/s: 1.0 add MR2 options to MAHOUT_OPTS in bin/mahout Key: MAHOUT-1565 URL: https://issues.apache.org/jira/browse/MAHOUT-1565 Project: Mahout Issue Type: Improvement Affects Versions: 1.0, 0.9 Reporter: Nishkam Ravi Fix For: 1.0 Attachments: MAHOUT-1565.patch MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.
[ https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014573#comment-14014573 ] Sebastian Schelter commented on MAHOUT-1566: I'm not sure whether we should really include the standard ALS in the new codebase. It is optimized for rating prediction on Netflix-like data which rarely exists outside of academia. I think we should rather focus on the ALS version targeted for implicit data (clicks, views, etc). Regular ALS factorizer with convergence test. - Key: MAHOUT-1566 URL: https://issues.apache.org/jira/browse/MAHOUT-1566 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Priority: Trivial Fix For: 1.0 ALS-related: let's start with unweighed, unregularized implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012136#comment-14012136 ] Sebastian Schelter commented on MAHOUT-1565: I'd favor removing that add MR2 options to MAHOUT_OPTS in bin/mahout Key: MAHOUT-1565 URL: https://issues.apache.org/jira/browse/MAHOUT-1565 Project: Mahout Issue Type: Improvement Affects Versions: 1.0, 0.9 Reporter: Nishkam Ravi Attachments: MAHOUT-1565.patch MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations
[ https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010124#comment-14010124 ] Sebastian Schelter commented on MAHOUT-1529: Hi Dmitriy, the PR looks good, +1 from me, go ahead! Best, Sebastian Finalize abstraction of distributed logical plans from backend operations - Key: MAHOUT-1529 URL: https://issues.apache.org/jira/browse/MAHOUT-1529 Project: Mahout Issue Type: Improvement Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 We have a few situations when algorithm-facing API has Spark dependencies creeping in. In particular, we know of the following cases: -(1) checkpoint() accepts Spark constant StorageLevel directly;- (2) certain things in CheckpointedDRM; (3) drmParallelize etc. routines in the drm and sparkbindings package. (5) drmBroadcast returns a Spark-specific Broadcast object *Current tracker:* https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529. *Pull requests are welcome*. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1536) Update Creating vectors from text page
[ https://issues.apache.org/jira/browse/MAHOUT-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008319#comment-14008319 ] Sebastian Schelter commented on MAHOUT-1536: Added the changes. Can someone have a look at the lucene part of the site? We should post the currently used lucene version there and not require users to look into the POM for example. Update Creating vectors from text page Key: MAHOUT-1536 URL: https://issues.apache.org/jira/browse/MAHOUT-1536 Project: Mahout Issue Type: Bug Components: Documentation Affects Versions: 0.9 Reporter: Andrew Palumbo Priority: Minor Fix For: 1.0 Attachments: MAHOUT-1536_edit1.patch, MAHOUT-1536_edit2.patch At least the seq2sparse section of the Creating vectors from text page is out of date. https://mahout.apache.org/users/basics/creating-vectors-from-text.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1480) Clean up website on 20 newsgroups
[ https://issues.apache.org/jira/browse/MAHOUT-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1480: --- Resolution: Fixed Status: Resolved (was: Patch Available) committed, thank you very much Clean up website on 20 newsgroups - Key: MAHOUT-1480 URL: https://issues.apache.org/jira/browse/MAHOUT-1480 Project: Mahout Issue Type: Improvement Components: Documentation Reporter: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1480_edit1.patch, MAHOUT-1480_edit2.patch The website on the twenty newsgroups example needs clean up. We need to go through the text, remove dead links and check whether the information is still consistent with the current code. https://mahout.apache.org/users/clustering/twenty-newsgroups.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1446) Create an intro for matrix factorization
[ https://issues.apache.org/jira/browse/MAHOUT-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1446. Resolution: Fixed Assignee: Sebastian Schelter Jian, thank you very much, you did a great job. I put the page online, could you have a look at it? Thx, Sebastian Create an intro for matrix factorization Key: MAHOUT-1446 URL: https://issues.apache.org/jira/browse/MAHOUT-1446 Project: Mahout Issue Type: New Feature Components: Documentation Reporter: Maciej Mazur Assignee: Sebastian Schelter Fix For: 1.0 Attachments: matrix-factorization.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1560) Last batch is not filled correctly in MultithreadedBatchItemSimilarities
[ https://issues.apache.org/jira/browse/MAHOUT-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1560. Resolution: Fixed Fix Version/s: 1.0 Assignee: Sebastian Schelter committed, thank you for the contribution Last batch is not filled correctly in MultithreadedBatchItemSimilarities Key: MAHOUT-1560 URL: https://issues.apache.org/jira/browse/MAHOUT-1560 Project: Mahout Issue Type: Bug Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Jarosław Bojar Assignee: Sebastian Schelter Priority: Minor Fix For: 1.0 Attachments: Corrected_last_batch_size_calculation.patch, MultithreadedBatchItemSimilaritiesTest.patch In {{MultithreadedBatchItemSimilarities}} method {{queueItemIDsInBatches}} handles last batch incorrectly. Last batch length is calculated incorrectly. As a result last batch is either truncated or too long with superfluous positions filled with item indexes from previous batch (or zeros if it is also the first batch as in attached test). Attached test fails for very short model (with only 4 items) with NoSuchItemException. Attached patch corrects this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1558) Clean up classify-wiki.sh and add in a binary classification problem
[ https://issues.apache.org/jira/browse/MAHOUT-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1558: --- Resolution: Fixed Assignee: Sebastian Schelter Status: Resolved (was: Patch Available) committed, thank you for your great work Clean up classify-wiki.sh and add in a binary classification problem -- Key: MAHOUT-1558 URL: https://issues.apache.org/jira/browse/MAHOUT-1558 Project: Mahout Issue Type: Improvement Components: Classification, Examples Affects Versions: 1.0 Reporter: Andrew Palumbo Assignee: Sebastian Schelter Priority: Minor Fix For: 1.0 Attachments: MAHOUT-1558.patch Some minor cleanups to classify-wiki.sh. Added in a 2 class problem: United States and United Kingdom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1561) cluster-syntheticcontrol.sh not running locally with MAHOUT_LOCAL=true
[ https://issues.apache.org/jira/browse/MAHOUT-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1561: --- Resolution: Fixed Assignee: Sebastian Schelter Status: Resolved (was: Patch Available) committed, thank you very much cluster-syntheticcontrol.sh not running locally with MAHOUT_LOCAL=true -- Key: MAHOUT-1561 URL: https://issues.apache.org/jira/browse/MAHOUT-1561 Project: Mahout Issue Type: Bug Components: Clustering, Examples Affects Versions: 0.9 Reporter: Andrew Palumbo Assignee: Sebastian Schelter Priority: Minor Fix For: 1.0 Attachments: MAHOUT-1561.patch cluster-syntheticcontrol.sh is not running locally with MAHOUT_LOCAL set. Patch adds a check for MAHOUT_LOCAL. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Hadoop 2 support in a real release?
Big +1 Am 23.05.2014 15:33 schrieb Ted Dunning ted.dunn...@gmail.com: What do folks think about spinning out a new version of 0.9 that only changes which version of Hadoop the build uses? There have been quite a few questions lately on this topic. My suggestion would be that we use minor version numbering to maintain this and the normal 0.9 release simultaneously if we decide to do a bug fix release. Any thoughts?
[jira] [Commented] (MAHOUT-1557) Add support for sparse training vectors in MLP
[ https://issues.apache.org/jira/browse/MAHOUT-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007235#comment-14007235 ] Sebastian Schelter commented on MAHOUT-1557: Karol, your patch contains some errors, e.g. the variable position is set but never read in RunMultilayerPerceptron. Furthermore, NeuralNetwork converts the input to a DenseVector internally in getOutput(), so you also have to modify that code. Add support for sparse training vectors in MLP -- Key: MAHOUT-1557 URL: https://issues.apache.org/jira/browse/MAHOUT-1557 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Karol Grzegorczyk Priority: Minor Labels: mlp Fix For: 1.0 Attachments: mlp_sparse.diff When the number of input units of MLP is big, it is likely that input vector will be sparse. It should be possible to read input files in a sparse format. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1555) Exception thrown when a test example has the label not present in training examples
[ https://issues.apache.org/jira/browse/MAHOUT-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007243#comment-14007243 ] Sebastian Schelter commented on MAHOUT-1555: Hi Karol, Could you update the patch to at least log a warning in such a case? Exception thrown when a test example has the label not present in training examples --- Key: MAHOUT-1555 URL: https://issues.apache.org/jira/browse/MAHOUT-1555 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 1.0 Reporter: Karol Grzegorczyk Priority: Minor Fix For: 1.0 Attachments: test_label_not_present_in_training_examples.diff Currently an IllegalArgumentException is thrown when a test example has the label (belongs to the class) not present in training examples. When the number of labels is big, such a situation is likely and valid. The example of course will be misclassified, but exception should not be thrown. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1554) Provide more comprehensive classification statistics
[ https://issues.apache.org/jira/browse/MAHOUT-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1554: --- Resolution: Fixed Status: Resolved (was: Patch Available) committed with a few cosmetic changes, thank you for the contribution Provide more comprehensive classification statistics Key: MAHOUT-1554 URL: https://issues.apache.org/jira/browse/MAHOUT-1554 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Karol Grzegorczyk Priority: Minor Fix For: 1.0 Attachments: statistics.diff Currently only limited classification statistics are provided. To better understand classification results, it would be worth to provide at lease average precision, recall and F1 score. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1553) Fix for run Mahout stuff as oozie java action
[ https://issues.apache.org/jira/browse/MAHOUT-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1553: --- Resolution: Not a Problem Status: Resolved (was: Patch Available) closing this, as suneel said its already fixed Fix for run Mahout stuff as oozie java action - Key: MAHOUT-1553 URL: https://issues.apache.org/jira/browse/MAHOUT-1553 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.7 Environment: mahout-core-0.7-cdh4.4.0.jar Reporter: Sergey Attachments: MAHOUT-1553.patch Related to MAHOUT-1498, the problem is the same. mapred.job.classpath.files property is not correctly pushed down to Mahout MR stuff because of new Configuration usage at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276) at org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135) at org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372) at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158) at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website
[ https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005681#comment-14005681 ] Sebastian Schelter commented on MAHOUT-1534: Looks good, I think we should also mention the skipTests option for packaging and add a news entry for that. Add documentation for using Mahout with Hadoop2 to the website -- Key: MAHOUT-1534 URL: https://issues.apache.org/jira/browse/MAHOUT-1534 Project: Mahout Issue Type: Task Components: Documentation Reporter: Sebastian Schelter Assignee: Gokhan Capan Fix For: 1.0 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. We should have a page on the website describing this for our users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1556) Mahout for Hadoop2 - HDP2.1.1
[ https://issues.apache.org/jira/browse/MAHOUT-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005720#comment-14005720 ] Sebastian Schelter commented on MAHOUT-1556: You have to use the trunk version, 0.9 does not have the support for Hadoop 2 yet. This page has infos on how to build mahout for Hadoop 2: https://mahout.apache.org/developers/buildingmahout.html Let us know if that doesn't work for you. Mahout for Hadoop2 - HDP2.1.1 - Key: MAHOUT-1556 URL: https://issues.apache.org/jira/browse/MAHOUT-1556 Project: Mahout Issue Type: Dependency upgrade Components: Integration Affects Versions: 0.9 Environment: Ubuntu 12.04, Centos6, Java Oracle 1.7 Reporter: Prabhat K Singh Labels: hadoop2 Fix For: 0.9 Hi, I tried build and install of Mahout0.9 for hadoop HDP2.1.1 as per given methods in https://issues.apache.org/jira/browse/MAHOUT-1329, but I get errors as mentioned below. Method: mvn clean package -Dhadoop.profile=200 -Dhadoop2.version=2.2.0 -Dhbase.version=0.98 mvn clean install -Dhadoop2 -Dhadoop.2.version=2.2.0 mvn clean package -Dhadoop2 -Dhadoop.profile=200 -Dhadoop2.version=2.4.0 -Dhbase.version=0.98 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project mahout-integration: Compilation failure: Compilation failure: [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[30,31] cannot find symbol [ERROR] symbol: class HBaseConfiguration [ERROR] location: package org.apache.hadoop.hbase [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[33,31] cannot find symbol [ERROR] symbol: class KeyValue [ERROR] location: package org.apache.hadoop.hbase [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[47,36] cannot find symbol [ERROR] symbol: class Bytes [ERROR] location: package org.apache.hadoop.hbase.util [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[91,42] cannot find symbol [ERROR] symbol: variable Bytes [ERROR] location: class org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[92,42] cannot find symbol [ERROR] symbol: variable Bytes [ERROR] location: class org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[107,26] cannot find symbol [ERROR] symbol: variable HBaseConfiguration [ERROR] location: class org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[138,51] cannot find symbol [ERROR] symbol: variable Bytes [ERROR] location: class org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[206,26] cannot find symbol [ERROR] symbol: variable Bytes [ERROR] location: class org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[207,25] cannot find symbol [ERROR] symbol: variable Bytes [ERROR] location: class org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[233,15] cannot find symbol [ERROR] symbol: variable Bytes [ERROR] location: class org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[265,26] cannot find symbol [ERROR] symbol: variable Bytes [ERROR] location: class org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel [ERROR] /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[266,25] cannot find symbol [ERROR] symbol: variable Bytes [ERROR] location: class org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel [ERROR
[jira] [Updated] (MAHOUT-1554) Provide more comprehensive classification statistics
[ https://issues.apache.org/jira/browse/MAHOUT-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1554: --- Fix Version/s: 1.0 Provide more comprehensive classification statistics Key: MAHOUT-1554 URL: https://issues.apache.org/jira/browse/MAHOUT-1554 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Karol Grzegorczyk Priority: Minor Fix For: 1.0 Attachments: statistics.diff Currently only limited classification statistics are provided. To better understand classification results, it would be worth to provide at lease average precision, recall and F1 score. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: consensus statement?
Big +1, very nicely captures what I also think --sebastian Am 21.05.2014 14:27 schrieb Gokhan Capan gkhn...@gmail.com: I want to express my opinions for the vision, too. I tried to capture those words from various discussions in the dev-list, and hope that most, of them support the common sense of excitement the new Mahout arouses To me, the fundamental benefit of the shift that Mahout is undergoing is a better separation of the distributed execution engine, distributed data structures, matrix computations, and algorithms layers, which will allow the users/devs of Mahout with different roles focus on the relevant parts of the framework: 1. A machine learning scientist, independent from the underlying distributed execution engine, can utilize the matrix language and the decompositions to implement new algorithms (which implies that the current distributed mahout algorithms are to be rewritten in the matrix language) 2. A math-scala module contributor, for the benefit of higher level algorithms, can add new, or improve existing functions (the set of decompositions is an example) with optimization plans (such as if two matrices are partitioned in the same way, ...), where the concrete implementations of those optimizations are delegated to the distributed execution engine layer 3. A distributed execution engine author can add machine learning capabilities to her platform with i)concrete Matrix and Matrix I/O implementation ii)partitioning, checkpointing, broadcasting behaviors, iii)BLAS 4. A Mahout user with access to a cluster operated by a Mahout-supporting distributed execution engine can run machine learning algorithms implemented on top of the matrix language Best Gokhan On Tue, May 20, 2014 at 8:30 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: inline On Tue, May 20, 2014 at 12:42 AM, Sebastian Schelter s...@apache.org wrote: Let's take the next from our homepage as starting point. What should we add/remove/modify? The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them. We are building our future implementations on top of a Scala DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel for Apache Spark. More platforms to be added in the future. Furthermore, there is an experimental contribution undergoing which aims to integrate the h20 platform into Mahout.
[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website
[ https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004958#comment-14004958 ] Sebastian Schelter commented on MAHOUT-1534: I somehow cannot see the staged version unfortunately. Just publish it and I'll have a look. Maybe we should we even add an extra page and navigation point for that site, what do you think? Add documentation for using Mahout with Hadoop2 to the website -- Key: MAHOUT-1534 URL: https://issues.apache.org/jira/browse/MAHOUT-1534 Project: Mahout Issue Type: Task Components: Documentation Reporter: Sebastian Schelter Fix For: 1.0 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. We should have a page on the website describing this for our users. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: consensus statement?
On 05/18/2014 09:28 PM, Ted Dunning wrote: On Sun, May 18, 2014 at 11:33 AM, Sebastian Schelter s...@apache.org wrote: I suggest we start with a specific draft that someone prepares (maybe Ted as he started the thread) This is a good strategy, and I am happy to start the discussion, but I wonder if it might help build consensus if somebody else started the ball rolling. Let's take the next from our homepage as starting point. What should we add/remove/modify? The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them. We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark. Furthermore, there is an experimental contribution undergoing which aims to integrate the h20 platform into Mahout.
Re: [jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests
Can you create in an svn compatible way and check that it works with the current trunk? Thx, sebastian Am 19.05.2014 12:01 schrieb larryhu (JIRA) j...@apache.org: [ https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001554#comment-14001554] larryhu commented on MAHOUT-1543: - I'm so sorry for your trouble, this patch is created by git, I clone it from github. tag: mahout-0.7. JSON output format for classifying with random forests -- Key: MAHOUT-1543 URL: https://issues.apache.org/jira/browse/MAHOUT-1543 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.7, 0.8, 0.9 Reporter: larryhu Labels: patch Fix For: 0.7 Attachments: MAHOUT-1543.patch This patch adds JSON output format to build random forests, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell
[ https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002271#comment-14002271 ] Sebastian Schelter commented on MAHOUT-1542: No, go ahead, thats a great idea. Tutorial for playing with Mahout's Spark shell -- Key: MAHOUT-1542 URL: https://issues.apache.org/jira/browse/MAHOUT-1542 Project: Mahout Issue Type: Improvement Components: Documentation, Math Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 I have a created a tutorial for setting up the spark shell and implementing a simple linear regression algorithm. I'd love to make this part of the website, could someone give it a review? https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 to your sources. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1439) Update talks on Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001013#comment-14001013 ] Sebastian Schelter commented on MAHOUT-1439: @tdunning [~dlyubimov] could you add your talks from last month? Update talks on Mahout -- Key: MAHOUT-1439 URL: https://issues.apache.org/jira/browse/MAHOUT-1439 Project: Mahout Issue Type: Bug Components: Documentation Reporter: Sebastian Schelter Fix For: 1.0 The talks listed on our homepage seem to end somewhere in 2012. I know that there have been tons of other talks on Mahout since then, I've added mine already. It would be great if everybody who knows of additional talks would paste them here, so I can add them to the website. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1521) lucene2seq - Error trying to load data from stored field (when non-indexed)
[ https://issues.apache.org/jira/browse/MAHOUT-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001026#comment-14001026 ] Sebastian Schelter commented on MAHOUT-1521: [~frankscholten] what's the status here? lucene2seq - Error trying to load data from stored field (when non-indexed) Key: MAHOUT-1521 URL: https://issues.apache.org/jira/browse/MAHOUT-1521 Project: Mahout Issue Type: Bug Affects Versions: 0.9 Reporter: Terry Blankers Assignee: Frank Scholten Labels: lucene2seq Fix For: 1.0 When using lucene2seq to load data from a field that is stored but not indexed I receive the following error: {noformat}IllegalArgumentException: Field 'body' does not exist in the index{noformat} Field is described in schema.xml as: {noformat}fieldname=bodytype=string stored=true indexed=false/{noformat} BTW, field is copied to 'content' field for searching, schema.xml snippet: {noformat}copyField source=body dest=content /{noformat} Copy field is described in schema.xml as: {noformat}fieldname=content type=text stored=false indexed=true multiValued=true/{noformat} If I try to load data from the copy field, lucene2seq runs with no errors but I receive empty data for each key/doc: {noformat}Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.Text Key: 96C4C76CF9D7449C724CA77CB8F650EAFD33E31C: Value: Key: D6842B81B8D09733B50BEDB4767C2A5C49E43B20: Value:{noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1153) Implement streaming random forests
[ https://issues.apache.org/jira/browse/MAHOUT-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1153. Resolution: Won't Fix no activity for more than a month Implement streaming random forests -- Key: MAHOUT-1153 URL: https://issues.apache.org/jira/browse/MAHOUT-1153 Project: Mahout Issue Type: New Feature Components: Classification Reporter: Andy Twigg Labels: features Fix For: 1.0 The current random forest implementations are in-core and not scalable. This issue is to add an out-of-core, scalable, streaming implementation. Initially it could be based on [1], and using mappers in a master-worker style. [1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1544) make Mahout DSL shell depend dynamically on Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001029#comment-14001029 ] Sebastian Schelter commented on MAHOUT-1544: [~avati] What's the status here? make Mahout DSL shell depend dynamically on Spark - Key: MAHOUT-1544 URL: https://issues.apache.org/jira/browse/MAHOUT-1544 Project: Mahout Issue Type: Improvement Reporter: Anand Avati Fix For: 1.0 Attachments: 0001-spark-shell-rename-to-shell.patch, 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch, 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch, 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch Today the Mahout's scala shell depends on spark. Create a cleaner separation between the shell and Spark. For e.g, the in core scalabindings and operators do not need Spark. So make Spark a runtime addon to the shell. Similarly in the future new distributed backend engines can transparently (dynamically) be available through the DSL shell. The new shell works, looks and feels exactly like the shell before, but has a cleaner modular architecture. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie
[ https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001014#comment-14001014 ] Sebastian Schelter commented on MAHOUT-1498: [~serega_sheypak] whats the status here? DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie - Key: MAHOUT-1498 URL: https://issues.apache.org/jira/browse/MAHOUT-1498 Project: Mahout Issue Type: Bug Affects Versions: 0.7 Environment: mahout-core-0.7-cdh4.4.0.jar Reporter: Sergey Fix For: 1.0 Hi, I get exception {code} Invocation of Main class completed Failing Oozie Launcher, Main class [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw exception, Job failed! java.lang.IllegalStateException: Job failed! at org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329) at org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199) at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271) {code} The root cause is: {code} Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247 {code} Looks like it happens because of DictionaryVectorizer.makePartialVectors method. It has code: {code} DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf); {code} which overrides jars pushed with job by oozie: {code} public static void More ...setCacheFiles(URI[] files, Configuration conf) { String sfiles = StringUtils.uriToString(files); conf.set(mapred.cache.files, sfiles); } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website
[ https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001020#comment-14001020 ] Sebastian Schelter commented on MAHOUT-1534: Anybody willing to help here? This is important, as a lot of users keep asking about using Mahout with Hadoop2 Add documentation for using Mahout with Hadoop2 to the website -- Key: MAHOUT-1534 URL: https://issues.apache.org/jira/browse/MAHOUT-1534 Project: Mahout Issue Type: Task Components: Documentation Reporter: Sebastian Schelter Fix For: 1.0 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. We should have a page on the website describing this for our users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1480) Clean up website on 20 newsgroups
[ https://issues.apache.org/jira/browse/MAHOUT-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001027#comment-14001027 ] Sebastian Schelter commented on MAHOUT-1480: [~Andrew_Palumbo] did you have time yet to make the confusion matrix fit? Clean up website on 20 newsgroups - Key: MAHOUT-1480 URL: https://issues.apache.org/jira/browse/MAHOUT-1480 Project: Mahout Issue Type: Improvement Components: Documentation Reporter: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1480_edit1.patch The website on the twenty newsgroups example needs clean up. We need to go through the text, remove dead links and check whether the information is still consistent with the current code. https://mahout.apache.org/users/clustering/twenty-newsgroups.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1532) Add solve() function to the Scala DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1532. Resolution: Fixed Add solve() function to the Scala DSL -- Key: MAHOUT-1532 URL: https://issues.apache.org/jira/browse/MAHOUT-1532 Project: Mahout Issue Type: Bug Components: Math Reporter: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1532.patch, MAHOUT-1532.patch We should add a solve() function to the Scala DSL with helps with solving Ax = b for in-core matrices and vectors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1514) Contact the original Random Forest author
[ https://issues.apache.org/jira/browse/MAHOUT-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1514. Resolution: Won't Fix no answer in four weeks Contact the original Random Forest author - Key: MAHOUT-1514 URL: https://issues.apache.org/jira/browse/MAHOUT-1514 Project: Mahout Issue Type: Task Reporter: Sebastian Schelter Priority: Critical Fix For: 1.0 We should contact the original Random Forest author to ask about maintenance of the implementation. Otherwise, this becomes a candidate for removal. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1522) Handle logging levels via log4j.xml
[ https://issues.apache.org/jira/browse/MAHOUT-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001010#comment-14001010 ] Sebastian Schelter commented on MAHOUT-1522: [~andrew.musselman] whats the status here? Handle logging levels via log4j.xml --- Key: MAHOUT-1522 URL: https://issues.apache.org/jira/browse/MAHOUT-1522 Project: Mahout Issue Type: Bug Affects Versions: 0.9 Reporter: Andrew Musselman Assignee: Andrew Musselman Priority: Critical Fix For: 1.0 We don't have a properties file to tell log4j what to do, so we inherit other frameworks' settings. Suggestion is to add a log4j.xml file in a canonical place and set up logging levels, maybe separating out components for ease of setting levels during debugging. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1252) Add support for Finite State Transducers (FST) as a DictionaryType.
[ https://issues.apache.org/jira/browse/MAHOUT-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001022#comment-14001022 ] Sebastian Schelter commented on MAHOUT-1252: [~drew.farris] what's the status here? Add support for Finite State Transducers (FST) as a DictionaryType. --- Key: MAHOUT-1252 URL: https://issues.apache.org/jira/browse/MAHOUT-1252 Project: Mahout Issue Type: Improvement Components: Integration Affects Versions: 0.7 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 1.0 Add support for Finite State Transducers (FST) as a DictionaryType, this should result in an order of magnitude speedup of seq2sparse. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1545) Creating holdout sets with seq2sparse and split
[ https://issues.apache.org/jira/browse/MAHOUT-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1545. Resolution: Later Fix Version/s: 1.0 Closing this as it is a reminder for things to do in the future. Creating holdout sets with seq2sparse and split --- Key: MAHOUT-1545 URL: https://issues.apache.org/jira/browse/MAHOUT-1545 Project: Mahout Issue Type: Bug Components: Classification, CLI, Examples Affects Versions: 0.9 Reporter: Andrew Palumbo Fix For: 1.0 The current method for vectorizing data using seq2sparse and then split allows for a large amount of information to spill over from the training sets to the test sets- especially in the case of TF-IDF transformations. The IDF transform provides alot of information on the holdout set to the training set if calculated previous to splitting them up. I'm not sure if given the current seq2sparse implementation's status as Legacy and the relatively minor advantages that it might give whether or not its worth adding something like a split option to SparseVectorsFromSequenceFiles.java. But i know that i saw a new implementation being discussed and and think that it would be worth it to have an option like this built in. I think that this issue may have been raised before, but i wanted to bring it up again in light of the current move away from MapReduce and the new implementations of Mahout tools that will be coming along. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1470) Topic dump
[ https://issues.apache.org/jira/browse/MAHOUT-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001035#comment-14001035 ] Sebastian Schelter commented on MAHOUT-1470: [~andrew.musselman] what's the status here? Topic dump -- Key: MAHOUT-1470 URL: https://issues.apache.org/jira/browse/MAHOUT-1470 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 1.0 Reporter: Andrew Musselman Assignee: Andrew Musselman Priority: Minor Fix For: 1.0 Per http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCAMc_qaL2DCgbVbam2miNsLpa4qvaA9sMy1-arccF9Nz6ApcsvQ%40mail.gmail.com%3E The script needs to be corrected to not call vectordump for LDA as vectordump utility (or even clusterdump) are presently not capable of displaying topics and relevant documents. I recall this issue was previously reported by Peyman Faratin post 0.9 release. Mahout's missing a clusterdump utility that reads in LDA topics, Document - DocumentId mapping and displays a report of the topics and the documents that belong to a topic. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1453) ImplicitFeedbackAlternatingLeastSquaresSolver add support for user supplied confidence functions
[ https://issues.apache.org/jira/browse/MAHOUT-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1453. Resolution: Won't Fix no activity in four weeks ImplicitFeedbackAlternatingLeastSquaresSolver add support for user supplied confidence functions Key: MAHOUT-1453 URL: https://issues.apache.org/jira/browse/MAHOUT-1453 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Adam Ilardi Assignee: Sebastian Schelter Priority: Minor Labels: newbie, patch, performance Fix For: 1.0 double confidence(double rating) { return 1 + alpha * rating; } The original paper mentions other functions that could be used as well. @ the moment It's not easy for a user to change this without compiling the source. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1427) Convert old .mapred API to new .mapreduce
[ https://issues.apache.org/jira/browse/MAHOUT-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001034#comment-14001034 ] Sebastian Schelter commented on MAHOUT-1427: [~smarthi] what's the status here? Convert old .mapred API to new .mapreduce - Key: MAHOUT-1427 URL: https://issues.apache.org/jira/browse/MAHOUT-1427 Project: Mahout Issue Type: Bug Components: Collaborative Filtering, Integration Affects Versions: 0.9 Reporter: Suneel Marthi Assignee: Suneel Marthi Priority: Minor Fix For: 1.0 Attachments: Mahout-1427.patch Migrate code still using the old .mapred to .mapreduce API -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1495) Create a website describing the distributed item-based recommender
[ https://issues.apache.org/jira/browse/MAHOUT-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001017#comment-14001017 ] Sebastian Schelter commented on MAHOUT-1495: [~apsaltis] what's the status here? Create a website describing the distributed item-based recommender -- Key: MAHOUT-1495 URL: https://issues.apache.org/jira/browse/MAHOUT-1495 Project: Mahout Issue Type: Bug Components: Collaborative Filtering, Documentation Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1549) Extracting tfidf-vectors by key
[ https://issues.apache.org/jira/browse/MAHOUT-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001030#comment-14001030 ] Sebastian Schelter commented on MAHOUT-1549: [~Pilgrim] has your question been answered yet? Extracting tfidf-vectors by key --- Key: MAHOUT-1549 URL: https://issues.apache.org/jira/browse/MAHOUT-1549 Project: Mahout Issue Type: Question Components: Classification Affects Versions: 0.7, 0.8, 0.9 Reporter: Richard Scharrer Labels: documentation, features, newbie Hi, I have about 20 tfidf-vectors and I need to extract 500 of them of which I have the keys. Is there some kind of magical option which allows me something like taking the output of mahout seqdumper and transform it back into a sequencefile that I can use for trainnb /testnb? The sequencefiles of tfidf use the Text class for the keys and the VectorWritable class for the values. I tried https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java with different settings but the output always gives me the Text class for both, key and value which can't be used in trainnb and testnb. I posted this question on: http://stackoverflow.com/questions/23502362/extracting-tfidf-vectors-by-key-without-destroying-the-fileformat I ask this question in here because I've seen similar questions on stackoverflow that where asked last year and still didn't get an answer I really need this information so in case you know anything please tell me. Regards, Richard -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1425) SGD classifier example with bank marketing dataset
[ https://issues.apache.org/jira/browse/MAHOUT-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001024#comment-14001024 ] Sebastian Schelter commented on MAHOUT-1425: [~frankscholten] what's the status here? SGD classifier example with bank marketing dataset -- Key: MAHOUT-1425 URL: https://issues.apache.org/jira/browse/MAHOUT-1425 Project: Mahout Issue Type: Improvement Components: Examples Affects Versions: 1.0 Reporter: Frank Scholten Assignee: Frank Scholten Fix For: 1.0 Attachments: MAHOUT-1425.patch As discussed on the mailing list a few weeks back I started working on an SGD classifier example with the bank marketing dataset from UCI: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing See https://github.com/frankscholten/mahout-sgd-bank-marketing Ted has also made further changes that were very useful so I suggest to add this example to Mahout Ted: can you tell a bit more about the log transforms? Some of them are just Math.log while others are more complex expressions. What else is needed to contribute it to Mahout? Anything that could improve the example? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1487) More understandable error message when attempt to use wrong FileSystem
[ https://issues.apache.org/jira/browse/MAHOUT-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1487. Resolution: Won't Fix no activity in four weeks More understandable error message when attempt to use wrong FileSystem -- Key: MAHOUT-1487 URL: https://issues.apache.org/jira/browse/MAHOUT-1487 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.9 Environment: Amazon S3, Amazon EMR, Local file system Reporter: Konstantin Priority: Trivial Fix For: 1.0 RandomSeedGenerator has following code: FileSystem fs = FileSystem.get(output.toUri(), conf); ... fs.getFileStatus(input).isDir() If specify output path correctly and input path not correctly, Mahout throws not well understandable error message. Exception in thread main java.lang.IllegalArgumentException: This file system object (hdfs://172.31.41.65:9000) does not support access to the request path 's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path This happens because FileSystem object was created from output path, and getFileStatus has parameter for input path. This caused misunderstanding when try to understand what error message means. To prevent this misunderstanding, I propose to improve error message adding following details: 1. Specify which filesystem type used (DistributedFileSystem, NativeS3FileSystem, etc. using fs.getClass().getName()) 2. Then specify which path can not be processed correctly. This can be done by validation utility which can be applied to many places in Mahout. When we use Mahout we need to specify many paths and we also can use many types of file systems: local for debugging, distributed on Hadoop, and s3 on Amazon. In this case better error messages can save much time. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Build failed in Jenkins: Mahout-Quality #2608
Does someone have to check why the build is still failing? On 05/13/2014 01:14 AM, Apache Jenkins Server wrote: See https://builds.apache.org/job/Mahout-Quality/2608/ -- [...truncated 8432 lines...] } Q= { 0 = {0:0.40273861426601687,1:-0.9153150324187648} 1 = {0:0.9153150324227656,1:0.40273861426427493} } [32m- C = A %*% B mapBlock {}[0m [32m- C = A %*% B incompatible B keys[0m 36495 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG org.apache.mahout.sparkbindings.blas.AtB$ - A and B for A'B are not identically partitioned, performing inner join. [32m- C = At %*% B , join[0m 37989 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG org.apache.mahout.sparkbindings.blas.AtB$ - A and B for A'B are not identically partitioned, performing inner join. [32m- C = At %*% B , join, String-keyed[0m 39452 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG org.apache.mahout.sparkbindings.blas.AtB$ - A and B for A'B are identically distributed, performing row-wise zip. [32m- C = At %*% B , zippable, String-keyed[0m { 2 = {0:62.0,1:86.0,3:132.0,2:115.0} 1 = {0:50.0,1:69.0,3:105.0,2:92.0} 3 = {0:74.0,1:103.0,3:159.0,2:138.0} 0 = {0:26.0,1:35.0,3:51.0,2:46.0} } [32m- C = A %*% inCoreB[0m { 0 = {0:26.0,1:35.0,2:46.0,3:51.0} 1 = {0:50.0,1:69.0,2:92.0,3:105.0} 2 = {0:62.0,1:86.0,2:115.0,3:132.0} 3 = {0:74.0,1:103.0,2:138.0,3:159.0} } [32m- C = inCoreA %*%: B[0m 43683 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG org.apache.mahout.sparkbindings.blas.AtA$ - Applying slim A'A. [32m- C = A.t %*% A[0m 45370 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG org.apache.mahout.sparkbindings.blas.AtA$ - Applying non-slim non-graph A'A. 70680 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG org.apache.mahout.sparkbindings - test done. [32m- C = A.t %*% A fat non-graph[0m 71986 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG org.apache.mahout.sparkbindings.blas.AtA$ - Applying slim A'A. [32m- C = A.t %*% A non-int key[0m [32m- C = A + B[0m [32m- C = A + B side test 1[0m [32m- C = A + B side test 2[0m [32m- C = A + B side test 3[0m ArrayBuffer(0, 1, 2, 3, 4) ArrayBuffer(0, 1, 2, 3, 4) [32m- general side[0m [32m- Ax[0m [32m- A'x[0m [32m- colSums, colMeans[0m [36mRun completed in 1 minute, 31 seconds.[0m [36mTotal number of tests run: 38[0m [36mSuites: completed 9, aborted 0[0m [36mTests: succeeded 38, failed 0, canceled 0, ignored 0, pending 0[0m [32mAll tests passed.[0m [INFO] [INFO] --- build-helper-maven-plugin:1.8:remove-project-artifact (remove-old-mahout-artifacts) @ mahout-spark --- [INFO] /home/jenkins/.m2/repository/org/apache/mahout/mahout-spark removed. [INFO] [INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ mahout-spark --- [INFO] Building jar: /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT.jar [INFO] [INFO] --- maven-jar-plugin:2.4:test-jar (default) @ mahout-spark --- [INFO] Building jar: /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-tests.jar [INFO] [INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @ mahout-spark --- [INFO] Building jar: /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-sources.jar [INFO] [INFO] --- maven-install-plugin:2.5.1:install (default-install) @ mahout-spark --- [INFO] Installing /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT.jar to /home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.jar [INFO] Installing /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/pom.xml to /home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.pom [INFO] Installing /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-tests.jar to /home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-tests.jar [INFO] Installing /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-sources.jar to /home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-sources.jar [INFO] [INFO] maven-javadoc-plugin:2.9.1:javadoc (default-cli) @ mahout-spark [INFO] [INFO] --- build-helper-maven-plugin:1.8:add-source (add-source) @ mahout-spark --- [INFO] Source directory: /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/generated-sources/mahout added. [INFO] [INFO] --- build-helper-maven-plugin:1.8:add-test-source (add-test-source) @ mahout-spark --- [INFO] Test Source directory: /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/generated-test-sources/mahout added. [INFO] [INFO] maven-javadoc-plugin:2.9.1:javadoc (default-cli) @
[jira] [Resolved] (MAHOUT-1484) Spectral algorithm for HMMs
[ https://issues.apache.org/jira/browse/MAHOUT-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1484. Resolution: Won't Fix no activity in four weeks Spectral algorithm for HMMs --- Key: MAHOUT-1484 URL: https://issues.apache.org/jira/browse/MAHOUT-1484 Project: Mahout Issue Type: New Feature Reporter: Emaad Manzoor Priority: Minor Following up with this [comment|https://issues.apache.org/jira/browse/MAHOUT-396?focusedCommentId=12898284page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12898284] by [~isabel] on the sequential HMM [proposal|https://issues.apache.org/jira/browse/MAHOUT-396], is there any interest in a spectral algorithm as described in: A spectral algorithm for learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang)? I would like to take up this effort. This will enable learning the parameters of and making predictions with a HMM in a single step. At its core, the algorithm involves computing estimates from triples of observations, performing an SVD and then some matrix multiplications. This could also form the base for an implementation of Hilbert Space Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1536) Update Creating vectors from text page
[ https://issues.apache.org/jira/browse/MAHOUT-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001036#comment-14001036 ] Sebastian Schelter commented on MAHOUT-1536: [~Andrew_Palumbo] did you have time work on this yet? Update Creating vectors from text page Key: MAHOUT-1536 URL: https://issues.apache.org/jira/browse/MAHOUT-1536 Project: Mahout Issue Type: Bug Components: Documentation Affects Versions: 0.9 Reporter: Andrew Palumbo Priority: Minor Fix For: 1.0 Attachments: MAHOUT-1536_edit1.patch At least the seq2sparse section of the Creating vectors from text page is out of date. https://mahout.apache.org/users/basics/creating-vectors-from-text.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1528) Source tag and source release tarball for 0.9 don't exactly match
[ https://issues.apache.org/jira/browse/MAHOUT-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1528. Resolution: Later Thank you for raising this issue, we will keep that in mind for the next release. Especially the CHANGELOG file should be part of the distribution! Source tag and source release tarball for 0.9 don't exactly match - Key: MAHOUT-1528 URL: https://issues.apache.org/jira/browse/MAHOUT-1528 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.9 Reporter: Mark Grover If you download the source tarball for the Apache Mahout 0.9 release, you'd notice that it doesn't contain CHANGELOG or .gitignore file. However, if you look at the tag for the release in the github repo (https://github.com/apache/mahout/tree/mahout-0.9), you'd notice both the files there. I think, both as a best practice and to make life of downstream integrators less miserable, it would be fantastic if we could have the release tag in the source match one to one with the source code in the released source tarball. A test to do this in particular, would be awesome! Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1388) Add command line support and logging for MLP
[ https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001021#comment-14001021 ] Sebastian Schelter commented on MAHOUT-1388: [~yxjiang] what's the status here? Add command line support and logging for MLP Key: MAHOUT-1388 URL: https://issues.apache.org/jira/browse/MAHOUT-1388 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 1.0 Reporter: Yexi Jiang Assignee: Suneel Marthi Labels: mlp, sgd Fix For: 1.0 Attachments: Mahout-1388.patch, Mahout-1388.patch The user should have the ability to run the Perceptron from the command line. There are two programs to execute MLP, the training and labeling. The first one takes the data as input and outputs the model, the second one takes the model and unlabeled data as input and outputs the results. The parameters for training are as follows: --input -i (input data) --skipHeader -sk // whether to skip the first row, this parameter is optional --labels -labels // the labels of the instances, separated by whitespace. Take the iris dataset for example, the labels are 'setosa versicolor virginica'. --model -mo // in training mode, this is the location to store the model (if the specified location has an existing model, it will update the model through incremental learning), in labeling mode, this is the location to store the result --update -u // whether to incremental update the model, if this parameter is not given, train the model from scratch --output -o // this is only useful in labeling mode --layersize -ls (no. of units per hidden layer) // use whitespace separated number to indicate the number of neurons in each layer (including input layer and output layer), e.g. '5 3 2'. --squashingFunction -sf // currently only supports Sigmoid --momentum -m --learningrate -l --regularizationweight -r --costfunction -cf // the type of cost function, For example, train a 3-layer (including input, hidden, and output) MLP with 0.1 learning rate, 0.1 momentum rate, and 0.01 regularization weight, the parameter would be: mlp -i /tmp/training-data.csv -labels setosa versicolor virginica -o /tmp/model.model -ls 5,3,1 -l 0.1 -m 0.1 -r 0.01 This command would read the training data from /tmp/training-data.csv and write the trained model to /tmp/model.model. The parameters for labeling is as follows: - --input -i // input file path --columnRange -cr // the range of column used for feature, start from 0 and separated by whitespace, e.g. 0 5 --format -f // the format of input file, currently only supports csv --model -mo // the file path of the model --output -o // the output path for the results - If a user need to use an existing model, it will use the following command: mlp -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result Moreover, we should be providing default values if the user does not specify any. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1385) Caching Encoders don't cache
[ https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001032#comment-14001032 ] Sebastian Schelter commented on MAHOUT-1385: [~awmanoj] whats the status here? Caching Encoders don't cache Key: MAHOUT-1385 URL: https://issues.apache.org/jira/browse/MAHOUT-1385 Project: Mahout Issue Type: Bug Affects Versions: 0.8 Reporter: Johannes Schulte Priority: Minor Fix For: 1.0 Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch The Caching... line of encoders contains code of caching the hash code terms added to the vector. However, the method hashForProbe inside this classes is never called as the signature has String for the parameter original form (instead of byte[] like other encoders). Changing this to byte[] however would lose the java String internal caching of the Strings hash code , that is used as a key in the cache map, triggering another hash code calculation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1485) Clean up Recommender Overview page
[ https://issues.apache.org/jira/browse/MAHOUT-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001089#comment-14001089 ] Sebastian Schelter commented on MAHOUT-1485: [~yash...@gmail.com] yash, the documentation looks great. Could you create a markdown version of it, so that we can add it to the mahout website? Clean up Recommender Overview page -- Key: MAHOUT-1485 URL: https://issues.apache.org/jira/browse/MAHOUT-1485 Project: Mahout Issue Type: Improvement Components: Documentation Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Clean up the recommender overview page, remove outdated content and make sure the examples work. https://mahout.apache.org/users/recommender/recommender-documentation.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell
[ https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1542. Resolution: Fixed added to the website. I also added a new top navigation point called Spark. Shout if you don't like that naming. Tutorial for playing with Mahout's Spark shell -- Key: MAHOUT-1542 URL: https://issues.apache.org/jira/browse/MAHOUT-1542 Project: Mahout Issue Type: Improvement Components: Documentation, Math Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 I have a created a tutorial for setting up the spark shell and implementing a simple linear regression algorithm. I'd love to make this part of the website, could someone give it a review? https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 to your sources. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests
[ https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001088#comment-14001088 ] Sebastian Schelter commented on MAHOUT-1543: [~larryhu] I have trouble applying your patch to the sources checked out from SVN. Could you check that the patch is svn compatible? Sorry for the trouble. JSON output format for classifying with random forests -- Key: MAHOUT-1543 URL: https://issues.apache.org/jira/browse/MAHOUT-1543 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.7, 0.8, 0.9 Reporter: larryhu Labels: patch Fix For: 0.7 Attachments: MAHOUT-1543.patch This patch adds JSON output format to build random forests, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1527) Fix wikipedia classifier example
[ https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1527: --- Resolution: Fixed Status: Resolved (was: Patch Available) committed this with minor changes (removing a few typos, adding a check for MAHOUT_HOME to be set). Thank you Andrew, keep up the outstanding work. Fix wikipedia classifier example Key: MAHOUT-1527 URL: https://issues.apache.org/jira/browse/MAHOUT-1527 Project: Mahout Issue Type: Task Components: Classification, Documentation, Examples Affects Versions: 0.7, 0.8, 0.9 Reporter: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1527.patch The examples package has a classification showcase for prediciting the labels of wikipedia pages. Unfortunately, the example is totally broken: It relies on the old NB implementation which has been removed, suggests to use the whole wikipedia as input, which will not work well on a single machine and the documentation uses commands that have long been removed from bin/mahout. The example needs to be updated to use the current naive bayes implementation and documentation on the website needs to be written. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1385) Caching Encoders don't cache
[ https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1385: --- Resolution: Fixed Status: Resolved (was: Patch Available) I agree, Johannes is right that ideally we would want to leverage hashcode caching of Strings. But the current code is a non-working implementation, which this patch fixes. So I'm committing this for now. Caching Encoders don't cache Key: MAHOUT-1385 URL: https://issues.apache.org/jira/browse/MAHOUT-1385 Project: Mahout Issue Type: Bug Affects Versions: 0.8 Reporter: Johannes Schulte Priority: Minor Fix For: 1.0 Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch The Caching... line of encoders contains code of caching the hash code terms added to the vector. However, the method hashForProbe inside this classes is never called as the signature has String for the parameter original form (instead of byte[] like other encoders). Changing this to byte[] however would lose the java String internal caching of the Strings hash code , that is used as a key in the cache map, triggering another hash code calculation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie
[ https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1498: --- Resolution: Fixed Assignee: Sebastian Schelter Status: Resolved (was: Patch Available) committed with a few cosmetic changes, thank you for the contribution! DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie - Key: MAHOUT-1498 URL: https://issues.apache.org/jira/browse/MAHOUT-1498 Project: Mahout Issue Type: Bug Affects Versions: 0.7 Environment: mahout-core-0.7-cdh4.4.0.jar Reporter: Sergey Assignee: Sebastian Schelter Labels: patch Fix For: 1.0 Attachments: MAHOUT-1498.patch Hi, I get exception {code} Invocation of Main class completed Failing Oozie Launcher, Main class [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw exception, Job failed! java.lang.IllegalStateException: Job failed! at org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329) at org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199) at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271) {code} The root cause is: {code} Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247 {code} Looks like it happens because of DictionaryVectorizer.makePartialVectors method. It has code: {code} DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf); {code} which overrides jars pushed with job by oozie: {code} public static void More ...setCacheFiles(URI[] files, Configuration conf) { String sfiles = StringUtils.uriToString(files); conf.set(mapred.cache.files, sfiles); } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: consensus statement?
I think it is important to formulate such a statement and send it out the outside world. But we should focus the discussion. I suggest we start with a specific draft that someone prepares (maybe Ted as he started the thread) and then we can discuss and reformulate the individual sentences. I also think the formulation the committers work on Spark is not concise enough (and neglects a lot of our goals), but I also don't think it was meant to be part of an official statement in that exact wording. --sebastian On 05/18/2014 07:44 PM, Pat Ferrel wrote: Not sure why you address this to me. I agree with most of your statements. I think Ted’s intent was to find a simple consensus statement that addresses where the project is going in a general way. I look at it as something to communicate to the outside world. Why? We are rejecting new mapreduce code. This was announced as a project-wide rule and has already been used to reject one contribution I know of. OK, what replaces Hadoop mapreduce? What therefore should contributors look to as a model if not Hadoop mapreduce? Do we give no advice or comment on this question? For example, I’m doing drivers that read and write text files. This is quite tightly coupled to Spark. Possible contributors should know that this is OK, that it will not be rejected and is indeed where most of the engine specific work is being done by committers. You are right, most of us know what we are doing, but simply to say “no more mapreduce” without offering an alternative isn’t quite fair to everyone else. You are abstracting your code away from a specific engine, and that is great, but in practice anyone running it currently must run Spark. This also needs to be communicated. It’s as practical as answering, “What do I need to install to make Mahout 1.0-snapshot work? On May 15, 2014, at 7:17 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Pat, it can't be as high-level or as dteailed as it can be, I don't care, as long as it doesn't contain misstatements. It simply can state we adhere to the Apache's power of doing principle and accept new contributions. This is ok with me. But, as offered, it does try to enumerate strategic directions, and in doing so, its wording is either vague, or incomplete, or just wrong. For example, it says it is clear that what the committers are working on is Spark. This is less than accurate. First, if I interpret it literally, it is wrong, as our committers for most part are not working on Spark, and even if they do, to whatever negligible degree it esxists, why Mahout would care. Second, if it is meant to say we develop algorithms for Spark, this is also wrong, because whatever algorithms we have added to day, have 0 Spark dependencies. Third, if it is meant to say that majority of what we are working on is Spark bindings, this is still incorrect. Head count-wise, Mahout-math tweaks and Scala enablement were at least a big effort. Hadoop 2.0 stuff was at least as big. Documentation and tutorial work engagement was absolute leader headcount-wise to date. The problem i am trying to explain here is that we obviously internally know what we are doing; but this is for external consumption so we have to be careful to avoid miscommunication here. It is easy for us to pass on less than accurate info delivery exactly because we already know what we are doing and therefore our brain is happy to jump to conclusions and make up the missing connections between stated and implied as we see it. But for an outsider, this would sound vague or make him make wrong connections. On Wed, May 7, 2014 at 9:54 AM, Pat Ferrel pat.fer...@gmail.com wrote: This doesn’t seem to be a vision statement. I was +1 to a simple consensus statement. The vision is up to you. We have an interactive shell that scales to huge datasets without resorting to massive subsampling. One that allows you to deal with the exact data your black box algos work on. Every data tool has an interactive mode except Mahout—now it does. Virtually every complex transform as well as basic linear algebra works on massive datasets. The interactivity will allow people to do things with Mahout they could never do before. We also have the building blocks to make the fastest most flexible cutting edge collaborative filtering+metadata recommenders in the world. Honestly I don’t see anything like this elsewhere. We will also be able to fit into virtually any workflow and directly consume data produced in those systems with no intermediate scrubbing. This has never happened before in Mahout and I don’t see it in MLlib either. Even the interactive shell will benefit from this. Other feature champions will be able to add to this list. Seems like the vision comes from feature champions. I may not use Mahout in the same way you do but I rely on your code. Maybe I serve a different user type than you. I don’t see a problem with that, do you? On May 6, 2014, at 2:32 PM, Dmitriy Lyubimov
[jira] [Commented] (MAHOUT-1527) Fix wikipedia classifier example
[ https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001165#comment-14001165 ] Sebastian Schelter commented on MAHOUT-1527: Definitively. More examples and more documentation are always welcome :) Fix wikipedia classifier example Key: MAHOUT-1527 URL: https://issues.apache.org/jira/browse/MAHOUT-1527 Project: Mahout Issue Type: Task Components: Classification, Documentation, Examples Affects Versions: 0.7, 0.8, 0.9 Reporter: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1527.patch The examples package has a classification showcase for prediciting the labels of wikipedia pages. Unfortunately, the example is totally broken: It relies on the old NB implementation which has been removed, suggests to use the whole wikipedia as input, which will not work well on a single machine and the documentation uses commands that have long been removed from bin/mahout. The example needs to be updated to use the current naive bayes implementation and documentation on the website needs to be written. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1388) Add command line support and logging for MLP
[ https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1388: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed your patch with cosmetic changes, thank you. Could you open another JIRA for adding documentation on how to use MLP from the commandline? Add command line support and logging for MLP Key: MAHOUT-1388 URL: https://issues.apache.org/jira/browse/MAHOUT-1388 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 1.0 Reporter: Yexi Jiang Assignee: Suneel Marthi Labels: mlp, sgd Fix For: 1.0 Attachments: Mahout-1388.patch, Mahout-1388.patch The user should have the ability to run the Perceptron from the command line. There are two programs to execute MLP, the training and labeling. The first one takes the data as input and outputs the model, the second one takes the model and unlabeled data as input and outputs the results. The parameters for training are as follows: --input -i (input data) --skipHeader -sk // whether to skip the first row, this parameter is optional --labels -labels // the labels of the instances, separated by whitespace. Take the iris dataset for example, the labels are 'setosa versicolor virginica'. --model -mo // in training mode, this is the location to store the model (if the specified location has an existing model, it will update the model through incremental learning), in labeling mode, this is the location to store the result --update -u // whether to incremental update the model, if this parameter is not given, train the model from scratch --output -o // this is only useful in labeling mode --layersize -ls (no. of units per hidden layer) // use whitespace separated number to indicate the number of neurons in each layer (including input layer and output layer), e.g. '5 3 2'. --squashingFunction -sf // currently only supports Sigmoid --momentum -m --learningrate -l --regularizationweight -r --costfunction -cf // the type of cost function, For example, train a 3-layer (including input, hidden, and output) MLP with 0.1 learning rate, 0.1 momentum rate, and 0.01 regularization weight, the parameter would be: mlp -i /tmp/training-data.csv -labels setosa versicolor virginica -o /tmp/model.model -ls 5,3,1 -l 0.1 -m 0.1 -r 0.01 This command would read the training data from /tmp/training-data.csv and write the trained model to /tmp/model.model. The parameters for labeling is as follows: - --input -i // input file path --columnRange -cr // the range of column used for feature, start from 0 and separated by whitespace, e.g. 0 5 --format -f // the format of input file, currently only supports csv --model -mo // the file path of the model --output -o // the output path for the results - If a user need to use an existing model, it will use the following command: mlp -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result Moreover, we should be providing default values if the user does not specify any. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1527) Fix wikipedia classifier example
[ https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998664#comment-13998664 ] Sebastian Schelter commented on MAHOUT-1527: I'll have a look at this on the weekend. Fix wikipedia classifier example Key: MAHOUT-1527 URL: https://issues.apache.org/jira/browse/MAHOUT-1527 Project: Mahout Issue Type: Task Components: Classification, Documentation, Examples Affects Versions: 0.7, 0.8, 0.9 Reporter: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1527.patch The examples package has a classification showcase for prediciting the labels of wikipedia pages. Unfortunately, the example is totally broken: It relies on the old NB implementation which has been removed, suggests to use the whole wikipedia as input, which will not work well on a single machine and the documentation uses commands that have long been removed from bin/mahout. The example needs to be updated to use the current naive bayes implementation and documentation on the website needs to be written. -- This message was sent by Atlassian JIRA (v6.2#6252)